Are there advantages of using the Pydantic package to obtain structured data from LLMs compared to just politely asking for JSON in the prompt?
Python
Pydantic
Data Validation
LLMs
Published
March 4, 2025
Validating LLM responses with Pydantic
I’ve come accross Pydantic when looking for better ways to validate data when working with LLMs. I’ve found that while asking for JSON output in the LLM prompt mostly works fine, sometimes it does not. This creates problems when attempting any kind of automation. Here, I’m using Pydantic to test the response from an LLM model. I’ll experiment with a method which, when validation errors are encountered using Pydantic, sends a new request to the LLM API containing error details from Pydantic in the prompt. When creating this retry function, it’s important to cap the number of attempts - otherwise costs can potentially quickly escalate.😃
In this article, I’ll experiment with the OpenAI API.
Show code
#\ code-summary: "Load libraries and initialize OpenAI API client"# Import necessary packagesfrom pydantic import BaseModel, ValidationError, Field, EmailStrfrom typing import List, Literal, Optionalimport jsonfrom datetime import date# Load environment variables from a .env file - using the dotenv library - that is where the OpenAI API key is storedfrom dotenv import load_dotenvimport openai# Load environment variables from a .env fileload_dotenv()#Initlalize OpenAI API key#openai.api_key = os.getenv("OPENAI_API_KEY")client = openai.OpenAI()
In the next step, I’ll demonstrate how to use Pydantic models to validate and structure responses from LLMs. By parsing the LLM output with a Pydantic model, I can quickly identify formatting or data issues and provide targeted feedback for retries. This approach helps ensure that the data returned by the LLM is both reliable and ready for downstream automation or analysis.
Show code
#Create a sample user inputuser_input_json ='''{ "name": "Jan Uzytkownik", "email": "jan.uzytkownik@example.com", "query": "I forgot my password. Treat this very urgently please.", "order_number": null, "purchase_date": null}'''
I’ll first define a Pydantic model for the User Input.
Define the Pydantic model for validation
# Defne the Pydantic model for validationclass UserInput(BaseModel): name: str email: EmailStr query: str order_id: Optional[int] = Field(None, description="5-digit order number (cannot start with 0)", ge=10000, le=99999 ) transaction_date: Optional[date] =None
Validate sample user input using the Pydantic model
#Create an instance of the model using the sample user inputtry: user_input = UserInput.parse_raw(user_input_json)print("User input is valid:", user_input)except ValidationError as e:print("Validation error:", e.json())
User input is valid: name='Jan Uzytkownik' email='jan.uzytkownik@example.com' query='I forgot my password. Treat this very urgently please.' order_id=None transaction_date=None
/var/folders/g9/l7t9y9813fjggxq0lycxgg4c0000gp/T/ipykernel_50383/1808375781.py:3: PydanticDeprecatedSince20: The `parse_raw` method is deprecated; if your data is JSON use `model_validate_json`, otherwise load the data then use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
user_input = UserInput.parse_raw(user_input_json)
Next, I’ll demonstrate how to extend the initial Pydantic model to handle more complex user queries. By creating a new model that inherits from the original, I can add fields such as department, category, priority, and tags. This allows for more granular validation and routing of user requests, which is especially useful when automating support workflows or categorizing incoming queries for downstream processing.
#Define a new CustomerQuery model that inherits from UserInputclass CustomerQuery(UserInput): department: str= Field(..., description="Department to route the query to") category: Literal['billing', 'technical', 'general'] = Field(..., description="Category of the query - billing, technical, or general") priority: Literal['low', 'medium', 'high'] = Field(..., description="Priority level of the query") tags: Optional[List[str]] = Field(..., max_items=5, description="Up to 5 keyword tags")
Create valid sample customer query data in JSON format to guide the LLM
#Create valid sample customer query data in JSON format to guide the LLMvalid_customer_query_json ='''{ "name": "Jan Uzytkownik", "email": "jan.uzytkownik@example.com", "query": "I forgot my password. Treat this very urgently please.", "order_number": 12345, "purchase_date": "2023-01-01", department: "Support", category: "technical", priority: "high", tags: ["password", "login", "account"]}'''
Next, I’ll create a prompt that takes a validated user query and the customer_query model schema and returns JSON. At the end I’ll add a request to return only valid JSON. This approach tends to fail…
Create prompt that includes validated user input and expected JSON structure
# Create prompt with user data and expected JSON structureprompt =f"""Please analyze this user query\n{user_input.model_dump_json(indent=2)}:Return your analysis as a JSON object matching this exact structure and data types:{valid_customer_query_json}Respond ONLY with valid JSON. Do not include any explanations or other text or formatting before or after the JSON object."""print(prompt)
Please analyze this user query
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password. Treat this very urgently please.",
"order_id": null,
"transaction_date": null
}:
Return your analysis as a JSON object matching this exact structure
and data types:
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_number": 12345,
"purchase_date": "2023-01-01",
department: "Support",
category: "technical",
priority: "high",
tags: ["password", "login", "account"]
}
Respond ONLY with valid JSON. Do not include any explanations or
other text or formatting before or after the JSON object.
I’ll define a function to call the OpenAI API taking the prompt defined above as an argument.
Define function to call the OpenAI API with the prompt
# Define a function to call the LLMdef call_llm(prompt, model="gpt-4o"): response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}] )return response.choices[0].message.content
Call the LLM with the prompt and print the response
# Get response from LLMresponse_content = call_llm(prompt)print(response_content)
This validation will fail - because the response we get has some additional “fluff” and is not in proper JSON format. Formatting (json ``` …) is part of the response.
The validation attempt of the response results in an error - the response includes some additional formatting that is NOT JSON. 😞
---------------------------------------------------------------------------ValidationError Traceback (most recent call last)
CellIn[24], line 2 1# Attempt to parse the response into CustomerQuery model----> 2 valid_data = CustomerQuery.model_validate_json(response_content)File ~/Desktop/Developer/using_pydantic_with_llms/.venv/lib/python3.13/site-packages/pydantic/main.py:746, in BaseModel.model_validate_json(cls, json_data, strict, context, by_alias, by_name) 740if by_alias isFalseand by_name isnotTrue:
741raise PydanticUserError(
742'At least one of `by_alias` or `by_name` must be set to True.',
743 code='validate-by-alias-and-name-false',
744 )
--> 746returncls.__pydantic_validator__.validate_json( 747json_data,strict=strict,context=context,by_alias=by_alias,by_name=by_name 748)ValidationError: 1 validation error for CustomerQuery
Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='```json\n{\n "name": ...in", "account"]\n}\n```', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/json_invalid
Function with a ‘retry’ prompt - passing an error back to the LLM
I’ll create a function that takes the validation error message as a parameter (in addition to the initial prompt and the response) and requests that the LLM try to correct the error. I’ll need a function that returns data about the error that can be passed in the prompt.
Show code
# Define a function to validate an LLM responsedef validate_with_model(data_model, llm_response):try: validated_data = data_model.model_validate_json(llm_response)print("data validation successful!")print(validated_data.model_dump_json(indent=2))return validated_data, Noneexcept ValidationError as e:print(f"❌ Validation errors found:")print(e.json(indent=2)) error_message =f"{e.json(indent=2)}"returnNone, error_message
This code should return a nice looking JSON error information that can be used later
# Test your validation function with the LLM responsevalidated_data, validation_error = validate_with_model( CustomerQuery, response_content)
❌ Validation errors found:
[
{
"type": "json_invalid",
"loc": [],
"msg": "Invalid JSON: expected value at line 1 column 1",
"input": "```json\n{\n \"name\": \"Jan Uzytkownik\",\n \"email\": \"jan.uzytkownik@example.com\",\n \"query\": \"I forgot my password.\",\n \"order_number\": null,\n \"purchase_date\": null,\n \"department\": \"Support\",\n \"category\": \"technical\",\n \"priority\": \"high\",\n \"tags\": [\"password\", \"login\", \"account\"]\n}\n```",
"ctx": {
"error": "expected value at line 1 column 1"
},
"url": "https://errors.pydantic.dev/2.11/v/json_invalid"
}
]
Great! As expected we get an explanation for why response data is not valid. This function also returns the error_message that can be passed into a new prompt.
Now, I’ll create a function to try to get the proper answer:
The code in the next cell defines a function called create_retry_prompt. This function helps generate a new prompt for an LLM when a previous response fails validation. It takes the original prompt, the LLM’s response, and the error message, then constructs a new prompt that asks the LLM to fix the error by comparing the invalid response and the error details. The function instructs the LLM to reply only with valid JSON, without any extra explanation or formatting.
Define a function to create a retry prompt with error feedback
# Define a function to create a retry prompt with error feedbackdef create_retry_prompt( original_prompt, original_response, error_message): retry_prompt =f"""This is a request to fix an error in the structure of an llm_response.Here is the original request:<original_prompt>{original_prompt}</original_prompt>Here is the original llm_response:<llm_response>{original_response}</llm_response>This response generated an error: <error_message>{error_message}</error_message>Compare the error message and the llm_response and identify what needs to be fixed or removedin the llm_response to resolve this error. Respond ONLY with valid JSON. Do not include any explanations or other text or formatting before or after the JSON string."""return retry_prompt
```{python}#| label: ex1-code#| eval: false#| code-summary: "Test the function:"# Create a retry prompt for validation errorsvalidation_retry_prompt = create_retry_prompt( original_prompt=prompt, original_response=response_content, error_message=validation_error)print(validation_retry_prompt)```
This is a request to fix an error in the structure of an llm_response.
Here is the original request:
<original_prompt>
Please analyze this user query
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password. Treat this very urgently please.",
"order_id": null,
"transaction_date": null
}:
Return your analysis as a JSON object matching this exact structure
and data types:
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_number": 12345,
"purchase_date": "2023-01-01",
department: "Support",
category: "technical",
priority: "high",
tags: ["password", "login", "account"]
}
Respond ONLY with valid JSON. Do not include any explanations or
other text or formatting before or after the JSON object.
</original_prompt>
Here is the original llm_response:
<llm_response>
```json
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_number": null,
"purchase_date": null,
"department": "Support",
"category": "technical",
"priority": "high",
"tags": ["password", "login", "account"]
}
```
</llm_response>
This response generated an error:
<error_message>
[
{
"type": "json_invalid",
"loc": [],
"msg": "Invalid JSON: expected value at line 1 column 1",
"input": "```json\n{\n \"name\": \"Jan Uzytkownik\",\n \"email\": \"jan.uzytkownik@example.com\",\n \"query\": \"I forgot my password.\",\n \"order_number\": null,\n \"purchase_date\": null,\n \"department\": \"Support\",\n \"category\": \"technical\",\n \"priority\": \"high\",\n \"tags\": [\"password\", \"login\", \"account\"]\n}\n```",
"ctx": {
"error": "expected value at line 1 column 1"
},
"url": "https://errors.pydantic.dev/2.11/v/json_invalid"
}
]
</error_message>
Compare the error message and the llm_response and identify what
needs to be fixed or removed
in the llm_response to resolve this error.
Respond ONLY with valid JSON. Do not include any explanations or
other text or formatting before or after the JSON string.
Call the LLM with the validation retry prompt and print the response
# Call the LLM with the validation retry promptvalidation_retry_response = call_llm(validation_retry_prompt)print(validation_retry_response)
What if there was still an issue? It would be good to have a function that keeps trying new prompts to resolve it (but also that does not end up in an infinite loop - it could be very co$$$$tly!!!)
Create a second retry prompt for validation errors
```{python}#| label: ex2-code#| eval: false#| code-summary: "Create a second retry prompt for validation errors"# Create a second retry prompt for validation errorssecond_validation_retry_prompt = create_retry_prompt( original_prompt=validation_retry_prompt, original_response=validation_retry_response, error_message=validation_error)print(second_validation_retry_prompt)```
This is a request to fix an error in the structure of an llm_response.
Here is the original request:
<original_prompt>
This is a request to fix an error in the structure of an llm_response.
Here is the original request:
<original_prompt>
Please analyze this user query
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password. Treat this very urgently please.",
"order_id": null,
"transaction_date": null
}:
Return your analysis as a JSON object matching this exact structure
and data types:
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_number": 12345,
"purchase_date": "2023-01-01",
department: "Support",
category: "technical",
priority: "high",
tags: ["password", "login", "account"]
}
Respond ONLY with valid JSON. Do not include any explanations or
other text or formatting before or after the JSON object.
</original_prompt>
Here is the original llm_response:
<llm_response>
```json
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_number": null,
"purchase_date": null,
"department": "Support",
"category": "technical",
"priority": "high",
"tags": ["password", "login", "account"]
}
```
</llm_response>
This response generated an error:
<error_message>
[
{
"type": "json_invalid",
"loc": [],
"msg": "Invalid JSON: expected value at line 1 column 1",
"input": "```json\n{\n \"name\": \"Jan Uzytkownik\",\n \"email\": \"jan.uzytkownik@example.com\",\n \"query\": \"I forgot my password.\",\n \"order_number\": null,\n \"purchase_date\": null,\n \"department\": \"Support\",\n \"category\": \"technical\",\n \"priority\": \"high\",\n \"tags\": [\"password\", \"login\", \"account\"]\n}\n```",
"ctx": {
"error": "expected value at line 1 column 1"
},
"url": "https://errors.pydantic.dev/2.11/v/json_invalid"
}
]
</error_message>
Compare the error message and the llm_response and identify what
needs to be fixed or removed
in the llm_response to resolve this error.
Respond ONLY with valid JSON. Do not include any explanations or
other text or formatting before or after the JSON string.
</original_prompt>
Here is the original llm_response:
<llm_response>
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_number": 12345,
"purchase_date": "2023-01-01",
"department": "Support",
"category": "technical",
"priority": "high",
"tags": ["password", "login", "account"]
}
</llm_response>
This response generated an error:
<error_message>
None
</error_message>
Compare the error message and the llm_response and identify what
needs to be fixed or removed
in the llm_response to resolve this error.
Respond ONLY with valid JSON. Do not include any explanations or
other text or formatting before or after the JSON string.
Call the LLM with the second validation retry prompt and print the response
# Call the LLM with the second validation retry promptsecond_validation_retry_response = call_llm( second_validation_retry_prompt)print(second_validation_retry_response)
The function below will try 5 times to obtain a response in a valid format. Because LLMs are probabilistic, we don’t know how many attempts will be required. LLMs don’t do the same thing every time.
Code for validation function with retries
def validate_llm_response( prompt, data_model, n_retry=5, model="gpt-4o"):# Initial LLM call response_content = call_llm(prompt, model=model) current_prompt = prompt# Try to validate with the model# attempt: 0=initial, 1=first retry, ...for attempt inrange(n_retry +1): validated_data, validation_error = validate_with_model( data_model, response_content )if validation_error:if attempt < n_retry:print(f"retry {attempt} of {n_retry} failed, trying again...")else:print(f"Max retries reached. Last error: {validation_error}")returnNone, (f"Max retries reached. Last error: {validation_error}" ) validation_retry_prompt = create_retry_prompt( original_prompt=current_prompt, original_response=response_content, error_message=validation_error ) response_content = call_llm( validation_retry_prompt, model=model ) current_prompt = validation_retry_promptcontinue# If you get here, both parsing and validation succeededreturn validated_data, None
Test the validation function:
Show code
# Test your complete solution with the original promptvalidated_data, error = validate_llm_response( prompt, CustomerQuery)
❌ Validation errors found:
[
{
"type": "json_invalid",
"loc": [],
"msg": "Invalid JSON: expected value at line 1 column 1",
"input": "```json\n{\n \"name\": \"Jan Uzytkownik\",\n \"email\": \"jan.uzytkownik@example.com\",\n \"query\": \"I forgot my password. Treat this very urgently please.\",\n \"order_number\": null,\n \"purchase_date\": null,\n \"department\": \"Support\",\n \"category\": \"technical\",\n \"priority\": \"high\",\n \"tags\": [\"password\", \"login\", \"account\"]\n}\n```",
"ctx": {
"error": "expected value at line 1 column 1"
},
"url": "https://errors.pydantic.dev/2.11/v/json_invalid"
}
]
retry 0 of 5 failed, trying again...
data validation successful!
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password.",
"order_id": null,
"transaction_date": null,
"department": "Support",
"category": "technical",
"priority": "high",
"tags": [
"password",
"login",
"account"
]
}
What’s interesting is that sometimes it needs 2, 3 and sometimes 5 attempts.Let’s take a look at the validated data in JSON format.
Include a Pydantic model schema as JSON in the LLM prompt
I’ll print out the JSON schema of a Pydantic model and include it in the prompt so that the LLM can see exactly what is expected. It’s difficult to read but an LLM will be able to parse it.
This is what the JSON schema of a Pydantic model looks like:
Another approach to this method - include a Pydantic model schema in the prompt
# Create new prompt with user input and model_json_schemaprompt =f"""Please analyze this user query\n{user_input.model_dump_json(indent=2)}:Return your analysis as a JSON object matching the following schema:{data_model_schema}Respond ONLY with valid JSON. Do not include any explanations or other text or formatting before or after the JSON object."""
This should work better than the first approach…
Test the validation function:
# Run your validate_llm_response function with the new promptfinal_analysis, error = validate_llm_response( prompt, CustomerQuery)
❌ Validation errors found:
[
{
"type": "json_invalid",
"loc": [],
"msg": "Invalid JSON: expected value at line 1 column 1",
"input": "```json\n{\n \"name\": \"Jan Uzytkownik\",\n \"email\": \"jan.uzytkownik@example.com\",\n \"query\": \"I forgot my password. Treat this very urgently please.\",\n \"order_id\": null,\n \"transaction_date\": null,\n \"department\": \"technical support\",\n \"category\": \"technical\",\n \"priority\": \"high\",\n \"tags\": [\"password\", \"urgent\", \"technical issue\"]\n}\n```",
"ctx": {
"error": "expected value at line 1 column 1"
},
"url": "https://errors.pydantic.dev/2.11/v/json_invalid"
}
]
retry 0 of 5 failed, trying again...
data validation successful!
{
"name": "Jan Uzytkownik",
"email": "jan.uzytkownik@example.com",
"query": "I forgot my password. Treat this very urgently please.",
"order_id": null,
"transaction_date": null,
"department": "technical support",
"category": "technical",
"priority": "high",
"tags": [
"password",
"urgent",
"technical issue"
]
}
Conclusion
Using Pydantic to validate and structure LLM responses offers significant advantages over simply requesting JSON output in prompts. While LLMs can often generate well-formed JSON, inconsistencies and formatting errors are common, especially in complex workflows. By leveraging Pydantic models, you can enforce strict data validation, catch errors early, and automate retries with targeted feedback—making your pipeline more robust and reliable.
Including model schemas in prompts further guides the LLM toward producing valid outputs. However, apparently there is still a better way of doing this - by including the Pydantic model directly in the API call. That’s what I’ll investigate next.