Pydantic Basics - Part 1

I explore using the Pydantic package to validate sample user input data

Python
Pydantic
Data Validation
Published

March 1, 2025

Using the Pydantic package to validate data in applications relying on LLM APIs

I’ve been experimenting with validating LLM input and output data with the Pydantic Python package. Several applications I’m working on require data validation at different stages: when accepting user input, when receiving responses from LLM APIs, when using that data to call other functions.

The code I used for these experiments is available here.

User input data can be messy and needs to be validated before being passed to other parts of the application. When using LLM APIs, I’ve found that simply requesting a certain output format (like JSON) in the LLM prompt is not 100% reliable and the LLM response often contains some additional ‘fluff’ or misses key fields.

There are several ways in which the Pydantic package can be used to help with this. It can be used to validate data coming from the user or an LLM response.

When using Instructor, another useful Python library helping to bring structure to LLM outputs, a Pydantic model schema (in JSON format) can also be included directly in the LLM API call.

I’ll experiment with these methods. I’ll use the OpenAI, Anthropic and Gemini LLMs in the process.

My first objective is to create a Pydantic model to validate user input data in a fictional customer support system. In this scenario, users submit queries and I rely on the LLM to classify the request, assign urgency and keyword tags. I’ll use venv to create a virtual environment .venv into which I’ll install the IPkernel and Pydantic packages (it’s a best practice and will prevent any conflicts with packages installed in the global environment). I’ll also need to pip install pydantic[email] - the email validator.

First, I’ll load the required libraries.

Show code
# Import the pydantic and json libraries
from pydantic import BaseModel, ValidationError, EmailStr, Field
from typing import Optional
from datetime import date
import json

Define a Pydantic model for user input and populate it with some sample data

In the next cell, I’ll define a Pydantic model to validate user input data for a customer support scenario. This model will ensure that the data provided by users—such as customer number, email address, and issue description—meets the expected format and requirements before it is processed further. By using Pydantic, I can catch common data issues early and provide clear feedback on any validation errors.

I’ll also create some sample data in a Python dictionary format.

# Create a Pydantic model to validate user input data consisting of a customer_number (int), email (str), and issue_description (str).
# The new UserInput class inherits from BaseModel, which provides the core functionality for data validation and parsing.
class UserInput(BaseModel):
    customer_number: int
    email: EmailStr
    issue_description: str
    
# Sample data to populate the model
sample_data = {
    "customer_number": 12345,
    "email": "  user@example.com  ",
    "issue_description": "I am unable to access my account."
}

I can then use Pydantic to validate the sample data. The ** syntax is used to unpack a Python dictionary. The code returns validated data - as an instance of the UserInput class.

# Validate sample_data

user_input = UserInput(**sample_data)

print(user_input)
print(type(user_input))
customer_number=12345 email='user@example.com' issue_description='I am unable to access my account.'
<class '__main__.UserInput'>

Next, I’ll create an error on purpose - string instead of number and no email. The error can be introduced in several places just to test the model.

Show code
# Populate model with invalid data to see validation errors
invalid_data = {
    "customer_number": "not_a_number",
    "email": "invalid_email",
    "issue_description": 6
}

#Run this cell below to see validation errors returned as JSON
try:
    #I'm using the UserInput model defined above to validate data
    user_input = UserInput(**invalid_data)
except ValidationError as e:
    print(e.json()) 
[{"type":"int_parsing","loc":["customer_number"],"msg":"Input should be a valid integer, unable to parse string as an integer","input":"not_a_number","url":"https://errors.pydantic.dev/2.11/v/int_parsing"},{"type":"value_error","loc":["email"],"msg":"value is not a valid email address: An email address must have an @-sign.","input":"invalid_email","ctx":{"reason":"An email address must have an @-sign."},"url":"https://errors.pydantic.dev/2.11/v/value_error"},{"type":"string_type","loc":["issue_description"],"msg":"Input should be a valid string","input":6,"url":"https://errors.pydantic.dev/2.11/v/string_type"}]

In the next cell, I’ll create a reusable function that validates user input data against the Pydantic model. This function will help streamline the validation process by handling both successful validations and errors, making it easier to test different scenarios and demonstrate how Pydantic provides clear feedback when data doesn’t meet the expected requirements.

Show code
# Create a function that takes a dictionary as input, validates it against the UserInput model, and returns either the validated data or the validation errors.
def validate_user_input(data: dict):
    try:
        user_input = UserInput(**data)
        print(f"✅ Valid user input created:")
        #Print out the JSON representation of the validated data with indentation for readability
        print(f"{user_input.model_dump_json(indent=2)}")
        return user_input
    except ValidationError as e:
        print(f"❌ Validation errors found:")
        print(f"{e.json(indent=2)}")
        return e.json()

This is what the data that is returned if there is an error. All elements of this JSON object can be accessed separately.

Show code
#Try to validate invalid data
validate_user_input(invalid_data)
❌ Validation errors found:
[
  {
    "type": "int_parsing",
    "loc": [
      "customer_number"
    ],
    "msg": "Input should be a valid integer, unable to parse string as an integer",
    "input": "not_a_number",
    "url": "https://errors.pydantic.dev/2.11/v/int_parsing"
  },
  {
    "type": "value_error",
    "loc": [
      "email"
    ],
    "msg": "value is not a valid email address: An email address must have an @-sign.",
    "input": "invalid_email",
    "ctx": {
      "reason": "An email address must have an @-sign."
    },
    "url": "https://errors.pydantic.dev/2.11/v/value_error"
  },
  {
    "type": "string_type",
    "loc": [
      "issue_description"
    ],
    "msg": "Input should be a valid string",
    "input": 6,
    "url": "https://errors.pydantic.dev/2.11/v/string_type"
  }
]
'[{"type":"int_parsing","loc":["customer_number"],"msg":"Input should be a valid integer, unable to parse string as an integer","input":"not_a_number","url":"https://errors.pydantic.dev/2.11/v/int_parsing"},{"type":"value_error","loc":["email"],"msg":"value is not a valid email address: An email address must have an @-sign.","input":"invalid_email","ctx":{"reason":"An email address must have an @-sign."},"url":"https://errors.pydantic.dev/2.11/v/value_error"},{"type":"string_type","loc":["issue_description"],"msg":"Input should be a valid string","input":6,"url":"https://errors.pydantic.dev/2.11/v/string_type"}]'

In the following cell, I’ll check how Pydantic handles cases where required fields are missing from the user input data. This is a common scenario in real-world applications, where users may forget to provide all necessary information.

Show code
#Create invalid data with a missing field
incomplete_data = {
    "customer_number": 67890,
    "issue_description": "My order hasn't arrived yet."
}

And again we get some useful data about the error.

Show code
# Try validating the incomplete data
validate_user_input(incomplete_data)
❌ Validation errors found:
[
  {
    "type": "missing",
    "loc": [
      "email"
    ],
    "msg": "Field required",
    "input": {
      "customer_number": 67890,
      "issue_description": "My order hasn't arrived yet."
    },
    "url": "https://errors.pydantic.dev/2.11/v/missing"
  }
]
'[{"type":"missing","loc":["email"],"msg":"Field required","input":{"customer_number":67890,"issue_description":"My order hasn\'t arrived yet."},"url":"https://errors.pydantic.dev/2.11/v/missing"}]'

In the next cell, I’ll explore how Pydantic handles user input data that contains unexpected or extra fields not defined in the model. This is a common scenario when working with real-world data, where users or external systems might send additional information. I’ll demonstrate how Pydantic validates the known fields and gracefully ignores any extra fields, ensuring that only the expected data is processed by the application.

Show code
# Create sample data with additional unexpected fields
extra_data = {
    "customer_number": 54321,
    "email": "extra@example.com",
    "issue_description": "I have an extra field.",
    "unexpected_field": "This field is not defined in the model."
}
Show code
# Validate the extra data (validated data will not include the extra field)
validate_user_input(extra_data)
✅ Valid user input created:
{
  "customer_number": 54321,
  "email": "extra@example.com",
  "issue_description": "I have an extra field."
}
UserInput(customer_number=54321, email='extra@example.com', issue_description='I have an extra field.')

As the next step, I’ll enhance the user input validation model to include additional fields relevant to customer support scenarios, such as an optional order ID and transaction date. By specifying constraints and descriptions for these fields, I can ensure that the data collected is both accurate and meaningful. This approach demonstrates how Pydantic models can be easily extended to accommodate evolving requirements while maintaining robust data validation. Pydantic also allows for the use of custom functions (field validators) which I’ll experiment with later.

Show code
# Enhanced input model 
class UserInput(BaseModel):
    customer_number: int
    email: EmailStr
    issue_description: str
    order_id: Optional[int] = Field(
        #Default is None
        None,
        description="5-digit order number (cannot start with 0)",
        # Greater than or equal to 10000 and less than or equal to 99999
        ge=10000,
        le=99999
    )
    transaction_date: Optional[date] = None

I can then test the enhanced model with valid data.

Show code
# Test the enhanced model with valid data
valid_enhanced_data = {
    "customer_number": 11223,
    "email": "  valid@example.com  ",
    "issue_description": "My order hasn't arrived yet.",
    "order_id": 12345,
    "transaction_date": "2023-10-01"
}

validate_user_input(valid_enhanced_data)
✅ Valid user input created:
{
  "customer_number": 11223,
  "email": "valid@example.com",
  "issue_description": "My order hasn't arrived yet.",
  "order_id": 12345,
  "transaction_date": "2023-10-01"
}
UserInput(customer_number=11223, email='valid@example.com', issue_description="My order hasn't arrived yet.", order_id=12345, transaction_date=datetime.date(2023, 10, 1))
Show code
valid_enhanced_data_short = {
    "customer_number": 11223,
    "email": "  valid@example.com  ",
    "issue_description": "My order hasn't arrived yet."
}

validate_user_input(valid_enhanced_data_short)
✅ Valid user input created:
{
  "customer_number": 11223,
  "email": "valid@example.com",
  "issue_description": "My order hasn't arrived yet.",
  "order_id": null,
  "transaction_date": null
}
UserInput(customer_number=11223, email='valid@example.com', issue_description="My order hasn't arrived yet.", order_id=None, transaction_date=None)
Show code
#Inspect validated data structure
validated_data = validate_user_input(valid_enhanced_data)
print(validated_data)
✅ Valid user input created:
{
  "customer_number": 11223,
  "email": "valid@example.com",
  "issue_description": "My order hasn't arrived yet.",
  "order_id": 12345,
  "transaction_date": "2023-10-01"
}
customer_number=11223 email='valid@example.com' issue_description="My order hasn't arrived yet." order_id=12345 transaction_date=datetime.date(2023, 10, 1)

Validating JSON data.

In the following cell, I’ll demonstrate how to handle user input data provided in JSON format. This is a common scenario when integrating with web APIs or front-end applications, where data is often exchanged as JSON strings. I’ll show how to parse the JSON string into a Python dictionary and then validate it using the Pydantic model, ensuring that the data meets all required constraints before further processing.

Show code
# What if the valid data is in JSON format?
valid_json_data = '''{
    "customer_number": 33445,
    "email": "  valid@example.com  ",
    "issue_description": "My order hasn't arrived yet.",
    "order_id": 12345,
    "transaction_date": "2023-10-01"
}'''
# Parse the JSON string into a Python dictionary that can be used as an argument in the validation function
data_dict = json.loads(valid_json_data)
print("Parsed JSON:", data_dict)
# Validate the parsed data
validate_user_input(data_dict)
Parsed JSON: {'customer_number': 33445, 'email': '  valid@example.com  ', 'issue_description': "My order hasn't arrived yet.", 'order_id': 12345, 'transaction_date': '2023-10-01'}
✅ Valid user input created:
{
  "customer_number": 33445,
  "email": "valid@example.com",
  "issue_description": "My order hasn't arrived yet.",
  "order_id": 12345,
  "transaction_date": "2023-10-01"
}
UserInput(customer_number=33445, email='valid@example.com', issue_description="My order hasn't arrived yet.", order_id=12345, transaction_date=datetime.date(2023, 10, 1))

I’ll also try validating invalid JSON data in the next step.

Show code
# Try invalid JSON data
invalid_json_data = '''{
    "customer_number": "not_a_number",
    "email": "invalid_email",
    "issue_description": 6
}'''
# Parse the invalid JSON string into a Python dictionary
data_dict = json.loads(invalid_json_data)
print("Parsed JSON:", data_dict)
# Validate the parsed data
validate_user_input(data_dict)
Parsed JSON: {'customer_number': 'not_a_number', 'email': 'invalid_email', 'issue_description': 6}
❌ Validation errors found:
[
  {
    "type": "int_parsing",
    "loc": [
      "customer_number"
    ],
    "msg": "Input should be a valid integer, unable to parse string as an integer",
    "input": "not_a_number",
    "url": "https://errors.pydantic.dev/2.11/v/int_parsing"
  },
  {
    "type": "value_error",
    "loc": [
      "email"
    ],
    "msg": "value is not a valid email address: An email address must have an @-sign.",
    "input": "invalid_email",
    "ctx": {
      "reason": "An email address must have an @-sign."
    },
    "url": "https://errors.pydantic.dev/2.11/v/value_error"
  },
  {
    "type": "string_type",
    "loc": [
      "issue_description"
    ],
    "msg": "Input should be a valid string",
    "input": 6,
    "url": "https://errors.pydantic.dev/2.11/v/string_type"
  }
]
'[{"type":"int_parsing","loc":["customer_number"],"msg":"Input should be a valid integer, unable to parse string as an integer","input":"not_a_number","url":"https://errors.pydantic.dev/2.11/v/int_parsing"},{"type":"value_error","loc":["email"],"msg":"value is not a valid email address: An email address must have an @-sign.","input":"invalid_email","ctx":{"reason":"An email address must have an @-sign."},"url":"https://errors.pydantic.dev/2.11/v/value_error"},{"type":"string_type","loc":["issue_description"],"msg":"Input should be a valid string","input":6,"url":"https://errors.pydantic.dev/2.11/v/string_type"}]'

All of these steps can be rolled into a single function that helps to validate user data:

# Use the model_validate_json method to validate JSON data directly
def validate_user_input_json(json_data: str):
    try:
        #This is a Pydantic v2 method to validate JSON data directly
        user_input = UserInput.model_validate_json(json_data)
        print(f"✅ Valid user input created:")
        print(f"{user_input.model_dump_json(indent=2)}")
        return user_input
    except ValidationError as e:
        print(f"❌ Validation errors found:")
        print(f"{e.json(indent=2)}")
        return e.json()

Validation user input with Pydantic - some thoughts

The Pydantic package offers a powerful and flexible approach to data validation in Python. By defining clear models, we can ensure that incoming data meets expected formats, catch errors early, and provide meaningful feedback to users or downstream processes.

Pydantic’s ability to handle various scenarios—such as missing fields, invalid types, extra data, and direct JSON validation—makes it a very useful tool. Pydantic also simplifies error handling.

Next, I’ll try using Pydantic with LLM APIs to get structured responses.