Pydantic AI: how to review AI-generated code when typed agent results validate structure but not correctness

2026-05-04 · 5 min read · ZenCode

Pydantic AI is the agent framework from the team behind Pydantic, the most-downloaded Python data validation library. Where frameworks like LangGraph and CrewAI focus on orchestration graphs and multi-agent coordination, Pydantic AI’s design center is type safety: agents are parameterized with a result_type that the model must satisfy, the framework retries automatically until the model’s output validates against that schema, and tools receive a typed RunContext[T] dependency container that works with mypy and Pyright. The framework’s tight Pydantic integration means AI-generated code passes static type checks in ways that other agent frameworks often do not — and that type-safety creates a specific set of review gaps that are easy to miss precisely because the code looks so well-validated.

Because Pydantic AI’s boilerplate is structured and repetitive — defining a result_type Pydantic model, instantiating Agent[T] with a model string, registering tools with @agent.tool decorators, constructing a typed deps object, and calling agent.run() or agent.run_stream() — developers frequently use AI coding tools to generate it. The assistant writes the agent and its tools, the developer reviews the output. AI coding tools generate Pydantic AI code that validates successfully, passes the type checker, and runs correctly on test inputs. Three specific review gaps appear consistently in AI-generated Pydantic AI code — gaps that look like correctness because the schema validates and the type checker is satisfied, but carry real failure risks that only surface under production conditions.

The three Pydantic AI code review traps

1. result_type validation masking semantic incorrectness

Pydantic AI’s result_type parameter instructs the agent to produce output that validates against a given Pydantic model. When the model’s response does not validate, the framework retries with a validation error message appended to the conversation. When the response does validate, agent.run() returns a RunResult[T] with a .data attribute containing the typed, validated result. This is where the first review gap appears.

Pydantic validation confirms structure: the required fields are present, the types match, and any explicit validators pass. It does not confirm that the values are semantically correct for the task the agent was asked to perform. An agent with result_type=InvoiceExtraction will return a valid InvoiceExtraction object whether or not the extracted total, vendor, line_items, and due_date fields correspond to the actual invoice the agent was processing. Reviewers who see a result_type declaration and a clean mypy output often assume the agent’s result is correct for the input — the type system has done the work. The type system confirmed shape; it said nothing about value.

AI-generated result_type models compound the problem by omitting constraints that downstream code relies on. A field declared as total: float allows -1.0, 0.0, and float('inf'). The application code that receives the result may assume total >= 0 without checking. AI-generated Pydantic models rarely include @field_validator constraints or Field(ge=0)` annotations for numeric bounds, non-empty requirements for string fields, or format checks for identifiers, dates, and codes — because the model that generated the schema was completing the structural pattern, not reasoning about what values the downstream code actually requires. The schema validates as a Pydantic model; the values it permits include the cases that break the application.

The review check: for every result_type in AI-generated Pydantic AI code, read the Pydantic model and ask what values pass validation that would break the code that consumes the result. Identify one field that should have a range constraint, non-empty requirement, or format check and verify whether it is present. Do not treat a clean mypy pass as confirmation that the schema enforces what the application requires.

2. RunContext dependency coupling and initialization trust

Pydantic AI’s RunContext[T] is the typed dependency container passed to tool functions. A tool decorated with @agent.tool receives a RunContext[MyDeps] as its first argument, where MyDeps is a dataclass or Pydantic model containing the dependencies — database connections, HTTP clients, configuration values, authentication tokens — that the tool needs to do its work. The tool accesses these as ctx.deps.database, ctx.deps.http_client, and so on. The type annotation on RunContext is enforced by the type checker: if a tool is typed RunContext[MyDeps], passing a RunContext[OtherDeps] is a type error.

The review gap is not in the type annotation but in the dependency construction. AI-generated code creates the deps object at the call site: result = await agent.run(prompt, deps=MyDeps(database=db, http_client=client)). In test code generated alongside the agent, the deps object is typically constructed with a mock database, a test HTTP client, or — when the developer asks the AI to write a quick test — with None values substituted for fields that are awkward to instantiate. Tools that work with these minimal deps objects surface AttributeError or silent wrong behavior in production when the real deps object has a different initialization state.

A subtler variant: AI-generated tools assume their deps attributes are fully initialized and ready to use at the moment the tool is called. A database connection pool that is None until the application’s startup event completes, an HTTP client that requires authentication before any request, or a rate-limiter that is shared across concurrent agent runs — these are initialization and concurrency concerns that the type annotation cannot capture. RunContext[MyDeps] confirms that deps is a MyDeps instance; it does not confirm that deps.database is connected, that deps.http_client has valid credentials, or that deps.rate_limiter is safe to call concurrently. Reviewers who see a typed RunContext pass the type checker often treat the dependency concern as solved.

The review check: for every @agent.tool function, read how the deps object is constructed at each call site where the agent runs in production. Verify that every ctx.deps attribute the tool accesses is initialized and ready at the point when the tool can be called during an agent run. Check whether the test-time deps construction matches the production-time construction closely enough that tests would surface initialization failures.

3. Retry cost amplification from non-idempotent tool calls

Pydantic AI retries the model call automatically when its output fails to validate against result_type. Each retry appends the validation error to the conversation and sends the full message history back to the model. The number of retries is controlled by max_retries, which defaults to 1 in most configurations. AI-generated code rarely sets max_retries explicitly, which means the value that reaches production is whatever the framework version defaults to at import time — a coupling that changes silently when the library is updated.

The deeper review gap is not the retry count but what happens during a retry when the agent uses tools. When the model’s structured output fails validation and the framework retries, the retry does not replay only the final generation step. If the model decides during the retry that it needs additional tool results to produce a valid output, it will call tools again. AI-generated agents that use tools to fetch data, increment counters, write database records, send notifications, or call external APIs may perform these side effects on both the initial attempt and on each retry. A tool that creates a database record on each invocation will create duplicate records when a validation retry fires. A tool that charges a payment or sends an email will do so on every retry. Reviewers who see a result_type validator and a list of tool decorators do not typically ask whether the tools are safe to call multiple times per agent run.

Streaming agents introduce a related variant. agent.run_stream() returns an async context manager that yields result deltas as the model generates them, with a final validated result available via await result.get_data() after the stream completes. AI-generated streaming code frequently consumes the stream in a loop but does not call get_data() or handle the case where the stream terminates before a valid result is produced — either because the model ran out of tokens, because the connection was interrupted, or because the application code broke out of the stream loop early. The underlying model call was made, tokens were consumed, and any tool calls that fired during generation had their side effects — but the application never received a validated result and has no signal that the run was incomplete.

The review check: for every @agent.tool registered on an agent that uses result_type, ask whether calling the tool twice within a single agent run produces a different application state than calling it once. If yes — if the tool writes, increments, sends, or charges — verify that the agent uses idempotency keys, that max_retries is explicitly set and the retry budget is acceptable, or that the tool has compensating logic for duplicate calls. For streaming agents, verify that the call site handles stream termination before get_data() returns a valid result.

Reviewing Pydantic AI code without treating type safety as correctness

Pydantic AI’s type-first design is a genuine improvement over untyped agent frameworks: result_type catches structural mismatches at framework level, RunContext[T] makes dependency contracts explicit, and mypy integration surfaces coupling errors during development rather than at runtime. The review problem is that type safety is a narrower guarantee than it appears. The type checker confirms that code is structurally consistent; it does not confirm that values are semantically correct, that dependencies are initialized, or that tool side effects are safe to repeat.

A practical review approach for AI-generated Pydantic AI code: when you see a result_type model, read its field definitions and ask what valid-but-wrong values pass validation before reading anything else — the schema is the most trusted artifact in the codebase but it only enforces what it explicitly declares. When you see a RunContext[T] tool, trace the deps construction to its production call site and verify initialization state. When you see a result_type agent that also uses @agent.tool, check whether the tools are idempotent before accepting the retry configuration as a background detail. The guarantees Pydantic AI provides are real; the review habit of treating them as broader than they are is the gap.


Related reading: LangChain on reviewing AI-generated chain code where inter-step trust and structured output parsing create similar correctness gaps at the chain boundary. LangGraph on reviewing stateful agent graphs where unbounded state accumulation and conditional edge routing introduce failure modes invisible during short test runs. LlamaIndex on reviewing AI-generated RAG pipelines where retrieval configuration silently degrades at scale while integration tests pass. CrewAI on reviewing multi-agent orchestration code where task output handoff between agents creates trust gaps similar to Pydantic AI’s result_type boundary. How to review AI-generated code for the general review framework that applies when AI generates code against any typed, structured agent abstraction.

Pydantic AI results validate. ZenCode checks whether they’re correct.

ZenCode surfaces one concrete review question before you commit — including when AI-generated Pydantic AI code passes the type checker and result_type validation but carries semantic incorrectness, unvalidated dependency state, or retry cost amplification from non-idempotent tools.

Try ZenCode free

More posts on AI-assisted coding habits