LangChain: how to review AI-generated code when chains connect LLM calls across multiple steps

2026-05-03 · 5 min read · ZenCode

LangChain has become the dominant framework for building LLM-powered applications in Python and JavaScript. It provides the chain primitives that connect prompts, models, and parsers; the LangChain Expression Language (LCEL) pipe syntax that makes multi-step LLM pipelines readable; the retrieval components for RAG applications; and the agent and tool abstractions that let a model choose and execute operations across multiple steps. Because LangChain code involves significant boilerplate — prompt templates, output parsers, retrieval chains, tool definitions, agent executors — developers frequently turn to AI coding tools to generate it. The assistant writes the chain, the developer reviews it.

AI coding tools generate LangChain code that runs correctly on the happy path. The chains execute, the agents call tools, the retrieval pipelines return documents, and a test with representative inputs passes. Three specific review gaps appear consistently in AI-generated LangChain code — gaps that look like correctness because the code runs, but carry real failure risks that only surface under specific input conditions, runtime error states, or extended operation.

The three LangChain code review traps

1. Inter-step output trust bypassing validation at chain boundaries

LCEL chains are defined using the pipe operator: a ChatPromptTemplate is piped into a model, the model output is piped into an StrOutputParser, the parsed string is piped into the next prompt template, and so on. Each step receives the output of the prior step as its input. AI coding tools generate these chains fluently — the template variables match the output format the prior step should produce, the parsers handle the expected output structure, and the chain composition looks clean and intentional.

The problem is that the output of one LLM call is a generated string. The chain’s next step assumes a specific format: a JSON object with required fields, a comma-separated list, a structured decision between exactly two options. AI-generated chains do not add validation at chain boundaries — there is no schema check between the first model call and the second prompt that consumes its output. If the first model call produces output that deviates from the expected format — a JSON object missing a required field, a list with an unexpected delimiter, a decision phrased as a question rather than a command — the second prompt receives malformed input. The failure mode is not an immediate exception; the second model call receives the malformed input and generates a response that propagates the malformation further, often producing a chain output that is plausibly shaped but semantically incorrect.

The downstream failure is often silent. The chain completes without raising an exception; the output reaches the application layer; it is processed against business logic that fails on malformed data. The connection between the chain output and the upstream format deviation is not obvious without tracing back through the chain steps. Reviewers who see a clean LCEL chain assume that the pipe composition handles format negotiation; they do not check whether the format the first step produces matches the format the second step requires under non-ideal model outputs.

The review check: for each | operator in an LCEL chain, trace what format the left side is expected to produce and what format the right side explicitly requires. If the right side is a prompt template with variables like {summary} or {decision}, identify where those values come from and what happens if the model produces a valid-looking but incorrectly formatted output. Look for missing output parsers between steps that rely on structured model output, and for prompt templates that assume a specific format without an explicit parsing step to enforce it.

2. LCEL Runnable type mismatch opacity at composition boundaries

LangChain’s Runnable protocol is the abstraction that makes LCEL work: every component that participates in a chain implements invoke, stream, and batch, accepting an input and producing an output. The pipe operator connects Runnables, and because every component exposes the same interface, the chain compiles and runs regardless of whether the types flowing through it are compatible. AI coding tools generate LCEL chains by composing components that are individually correct — the ChatPromptTemplate is set up correctly, the model is instantiated correctly, the parser handles the expected format — but the chain as a whole may have type mismatches at composition boundaries that are not visible in the code and do not raise errors until runtime.

The most common mismatch in AI-generated chains involves message types. A chain that uses a ChatPromptTemplate produces a list of BaseMessage objects that are passed to a chat model; the chat model produces an AIMessage. An StrOutputParser extracts the string content. If the chain then passes this string into another ChatPromptTemplate via a RunnablePassthrough or a dictionary binding, the template receives a string where it expects a message list, or a message object where it expects a string. The error surfaces at runtime as a type error or an unexpected template variable, often with a stack trace that points to the template rather than to the composition boundary where the type mismatch was introduced. AI-generated code that uses RunnableParallel to merge outputs from multiple chain branches compounds this: each branch may produce a different type, and the merge step receives heterogeneous types that the downstream component did not expect.

The visual linearity of LCEL — left to right with |, readable like a Unix pipe — creates the impression that the composition is type-safe in the same way that typed function composition would be. It is not. The Runnable protocol is dynamically typed at the composition level; the type checker does not verify that ComponentA’s output type matches ComponentB’s expected input type at chain construction time. Reviewers who read a four-step LCEL chain as a visually coherent pipeline may not check whether each step’s output type is explicitly compatible with the next step’s input contract.

The review check: for each | in an LCEL chain, identify explicitly what Python or JavaScript type the left side produces and what type the right side’s invoke method accepts. Check the LangChain documentation for the specific component if the type is not obvious from the code. Pay particular attention to transitions between prompt templates and models (which produce message lists), models and parsers (which expect AIMessage or a compatible type), and parsers and downstream components (which receive the parsed output type). For RunnableParallel branches, verify that the merged output dictionary has the key names and value types that the downstream component expects.

3. Agent loop termination delegated to the model’s self-report

LangChain agent executors and LangGraph agents operate by calling a model, reading the model’s output to determine whether to call a tool or return a final answer, executing the chosen tool, appending the result to the conversation, and repeating until the model signals that it is done. AI coding tools generate agent code that implements the tool list, the system prompt, and the executor correctly. The tools are defined with accurate schemas, the prompt instructs the model to use tools when needed and return a final answer when done, and the executor wires the loop together. The generated code runs correctly for queries where the model’s path to a final answer is short and clear.

The problem is that AI-generated agent code frequently does not set a max_iterations limit on the executor or implement an explicit termination condition at the graph level. The agent terminates when the model outputs an AgentFinish — a signal that the agent itself produces based on its own assessment of whether it has enough information to answer. For queries where the model gets stuck in a loop — calling the same search tool with slightly different queries because no result fully satisfies the model’s intermediate reasoning, or alternating between two tools because each result partially contradicts the other — the agent runs until an external timeout kills the process or the API budget is exhausted. There is no structural termination beyond the model’s self-report.

The failure mode is not a crash; it is silent, expensive, and hard to diagnose after the fact. The agent process runs, API calls accumulate at the configured model’s rate, and the application appears to hang from the caller’s perspective. When the host application’s HTTP timeout fires, the agent is killed mid-loop with no final answer produced. The logs show tool calls without a final answer entry, but the connection between the loop pattern and the missing max_iterations configuration is easy to miss in a post-mortem. Reviewers who see a correctly configured agent — right tools, right prompt, right executor — may not check whether the loop has a structural ceiling or whether the only termination guarantee is the model’s own judgment.

The review check: for any AgentExecutor instantiation, verify that max_iterations is set to a value that reflects the expected number of tool calls for the agent’s intended tasks, not left at the default (which is typically 15 — high enough to loop on ambiguous queries without obvious cost). For LangGraph agents, verify that the graph includes an explicit edge or conditional that terminates after a maximum number of node visits, independent of the model’s AgentFinish output. Also check whether the agent prompt includes instructions for what to do when no tool call produces a satisfactory result — an explicit fallback instruction reduces the probability of indeterminate loops on underspecified queries.

Reviewing LangChain code without treating chain execution as correctness

LangChain’s abstractions — LCEL composition, the Runnable protocol, the agent executor loop — make multi-step LLM pipelines significantly easier to build and read. AI coding tools generate LangChain code that uses these abstractions correctly at the component level: each piece is right, and the chain runs. The review problem is that running is not the same as correct end-to-end: inter-step format assumptions can fail on non-ideal model outputs, LCEL’s dynamic typing can hide composition boundary mismatches until runtime, and agent loops without structural ceilings can run indefinitely on difficult queries.

A practical review approach for AI-generated LangChain code: when you see an LCEL chain with multiple steps, trace the output type of each step explicitly rather than reading the pipe as a type-safe composition. When you see a multi-step chain with no validation between steps, identify the format assumption the second step makes about the first step’s output and name what happens when that assumption fails. When you see an agent executor or LangGraph agent, check for max_iterations before evaluating anything else — structural termination is the single most important property to verify in agent code because its absence is invisible during correct-path testing and expensive during incorrect-path execution.

Related reading: Vercel AI SDK on reviewing AI-generated LLM integration code that uses SDK abstractions correctly at the component level while carrying gaps at integration boundaries — the same pattern in a TypeScript/JavaScript SDK context. OpenAI Codex agent on reviewing autonomous agent output that chains tool calls across multiple steps, including the same termination and loop-boundary concerns. GitHub Copilot Agent Mode on reviewing AI-generated code that wires together multi-step operations where earlier steps establish context that later steps depend on. Cursor on reviewing AI-generated Python that satisfies framework constraints correctly while carrying logic gaps that the framework cannot catch. How to review AI-generated code for the general review framework that applies when AI generates integration code against any multi-component abstraction layer.

LangChain chains run. ZenCode checks whether they’re correct.

ZenCode surfaces one concrete review question before you commit — including when AI-generated LangChain code executes on the happy path but carries inter-step format assumptions, LCEL type boundary gaps, or agent termination conditions that only the model enforces.

Try ZenCode free