LangGraph: how to review AI-generated code when stateful agent graphs coordinate multi-step reasoning
LangGraph is the framework for building stateful, multi-actor agent applications on top of LangChain. Where LangChain’s LCEL handles linear pipelines and simple agent executors, LangGraph handles the harder problem: graphs of nodes that read from and write to shared state, conditional edges that route execution based on the model’s output, cycles that let agents revisit earlier nodes, and checkpointers that persist state across invocations so a human can review or interrupt the agent mid-task. It is the standard choice when an agent needs to coordinate multiple tools or sub-agents across an extended, multi-step process rather than a single chain execution.
Because LangGraph code involves significant structural boilerplate — defining a TypedDict state schema, adding nodes with add_node, connecting them with add_edge and add_conditional_edges, attaching a checkpointer, compiling the graph, and invoking it with a thread config — developers frequently use AI coding tools to generate it. The assistant writes the graph, the developer reviews it. AI coding tools generate LangGraph graphs that compile and run correctly on the happy path. Three specific review gaps appear consistently in AI-generated LangGraph code — gaps that look like correctness because the graph executes, but carry real failure risks that only surface under sustained operation, unusual model outputs, or multi-session use.
The three LangGraph code review traps
1. State accumulation without bounds across node executions
LangGraph state is defined as a TypedDict with fields that each node can read and overwrite. The most common pattern in AI-generated graphs uses a messages field typed as Annotated[list[BaseMessage], operator.add] — the operator.add annotation tells LangGraph to append each node’s message output to the list rather than replace it. This is the correct pattern for building a conversation history that accumulates as the agent calls tools and receives results. It is also the source of a persistent review gap in AI-generated code.
AI-generated graphs do not implement any mechanism to bound the size of the messages list. Every tool call appends a ToolMessage, every model call appends an AIMessage, and every human turn appends a HumanMessage. For a single short task, this is not a problem. For an agent that runs repeatedly, handles long-form tasks, or operates in a loop with many tool calls per invocation, the messages list grows without limit. The downstream failure is not an exception at graph compilation time or at the first invocation; it is a gradual degradation in response quality as the model’s context window fills, followed by a hard error when the accumulated messages exceed the model’s token limit, followed by increased API cost on every subsequent invocation as the model processes an ever-longer history even when most of it is not relevant to the current task.
The failure is invisible during development because tests typically run short tasks with small histories. The graph compiles, the agent completes the task, and the state looks correct because it contains a faithful record of everything that happened. Reviewers who see Annotated[list[BaseMessage], operator.add] recognize it as the standard LangGraph pattern and do not check whether the list has a length bound or a summarization step that keeps it from growing indefinitely.
The review check: for any messages field typed with operator.add, ask what happens after 50 tool calls, after 200 turns, after a week of continuous operation. Look for a summarization node that condenses old messages before the list exceeds the model’s context window. Look for a trim step that removes messages beyond a maximum count. If neither exists, the graph relies on callers to manage state size externally — which AI-generated calling code also typically does not implement. This is a structural gap, not a configuration option that can be adjusted after deployment.
2. Conditional edge routing that trusts model output as a reliable signal
LangGraph’s conditional edges are defined with add_conditional_edges: a function reads the current graph state and returns a string that names the next node to visit. AI coding tools generate this pattern by writing a router function that inspects the model’s last message — typically checking whether the message is a ToolMessage, whether its content matches a specific string, or whether a field in the parsed model output is set to a specific value — and routes to a tool-calling node if the model wants a tool call, or to an end node if the model is done.
The problem is that AI-generated router functions treat the model’s output as a reliable routing signal without validating that the output conforms to the structure the router expects. A common pattern: the router checks state["messages"][-1].tool_calls to decide whether to call a tool. This works when the model outputs a proper tool call. When the model outputs a malformed tool call — a tool name that does not exist in the registered tool list, arguments that do not match the tool schema, or an empty tool call list with a non-empty content field — the router either routes to the tool node with an invalid call (which raises an exception in the tool executor), routes to the end node prematurely (which terminates the task without a result), or raises a KeyError or AttributeError inside the router function itself (which terminates the graph with an unhandled exception).
AI-generated conditional edges also commonly route to a fixed set of named nodes without a default or fallback branch. If the model output does not match any expected pattern — because the model is in an unexpected state, because the prompt produced an output the developer did not anticipate, or because a new model version produces slightly different tool call formatting — the router returns a string that does not match any registered node name. LangGraph raises a ValueError at runtime that names the unrecognized node, which is debuggable but not graceful. Reviewers who see a clean conditional edge mapping — {"continue": "tools", "end": END} — often do not check whether the router function handles the full range of possible model outputs, including outputs that do not map to any defined branch.
The review check: for every add_conditional_edges call, read the router function and enumerate the set of model outputs it handles. Check whether it has a default branch for unexpected outputs. Check whether it validates the structure of the model output before accessing fields like tool_calls, content, or parsed sub-fields. Check whether the set of target node names in the routing map matches the set of nodes registered with add_node exactly — a mismatch is not caught at graph compilation time in all LangGraph versions. The router function is the most load-bearing piece of a LangGraph graph; it is also the piece that AI coding tools generate with the least defensive validation.
3. Checkpointing that persists state in ways that replay stale context
LangGraph’s checkpointer abstraction — MemorySaver for in-process persistence, SqliteSaver or PostgresSaver for durable persistence — allows a graph to save its state after every node execution and resume from any prior checkpoint. This is the feature that enables human-in-the-loop workflows: the agent pauses after a tool call, a human reviews the output and provides feedback, and the graph resumes from the saved state with the feedback incorporated. It is also the feature that introduces the most subtle correctness gaps in AI-generated LangGraph code.
AI-generated graphs that use checkpointers invoke the graph with a thread_id in the config: {"configurable": {"thread_id": "session-123"}}. Each invocation with the same thread_id resumes from the last checkpoint for that thread. This is correct behavior for a conversational agent that maintains context across turns. The review gap is that AI-generated code frequently reuses thread_id values in contexts where resuming from a prior checkpoint is incorrect. A common example: a background job that processes records uses a fixed thread_id for all runs. On the first run, the graph accumulates state for a specific record. On the second run with the same thread_id, the graph resumes from the prior state — the messages list still contains the tool calls and results from the first run, the model receives context from a different record, and the second run’s output is influenced by state that has no relationship to the current task.
A subtler variant: AI-generated graphs that use SqliteSaver or another durable store do not implement checkpoint expiry or cleanup. Over time, the checkpoint store accumulates state for every thread that has ever run. If the graph’s state schema changes — a field is renamed, a type is changed, a new required field is added — resuming from an old checkpoint may produce a state object that does not match the current schema. The deserialized checkpoint passes the graph’s type annotations at invocation time (because LangGraph does not re-validate checkpoint state against the current schema on load), and the type mismatch surfaces as an unexpected KeyError or AttributeError inside a node that reads a field that has been renamed.
The review check: for any graph that uses a checkpointer, read every invocation site and verify that the thread_id is scoped to the intended unit of continuity — a specific conversation, a specific user session, a specific long-running task. A thread_id that is reused across logically distinct invocations is a latent bug that does not surface until the second run. For graphs that use durable checkpointers, check whether there is a mechanism to expire or clean up old checkpoints when the graph’s state schema changes, and verify that the code handles the case where a checkpoint was created by a prior version of the graph schema.
Reviewing LangGraph code without treating graph compilation as correctness
LangGraph’s graph abstraction — typed state, conditional edges, cycles, checkpointers — makes multi-agent coordination significantly more tractable than building the same logic with a bare LangChain executor or a custom agent loop. AI coding tools generate LangGraph code that uses these abstractions correctly at the structural level: the graph compiles, the edges connect, the checkpointer persists state, and a short test run completes correctly. The review problem is that compilation and short test runs do not reveal the failure modes that appear under sustained use.
A practical review approach for AI-generated LangGraph code: when you see a messages field with operator.add, check for a message pruning or summarization node before evaluating anything else — unbounded state is invisible during development and expensive in production. When you see a add_conditional_edges call, read the router function as a standalone piece of logic and check whether it handles the full range of model outputs, including malformed and unexpected ones. When you see a checkpointer with a thread_id, verify that the thread ID is scoped to the unit of continuity the developer intended — a single bad scoping decision silently corrupts every subsequent run on the affected thread.
Related reading: LangChain on reviewing AI-generated code when LCEL chains connect LLM calls across multiple steps — the same inter-step trust and agent termination concerns in the framework that LangGraph builds on. Microsoft AutoGen on reviewing multi-agent conversation code where termination conditions and code execution scope create review gaps similar to LangGraph’s conditional edge and state accumulation problems. CrewAI on reviewing multi-agent orchestration code where role delegation and task chaining introduce correctness gaps that only surface when the agent crew handles unexpected inputs. OpenAI Codex agent on reviewing autonomous agent output that chains tool calls across steps, including the same loop termination and state boundary concerns. How to review AI-generated code for the general review framework that applies when AI generates code against any stateful, multi-step agent abstraction.
LangGraph graphs compile. ZenCode checks whether they’re correct.
ZenCode surfaces one concrete review question before you commit — including when AI-generated LangGraph code runs correctly on the happy path but carries unbounded state accumulation, conditional edge routing gaps, or checkpointing behavior that replays stale context across sessions.
Try ZenCode free