CrewAI: how to review AI-generated code when multi-agent crews orchestrate task execution

2026-05-03 · 5 min read · ZenCode

CrewAI has become one of the dominant Python frameworks for building multi-agent systems. It provides the Agent and Task primitives that define what each agent is responsible for; the Crew orchestrator that sequences task execution and routes outputs between agents; the Tool abstraction that gives agents the ability to call external systems; and the Process configuration that determines whether agents work sequentially, hierarchically with a manager, or in parallel. Because CrewAI crews involve substantial boilerplate — agent role definitions, task descriptions with expected output specifications, tool wiring, crew instantiation with process configuration — developers frequently use AI coding tools to generate the initial crew. The assistant writes the crew; the developer reviews it.

AI coding tools generate CrewAI code that runs correctly on a well-defined task path. The agents execute in order, tools get called, tasks produce outputs, and the crew completes without raising an exception for the scenario the developer described in the prompt. Three specific review gaps appear consistently in AI-generated CrewAI code — gaps that look like correctness because the crew runs, but carry real failure risks that surface when agent outputs deviate from expectations, when tools return unexpected content, or when the crew is run in a configuration that differs from the one implicitly assumed during generation.

The three CrewAI code review traps

1. Task output handoff without schema enforcement between agents

In a CrewAI sequential or hierarchical crew, each task’s output becomes the input context for the downstream task. The connection is mediated by the expected_output field in each Task definition and by the context parameter that explicitly links tasks together. AI coding tools generate these task chains naturally: the researcher agent’s task has an expected_output of “a structured report with findings”, the writer agent’s task uses the researcher task as context and expects to receive that structured report. The agent descriptions align, the task chain is coherent, and the crew runs end to end for a representative query.

The problem is that expected_output is a plain-text description, not a schema. CrewAI does not validate that the output an agent actually produces matches the structure the downstream agent’s task description assumes. When an upstream agent produces output that deviates from the described format — the structured report is less structured than expected, key sections are missing, the format is prose where the downstream agent was prompted to parse bullet points — the downstream agent receives malformed context. The failure mode is not an exception; the downstream agent uses the available context as best it can, producing output that is plausibly shaped given what it received but semantically incorrect relative to the original task intent. The final crew output looks like a result and does not signal that an intermediate handoff carried the wrong content.

In AI-generated crews, this gap is compounded by the way expected_output descriptions are written. AI coding tools write expected_output values that match the happy-path output the agent would typically produce — “a JSON object with keys summary, findings, and recommendations” — but do not add any validation logic to check that the upstream agent actually produced that structure before passing it downstream. Reviewers who see a well-described task chain read the expected_output fields as constraints; they are descriptions, not enforced contracts.

The review check: for each task that uses another task as context, compare the upstream task’s expected_output description against the downstream task’s task description and system prompt. Identify what structure the downstream task assumes the upstream output has — keys it references, sections it expects to parse, formats it treats as given. Determine whether there is any validation logic between the two tasks that checks the upstream output before the downstream task runs. Look for post-processing callbacks or output parsers in the agent definition that would enforce the expected format before passing the output to the next task. If none exist, the handoff is a description with no enforcement.

2. Tool return content entering agent reasoning without validation

CrewAI tools return string content that the agent’s LLM receives directly in its reasoning context. When an agent calls a tool — a web search tool, a database query tool, a file reader tool, an API integration tool — the tool’s return value is appended to the agent’s conversation as an observation, and the agent reasons about what to do next based on that observation. AI coding tools generate CrewAI tool definitions correctly: the tool class inherits from BaseTool, the _run method performs the operation, the return type is a string, and the agent is configured to use the tool. The tool executes and the agent processes its output.

The problem is that tool return content is treated as factual by the agent’s reasoning process. The agent LLM receives the tool output as an authoritative observation and reasons from it directly, without a validation layer between the tool’s return and the agent’s next reasoning step. For tools that query external data sources — user-provided search queries, database records that may contain user-supplied strings, third-party API responses with variable content — this creates a prompt injection surface. A search tool that returns web content containing instruction-like text can cause the agent to follow those instructions as if they were part of its task. A database query tool that returns records containing strings formatted to look like task outputs can corrupt the agent’s assessment of what it has accomplished.

AI-generated tool definitions do not include sanitization of tool return values. The _run method fetches or queries the data and returns it as a string; there is no step that strips instruction-like content from returned strings before they enter the agent context. Reviewers who evaluate a CrewAI tool implementation check whether the tool performs its operation correctly — the query is structured, the API call has the right parameters, the result is returned. They do not typically check whether the returned content could be structured to influence the agent’s subsequent reasoning in unintended ways, because the tool looks correct in isolation and the risk is only visible when considering what the agent will do with the tool’s return value.

The review check: for each tool in a CrewAI crew, identify what data source the tool queries and who controls the content of that data source. If the tool returns content from user-supplied input, external web content, or database records that may contain user-supplied strings, check whether the _run method sanitizes the return value before returning it to the agent. Look for whether the agent’s system prompt includes instructions for handling tool observations that contain instruction-like content. For tools that query external APIs, verify that the return value is parsed to extract only the specific fields the agent needs, rather than passing the full API response as a raw string that the agent interprets end to end.

3. Hierarchical crew configuration failures after expensive token consumption

CrewAI supports a hierarchical process where a manager agent orchestrates specialist agents, delegates subtasks, and synthesizes results. This requires configuring the Crew with process=Process.hierarchical and providing a manager_llm (or manager_agent) that handles delegation. AI coding tools generate hierarchical crews that look structurally complete: the agents are defined with their roles and tools, the tasks are defined with descriptions and expected outputs, the crew is instantiated with the hierarchical process, and the manager LLM is set. The crew definition is coherent and the code is syntactically valid.

The problem is that hierarchical crew configuration errors fail at runtime after the manager agent has already consumed tokens making delegation decisions — not at crew initialization time. If the manager_llm is set to a model that is not available in the current environment, the crew initializes successfully and the manager agent begins reasoning; the failure occurs when the manager tries to call the model. If the manager agent’s delegation instructions reference an agent role that does not exactly match any agent’s role field in the crew, the manager delegates successfully but the subtask is assigned to the wrong agent or fails to find a matching agent after several reasoning cycles. If the task structure assumes that one agent will delegate to another but the actual delegation is controlled by the manager LLM’s reasoning rather than by explicit task wiring, the crew may execute a different task sequence than the developer intended, consuming tokens on an unexpected path before producing an incorrect result.

AI-generated hierarchical crews do not include explicit delegation validation or fallback configuration. The manager_llm value is set based on the model the developer mentioned in the prompt; the agent roles are written in prose that reads well but may not exactly match what the manager LLM will use to identify agents when delegating. There is no startup check that verifies agent role uniqueness, manager LLM availability, or delegation path reachability before the crew begins token-expensive execution. Reviewers who see a hierarchical crew with correctly named agents, properly described tasks, and an assigned manager LLM may not check whether the manager’s delegation instructions and the agents’ role definitions form an unambiguous matching that the manager LLM will resolve correctly under the actual runtime environment.

The review check: for any crew configured with Process.hierarchical, verify that manager_llm is set to a model that is explicitly available in the deployment environment and that the environment variable or API key required to call it is present. Check that each agent’s role field is unique within the crew and concise enough that the manager LLM will match it unambiguously when delegating. Look for whether the crew includes a max_rpm or token budget configuration to cap execution cost if the manager enters an unexpected delegation loop. Trace the intended delegation path explicitly — identify which agent should receive which subtask — and verify that the manager LLM’s system prompt or the task descriptions guide it to that path rather than leaving delegation fully open-ended.

Reviewing CrewAI code without treating crew completion as correctness

CrewAI’s abstractions — the agent/task/crew structure, the tool integration pattern, the hierarchical process — make multi-agent orchestration significantly easier to build and read. AI coding tools generate CrewAI code that uses these abstractions correctly at the component level: each agent is well-defined, each task has a description and expected output, each tool is wired correctly, and the crew instantiation follows the framework’s API. The review problem is that crew completion is not the same as correct output: task handoffs can carry semantically incorrect content when upstream output deviates from the described format, tool observations can influence agent reasoning in unintended ways when the tool returns external content without sanitization, and hierarchical crews can consume significant tokens on incorrect delegation paths before a configuration mismatch surfaces as a failure.

A practical review approach for AI-generated CrewAI code: when you see a sequential or hierarchical crew with tasks that use other tasks as context, trace each handoff by comparing the upstream expected_output description against what the downstream task’s prompt actually requires — identify what breaks if the upstream output is correct in shape but wrong in content. When you see tools that query external or user-influenced data sources, check whether the _run method returns raw content or extracts and sanitizes only the fields the agent needs. When you see a hierarchical crew, verify the manager LLM is available in the target environment and trace the intended delegation path against the agent role definitions before evaluating anything else.

Related reading: LangChain on reviewing AI-generated multi-step LLM pipelines where inter-step output trust and agent loop termination carry the same structural gaps as CrewAI task handoffs and crew execution bounds. OpenAI Codex agent on reviewing autonomous agent output where tool-calling chains and multi-step execution create similar handoff and trust boundary concerns. Devin AI on reviewing fully autonomous agent code where multi-step execution makes intermediate failures invisible until the final output is evaluated. Roo Code on reviewing AI-generated multi-agent orchestration code where agent delegation and tool trust carry similar review concerns. How to review AI-generated code for the general review framework that applies when AI generates orchestration code against any multi-agent abstraction layer.

CrewAI crews complete. ZenCode checks whether they’re correct.

ZenCode surfaces one concrete review question before you commit — including when AI-generated CrewAI code runs on the happy path but carries task handoff gaps, tool trust assumptions, or hierarchical configuration issues that only surface after expensive token consumption.

Try ZenCode free