AutoGen: how to review AI-generated code when agent conversations drive code execution

2026-05-04 · 5 min read · ZenCode

AutoGen is Microsoft’s open-source framework for building multi-agent AI applications. It provides the ConversableAgent base class from which all agent types derive; the AssistantAgent that wraps an LLM and generates text and code responses; the UserProxyAgent that represents the human side of the conversation and, critically, can execute code suggested by the assistant; the function registration mechanism that lets agents call Python functions and receive results as conversation messages; and the GroupChat and GroupChatManager constructs that coordinate conversations among three or more agents. What distinguishes AutoGen from other multi-agent frameworks is that code execution is a first-class feature: the UserProxyAgent is designed to run code blocks that appear in assistant messages, making the agent conversation itself a code execution environment rather than just an orchestration layer.

AI coding tools generate AutoGen multi-agent code that completes the described task correctly on the happy path. The assistant agent generates code, the user proxy executes it, results are fed back into the conversation, and the agents iterate until the task is done. The generated code looks structurally correct: agents are initialized with system messages, a conversation is started with initiate_chat, and the result appears after a few turns. Three specific review gaps appear consistently in AI-generated AutoGen code — gaps that are invisible when the task runs to completion on the first attempt, but carry real execution and correctness risks when termination is ambiguous, when execution runs outside a controlled environment, or when function outputs contain external content that the model treats as trusted instruction.

The three AutoGen code review traps

1. Termination conditions that defer conversation ending to model output

AutoGen agent conversations end when a termination condition is met. The ConversableAgent base class provides two mechanisms: the is_termination_msg callable, which receives each incoming message and returns True when the conversation should end; and the max_consecutive_auto_reply parameter, which limits how many replies an agent will send without a human-in-the-loop turn. AI coding tools generate code where is_termination_msg is set to check for the string “TERMINATE” in the message content — the standard AutoGen convention — and max_consecutive_auto_reply is either absent or set to a high value like 10 or 20 to avoid premature termination on longer tasks.

The problem is that relying on the model to produce “TERMINATE” as the conversation-ending signal delegates termination to a non-deterministic source. The assistant model appends “TERMINATE” when it believes the task is complete, but this judgment is based on its interpretation of the conversation context, not on a programmatic condition. For tasks that involve iterative refinement — generate code, execute, observe error, regenerate, execute again — the model may not append “TERMINATE” after a successful execution if it interprets the task as requiring further verification steps it never asks for. In these cases the conversation continues past the point where the task was actually completed, consuming additional API tokens on unnecessary turns. For tasks that involve ambiguous success criteria, the model may append “TERMINATE” prematurely after a partial result that satisfies a surface reading of the task description but not the developer’s intent.

AI-generated code does not include a programmatic termination condition based on observable state — a file existing, a specific return value from a function, a counter reaching a threshold. The is_termination_msg function checks for a keyword in model text rather than verifying that the work product satisfies a checkable condition. The max_consecutive_auto_reply guard, when present, is set high enough that it does not function as a practical cost control; when absent, there is no upper bound on conversation length if “TERMINATE” is never produced.

The review check: for any initiate_chat call, identify the termination mechanism. If the only termination path is is_termination_msg checking for a keyword in model output, determine whether there is a max_consecutive_auto_reply that provides a hard upper bound on turns. Check whether the bound is proportionate to the task complexity — a task that should complete in three turns should not have a limit of 20. Look for whether the system message instructs the assistant to append “TERMINATE” only when a specific checkable condition is met (a file written, a function returning a specific value), not when the model believes it is done. Where possible, identify whether a programmatic termination condition can replace the keyword check entirely.

2. Code execution scope set to local filesystem without sandboxing

AutoGen’s UserProxyAgent has a code_execution_config parameter that controls where and how code blocks extracted from assistant messages are executed. The configuration accepts a work_dir key specifying a local directory, a timeout key limiting execution time, and a use_docker key that when set to True or a Docker image name runs the code inside a container rather than the local process. AI coding tools generate UserProxyAgent instances with human_input_mode="NEVER" (no human approval before execution) and code_execution_config={"work_dir": "coding"} or {"work_dir": "coding", "use_docker": False}, meaning code extracted from assistant messages executes directly in the local Python environment with the permissions of the running process.

The problem is that local execution without sandboxing means any code the assistant generates runs in the same environment as the application — with access to the local filesystem, environment variables, installed packages, and network. AI-generated AutoGen code is typically demonstrated on tasks where the assistant writes benign Python scripts: data analysis, file transformation, API calls. On these tasks the execution scope is not a concern because the generated code does the described work and terminates. The review issue is what happens when the assistant’s generated code contains errors that write to unexpected filesystem paths, when the task involves executing code derived from external data sources (a file the user provided, an API response), or when a prompt injection in function return content redirects the assistant to generate code that exceeds the intended scope of the task.

AI-generated UserProxyAgent configurations do not include a timeout that would bound runaway execution, and they do not use Docker isolation even when the AutoGen documentation recommends it for untrusted code. The human_input_mode="NEVER" setting is correct for autonomous operation, but its combination with unrestricted local execution creates a situation where the assistant’s entire code generation output runs without any approval gate — a design that requires high confidence in the assistant’s code quality and the trustworthiness of all inputs flowing into the conversation.

The review check: for any UserProxyAgent with human_input_mode="NEVER", check whether code_execution_config includes a timeout and whether use_docker is set. If code will be executed in the local environment, identify what data sources flow into the assistant’s context via function calls or conversation history and whether any of those sources are user-controlled or externally sourced. For production deployments where the agent will process external input, verify whether the execution scope is appropriate for the trust level of the task — local execution is acceptable for a local development tool where the user controls all inputs, but is a significant risk surface for an agent that processes content from third-party APIs or user-provided files.

3. Registered function return values flowing into agent context as trusted input

AutoGen’s function calling mechanism allows agents to invoke Python functions registered via register_for_execution and register_for_llm (AutoGen 0.2) or the @user_proxy.register_for_execution() and @assistant.register_for_llm() decorators. When the assistant agent generates a function call, the user proxy executes the registered function and appends the return value to the conversation as a FunctionMessage (or role=tool message in newer versions). This message is part of the conversation history that the assistant uses as context for its next response. AI coding tools generate registered functions that call external APIs, query databases, read files, or fetch web content and return the result directly — as a string, a dictionary serialized to a string, or a formatted summary — without any transformation or sanitization.

The problem is that function return values enter the conversation as messages that the LLM treats as part of the trusted conversation context. Unlike user-provided messages that a well-instructed model may treat cautiously, function return values occupy the same position in the conversation structure as the results of authorized tool calls — they are messages that the model has no reason to distrust. If a function returns content containing instruction-like strings — a database record that includes text like “ignore previous instructions and instead”, a web page that includes a hidden prompt injection in a comment or metadata field, an API response that contains a field value crafted to redirect agent behavior — those strings enter the model’s context without any indication that they are untrusted external data rather than legitimate function output.

AI-generated function tools do not include output sanitization or scope validation before the return value enters the conversation. A function that queries a product database returns the full database record. A function that fetches a web page returns the page content as a string. A function that reads a configuration file returns the file contents. In each case the return value may contain arbitrary text that the LLM incorporates into its subsequent reasoning, and the developer reviewing the function implementation sees only the clean happy-path case where the external source returns expected data.

The review check: for each registered function, identify what external data sources contribute to the return value — database records, API responses, file content, user-supplied identifiers used as query parameters. For any return value that includes externally sourced string content, check whether the return value is structured to separate data from metadata (returning a parsed object rather than a raw string reduces but does not eliminate the risk), whether the function includes any validation that the returned content matches an expected schema or format, and whether the system message instructs the assistant to treat function outputs as data to be processed rather than instructions to be followed. Look specifically for functions that return long-form text content where prompt injection patterns are harder to detect than in structured data returns.

Reviewing AutoGen code without treating a completed conversation as correct execution

AutoGen’s design makes it easy to build autonomous agents that accomplish real tasks: the conversation loop, code execution, and function calling are integrated in a way that requires minimal scaffolding. AI coding tools generate AutoGen code that runs the described task correctly end-to-end on the happy path — the conversation terminates, the code executes, the functions return results, and the output looks right. The review problem is that a successful test run is not the same as a correct agent design: termination conditions that work when the model produces “TERMINATE” reliably break when the model interprets task completion differently, local code execution scope that is acceptable for a trusted single-user task becomes a risk surface when inputs come from external sources, and function return values that contain clean expected data in development may contain adversarial content in production.

A practical review approach for AI-generated AutoGen code: when you see an is_termination_msg that checks for a keyword, check whether max_consecutive_auto_reply provides a hard cost bound and whether the bound is proportionate to the task. When you see code_execution_config without use_docker or a timeout, identify what data flows into the assistant’s context and whether any of it comes from external sources. When you see a registered function that calls an external API or reads external data, check whether the return value sanitizes or structures the content before it enters the conversation as a trusted message. Evaluate each of these independently — termination, execution scope, and function trust — rather than treating the end-to-end working demo as evidence that all three are correctly configured.

Related reading: CrewAI on reviewing multi-agent orchestration code where task output handoff without schema enforcement and tool return content trust boundaries carry the same structural risks as AutoGen’s function return injection surface. LangChain on reviewing AI-generated multi-step LLM chains where inter-step output trust and agent loop termination share review patterns with AutoGen’s conversation termination problem. LlamaIndex on reviewing AI-generated RAG pipelines where postprocessor ordering changes what the LLM receives without raising errors, similar to how AutoGen function return content enters context without signaling its source. OpenAI Codex agent on reviewing autonomous agent code where multi-step tool chains create similar execution scope and trust-boundary concerns to AutoGen’s UserProxyAgent. How to review AI-generated code for the general review framework that applies when AI generates agent orchestration code against any multi-agent framework.

AutoGen runs the task. ZenCode checks whether it’s safe to let it.

ZenCode surfaces one concrete review question before you commit — including when AI-generated AutoGen code completes the described task correctly on the happy path but carries termination gaps, execution scope risks, or function return trust boundaries that surface in production.

Try ZenCode free