GitHub Copilot agent mode: how to review code when an in-IDE agent runs terminal commands and iterates autonomously
GitHub Copilot agent mode is a different kind of tool than the Copilot most developers know. Inline completions and chat are request-response: you write or ask, Copilot suggests, you accept or decline. Agent mode shifts that model entirely. You describe a goal, and Copilot’s agent takes over — reading your codebase, editing files, running terminal commands, observing the output, and iterating until it believes the task is done. You do not see each step as it happens. You see the accumulated result when the agent stops and hands control back.
That shift from request-response to goal-delegation creates a different review problem. The code you are reviewing was not generated in a single pass in response to a single prompt. It arrived through a sequence of actions the agent took autonomously, each informed by the output of the previous one. The final diff is a summary of that process, not a record of it. Three review traps follow directly from this architecture, and none of them appear when you use Copilot’s inline completions or chat.
The three Copilot agent mode attention traps
1. Terminal-output confirmation bias
When Copilot agent mode completes a task, it typically shows you what it did: which files it edited, which commands it ran, and the terminal output from those commands. If the agent ran your test suite and the tests passed, that output is visible. If it ran the linter and found no errors, that output is there too. The result looks like a completed CI pipeline — green across the board, verified and ready.
The problem is that the agent chose which commands to run. It decided to run the existing test suite, but it also decided which tests were “existing.” If the task required new behavior, the tests that verify that behavior may not exist yet, and the agent may not have written them. A passing test run against a test suite that does not cover the new code path is not validation — it is the absence of failure in the wrong place. The terminal output is accurate; the interpretation that it represents thorough verification is not.
This trap is structurally different from reviewing AI-generated code where you never see test output at all. When there is no evidence of testing, you know to evaluate coverage yourself. When there is visible green test output, the cognitive path to that same evaluation is blocked by the appearance of work already done. The fix is to treat agent-run tests as a baseline, not a ceiling. Before accepting the diff, ask which test cases would fail if the agent’s implementation were wrong — and verify those cases exist in the suite that ran.
2. Iteration-collapse illusion
Agent mode does not generate a solution in one step. It plans, edits, runs a command, observes the result, edits again in response to a failure, re-runs, and eventually arrives at a state it considers complete. You only ever see the final diff. The intermediate states — the approaches that were tried and discarded, the errors that caused the agent to reverse course, the temporary code that was written and then deleted — are invisible.
A diff generated through iteration looks different from a diff that was deliberately designed. Iterative code tends to have more defensive checks than necessary, guard clauses added in response to specific failures rather than systematic analysis, and structure that reflects the shape of the problem-solving path rather than the shape of the problem. To a reviewer seeing only the final state, these artifacts are indistinguishable from deliberate design choices. A null check at line 34 could be a correct invariant the developer considered, or it could be a patch the agent added because it hit a NullPointerException on iteration three and this was the fastest fix that made the test pass.
The fix is to read the diff for structural coherence, not just correctness. Code that was designed holds together — the abstractions are consistent, the error handling is uniform, the logic flows in one direction. Code that was iterated into existence tends to have local coherence (each piece works) without global coherence (the pieces fit together awkwardly). If a diff looks like it was written by someone who changed their mind three times, it may have been — and the approaches that were discarded may have been discarded for reasons that were not fully resolved.
3. IDE-environment trust transfer
Copilot agent mode runs inside VS Code. Your editor, your terminal, your file tree, your keybindings — everything is exactly as you normally work. When the agent edits a file, it appears in the same editor pane you use for all your editing. When it runs a terminal command, the output appears in the same terminal you use for all your commands. The visual environment is identical to your normal development environment, and your normal development environment is the context in which you trust your own judgment.
That familiarity creates a subtle trust transfer. When a PR arrives from an external tool — from a GitHub bot, a CodeRabbit comment, an automated pipeline — there is a clear visual signal that this came from somewhere else. The code is foreign in its presentation even if it is syntactically indistinguishable. Agent mode removes that foreignness. The code appears in your editor, in your terminal, in your workspace. The cognitive association between “my trusted environment” and “this output is trustworthy” applies even though the agent’s decisions are as external as any bot’s.
The fix is to establish a deliberate handoff moment. When the agent stops and returns control, treat the transition as equivalent to opening a PR from an external contributor. Before reading the diff, close and reopen the changed files so you are seeing them as a reviewer rather than as a co-author. The goal is to break the continuity between your development session and the agent’s work so that the familiar environment stops lending its trust to the unfamiliar code.
What distinguishes in-IDE agents from external ones
The traps above are specific to agents that operate inside your development environment rather than outside it. Tools like Devin, Sweep AI, or Google Jules generate code and open PRs that you review through the normal code review interface — a pull request, a diff viewer, a review tool. The foreignness is built into the workflow. In-IDE agents like Copilot agent mode, Cline, or Claude Code operate inside the same environment you use for everything else, which removes the foreignness signal that normally prompts careful review.
The terminal-output confirmation bias and iteration-collapse illusion are present in any agentic tool. The IDE-environment trust transfer is unique to in-IDE agents, and it is the hardest trap to correct because it operates below the level of deliberate attention — you do not decide to trust the familiar environment, it happens automatically. The corrective is not to distrust VS Code but to recognize that the agent’s use of VS Code does not make the agent’s decisions part of your development environment. The environment is borrowed, not shared.
Copilot agent mode is a genuine capability step beyond inline completions and chat. The ability to delegate a multi-step task and return to a completed implementation is valuable, and it will become more central to how developers work. Using it well requires treating the output as what it is: the product of autonomous iteration inside your environment, reviewed as carefully as any external contribution, even when it arrives wearing your editor’s clothes.
Related reading: GitHub Copilot Edits review · GitHub Copilot chat code review · GitHub Copilot Workspace review · Cline AI agent review · Claude Code terminal agent review · GitHub Copilot Enterprise review · How to review AI-generated code
Build better habits for reviewing agentic AI output
ZenCode helps developers stay deliberate when reviewing code generated by autonomous agents in their IDE.
Get ZenCode