OpenAI Codex CLI: how to review code when an agent edits files autonomously in your terminal
Codex CLI is OpenAI’s open-source terminal coding agent. You give it a task in natural language — “add input validation to the signup endpoint” — and it reads your files, reasons about what to change, writes the code, and optionally runs tests to verify the result. It operates in a sandboxed environment: by default network access is disabled, filesystem writes are contained, and shell commands run in an isolated context. The goal is to give you the productivity of an autonomous coding agent with safety guardrails that prevent accidental damage to your development environment.
The sandbox is the thing that makes Codex CLI feel trustworthy. The safety features are genuine: the agent cannot make API calls during development that hit your production systems, cannot delete files outside the working directory, and cannot run arbitrary commands without your approval depending on the policy mode you configure. But trustworthy execution environment is not the same as correct implementation, and the gap between those two things is exactly where the review traps form.
The three traps
1. Sandbox isolation as completeness proxy
Codex CLI’s sandbox is designed for safety, not for validation. When a Codex session completes — files written, tests run, agent reports success — the result is code that ran correctly in a clean, isolated, network-disabled environment. That is a different statement than “this code is correct in your production environment.”
Code that passes tests in the sandbox can silently fail against real conditions: an external API with a rate limit the sandbox never hit, a database schema with a constraint the sandbox didn’t check, an authentication middleware that runs in your actual request pipeline but not in the test fixture. The sandbox isolates the agent from causing unintended side effects. It does not simulate the full execution context the code will encounter after deployment.
The trap is that the sandbox’s safety story transfers to the code’s correctness story. “Codex ran it in a safe sandbox” and “this code is ready to ship” feel like the same statement because both describe the session outcome as controlled and verified. The safety mechanism is real; the correctness implication is a projection. Code that ran safely in a clean room still needs to be evaluated against what lives outside the clean room.
The fix: before evaluating the Codex output, name one real-environment condition the sandbox did not test. One concrete condition is enough — pick the production constraint most likely to interact with the change (authentication behavior, an external API call, a downstream schema assumption). Check that condition directly in the diff. The sandbox is a floor, not a ceiling.
2. Step-completion stream as oversight
Codex CLI shows its work. As the agent executes, you see each step stream to the terminal: reading file src/auth/middleware.ts, searching codebase for session validation, writing file src/auth/middleware.ts, running tests, all tests passed. The step log is detailed, real-time, and shows everything the agent did. Watching it creates a strong active-involvement feeling — you were present for the entire build.
This feeling is the same trap as watching an autonomous agent like OpenHands run bash commands in a Docker container. Watching an activity log is not the same as evaluating the output. The step stream tells you what the agent did. It does not tell you whether what it did was correct. Each writing file notification is a factual record of an action; it is not a review of the content of that action.
The deeper problem is that watching the step stream depletes the attention budget. A Codex session that reads six files, searches four code patterns, and writes three files produces a step log long enough to require real attention to follow. By the time the agent reports completion, you’ve spent cognitive resources on the log rather than on the diff. The watching itself is work; the work produces the feeling of review without the substance of it.
The fix: treat the step log as context, not as review. Glance at it to understand what the agent touched — which files were read, which were written. Then close the log and open the diff. The diff is the artifact; the step log is the audit trail. Read the audit trail after you’ve formed a view from the artifact, not before.
3. Auto-apply policy trust
Codex CLI has three approval policy modes. In suggest mode, the agent proposes every change and waits for your confirmation. In auto-edit mode, it applies file edits automatically but asks before running shell commands. In full-auto mode, it applies everything without prompting. The mode you choose determines when review happens — or whether it happens at all.
In suggest mode, each proposed change arrives as a diff you approve or reject. The review moment is explicit and mandatory. In auto-edit or full-auto mode, the only review moment is after the session completes, when you read the aggregate diff of everything that was applied. The difference matters because the review psychology is completely different: approving a proposed diff before it’s applied is an active evaluation; reviewing an already-applied diff under the glow of a successful session is a retrospective audit that has to fight against closure bias.
The trap activates when the policy mode is chosen for convenience rather than for the task at hand. auto-edit mode makes multi-step sessions faster because you don’t approve individual file writes. For a well-understood codebase and a simple task, this is a reasonable trade. For a new codebase, a non-trivial feature, or a task where the scope can grow — where the agent might touch files you didn’t expect — the removed review friction is also the removed review moment. The session that completes cleanly feels done before the diff is read.
The fix: default to suggest mode for any task that touches code you haven’t looked at recently or that crosses module boundaries. Use auto-edit only for tasks where you can predict exactly which files will change. After any auto-edit or full-auto session, run git diff before doing anything else — before running the app, before writing a follow-up prompt, before closing the terminal. The diff-first discipline is the substitute for the per-change approval that suggest mode builds in.
How this differs from similar tools
Claude Code is the closest comparison: both are terminal-based autonomous coding agents that read your codebase, write files, and run shell commands. The key difference is that Claude Code’s --dangerously-skip-permissions mode and auto-approve settings are opt-in overrides, while Codex CLI’s policy model is a first-class configuration choice. Claude Code’s default is to ask permission for each action; the review friction is the baseline. In Codex CLI, the review friction is one of three modes, and the fastest mode has the least friction.
Aider shares the terminal diff model: both tools produce a git diff after a session that you review in your editor or terminal. Aider has no sandbox and no policy mode system — it applies changes directly to your filesystem and commits them. The re-prompt tax (rejecting an Aider diff means re-running the whole session) creates acceptance bias; Codex CLI’s suggest mode avoids this by showing individual changes before applying them, but only if you use it. Codex CLI’s sandbox is a genuine safety advantage over Aider for tasks that involve shell commands.
OpenHands shares the sandbox-isolation trap: its Docker container provides execution containment, not production-environment validation. OpenHands also shares the execution-watching trap, since its multi-turn session with bash command logs creates the same active-involvement feeling. The key difference is execution model: OpenHands runs a multi-turn agentic loop in a browser session; Codex CLI runs in your local terminal with direct access to your actual working directory (isolated by the sandbox, but still your files).
Cline uses the same per-action approval model as Codex CLI’s suggest mode, but in a VS Code panel rather than a terminal. Cline’s approval fatigue trap — clicking Approve 40 times per session turns conscious review into a reflex — applies to Codex CLI’s suggest mode as well. The terminal approval flow has the same reflex-hardening dynamic: after five correct suggestions in a row, the sixth approval fires before the diff is read.
What Codex CLI gets right
The sandbox is a genuine safety mechanism for a class of real risks. Shell commands that a less-constrained agent could run accidentally — deleting files, making production API calls, running database migrations against a live connection — are blocked by Codex CLI’s default isolation. The network-disabled mode prevents a category of accidental side effects that agentic coding tools without sandboxes are genuinely susceptible to. For developers working on codebases with external dependencies that would be expensive to hit accidentally, the sandbox is a real feature, not just a marketing claim.
The suggest mode, used consistently, is one of the better review-integrated approval flows among terminal coding agents. Showing the diff before applying it, as a default rather than an override, makes Codex CLI easier to use carefully than tools where the default path is apply-then-review. The review traps above emerge specifically when auto-edit or full-auto modes remove the built-in approval step — which is a choice you make, not a default the tool imposes.
ZenCode — stay in review mode during AI generation gaps
A VS Code extension that surfaces a 10-second breathing pause during AI generation gaps — keeping you in active review mode instead of passive waiting mode when the output lands.
Get ZenCode free