OpenHands: how to review code when an autonomous agent builds the whole feature
OpenHands (formerly OpenDevin) is an open-source autonomous coding agent that operates differently from most AI coding tools. Where Copilot or Cursor suggest the next block of code, OpenHands executes across many turns: it reads your codebase, writes files, runs bash commands, installs packages, executes tests, reads the output, decides what to fix, and repeats. You give it a task description. It builds the thing. By the time it hands control back to you, it may have touched a dozen files, run thirty commands, and produced a passing test suite.
That autonomous execution is the entire point. And it is also the source of the review traps that are specific to OpenHands. The traps are not about whether OpenHands writes good code — it often does, particularly on well-scoped tasks with clear inputs and outputs. They are about what happens to your evaluation posture when you have been watching an agent build for fifteen minutes and it surfaces a working demo.
The three traps
1. Execution-log watching as oversight illusion
OpenHands shows you what it is doing: the bash commands it runs, the files it edits, the test output it reads. Watching this stream creates a strong feeling of active involvement. You see every git add, every npm install, every pytest invocation. The activity is visible and continuous. It feels like monitoring.
But there is a critical distinction between watching activity logs and reviewing code. The logs tell you what OpenHands did. They do not tell you whether what it did is correct. Seeing POST /api/users return a 201 in the test output tells you the route exists and responds. It does not tell you whether the route validates input before writing to the database, whether it respects authorization boundaries, or whether the user record it creates has the right field defaults. The execution log is an activity record. Code review is a correctness evaluation. Watching the former creates a strong psychological sense of having done the latter — without the actual content that would make it so.
This is a more intense version of the trap that Cline creates through its tool-use approval cadence: approving each tool call in a long sequence converts a deliberate review decision into a flow reflex. With OpenHands, you are not even approving individual steps — you are watching the agent operate with full autonomy, which makes the illusion of oversight even stronger while the actual oversight is zero.
2. Multi-turn sunk-cost accumulation
A typical OpenHands session for a non-trivial task runs ten to twenty turns. Each turn the agent makes a decision: what to read next, what to write, what to run. After ten turns, you have invested significant time watching the session. The agent has produced a coherent result. Reverting and starting over means losing the entire build and re-investing that time. The sunk cost of the session creates a pressure toward acceptance that is unrelated to whether the code is correct.
The pressure is stronger than with Aider’s bulk-diff re-prompt tax, where rejecting a change means re-writing one prompt. With OpenHands, rejecting the result means abandoning a full autonomous session. The asymmetry is steeper. This does not make OpenHands’s output less reliable than Aider’s — it makes the human evaluation less reliable, because the cost of the rejection reflex has increased independently of the quality of the output.
The accumulation also affects what you are reviewing. After ten turns, you are not looking at a targeted change. You are reviewing a system: new files, modified files, new dependencies, a test suite, possibly configuration changes. The review surface has expanded beyond what a single session of focused attention can hold. Each component looks locally reasonable. The interactions between components — where integration bugs live — require holding the whole system in mind at once, exactly when your attention has been depleted by fifteen minutes of watching the build.
3. Sandbox isolation as correctness proxy
OpenHands runs in an isolated Docker container by default. The agent can install packages, run servers, execute destructive commands — all contained inside the sandbox, with no direct effect on your actual environment until you choose to apply the changes. This isolation is a genuine safety feature. It means you can let the agent work without worrying about it breaking your local setup or accidentally deleting files you need.
The trap is that the technical safety guarantee is not a code correctness guarantee. The sandbox contains the agent’s execution. It does not evaluate whether the agent’s choices were correct. Code that runs cleanly inside a Docker container with a clean database schema and no existing users can fail badly when deployed against a production database with six years of legacy data, an existing auth middleware chain, and rate limits on the external APIs it calls. The sandbox’s clean-room properties make the agent’s tests easier to pass — and easier to misread as production-validity tests.
Replit Agent creates the same trap through its managed platform: code that works in Replit’s environment depends on Replit’s infrastructure for CORS handling, secret injection, and port binding — and can fail outside that environment. OpenHands’s sandbox is more explicit and configurable, but the isolation-as-validation psychological mechanism is identical.
Three fixes
Set a turn checkpoint. Before starting an OpenHands session, decide in advance: you will interrupt the agent and review the current diff after turn five, or when the first test run completes, whichever comes first. Do not wait for the agent to declare it is finished. At the checkpoint, run git diff against the base and read the changes so far — not the test output, the actual diff. This breaks the session into reviewable segments before sunk-cost accumulation and attention depletion make review harder. If the diff at turn five has a correctness problem, addressing it with the agent now costs two turns. Addressing it after the agent declares done costs restarting the entire approach.
Diff-first, tests-second. When OpenHands completes a task, the default evaluation path is: run the demo, run the tests, see them pass, accept the changes. Reverse that order. Open the diff first — git diff main or the equivalent — and read it as you would read any code change. Find the entry point the agent created or modified and read from there. Find the auth check, the input validation, the error paths. This takes fifteen to twenty minutes for a substantial change and it is not the same as watching the agent build for fifteen minutes. Watching is passive. Reading the diff is active evaluation. Run the tests second, after you have formed an opinion about what the code does.
Check the boundary conditions the sandbox did not test. OpenHands’s tests run against a clean-room sandbox. Before applying changes, name the three conditions that the sandbox does not reproduce: existing data in your database, your production auth middleware chain, and the external API rate limits or response shapes you depend on. For each one, read the relevant code path in the generated files and ask whether it handles the production condition correctly. The agent’s tests confirm the happy path in a clean environment. Your three boundary condition checks cover what the tests structurally cannot. The same principle applies across AI code review generally: tests passing is a floor condition, not a ceiling.
What OpenHands gets right
OpenHands is genuinely capable on well-scoped tasks where the inputs and outputs are clear, the codebase is small enough for the agent to read in full, and the correctness criteria map cleanly to test assertions. For building a self-contained utility, scaffolding a new module from a spec, or implementing a documented API integration, the autonomous multi-turn approach can produce a working first draft faster than you would write it from scratch. The traps described above are not arguments against using it. They are arguments for treating its output as a first draft to be reviewed rather than a finished implementation to be deployed.
The comparison point is Devin, the commercial autonomous coding agent. Both operate across many turns with full execution autonomy. OpenHands is open-source and self-hostable, which means you control the sandbox configuration and can inspect every component. The core review traps — execution-log oversight illusion, sunk-cost accumulation, and sandbox isolation as correctness proxy — appear in both tools because they follow from the autonomous multi-turn architecture rather than from either tool’s specific implementation.
ZenCode — stay in review mode during AI generation gaps
A VS Code extension that surfaces a 10-second breathing pause during AI generation gaps — keeping you in active review mode instead of passive waiting mode when the output lands.
Get ZenCode free