Qodo Gen: how to review code when AI-generated tests make it feel already verified
Qodo Gen (formerly CodiumAI) adds test generation to VS Code. You write a function, click “Generate tests,” and 30 seconds later you have 8–15 unit tests that all pass. New file, clean test suite, full green. That is the design goal, and Qodo delivers on it. It is also where the review trap begins.
Qodo occupies a different position in the AI coding tool landscape than inline completion tools like GitHub Copilot or Tabnine. Those tools generate code you then decide to accept. Qodo generates tests that describe whether your code is correct — or rather, tests that describe what your code currently does. The distinction between “describes current behavior” and “verifies correctness” is where the three review traps live.
The three traps
1. Circular test validation
Qodo’s test generation works by analyzing the function’s implementation. The model reads your code, infers the expected inputs and outputs, and generates assertions that match the behavior it observed. This means the tests are derived from the same source as the code: when the model reads a function with a subtle bug, it generates tests that reflect that buggy behavior as the expected result.
The result is circular validation. Tests always pass because they were written to match the implementation rather than to verify it against an independent specification. When you see 12 green tests, your brain reads “verified.” But those 12 tests are behavioral descriptions of the current implementation — including any incorrect current behavior. The green signal is real; what it confirms is not what you think it confirms.
The trap is subtle because Qodo’s tests look correct. They use realistic input values, meaningful test names, and exercise multiple code paths. The tests are structurally plausible and syntactically valid. What they lack is an independent source of truth — a requirement or specification that exists outside the implementation and against which the implementation can be falsified. Without that independence, the tests describe the code; they do not verify it.
2. PR review shorthand (Qodo Merge)
Qodo’s PR review feature (Qodo Merge, previously PR-Agent) posts automated code review comments on your GitHub pull request. The comments appear in the review tab, formatted like code review feedback, attached to specific diff lines, flagging potential issues: unused variables, type inconsistencies, missing null checks, security patterns.
The review interface is the trap. GitHub’s PR review UX was built for human reviewers who own their comments and reason through correctness end-to-end. Qodo posts its analysis in the same location, using the same visual language — comment thread, line attachment, “resolved” button. When you address Qodo’s 14 comments and the thread goes quiet, the review-complete signal fires on the same trigger as it would from a human reviewer: “no more open issues.”
Qodo’s review is genuine within its scope. It catches real issues that pattern-matching can reach. It cannot reason about whether the algorithm is correct for the problem, whether the business logic matches the specification, or whether an approach that looks locally valid creates a systemic problem upstream. The tool narrows the review; the interface makes it feel like the review is finished.
This is structurally similar to the trap in GitHub Copilot Chat: receiving articulate, well-formatted feedback from an AI creates a review-complete feeling that the same feedback from a tool without that interface would not create to the same degree.
3. Coverage-as-confidence
After Qodo generates tests, code coverage improves significantly. A function with 20% test coverage from your previous manual tests might jump to 85% after Qodo’s additions. The number is real — 85% of lines are now executed during the test run.
Coverage measures lines executed, not correctness of execution. A test that calls processOrder(input) and asserts the response code is 200 still contributes coverage regardless of whether the order was processed correctly. More specifically: generated tests that accurately describe incorrect behavior still increase coverage. The metric improves; the underlying problem remains.
High coverage from AI-generated tests provides no more safety than the tests themselves provide — and those tests inherit the same model assumptions as the code they test. The coverage number has been a quality proxy for so long that the inference “coverage went up, quality improved” fires automatically. Here, the proxy breaks. Qodo-generated tests are not coverage in the traditional sense; they are a measurement of how well the model understood the code, expressed as a green bar.
Three fixes
Write one spec-derived test before running Qodo. Choose the edge case you are most uncertain about — the input your function should handle that you are not sure it handles correctly. Write one assertion that comes from the specification or requirement, not from reading the implementation. That test exists independently of Qodo’s analysis. When Qodo’s 12 tests pass alongside your 1 spec test, the picture is different: 1 independent verification plus 12 behavioral descriptions. Your spec test cannot be absorbed into circular validation because it was written before you knew what the implementation would produce. One minute before running Qodo; pays back every time the implementation contains a silent assumption you had not caught.
Read Qodo’s PR review comments as a triage pass, then read the diff anyway. Go through Qodo’s PR comments first — they are real signals and they narrow the diff space. Mark the issues, address them if needed. Then treat the end of the comment thread as the beginning of your own diff read, not the end of the review. The interface will not distinguish “triage complete” from “review complete” — you have to maintain that distinction yourself. Qodo’s analysis covers pattern-matchable issues; your diff read covers correctness and intent.
Name the falsification scenario after test generation. After Qodo generates tests, spend 30 seconds on one question: what input could make every generated test pass while the function is still wrong for your actual use case? Name one concrete scenario. If you can name it, add a test. If you genuinely cannot name one, the check itself is the value — you have done an independent evaluation pass rather than accepting the circular validation at face value. This habit collapses the generated-tests-as-verification assumption by forcing one moment of spec-thinking after Qodo runs, when the green signal is loudest and the review instinct is quietest.
What Qodo gets right
Qodo Gen is a genuine productivity tool. Test generation that would take 30–60 minutes writes itself in 30 seconds, and the coverage it creates catches regressions reliably in code that was not changing — the tests describe the previous behavior, which is exactly the right thing to test when the requirement is “this should not break.” The review traps appear specifically in the context of new or changed code, where the tests need to verify correctness rather than just lock in existing behavior.
The practical dividing line: if you are generating tests for stable code that already works and you want regression protection, Qodo’s circular validation property is a feature. The tests will accurately describe current behavior and fire if that behavior changes. If you are generating tests for new code to verify it works, the circular validation property is the trap — the tests will accurately describe current behavior regardless of whether that behavior is correct. The spec-test-first practice applies to the second case, not the first.
ZenCode — stay in review mode during AI generation gaps
A VS Code extension that surfaces a 10-second breathing pause during AI generation gaps — keeping you in active review mode instead of passive waiting mode when the output lands.
Get ZenCode free