Playwright AI: how to review end-to-end tests when AI generates your browser automation scenarios

2026-05-03 · 5 min read · ZenCode

GitHub Copilot, Cursor, and dedicated AI test generation tools can write Playwright end-to-end tests from a prompt, a description of a user flow, or an existing page component. The output is a .spec.ts file with page.goto(), page.click(), expect() assertions, and sometimes multi-step flows through your application. The tests look complete — they cover the scenario you described, they parse cleanly, and they pass on the first run against your development environment.

The traps in AI-generated Playwright tests are not about whether the tests run. They usually do run, at least locally. The traps are about what the tests are actually testing, whether they will keep running as the application changes, and whether they will behave the same way in your CI environment as they do on your machine. Three patterns appear consistently across AI-generated Playwright test suites.

The three Playwright AI review traps

1. Selector brittleness from inferred DOM

When you ask an AI tool to write a Playwright test for a feature, it does not have access to your live application’s rendered DOM unless you paste it into the prompt. The AI infers what the DOM structure probably looks like based on your component code, your description, or patterns from its training data. Selectors it generates reflect that inference, and inferred selectors tend to be structural: div.modal-container button:last-child, section.results > ul > li:first-child, or positional nth-child patterns that encode the current DOM shape.

The specific problem with structural selectors is that they bind the test to the implementation, not the behavior. A UI redesign that does not change any user-visible behavior — restructuring a list into a grid, wrapping a button in a tooltip container, splitting a form into multiple steps — breaks the test at the selector level even though the feature is unchanged. The test fails not because the feature is broken but because the CSS path no longer matches.

AI tools also generate selectors by inferring attribute values that may not exist. A test that uses getByTestId('submit-form') implies that your form element has data-testid="submit-form". If the AI inferred that attribute from naming conventions in your codebase, it may not actually be there. The test fails immediately, but the failure mode is a missing attribute rather than a broken feature — and it is easy to fix the test by adding the attribute rather than by reviewing whether the selector strategy was sound.

The fix is to read every selector in an AI-generated test against your actual rendered DOM before committing it. Open the application in a browser, inspect the element the test targets, and verify that the selector matches and that it is stable — not dependent on child ordering, not relying on an inferred attribute that may not exist. Where possible, replace inferred structural selectors with role-based or text-based locators: getByRole('button', { name: 'Submit' }) survives DOM restructuring as long as the button’s accessible name is unchanged.

2. Happy-path coverage illusion

AI generates tests for the scenario you described. If you ask for a test for the login flow, you get a test that fills in valid credentials and asserts that the dashboard loads. If you ask for a test for the checkout flow, you get a test where the cart has items, payment succeeds, and the confirmation page renders. The test is complete for the happy path. It passes. It gives you a green check in your CI pipeline. And it creates a specific false signal: the feature is tested.

The feature is not tested. The happy path is tested. The error conditions — invalid credentials, expired sessions, failed payment responses, network timeouts, form validation failures, empty states, concurrent session conflicts — are invisible in the prompt and absent from the generated tests. This is not a failure of the AI to anticipate edge cases. It is the expected behavior of a tool that generates exactly what you asked for. The problem is that the generated test suite looks complete because it covers the scenario, and looking complete suppresses the impulse to enumerate what is missing.

The practical consequence is that error paths accumulate without test coverage while the happy path remains green. A regression in your error handling — the login form now shows a generic “Something went wrong” message instead of “Invalid credentials” for failed logins — will not be caught by an AI-generated test suite built entirely from success-flow prompts.

The fix is to treat AI-generated tests as a starting point rather than a coverage report. After reviewing the generated happy-path tests, explicitly enumerate the failure conditions for the same flow: what happens if the first API call fails, what happens if the user is unauthenticated, what happens if validation rejects the input. Each enumerated condition is either a missing test case or an acknowledged gap. AI can generate those additional tests once you specify the conditions — the issue is that specifying them requires you to think through the failure modes before prompting, not after reviewing the output.

3. Optimistic async timing

Playwright is an async-first framework: every interaction is awaited, and expect() assertions include built-in auto-retry. AI-generated tests use await page.click(), await page.fill(), and await expect(locator).toBeVisible() correctly at the syntactic level. The tests run. They pass locally. In CI, they become flaky.

The timing problem is that AI generates tests calibrated to development environment conditions: fast local API responses, warm browser cache, single isolated test run, no resource contention. In CI — containerized Chromium, cold cache, parallel test workers, slower network responses from your staging API, occasionally degraded infrastructure — the same sequence of awaited interactions can hit timing windows that the default retry timeouts do not cover.

The signature is intermittent failures that are hard to reproduce locally. The test fails with TimeoutError: locator.click: Timeout 30000ms exceeded for a button that visually appears on the page during CI replay, or with an assertion failure because the element was present but not yet interactive. The root cause is usually that the AI generated a sequence that assumes the page is in a stable state after each navigation, but the actual page has an in-flight network request or an animation in progress that makes the target element technically present but not actionable.

The fix is to audit every async sequence for its implicit timing assumptions. For any test step that follows a navigation or a user action that triggers a network request, identify the explicit observable completion signal: the loading spinner disappearing, a specific response returning, a URL change completing, or a new element appearing. Replace optimistic waits with await page.waitForResponse() on the relevant request, or with a waitFor condition on the element that signals the state transition is complete — not the element you are about to interact with, but the indicator that the preceding async operation finished. The Playwright locator API retries automatically, but retry alone does not compensate for a missing explicit wait on a slow network response.

Reviewing AI-generated Playwright tests without mistaking passing for tested

The three traps share a structure: AI generates tests that pass for the conditions it could see or infer, and the tests look complete because they are syntactically correct, they run, and they cover the described scenario. The review gap is the distance between “tests that run” and “tests that protect against regressions in the feature.”

Selector brittleness means tests that pass today break on UI changes that do not affect behavior — your test coverage degrades silently every time a component is restyled. Happy-path coverage illusion means error paths accumulate without tests while the happy path stays green. Optimistic async timing means a test suite that passes locally becomes an intermittent failure generator in CI, training the team to treat red builds as noise rather than signal.

The practical review checklist for AI-generated Playwright tests: verify every selector against the live DOM and prefer role-based or text-based locators; enumerate the error conditions for each covered flow and confirm they have test cases; for every navigation and user action that triggers a network call, identify the explicit completion signal and add a waitForResponse or visible completion indicator rather than relying on default retry timeouts. These three checks do not replace the AI-generated baseline — they make it reliable enough to function as regression coverage rather than as documentation of the happy path.

Related reading: Qodo Gen on the analogous traps in AI-generated unit tests — where the test passes because it was generated to match the implementation rather than to verify behavior. CodeRabbit on automated PR review that sees your test diffs and can surface missing coverage. GitHub Copilot PR code review on using Copilot to review pull requests that include both implementation and test changes. Semgrep on static analysis that can catch structural patterns in test code including common async anti-patterns. How to review AI-generated code for the five-check framework that applies across all AI-generated test suites.

The AI generated tests that pass. ZenCode asks whether they test what you think they test.

ZenCode surfaces one concrete review question before you commit — separate from whether the selectors match, whether the coverage looks complete, or whether the tests passed locally.

Try ZenCode free