AI Test Generation with Claude & Playwright
How to use Anthropic Claude to automatically generate Playwright test scripts from user stories and design specs — a practical guide with real code.
One of the most exciting applications of Generative AI in QE is automated test generation. Instead of manually writing test cases from requirements, you can use an LLM to do the heavy lifting — and then review, refine, and run them.
Here's a practical workflow I've been using with Anthropic Claude + Playwright.
The problem with manual test writing
Writing comprehensive test suites is time-consuming:
- A single feature can require 20–50 test cases when you account for edge cases
- Requirements change and tests go stale
- Engineers often skip writing negative tests due to time pressure
- Coverage gaps are discovered too late (in production)
The AI-assisted approach
The workflow looks like this:
Requirements / User Story
↓
Prompt Claude with context
↓
Generated test cases (JSON)
↓
Generated Playwright scripts
↓
Human review + refinement
↓
CI/CD integration
Step 1: Structure your prompt
The quality of generated tests depends entirely on your prompt. Here's a template that works well:
TYPESCRIPT1const systemPrompt = ` 2You are an expert QA engineer specializing in Playwright test automation. 3Given a user story or feature description, generate comprehensive test cases covering: 41. Happy path scenarios 52. Edge cases and boundary values 63. Negative/error scenarios 74. Accessibility checks 8 9Output format: TypeScript Playwright test file with proper describe/test structure. 10Use data-testid selectors where possible. Include meaningful test names. 11`; 12 13const featureDescription = ` 14Feature: User Login 15- Users can log in with email and password 16- Email must be valid format 17- Password must be at least 8 characters 18- After 5 failed attempts, account is locked for 30 minutes 19- Successful login redirects to /dashboard 20`;
Step 2: Call the API
TYPESCRIPT1import Anthropic from '@anthropic-ai/sdk'; 2 3const client = new Anthropic(); 4 5async function generateTests(featureDescription: string): Promise<string> { 6 const response = await client.messages.create({ 7 model: 'claude-opus-4-6', 8 max_tokens: 4096, 9 system: systemPrompt, 10 messages: [ 11 { 12 role: 'user', 13 content: `Generate Playwright tests for this feature:\n\n${featureDescription}`, 14 }, 15 ], 16 }); 17 18 const content = response.content[0]; 19 if (content.type === 'text') { 20 return content.text; 21 } 22 throw new Error('Unexpected response type'); 23}
Step 3: What gets generated
Claude typically generates something like this:
TYPESCRIPT1import { test, expect } from '@playwright/test'; 2 3test.describe('User Login', () => { 4 test.beforeEach(async ({ page }) => { 5 await page.goto('/login'); 6 }); 7 8 test('successful login redirects to dashboard', async ({ page }) => { 9 await page.fill('[data-testid="email-input"]', 'user@example.com'); 10 await page.fill('[data-testid="password-input"]', 'SecurePass123'); 11 await page.click('[data-testid="login-button"]'); 12 13 await expect(page).toHaveURL('/dashboard'); 14 await expect(page.locator('[data-testid="user-greeting"]')).toBeVisible(); 15 }); 16 17 test('invalid email format shows validation error', async ({ page }) => { 18 await page.fill('[data-testid="email-input"]', 'not-an-email'); 19 await page.fill('[data-testid="password-input"]', 'SecurePass123'); 20 await page.click('[data-testid="login-button"]'); 21 22 await expect(page.locator('[data-testid="email-error"]')) 23 .toHaveText('Please enter a valid email address'); 24 }); 25 26 test('account locks after 5 failed attempts', async ({ page }) => { 27 for (let i = 0; i < 5; i++) { 28 await page.fill('[data-testid="email-input"]', 'user@example.com'); 29 await page.fill('[data-testid="password-input"]', 'WrongPassword'); 30 await page.click('[data-testid="login-button"]'); 31 } 32 33 await expect(page.locator('[data-testid="lockout-message"]')) 34 .toContainText('account has been locked'); 35 }); 36 37 // ... more tests generated automatically 38});
Step 4: Review and refine
AI-generated tests need human review for:
- Selector accuracy — the AI guesses
data-testidvalues; you need to verify or adjust - Assertion completeness — add business-specific assertions the AI couldn't know
- Test data — replace placeholder values with real test data
- Flakiness — add appropriate wait strategies where needed
Measuring the impact
In a recent project, we tested this approach on a 12-feature sprint:
| Metric | Manual | AI-Assisted |
|---|---|---|
| Time to first test | 45 min | 8 min |
| Tests per feature | 8 avg | 22 avg |
| Edge case coverage | 40% | 78% |
| Review time | — | 15 min |
What's next: Agentic test maintenance
The next evolution is self-healing tests — AI agents that:
- Detect when tests fail due to UI changes (not bugs)
- Identify the changed selector or element
- Automatically update the test script
- Open a PR for human approval
That's a topic for a future post. For now, start with generation — the ROI is immediate.
Want to see this in action? Check the GitHub repo for a working example with the full pipeline.
Building a production-grade test generation pipeline
The examples above cover the fundamentals. Here's how to build a pipeline that generates, validates, and commits tests automatically as part of your development workflow.
The generation pipeline architecture
Pull Request opened
↓
Extract changed files + requirements from PR description
↓
Claude analyses changes → generates test scenarios
↓
Claude generates Playwright test code for each scenario
↓
Pipeline runs generated tests against staging
↓
Passing tests → auto-committed to test branch
Failing tests → posted as PR comments for QA review
Extracting context from a PR
TYPESCRIPT1// scripts/generate-tests-from-pr.ts 2import Anthropic from '@anthropic-ai/sdk' 3import { readFileSync } from 'fs' 4import { execSync } from 'child_process' 5 6const client = new Anthropic() 7 8async function generateTestsForPR(prDescription: string, changedFiles: string[]) { 9 // Read the actual changed source files for context 10 const sourceContext = changedFiles 11 .filter(f => f.endsWith('.ts') || f.endsWith('.tsx')) 12 .slice(0, 5) // Limit context size 13 .map(f => { 14 try { return `// ${f}\n${readFileSync(f, 'utf8').slice(0, 2000)}` } 15 catch { return '' } 16 }) 17 .join('\n\n') 18 19 const response = await client.messages.create({ 20 model: 'claude-opus-4-6', 21 max_tokens: 4096, 22 messages: [{ 23 role: 'user', 24 content: ` 25You are a QA engineer writing Playwright tests. 26 27PR Description: 28${prDescription} 29 30Changed source files (excerpt): 31${sourceContext} 32 33Generate Playwright TypeScript tests that: 341. Cover the happy path described in the PR 352. Cover at least 3 negative/edge cases 363. Use data-testid selectors where possible 374. Include meaningful assertions (not just "element is visible") 385. Follow the Page Object Model pattern 39 40Output ONLY valid TypeScript code, no explanation. 41 ` 42 }] 43 }) 44 45 return response.content[0].type === 'text' ? response.content[0].text : '' 46}
Validating generated tests before committing
Never commit AI-generated tests without validation:
TYPESCRIPT1import { execSync } from 'child_process' 2import { writeFileSync, unlinkSync } from 'fs' 3 4async function validateGeneratedTest(testCode: string): Promise<boolean> { 5 const tmpFile = `/tmp/generated-test-${Date.now()}.spec.ts` 6 7 try { 8 writeFileSync(tmpFile, testCode) 9 10 // TypeScript type check 11 execSync(`npx tsc --noEmit --skipLibCheck ${tmpFile}`, { stdio: 'pipe' }) 12 13 // Run the test (with timeout) 14 execSync( 15 `npx playwright test ${tmpFile} --timeout=30000`, 16 { stdio: 'pipe', timeout: 60000, env: { ...process.env, BASE_URL: process.env.STAGING_URL } } 17 ) 18 19 return true 20 } catch { 21 return false 22 } finally { 23 try { unlinkSync(tmpFile) } catch {} 24 } 25}
Azure DevOps pipeline integration
YAML1trigger: none 2pr: 3 branches: 4 include: [main] 5 6pool: 7 vmImage: ubuntu-latest 8 9steps: 10 - task: NodeTool@0 11 inputs: 12 versionSpec: '20.x' 13 14 - script: npm ci 15 16 - script: npx playwright install --with-deps chromium 17 18 - script: | 19 # Get changed files 20 CHANGED=$(git diff --name-only origin/main...HEAD) 21 PR_BODY="$(git log -1 --format='%B')" 22 23 node scripts/generate-tests-from-pr.js \ 24 --changed-files "$CHANGED" \ 25 --pr-description "$PR_BODY" \ 26 --output generated-tests/ 27 displayName: Generate AI tests 28 env: 29 ANTHROPIC_API_KEY: $(ANTHROPIC_API_KEY) 30 STAGING_URL: $(STAGING_URL) 31 32 - script: | 33 npx playwright test generated-tests/ --reporter=junit 34 displayName: Run generated tests 35 continueOnError: true 36 37 - task: PublishTestResults@2 38 inputs: 39 testResultsFormat: JUnit 40 testResultsFiles: playwright-results/results.xml 41 testRunTitle: AI-Generated Tests — $(Build.BuildNumber) 42 condition: always()
Quality control for AI-generated tests
AI-generated tests require a quality bar before they enter your permanent test suite.
Review checklist before accepting generated tests:
☑ Selectors use data-testid, not fragile CSS classes
☑ Assertions are specific (exact text, not just "is visible")
☑ Test is isolated (creates own data, cleans up)
☑ Edge cases are meaningful, not just random invalid inputs
☑ Test name follows [Feature — Action — Condition] pattern
☑ No hardcoded delays (no sleep/wait with fixed ms)
☑ Runs green in the pipeline independently
Common AI generation mistakes to watch for:
- Fragile selectors: Claude may use
page.locator('.btn-primary')instead ofpage.getByRole('button', { name: 'Submit' })if the source code doesn't havedata-testidattributes - Missing cleanup: Generated tests that create records often skip the
afterEachcleanup - Over-assertion: Sometimes generates 15+ assertions for a simple action — trim to the 3–4 that actually matter
- Wrong test boundaries: May generate one giant test that should be split into 5 focused tests
The generation step saves 60–70% of test writing time. The review step ensures the resulting tests are actually trustworthy.
Measuring the impact
Track these metrics before and after introducing AI test generation to quantify the value:
| Metric | Before | After (typical) |
|---|---|---|
| Time to write new test suite | 3–4 hours | 45–60 minutes |
| Test case count per story | 5–8 | 12–18 |
| Edge cases covered per story | 2–3 | 6–8 |
| QA review time per test | 2 min | 5 min |
| Net time saving | — | 60–70% |
The review time increases because generated tests need checking. But the net savings are substantial — and coverage improves significantly because AI consistently generates more edge cases than a rushed QA engineer writing tests manually.