AI Transformation

AI Test Generation with Claude & Playwright

How to use Anthropic Claude to automatically generate Playwright test scripts from user stories and design specs — a practical guide with real code.

I
InnovateBits
InnovateBits

One of the most exciting applications of Generative AI in QE is automated test generation. Instead of manually writing test cases from requirements, you can use an LLM to do the heavy lifting — and then review, refine, and run them.

Here's a practical workflow I've been using with Anthropic Claude + Playwright.

The problem with manual test writing

Writing comprehensive test suites is time-consuming:

  • A single feature can require 20–50 test cases when you account for edge cases
  • Requirements change and tests go stale
  • Engineers often skip writing negative tests due to time pressure
  • Coverage gaps are discovered too late (in production)

The AI-assisted approach

The workflow looks like this:

Requirements / User Story
        ↓
   Prompt Claude with context
        ↓
   Generated test cases (JSON)
        ↓
   Generated Playwright scripts
        ↓
   Human review + refinement
        ↓
   CI/CD integration

Step 1: Structure your prompt

The quality of generated tests depends entirely on your prompt. Here's a template that works well:

TYPESCRIPT
1const systemPrompt = ` 2You are an expert QA engineer specializing in Playwright test automation. 3Given a user story or feature description, generate comprehensive test cases covering: 41. Happy path scenarios 52. Edge cases and boundary values 63. Negative/error scenarios 74. Accessibility checks 8 9Output format: TypeScript Playwright test file with proper describe/test structure. 10Use data-testid selectors where possible. Include meaningful test names. 11`; 12 13const featureDescription = ` 14Feature: User Login 15- Users can log in with email and password 16- Email must be valid format 17- Password must be at least 8 characters 18- After 5 failed attempts, account is locked for 30 minutes 19- Successful login redirects to /dashboard 20`;

Step 2: Call the API

TYPESCRIPT
1import Anthropic from '@anthropic-ai/sdk'; 2 3const client = new Anthropic(); 4 5async function generateTests(featureDescription: string): Promise<string> { 6 const response = await client.messages.create({ 7 model: 'claude-opus-4-6', 8 max_tokens: 4096, 9 system: systemPrompt, 10 messages: [ 11 { 12 role: 'user', 13 content: `Generate Playwright tests for this feature:\n\n${featureDescription}`, 14 }, 15 ], 16 }); 17 18 const content = response.content[0]; 19 if (content.type === 'text') { 20 return content.text; 21 } 22 throw new Error('Unexpected response type'); 23}

Step 3: What gets generated

Claude typically generates something like this:

TYPESCRIPT
1import { test, expect } from '@playwright/test'; 2 3test.describe('User Login', () => { 4 test.beforeEach(async ({ page }) => { 5 await page.goto('/login'); 6 }); 7 8 test('successful login redirects to dashboard', async ({ page }) => { 9 await page.fill('[data-testid="email-input"]', 'user@example.com'); 10 await page.fill('[data-testid="password-input"]', 'SecurePass123'); 11 await page.click('[data-testid="login-button"]'); 12 13 await expect(page).toHaveURL('/dashboard'); 14 await expect(page.locator('[data-testid="user-greeting"]')).toBeVisible(); 15 }); 16 17 test('invalid email format shows validation error', async ({ page }) => { 18 await page.fill('[data-testid="email-input"]', 'not-an-email'); 19 await page.fill('[data-testid="password-input"]', 'SecurePass123'); 20 await page.click('[data-testid="login-button"]'); 21 22 await expect(page.locator('[data-testid="email-error"]')) 23 .toHaveText('Please enter a valid email address'); 24 }); 25 26 test('account locks after 5 failed attempts', async ({ page }) => { 27 for (let i = 0; i < 5; i++) { 28 await page.fill('[data-testid="email-input"]', 'user@example.com'); 29 await page.fill('[data-testid="password-input"]', 'WrongPassword'); 30 await page.click('[data-testid="login-button"]'); 31 } 32 33 await expect(page.locator('[data-testid="lockout-message"]')) 34 .toContainText('account has been locked'); 35 }); 36 37 // ... more tests generated automatically 38});

Step 4: Review and refine

AI-generated tests need human review for:

  • Selector accuracy — the AI guesses data-testid values; you need to verify or adjust
  • Assertion completeness — add business-specific assertions the AI couldn't know
  • Test data — replace placeholder values with real test data
  • Flakiness — add appropriate wait strategies where needed

Measuring the impact

In a recent project, we tested this approach on a 12-feature sprint:

MetricManualAI-Assisted
Time to first test45 min8 min
Tests per feature8 avg22 avg
Edge case coverage40%78%
Review time15 min

What's next: Agentic test maintenance

The next evolution is self-healing tests — AI agents that:

  1. Detect when tests fail due to UI changes (not bugs)
  2. Identify the changed selector or element
  3. Automatically update the test script
  4. Open a PR for human approval

That's a topic for a future post. For now, start with generation — the ROI is immediate.


Want to see this in action? Check the GitHub repo for a working example with the full pipeline.


Building a production-grade test generation pipeline

The examples above cover the fundamentals. Here's how to build a pipeline that generates, validates, and commits tests automatically as part of your development workflow.

The generation pipeline architecture

Pull Request opened
        ↓
Extract changed files + requirements from PR description
        ↓
Claude analyses changes → generates test scenarios
        ↓
Claude generates Playwright test code for each scenario
        ↓
Pipeline runs generated tests against staging
        ↓
Passing tests → auto-committed to test branch
Failing tests → posted as PR comments for QA review

Extracting context from a PR

TYPESCRIPT
1// scripts/generate-tests-from-pr.ts 2import Anthropic from '@anthropic-ai/sdk' 3import { readFileSync } from 'fs' 4import { execSync } from 'child_process' 5 6const client = new Anthropic() 7 8async function generateTestsForPR(prDescription: string, changedFiles: string[]) { 9 // Read the actual changed source files for context 10 const sourceContext = changedFiles 11 .filter(f => f.endsWith('.ts') || f.endsWith('.tsx')) 12 .slice(0, 5) // Limit context size 13 .map(f => { 14 try { return `// ${f}\n${readFileSync(f, 'utf8').slice(0, 2000)}` } 15 catch { return '' } 16 }) 17 .join('\n\n') 18 19 const response = await client.messages.create({ 20 model: 'claude-opus-4-6', 21 max_tokens: 4096, 22 messages: [{ 23 role: 'user', 24 content: ` 25You are a QA engineer writing Playwright tests. 26 27PR Description: 28${prDescription} 29 30Changed source files (excerpt): 31${sourceContext} 32 33Generate Playwright TypeScript tests that: 341. Cover the happy path described in the PR 352. Cover at least 3 negative/edge cases 363. Use data-testid selectors where possible 374. Include meaningful assertions (not just "element is visible") 385. Follow the Page Object Model pattern 39 40Output ONLY valid TypeScript code, no explanation. 41 ` 42 }] 43 }) 44 45 return response.content[0].type === 'text' ? response.content[0].text : '' 46}

Validating generated tests before committing

Never commit AI-generated tests without validation:

TYPESCRIPT
1import { execSync } from 'child_process' 2import { writeFileSync, unlinkSync } from 'fs' 3 4async function validateGeneratedTest(testCode: string): Promise<boolean> { 5 const tmpFile = `/tmp/generated-test-${Date.now()}.spec.ts` 6 7 try { 8 writeFileSync(tmpFile, testCode) 9 10 // TypeScript type check 11 execSync(`npx tsc --noEmit --skipLibCheck ${tmpFile}`, { stdio: 'pipe' }) 12 13 // Run the test (with timeout) 14 execSync( 15 `npx playwright test ${tmpFile} --timeout=30000`, 16 { stdio: 'pipe', timeout: 60000, env: { ...process.env, BASE_URL: process.env.STAGING_URL } } 17 ) 18 19 return true 20 } catch { 21 return false 22 } finally { 23 try { unlinkSync(tmpFile) } catch {} 24 } 25}

Azure DevOps pipeline integration

YAML
1trigger: none 2pr: 3 branches: 4 include: [main] 5 6pool: 7 vmImage: ubuntu-latest 8 9steps: 10 - task: NodeTool@0 11 inputs: 12 versionSpec: '20.x' 13 14 - script: npm ci 15 16 - script: npx playwright install --with-deps chromium 17 18 - script: | 19 # Get changed files 20 CHANGED=$(git diff --name-only origin/main...HEAD) 21 PR_BODY="$(git log -1 --format='%B')" 22 23 node scripts/generate-tests-from-pr.js \ 24 --changed-files "$CHANGED" \ 25 --pr-description "$PR_BODY" \ 26 --output generated-tests/ 27 displayName: Generate AI tests 28 env: 29 ANTHROPIC_API_KEY: $(ANTHROPIC_API_KEY) 30 STAGING_URL: $(STAGING_URL) 31 32 - script: | 33 npx playwright test generated-tests/ --reporter=junit 34 displayName: Run generated tests 35 continueOnError: true 36 37 - task: PublishTestResults@2 38 inputs: 39 testResultsFormat: JUnit 40 testResultsFiles: playwright-results/results.xml 41 testRunTitle: AI-Generated Tests — $(Build.BuildNumber) 42 condition: always()

Quality control for AI-generated tests

AI-generated tests require a quality bar before they enter your permanent test suite.

Review checklist before accepting generated tests:

☑ Selectors use data-testid, not fragile CSS classes
☑ Assertions are specific (exact text, not just "is visible")
☑ Test is isolated (creates own data, cleans up)
☑ Edge cases are meaningful, not just random invalid inputs
☑ Test name follows [Feature — Action — Condition] pattern
☑ No hardcoded delays (no sleep/wait with fixed ms)
☑ Runs green in the pipeline independently

Common AI generation mistakes to watch for:

  1. Fragile selectors: Claude may use page.locator('.btn-primary') instead of page.getByRole('button', { name: 'Submit' }) if the source code doesn't have data-testid attributes
  2. Missing cleanup: Generated tests that create records often skip the afterEach cleanup
  3. Over-assertion: Sometimes generates 15+ assertions for a simple action — trim to the 3–4 that actually matter
  4. Wrong test boundaries: May generate one giant test that should be split into 5 focused tests

The generation step saves 60–70% of test writing time. The review step ensures the resulting tests are actually trustworthy.


Measuring the impact

Track these metrics before and after introducing AI test generation to quantify the value:

MetricBeforeAfter (typical)
Time to write new test suite3–4 hours45–60 minutes
Test case count per story5–812–18
Edge cases covered per story2–36–8
QA review time per test2 min5 min
Net time saving60–70%

The review time increases because generated tests need checking. But the net savings are substantial — and coverage improves significantly because AI consistently generates more edge cases than a rushed QA engineer writing tests manually.

Tags
#ai#playwright#test-generation#llm#claude