AI Test Generation with Claude & Playwright

How to use Anthropic Claude to automatically generate Playwright test scripts from user stories and design specs — a practical guide with real code.

One of the most exciting applications of Generative AI in QE is automated test generation. Instead of manually writing test cases from requirements, you can use an LLM to do the heavy lifting — and then review, refine, and run them.

Here's a practical workflow I've been using with Anthropic Claude + Playwright.

The problem with manual test writing

Writing comprehensive test suites is time-consuming:

A single feature can require 20–50 test cases when you account for edge cases
Requirements change and tests go stale
Engineers often skip writing negative tests due to time pressure
Coverage gaps are discovered too late (in production)

The AI-assisted approach

The workflow looks like this:

Requirements / User Story
        ↓
   Prompt Claude with context
        ↓
   Generated test cases (JSON)
        ↓
   Generated Playwright scripts
        ↓
   Human review + refinement
        ↓
   CI/CD integration

Step 1: Structure your prompt

The quality of generated tests depends entirely on your prompt. Here's a template that works well:

TYPESCRIPT
1const systemPrompt = `
2You are an expert QA engineer specializing in Playwright test automation.
3Given a user story or feature description, generate comprehensive test cases covering:
41. Happy path scenarios
52. Edge cases and boundary values  
63. Negative/error scenarios
74. Accessibility checks
8
9Output format: TypeScript Playwright test file with proper describe/test structure.
10Use data-testid selectors where possible. Include meaningful test names.
11`;
12
13const featureDescription = `
14Feature: User Login
15- Users can log in with email and password
16- Email must be valid format
17- Password must be at least 8 characters
18- After 5 failed attempts, account is locked for 30 minutes
19- Successful login redirects to /dashboard
20`;

Step 2: Call the API

TYPESCRIPT
1import Anthropic from '@anthropic-ai/sdk';
2
3const client = new Anthropic();
4
5async function generateTests(featureDescription: string): Promise<string> {
6  const response = await client.messages.create({
7    model: 'claude-opus-4-6',
8    max_tokens: 4096,
9    system: systemPrompt,
10    messages: [
11      {
12        role: 'user',
13        content: `Generate Playwright tests for this feature:\n\n${featureDescription}`,
14      },
15    ],
16  });
17
18  const content = response.content[0];
19  if (content.type === 'text') {
20    return content.text;
21  }
22  throw new Error('Unexpected response type');
23}

Step 3: What gets generated

Claude typically generates something like this:

TYPESCRIPT
1import { test, expect } from '@playwright/test';
2
3test.describe('User Login', () => {
4  test.beforeEach(async ({ page }) => {
5    await page.goto('/login');
6  });
7
8  test('successful login redirects to dashboard', async ({ page }) => {
9    await page.fill('[data-testid="email-input"]', 'user@example.com');
10    await page.fill('[data-testid="password-input"]', 'SecurePass123');
11    await page.click('[data-testid="login-button"]');
12
13    await expect(page).toHaveURL('/dashboard');
14    await expect(page.locator('[data-testid="user-greeting"]')).toBeVisible();
15  });
16
17  test('invalid email format shows validation error', async ({ page }) => {
18    await page.fill('[data-testid="email-input"]', 'not-an-email');
19    await page.fill('[data-testid="password-input"]', 'SecurePass123');
20    await page.click('[data-testid="login-button"]');
21
22    await expect(page.locator('[data-testid="email-error"]'))
23      .toHaveText('Please enter a valid email address');
24  });
25
26  test('account locks after 5 failed attempts', async ({ page }) => {
27    for (let i = 0; i < 5; i++) {
28      await page.fill('[data-testid="email-input"]', 'user@example.com');
29      await page.fill('[data-testid="password-input"]', 'WrongPassword');
30      await page.click('[data-testid="login-button"]');
31    }
32
33    await expect(page.locator('[data-testid="lockout-message"]'))
34      .toContainText('account has been locked');
35  });
36
37  // ... more tests generated automatically
38});

Step 4: Review and refine

AI-generated tests need human review for:

Selector accuracy — the AI guesses data-testid values; you need to verify or adjust
Assertion completeness — add business-specific assertions the AI couldn't know
Test data — replace placeholder values with real test data
Flakiness — add appropriate wait strategies where needed

Measuring the impact

In a recent project, we tested this approach on a 12-feature sprint:

Metric	Manual	AI-Assisted
Time to first test	45 min	8 min
Tests per feature	8 avg	22 avg
Edge case coverage	40%	78%
Review time	—	15 min

What's next: Agentic test maintenance

The next evolution is self-healing tests — AI agents that:

Detect when tests fail due to UI changes (not bugs)
Identify the changed selector or element
Automatically update the test script
Open a PR for human approval

That's a topic for a future post. For now, start with generation — the ROI is immediate.

Want to see this in action? Check the GitHub repo for a working example with the full pipeline.

Building a production-grade test generation pipeline

The examples above cover the fundamentals. Here's how to build a pipeline that generates, validates, and commits tests automatically as part of your development workflow.

The generation pipeline architecture

Pull Request opened
        ↓
Extract changed files + requirements from PR description
        ↓
Claude analyses changes → generates test scenarios
        ↓
Claude generates Playwright test code for each scenario
        ↓
Pipeline runs generated tests against staging
        ↓
Passing tests → auto-committed to test branch
Failing tests → posted as PR comments for QA review

Extracting context from a PR

TYPESCRIPT
1// scripts/generate-tests-from-pr.ts
2import Anthropic from '@anthropic-ai/sdk'
3import { readFileSync } from 'fs'
4import { execSync } from 'child_process'
5
6const client = new Anthropic()
7
8async function generateTestsForPR(prDescription: string, changedFiles: string[]) {
9  // Read the actual changed source files for context
10  const sourceContext = changedFiles
11    .filter(f => f.endsWith('.ts') || f.endsWith('.tsx'))
12    .slice(0, 5)  // Limit context size
13    .map(f => {
14      try { return `// ${f}\n${readFileSync(f, 'utf8').slice(0, 2000)}` }
15      catch { return '' }
16    })
17    .join('\n\n')
18
19  const response = await client.messages.create({
20    model: 'claude-opus-4-6',
21    max_tokens: 4096,
22    messages: [{
23      role: 'user',
24      content: `
25You are a QA engineer writing Playwright tests.
26
27PR Description:
28${prDescription}
29
30Changed source files (excerpt):
31${sourceContext}
32
33Generate Playwright TypeScript tests that:
341. Cover the happy path described in the PR
352. Cover at least 3 negative/edge cases
363. Use data-testid selectors where possible
374. Include meaningful assertions (not just "element is visible")
385. Follow the Page Object Model pattern
39
40Output ONLY valid TypeScript code, no explanation.
41      `
42    }]
43  })
44
45  return response.content[0].type === 'text' ? response.content[0].text : ''
46}

Validating generated tests before committing

Never commit AI-generated tests without validation:

TYPESCRIPT
1import { execSync } from 'child_process'
2import { writeFileSync, unlinkSync } from 'fs'
3
4async function validateGeneratedTest(testCode: string): Promise<boolean> {
5  const tmpFile = `/tmp/generated-test-${Date.now()}.spec.ts`
6  
7  try {
8    writeFileSync(tmpFile, testCode)
9    
10    // TypeScript type check
11    execSync(`npx tsc --noEmit --skipLibCheck ${tmpFile}`, { stdio: 'pipe' })
12    
13    // Run the test (with timeout)
14    execSync(
15      `npx playwright test ${tmpFile} --timeout=30000`,
16      { stdio: 'pipe', timeout: 60000, env: { ...process.env, BASE_URL: process.env.STAGING_URL } }
17    )
18    
19    return true
20  } catch {
21    return false
22  } finally {
23    try { unlinkSync(tmpFile) } catch {}
24  }
25}

Azure DevOps pipeline integration

YAML
1trigger: none
2pr:
3  branches:
4    include: [main]
5
6pool:
7  vmImage: ubuntu-latest
8
9steps:
10  - task: NodeTool@0
11    inputs:
12      versionSpec: '20.x'
13
14  - script: npm ci
15
16  - script: npx playwright install --with-deps chromium
17
18  - script: |
19      # Get changed files
20      CHANGED=$(git diff --name-only origin/main...HEAD)
21      PR_BODY="$(git log -1 --format='%B')"
22      
23      node scripts/generate-tests-from-pr.js \
24        --changed-files "$CHANGED" \
25        --pr-description "$PR_BODY" \
26        --output generated-tests/
27    displayName: Generate AI tests
28    env:
29      ANTHROPIC_API_KEY: $(ANTHROPIC_API_KEY)
30      STAGING_URL: $(STAGING_URL)
31
32  - script: |
33      npx playwright test generated-tests/ --reporter=junit
34    displayName: Run generated tests
35    continueOnError: true
36
37  - task: PublishTestResults@2
38    inputs:
39      testResultsFormat: JUnit
40      testResultsFiles: playwright-results/results.xml
41      testRunTitle: AI-Generated Tests — $(Build.BuildNumber)
42    condition: always()

Quality control for AI-generated tests

AI-generated tests require a quality bar before they enter your permanent test suite.

Review checklist before accepting generated tests:

☑ Selectors use data-testid, not fragile CSS classes
☑ Assertions are specific (exact text, not just "is visible")
☑ Test is isolated (creates own data, cleans up)
☑ Edge cases are meaningful, not just random invalid inputs
☑ Test name follows [Feature — Action — Condition] pattern
☑ No hardcoded delays (no sleep/wait with fixed ms)
☑ Runs green in the pipeline independently

Common AI generation mistakes to watch for:

Fragile selectors: Claude may use page.locator('.btn-primary') instead of page.getByRole('button', { name: 'Submit' }) if the source code doesn't have data-testid attributes
Missing cleanup: Generated tests that create records often skip the afterEach cleanup
Over-assertion: Sometimes generates 15+ assertions for a simple action — trim to the 3–4 that actually matter
Wrong test boundaries: May generate one giant test that should be split into 5 focused tests

The generation step saves 60–70% of test writing time. The review step ensures the resulting tests are actually trustworthy.

Measuring the impact

Track these metrics before and after introducing AI test generation to quantify the value:

Metric	Before	After (typical)
Time to write new test suite	3–4 hours	45–60 minutes
Test case count per story	5–8	12–18
Edge cases covered per story	2–3	6–8
QA review time per test	2 min	5 min
Net time saving	—	60–70%

The review time increases because generated tests need checking. But the net savings are substantial — and coverage improves significantly because AI consistently generates more edge cases than a rushed QA engineer writing tests manually.

AI Test Generation with Claude & Playwright

The problem with manual test writing

The AI-assisted approach

Step 1: Structure your prompt

Step 2: Call the API

Step 3: What gets generated

Step 4: Review and refine

Measuring the impact

What's next: Agentic test maintenance

Building a production-grade test generation pipeline

The generation pipeline architecture

Extracting context from a PR

Validating generated tests before committing

Azure DevOps pipeline integration

Quality control for AI-generated tests

Measuring the impact

Share this article