Skip to main content
Back to blog

Test Data Generation: Strategies and Best Practices for QA Engineers

A comprehensive guide to test data generation for QA engineers. Learn when to use static fixtures vs generated data, how to create realistic fake data without touching production, and strategies for managing test data across environments.

InnovateBits8 min read
Share

Test data is the foundation of meaningful testing. You can have the most sophisticated automation framework in the world, but if your test data is poor — too limited, too rigid, or copied from production — your tests will be slow to write, brittle to maintain, and dangerous to run.

This guide covers the full spectrum of test data strategies: from static fixtures to dynamic generators, from unit test data to E2E test data, and from local development data to CI pipeline data.


Why test data management is harder than it looks

Test data problems are among the most common causes of flaky and unreliable tests. Specifically:

Shared mutable data — when multiple tests read and write to the same records, tests start interfering with each other. A test that expects a user to have 0 orders fails when a previous test created an order on that user.

Production data in test environments — copying production data to staging or development environments creates compliance risk (GDPR, HIPAA, PCI-DSS), security risk, and maintenance overhead. Production data also changes over time, making tests that depend on specific production records increasingly fragile.

Hardcoded test data — tests that depend on a specific user ID, specific email address, or specific record that was manually created in a shared environment become the responsibility of whoever created that record. When they leave the team, the test breaks.

Insufficient variety — tests that only run against "happy path" data miss the edge cases that cause real production bugs: names with Unicode characters, addresses in countries with different formats, phone numbers with country codes, prices in currencies with no decimal places.


The three approaches to test data

1. Static fixtures

Static fixtures are JSON or CSV files committed to your repository. They're loaded before tests run and provide a known, consistent state.

Best for:

  • Unit tests where you need precise control over the data
  • Contract tests that validate specific response shapes
  • Tests for edge cases that would be hard to generate dynamically (specific Unicode sequences, maximum field lengths, malformed data)
// tests/fixtures/users.json
[
  { "id": "usr_001", "name": "Alice Chen",   "role": "admin",   "email": "alice@example.com" },
  { "id": "usr_002", "name": "Bob Smith",    "role": "member",  "email": "bob@example.com" },
  { "id": "usr_003", "name": "Carol López",  "role": "viewer",  "email": "carol@example.com" },
  { "id": "usr_004", "name": "Dariusz Wójcik", "role": "member", "email": "dariusz@example.com" }
]

Note that good fixture files include edge cases: Unicode names, varied roles, and a range of formats.

When to avoid static fixtures:

  • Tests that need unique data per-run (to avoid conflicts in parallel execution)
  • Tests that need large volumes of data (1,000+ rows)
  • Integration tests that hit a real database (the fixture might get stale)

2. Programmatic generation in tests

Generate data directly in your test code using a library or custom functions. Each test generates its own unique data at runtime.

// Playwright test with inline data generation
import { faker } from '@faker-js/faker'
 
test('user can complete registration', async ({ page }) => {
  const user = {
    firstName: faker.person.firstName(),
    lastName:  faker.person.lastName(),
    email:     faker.internet.email(),
    password:  faker.internet.password({ length: 12, memorable: false })
  }
 
  await page.goto('/register')
  await page.fill('[name="firstName"]', user.firstName)
  await page.fill('[name="lastName"]',  user.lastName)
  await page.fill('[name="email"]',     user.email)
  await page.fill('[name="password"]',  user.password)
  await page.click('[type="submit"]')
 
  await expect(page.locator('.welcome-message')).toContainText(user.firstName)
})

This pattern guarantees each test run uses unique email addresses, preventing "user already exists" errors when tests run in parallel or are re-run without a database reset.

When to avoid programmatic generation:

  • When you need specific, reproducible values for debugging
  • When the data needs to meet complex inter-field constraints (e.g., a date of birth that makes the user exactly 18 years old)

3. Pre-seeded environments

For E2E tests that hit a real backend, the most reliable approach is a seeded database reset before each test run. This gives you the benefits of both approaches: known, consistent data that you also control programmatically.

// playwright.config.ts — global setup seeds the database
export default defineConfig({
  globalSetup: './tests/global-setup.ts',
})
 
// tests/global-setup.ts
export default async function globalSetup() {
  await fetch('http://localhost:3000/api/test/reset-db', { method: 'POST' })
  await fetch('http://localhost:3000/api/test/seed', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ scenario: 'standard' })
  })
}

Realistic fake data: what it looks like and why it matters

The difference between testuser1@test.com and alice.chen47@gmail.com matters more than it might seem. Realistic fake data:

  • Exercises real validation — actual email domains, phone number formats, and postal codes will expose validation bugs that test@test.com won't
  • Makes screenshots and demos usable — if the product team ever sees your test environment, it looks professional
  • Catches display bugs — names like "Günther Müller" or "José García" expose character encoding issues that John Smith won't

Data types and realistic generation strategies

Names — mix first and last names from diverse cultural origins. Real users include names with apostrophes (O'Brien), hyphens (Mary-Jane), Unicode characters (José, Björn, 王 Wei), and varying lengths.

Email addresses — use realistic domain names and avoid @test.com for anything that goes into a real email validation workflow. Format: {firstname}.{lastname}{number}@{domain}.

Phone numbers — always include the country code format. US: +1-555-XXX-XXXX, UK: +44 7XXX XXXXXX, India: +91 XXXXX XXXXX.

Addresses — include variations in line 2 (apartment numbers, suite numbers), different postal code formats by country, and city names with accents.

Dates — generate dates that test boundary conditions: past dates, future dates, today, yesterday, leap day (Feb 29), end of month (Jan 31), end of year (Dec 31).


Using the InnovateBits Test Data Generator

The Test Data Generator tool lets you define a custom schema, choose field types, set a row count, and download the result as CSV or JSON — directly in your browser with no server involved.

It's useful for:

  • Loading test databases — generate 500 user records as CSV and import them into your test database before a test run
  • Seeding fixtures — generate 20–30 rows, download as JSON, and commit as a fixture file
  • Demo data — generate realistic-looking data for screenshots, mockups, and stakeholder demos
  • Performance testing — generate 10,000 rows as CSV for load testing a bulk import feature

To use it effectively: map the column names to match your database column names exactly, then import the CSV directly without transformation.


Test data for common QA scenarios

Boundary value testing

Always test the boundaries of valid input ranges. For a quantity field that accepts 1–999:

ValueCategoryWhat it tests
0Below minimumValidation rejects it
1Minimum validAccepted and processed
500Middle validNormal operation
999Maximum validAccepted and processed
1000Above maximumValidation rejects it

Equivalence partitioning

Divide valid inputs into groups where all members behave the same way, then test one value from each group:

  • User tier: free, pro, enterprise — test one of each
  • File size: 0 KB, 1–10 MB (valid), 10–50 MB (valid with warning), >50 MB (rejected)
  • Currency: USD (no decimal places sometimes), JPY (no decimal places), BHD (3 decimal places)

Negative data

Test data that should be rejected is as important as data that should be accepted:

[
  { "email": "notanemail",              "expected": "invalid" },
  { "email": "missing@tld",            "expected": "invalid" },
  { "email": "spaces in@email.com",    "expected": "invalid" },
  { "email": "double@@domain.com",     "expected": "invalid" },
  { "email": "valid@example.com",      "expected": "valid"   },
  { "email": "valid+tag@example.co.uk","expected": "valid"   }
]

Managing test data in CI

The biggest challenge with test data in CI is isolation: tests must not interfere with each other, and a test run must always start from a known state.

Strategy 1: Transactional rollback

Wrap each test in a database transaction and roll it back at the end. No data persists between tests.

// Jest with Prisma
beforeEach(async () => {
  await prisma.$executeRaw`BEGIN`
})
 
afterEach(async () => {
  await prisma.$executeRaw`ROLLBACK`
})

Strategy 2: Unique identifiers per test run

Prefix all generated data with a run-specific prefix so parallel runs don't collide:

const runId = process.env.CI_RUN_ID ?? Date.now().toString()
const testEmail = `qa-${runId}-${faker.internet.email()}`

Strategy 3: Per-test database

For integration tests that are too complex for transaction rollback, spin up a fresh database per test using Docker:

# GitHub Actions
services:
  postgres:
    image: postgres:16
    env:
      POSTGRES_PASSWORD: test
    options: >-
      --health-cmd pg_isready
      --health-interval 10s

Common test data mistakes

Using production email addresses in tests — even in staging, test emails sometimes get sent. Use @example.com (RFC 2606 reserved) or @mailinator.com for throwaway addresses.

Hardcoding IDs — IDs like user_id = 42 that were manually created break when the test database is reset. Always look up records by a stable attribute (email, username) rather than an auto-incremented ID.

Too little data variety — a test suite that only ever creates users with ASCII names, US addresses, and USD transactions will miss localisation and internationalisation bugs.

Forgetting cleanup — tests that create records but don't clean them up cause slow data growth in long-lived environments. Always implement teardown in afterEach or afterAll.

Free newsletter

Stay ahead in AI-driven QA

Get practical tutorials on test automation, AI testing, and quality engineering — straight to your inbox. No spam, unsubscribe any time.

Discussion

Sign in with GitHub to comment · powered by Giscus