Agentic AI in Quality Engineering: Intro

What agentic AI means for QE teams, how autonomous testing agents work, and how to start building your first self-healing test pipeline.

We've gone through waves of automation in QE — from record-and-playback tools to code-based frameworks to AI-assisted test generation. The next wave is agentic AI: systems that don't just generate tests but autonomously plan, execute, adapt, and repair them.

What is agentic AI?

An AI agent is a system that:

Perceives its environment (your application, test results, CI logs)
Reasons about what to do next
Acts by calling tools (browsers, APIs, code editors)
Learns from outcomes and adjusts

The key difference from traditional AI assistance: agents are autonomous loops, not single-shot prompts. They keep going until the task is done.

What agentic QE looks like in practice

Self-healing tests

The most immediate use case. When a Playwright test fails due to a selector change, an agent can:

1. Receive failing test + error message
2. Fetch the current DOM snapshot
3. Identify the element that moved/changed
4. Generate a new selector using semantic understanding
5. Update the test file
6. Run the test to verify the fix
7. Open a PR with the change

All without human intervention.

Autonomous test planning

Given a new feature spec, an agent can:

Break the spec into testable scenarios
Determine priority based on risk
Generate test scripts for each scenario
Schedule them in the appropriate test suite

Continuous monitoring agents

A background agent that:

Watches your staging environment 24/7
Runs smoke tests on every deployment
Identifies regressions and creates tickets
Correlates failures with recent code changes

Building your first agent with LangGraph

LangGraph is excellent for building stateful agents with clear decision loops.

PYTHON
1from langgraph.graph import StateGraph, END
2from langchain_anthropic import ChatAnthropic
3from typing import TypedDict, List
4
5class TestAgentState(TypedDict):
6    failing_test: str
7    error_message: str
8    dom_snapshot: str
9    proposed_fix: str
10    fix_verified: bool
11    attempts: int
12
13llm = ChatAnthropic(model="claude-opus-4-6")
14
15def analyze_failure(state: TestAgentState) -> TestAgentState:
16    """Agent node: understand what broke"""
17    response = llm.invoke(f"""
18    Failing test:
19    {state['failing_test']}
20    
21    Error:
22    {state['error_message']}
23    
24    Current DOM (relevant section):
25    {state['dom_snapshot']}
26    
27    Identify what changed and propose a fix to the test selector or assertion.
28    Output only the corrected test code.
29    """)
30    return {**state, "proposed_fix": response.content}
31
32def verify_fix(state: TestAgentState) -> TestAgentState:
33    """Tool node: run the fixed test"""
34    # In reality, this calls subprocess to run playwright
35    result = run_playwright_test(state['proposed_fix'])
36    return {**state, "fix_verified": result.passed, "attempts": state['attempts'] + 1}
37
38def should_retry(state: TestAgentState) -> str:
39    if state['fix_verified']:
40        return "commit"
41    if state['attempts'] >= 3:
42        return "escalate"
43    return "retry"
44
45# Build the graph
46graph = StateGraph(TestAgentState)
47graph.add_node("analyze", analyze_failure)
48graph.add_node("verify", verify_fix)
49graph.add_edge("analyze", "verify")
50graph.add_conditional_edges("verify", should_retry, {
51    "commit": END,
52    "retry": "analyze",
53    "escalate": END,
54})
55graph.set_entry_point("analyze")
56
57agent = graph.compile()

The agent observability problem

Autonomous agents are powerful but opaque. You need to know:

What decisions did the agent make and why?
Where did it go wrong?
What tools did it call?

Always instrument your agents with tracing:

PYTHON
1from langsmith import traceable
2
3@traceable(name="test-repair-agent")
4def run_repair_agent(failing_test: str, error: str):
5    return agent.invoke({
6        "failing_test": failing_test,
7        "error_message": error,
8        "dom_snapshot": fetch_dom_snapshot(),
9        "proposed_fix": "",
10        "fix_verified": False,
11        "attempts": 0,
12    })

LangSmith gives you full visibility into every step, token, and decision.

Where to start

Don't try to build a full autonomous QE system on day one. The pragmatic path:

Week 1-2:  AI test generation (single-shot, human reviews)
Week 3-4:  Automated failure analysis (agent reads logs, suggests fixes)
Month 2:   Self-healing selector repair (limited scope)
Month 3+:  Full autonomous test maintenance pipeline

Start small, measure the time savings, and expand from there.

I'm running workshops on Agentic AI for QE teams. If your organization wants to explore this, reach out.

Agentic AI vs traditional test automation

It helps to understand exactly where agentic AI sits in relation to what you already do.

Traditional automated tests are deterministic scripts: they follow exactly the steps you write, fail exactly where you expect them to fail, and require a human to update them when the application changes. They are fast, reliable, and well-understood — but they break silently when the UI shifts, require constant maintenance, and can't adapt to unexpected application states.

Agentic AI tests are goal-driven workflows: you tell the agent what to verify, and it determines how to verify it. When the UI changes, the agent adapts. When it encounters an unexpected state, it reasons about what to do next rather than throwing an error.

The important nuance: agentic AI doesn't replace scripted tests. It augments them. Your regression suite remains scripted and deterministic — predictable, fast, and trustworthy. Agentic AI handles the expensive maintenance work and the exploratory discovery that scripted tests can't do economically.

Dimension	Scripted tests	Agentic tests
Maintenance	Manual, constant	Self-healing, reduced
Determinism	High	Lower
Coverage	Known, explicit	Can discover unknowns
Speed	Fast	Slower (LLM calls)
Cost per test	Low	Higher
Best for	Regression, smoke	Exploratory, maintenance

Real-world agentic QE patterns

Pattern 1: The selector maintenance agent

The most immediately valuable agent. Runs after a deployment, detects broken selectors, and proposes fixes as a pull request — without paging anyone at 2 AM.

Integration with Azure DevOps:

YAML
1# azure-pipelines.yml — post-deploy agent task
2- stage: SelectorMaintenance
3  dependsOn: Deploy
4  condition: failed('E2ETests')
5  jobs:
6    - job: RepairAgent
7      steps:
8        - script: |
9            python agents/selector_repair.py \
10              --test-results test-results/results.xml \
11              --base-url $(STAGING_URL)
12          env:
13            ANTHROPIC_API_KEY: $(ANTHROPIC_API_KEY)
14          displayName: Run selector repair agent

Pattern 2: The requirement coverage agent

Given a product requirements document (PRD) or set of user stories, the agent generates test scenarios and flags which requirements have no automated coverage:

PYTHON
1from anthropic import Anthropic
2
3client = Anthropic()
4
5def analyse_coverage(requirements: list[str], existing_tests: list[str]) -> dict:
6    response = client.messages.create(
7        model="claude-opus-4-6",
8        max_tokens=4096,
9        messages=[{
10            "role": "user",
11            "content": f"""
12            Requirements:
13            {chr(10).join(f'- {r}' for r in requirements)}
14            
15            Existing test cases:
16            {chr(10).join(f'- {t}' for t in existing_tests)}
17            
18            Identify:
19            1. Requirements with no test coverage
20            2. Requirements with weak coverage (only happy path)
21            3. Suggested new test scenarios for gaps
22            
23            Output as JSON.
24            """
25        }]
26    )
27    import json
28    return json.loads(response.content[0].text)

Pattern 3: The flakiness detective agent

Reads pipeline failure history, identifies patterns in flaky tests, and diagnoses root causes:

PYTHON
1def analyse_flakiness(test_name: str, failure_history: list[dict]) -> str:
2    failures = [f for f in failure_history if f['test'] == test_name]
3    
4    response = client.messages.create(
5        model="claude-opus-4-6",
6        max_tokens=2048,
7        messages=[{
8            "role": "user",
9            "content": f"""
10            Test: {test_name}
11            Failure rate: {len(failures)}/{len(failure_history)} runs
12            
13            Failure patterns:
14            {chr(10).join(f['error'] for f in failures[:10])}
15            
16            Diagnose the likely root cause and suggest a fix.
17            Common causes: timing, shared state, environment, selector.
18            """
19        }]
20    )
21    return response.content[0].text

Guardrails for agentic QE

Autonomous agents that modify test code or create work items need guardrails. Without them, a misconfigured agent can flood your backlog with false bug reports or introduce subtle regressions in the test suite itself.

Always require human approval for:

PRs that change production test code
Bugs created with P1 or P2 severity
Changes to shared test utilities or fixtures

Always log agent reasoning: Every decision an agent makes should be logged — what it observed, what it considered, and why it chose the action it took. This makes debugging agent failures tractable.

Set a scope limit: Agents should operate on a clearly defined scope. A selector repair agent should only touch selector strings in test files, not test logic or assertions. A bug-reporting agent should only create work items, not modify code.

Run agents in a sandbox first: Before deploying an agent to your main pipeline, run it against a test branch or staging environment. Verify its outputs match expectations before giving it write access to production repositories.

Getting started this week

You don't need a multi-week implementation project to start benefiting from agentic AI in QE. The simplest entry point:

Set up Claude API access — create an account at anthropic.com, get an API key
Write a failure analysis script — when your pipeline fails, call the API with the error message and get a diagnosis
Integrate it into your pipeline — post the diagnosis as a PR comment or Teams notification

That single addition — an AI that reads test failure messages and explains what they mean — can save your team hours of investigation time per week. From there, you expand: selector repair, test generation, coverage analysis.

The agentic QE revolution is not about replacing QA engineers. It's about removing the tedious, repetitive parts of the work so QA engineers can focus on what matters: understanding the product, designing meaningful tests, and making quality decisions.