DevOps 5 min read

Self-Healing Test Automation in Azure DevOps

How to implement self-healing test automation in Azure DevOps. Covers Healenium for Selenium, Playwright's resilient locators, automatic retry strategies.

I
InnovateBits
InnovateBits

Self-healing test automation reduces maintenance overhead by automatically adapting to UI changes. Combined with Azure DevOps's retry mechanisms and analytics, you can build a test suite that stays green even as the application evolves.


Why tests break unnecessarily

The majority of test maintenance work falls into these categories:

CauseFrequencyPreventable?
CSS selector changed (UI refactor)HighYes — with self-healing
Text changed (copy update)MediumYes — with flexible matchers
Timing (flakiness)HighYes — with waits and retries
API response structure changedMediumYes — with contract tests
Actual bug (genuine regression)LowNo — this is what we want to catch

Self-healing addresses the first three categories — reducing noise so genuine failures stand out.


Playwright's resilient locator strategy

Playwright's built-in locators are already more resilient than raw CSS selectors:

TYPESCRIPT
1// Fragile — breaks when CSS class changes 2await page.locator('.btn-checkout-v2').click() 3 4// Resilient — tied to role and accessible name 5await page.getByRole('button', { name: 'Place order' }).click() 6 7// Resilient — tied to data attribute (never changes if dev keeps it) 8await page.getByTestId('place-order-btn').click() 9 10// Resilient — tied to user-visible text 11await page.getByText('Place order').click()

Add data-testid to your development Definition of Done: every interactive element must have a data-testid attribute. This single convention eliminates 80% of selector-related test breaks.


Automatic retries in Azure Pipelines

Configure retries at the pipeline level:

YAML
1# playwright.config.ts 2export default defineConfig({ 3 retries: process.env.CI ? 2 : 0, // Retry failed tests up to 2 times in CI 4 expect: { 5 timeout: 10_000, 6 }, 7})
YAML
1# azure-pipelines.yml — retry the entire job if flakiness is suspected 2jobs: 3 - job: E2ETests 4 timeoutInMinutes: 30 5 retryCountOnTaskFailure: 1 # Retry the entire job once if it fails 6 steps: 7 - script: npx playwright test

Use job-level retries sparingly — they mask systemic issues. Prefer test-level retries via retries: 2 in Playwright config.


Healenium for Selenium self-healing

Healenium uses ML to find elements when their locators break:

YAML
1# docker-compose.healenium.yml 2version: '3' 3services: 4 healenium-db: 5 image: postgres:15 6 environment: 7 POSTGRES_DB: healenium 8 POSTGRES_USER: healenium 9 POSTGRES_PASSWORD: healenium 10 11 healenium-proxy: 12 image: healenium/hlm-proxy:3.5.0 13 ports: 14 - "8085:8085" 15 environment: 16 SPRING_DATASOURCE_URL: jdbc:postgresql://healenium-db:5432/healenium 17 depends_on: 18 - healenium-db 19 20 healenium-backend: 21 image: healenium/hlm-backend:3.5.0 22 ports: 23 - "7878:7878" 24 environment: 25 DB_HOST: healenium-db 26 depends_on: 27 - healenium-db
YAML
1# azure-pipelines.yml — start Healenium before Selenium tests 2steps: 3 - script: | 4 docker-compose -f docker-compose.healenium.yml up -d 5 sleep 10 # Wait for services to start 6 displayName: Start Healenium 7 8 - script: | 9 mvn test \ 10 -Dselenide.remote=http://localhost:8085/wd/hub \ 11 -DBASE_URL=$(STAGING_URL) 12 displayName: Run Selenium with self-healing 13 14 - script: docker-compose -f docker-compose.healenium.yml down 15 displayName: Stop Healenium 16 condition: always()

When a locator fails, Healenium:

  1. Takes a screenshot of the current page
  2. Compares it to the last successful screenshot
  3. Finds the element using visual similarity and nearby text
  4. Updates the locator for future runs
  5. Logs the healed locator so developers can update the code

Flakiness detection in Azure DevOps

Azure Pipelines automatically tracks test flakiness. Go to Pipelines → [Pipeline] → Analytics → Test flakiness:

  • Tests flagged as flaky: pass rate between 5–95% across multiple runs
  • Flakiness rate trend over time
  • Most flaky tests list

Act on flakiness immediately:

YAML
1# Mark known flaky tests and quarantine them 2- script: | 3 # Generate flakiness report from pipeline analytics API 4 az pipelines runs list \ 5 --pipeline-id $(System.DefinitionId) \ 6 --result failed \ 7 --query "[].{id:id, tests:url}" \ 8 --output table 9 displayName: Check flakiness trend

Quarantine flaky tests so they don't block CI, then fix them:

TYPESCRIPT
1// Temporarily skip flaky test while fixing 2test.skip('TC-204: Wishlist limit — flaky timing issue', async ({ page }) => { 3 // TODO: fix timing issue — tracked in bug #891 4})

Create a bug in Azure Boards for each quarantined test and track resolution.


Proactive maintenance: test health dashboard

Build a dashboard that shows test suite health over time:

Test Suite Health — Week of 2025-10-20

Total tests:        284
Passing rate:        97.2%
Flaky tests:           4  (1.4% — threshold: < 3%)
Quarantined:           2
Average duration:  11m 34s  (target: < 15m)

Top flaky tests:
  checkout › payment › 3DS redirect     (fails 23% of runs)
  auth › session › token refresh        (fails 8% of runs)
  profile › avatar › upload large file  (fails 12% of runs)
  search › filters › price range        (fails 6% of runs)

Actions needed:
  ⚠ Fix 3DS redirect timing (priority: high)
  ⚠ Add explicit wait to token refresh test

Common errors and fixes

Error: Healenium doesn't heal broken locators — falls through to original failure Fix: Healenium requires a previous successful run to compare against. On the first run after a UI change, it will fail. After one successful run with the new UI, future runs self-heal.

Error: Playwright retries inflate test counts in the Tests tab Fix: Configure the JUnit reporter to only report final results: reporter: [['junit', { includeProjectInTestName: true }]]. Each retry shows as a separate row in the raw XML but Azure DevOps collapses them by default.

Error: Job-level retries run all tests again even when only 2 out of 200 failed Fix: Use Playwright's built-in --retries instead of job-level retry. Playwright retries only the failed tests, not the entire suite.

Error: Flakiness analytics show 0 flaky tests despite known intermittent failures Fix: Azure DevOps needs at least 7 pipeline runs to calculate flakiness. Also ensure test results are being published consistently — a pipeline that stops publishing results when it fails won't accumulate enough data.

Tags
#self-healing-tests#azure-devops#healenium#playwright#flaky-tests#test-maintenance#test-automation

Share this article

Follow for more

Follow me on social media for more developer tips, tricks, and tutorials. Let's connect and build something great together!