Should a failed accessibility test break the CI build?

Gate on a curated rule set, not the full default. Fail the build on a stable allowlist of high-confidence rules (image-alt, label, color-contrast, button-name, aria-required-attr, document-title) and report-only on the rest. This keeps the signal high and stops noisy rules from blocking unrelated merges. Expand the blocking set as the codebase gets cleaner.

Why does axe-core report fewer issues than a manual audit?

axe-core is deliberately conservative to keep false positives near zero, so it only flags violations it can verify deterministically from the DOM and computed styles. It cannot judge whether alt text is meaningful, whether focus order is logical, or whether an ARIA pattern behaves correctly with a screen reader. Those require human and assistive-technology testing. Roughly half of WCAG 2.2 success criteria have no reliable automated check.

Is Lighthouse enough for accessibility testing in CI?

No. Lighthouse runs a subset of axe-core rules and produces a 0-100 score, which is useful as a trend signal but weak as a gate. A score of 100 only means the automated subset found nothing, not that the page is accessible. Use axe-core directly through Playwright for assertions and reserve Lighthouse for performance and a coarse accessibility trendline.

How do I test accessibility in CI without flaky tests?

Run axe against fully settled DOM states, not mid-render. Wait for network idle and key elements before scanning, scope scans to the component or region under test, and exclude known third-party widgets you do not control. Disabling color-contrast checks on screenshots with animations also removes a common source of nondeterminism.

Does passing automated accessibility checks make a site WCAG or EAA compliant?

No. Automated checks cover only the machine-testable slice of WCAG 2.2 Level A and AA, which is the baseline referenced by EN 301 549 and the European Accessibility Act. Conformance claims require manual evaluation, keyboard testing, and screen-reader testing on top of automation. Treat green CI as the floor, not the ceiling.

Accessibility Testing in CI: axe-core & Playwright

Accessibility testing in CI works best when you treat it as a fast, deterministic regression net, not a compliance verdict. Wired correctly, it catches the broken label, the 3.8:1 button, and the empty heading before they reach review. Wired badly, it floods pull requests with noise until the team disables it.

This guide covers the three tools engineering teams actually reach for, axe-core, Playwright, and Lighthouse, what each genuinely catches, where they go silent, and how to gate a pipeline without blocking unrelated work. The honest framing up front: automation verifies roughly half of WCAG 2.2 success criteria, so CI is one layer in a system that still needs human testing.

The three tools and what each one is actually for

These tools are often discussed as alternatives. They are not. They sit at different layers, and the strongest setups run all three with distinct jobs.

axe-core is the rules engine. It parses the rendered DOM and computed styles, applies deterministic rules, and returns violations with the offending selector, the WCAG criterion, and remediation help. It is the actual brain behind most of the others, and it is tuned for near-zero false positives.
Playwright (or Cypress) is the driver. axe-core needs a real, fully rendered page to evaluate, including client-side state, opened modals, and expanded menus. The @axe-core/playwright package injects axe into a live browser context so you can scan the DOM after interactions, not just the static HTML.
Lighthouse is the trend gauge. Its accessibility category runs a curated subset of axe-core rules and rolls them into a 0-100 score. That score is useful for spotting regressions over time, but it is a coarse signal, not an assertion you should block a merge on.

The practical division of labor: assert with axe-core through Playwright, watch the trend with Lighthouse, and never mistake a Lighthouse 100 for a conformance claim. A perfect score means the automated subset found nothing, which is a much weaker statement than it sounds.

What accessibility testing in CI reliably catches

Automated engines excel at anything expressible as a measurable condition on the DOM. Run on every page and component, they catch a real and worthwhile class of defect:

Missing programmatic names: an img with no alt attribute, an input with no associated label, an icon button with no accessible name.
Color contrast below a numeric threshold: text under 4.5:1 (or under 3:1 for large text at 18pt, or 14pt bold) and UI components or graphics under 3:1, per WCAG 1.4.3 Contrast (Minimum) and 1.4.11 Non-text Contrast. The AccessScan contrast checker covers the same math on individual pairs.
Structural defects: missing page language, empty headings, skipped heading levels, and duplicate IDs in the rendered output.
Invalid ARIA: nonexistent roles, missing required ARIA attributes, and aria-* references pointing at IDs that are not present on the page.
Coarse keyboard signals: positive tabindex values, or elements with click handlers but no focusable role.

Because axe is conservative, a violation is almost always a genuine bug. That reliability is exactly why it belongs in CI: the false-positive rate is low enough that a failure is worth a developer's attention. For the full machine-testable picture in seconds, an automated scan covers this class across a whole page.

What it cannot catch, and why that matters

The hard limit is that a machine cannot judge meaning, intent, or experience. These are not edge cases; they are the majority of real-world accessibility failures, and they pass every automated check cleanly:

Alt text that exists but is wrong. alt="image" on a product photo passes axe and fails a user. Quality of alternative text is a human judgment.
Illogical focus order or focus that gets trapped. axe sees focusable elements but cannot tell whether tabbing through them makes sense, the heart of keyboard accessibility.
ARIA that is syntactically valid but semantically wrong: a tab pattern that announces incorrectly, a live region that never fires. Validity is checkable; correct behavior is not. This maps to criteria like 2.4.7 Focus Visible and 4.1.3 Status Messages that need observation, not parsing.
Whether captions match the audio, whether a custom control is operable, whether an error message is actually understandable.
Newer WCAG 2.2 criteria such as 2.5.7 Dragging Movements (AA), 2.5.8 Target Size Minimum (AA, 24x24 CSS px), and 3.3.8 Accessible Authentication Minimum (AA), which generally require manual verification of interaction.

This is why a green pipeline is a floor, not a finish line. Automation clears the measurable defects so human testers spend their limited time on the judgment calls only they can make.

Wiring it into the pipeline

A workable setup looks like this. Inject axe through Playwright after the page reaches a settled state, then assert on the results.

import { test, expect } from '@playwright/test'; import AxeBuilder from '@axe-core/playwright'; test('checkout has no critical a11y violations', async ({ page }) => { await page.goto('/checkout'); await page.waitForLoadState('networkidle'); const results = await new AxeBuilder({ page }).withTags(['wcag2a','wcag2aa','wcag22aa']).analyze(); expect(results.violations).toEqual([]); });

Three decisions separate a setup teams keep from one they rip out within a month:

Scan settled DOM, not mid-render. Wait for networkidle and key elements before analyzing, and scan after interactions (open the modal, expand the menu) so dynamic states are covered. Mid-render scans are the top cause of flaky accessibility tests.
Gate on a curated rule set. Block merges on a high-confidence allowlist, image-alt, label, color-contrast, button-name, aria-required-attr, document-title, and run everything else in report-only mode. Expand the blocking set as debt clears.
Scope and exclude deliberately. Use AxeBuilder's include/exclude to skip third-party widgets you do not control, and scope component-level scans to the component. This keeps failures attributable to the code under review.

Add a separate Lighthouse CI job for the accessibility trendline and performance, but keep its threshold advisory. Run the blocking axe job on every pull request; reserve full-site crawls for nightly builds where runtime is less precious.

Combining automation with manual testing

CI handles the regression net. The criteria automation cannot reach need a deliberate human layer, scheduled so it does not bottleneck releases:

Keyboard pass on new or changed interactive components: tab through, operate with Enter, Space, and arrows, confirm visible focus and no traps.
Screen-reader smoke test on critical flows, signup, checkout, search, with at least one real assistive technology.
A structured checklist for the judgment criteria, so coverage is consistent across reviewers. The AccessScan accessibility checklist maps to WCAG 2.2 A and AA.

This layering matters for conformance, not just quality. The European Accessibility Act applies from 28 June 2025 and uses WCAG 2.2 Level A and AA via EN 301 549 as its baseline, the same standard behind national laws like Germany's BFSG. A conformance claim or an accessibility statement rests on the full evaluation, automated plus manual, not on a passing CI run alone. Treat accessibility testing in CI as the fast, cheap first line that makes the expensive human testing count.

Accessibility Testing in CI: Wiring Up axe-core, Playwright, and Lighthouse

The three tools and what each one is actually for

What accessibility testing in CI reliably catches

What it cannot catch, and why that matters

Wiring it into the pipeline

Combining automation with manual testing

FAQ

More guides