AccessScanRun a free scan

Guide

Video Accessibility Captions: A Practical Guide for Content Teams

Video and audio carry more of your message every year, and every clip you publish is a place where accessibility either works or quietly breaks. Get it right and a Deaf viewer reads your launch announcement, a commuter follows your tutorial on mute, and a search engine indexes your transcript. Get it wrong and you exclude real users and risk falling short of legal requirements.

This guide is written for content and marketing teams who own the videos, podcasts, and social clips but do not write the player code. It explains captions, transcripts, audio description, and autoplay controls in concrete terms, maps each to the exact WCAG criterion it satisfies, and shows where the common shortcuts fail. For most platforms, none of it requires a developer.

Why media accessibility is now a baseline, not a nice-to-have

Media accessibility lives under the first of WCAG's four principles, Perceivable, alongside Operable, Understandable, and Robust. The idea is simple: if a user cannot perceive the content, nothing else about the experience matters. A video with no captions is unperceivable to someone who is Deaf; an autoplaying ad with no off switch is hostile to someone using a screen reader.

The legal context sharpened in 2025. The European Accessibility Act (Directive (EU) 2019/882) applies from 28 June 2025 to a wide range of private-sector products and services, with media-heavy sectors like streaming, e-commerce, and banking squarely in scope. Its technical baseline is WCAG 2.2 Level A and AA, referenced through the European standard EN 301 549. Public-sector bodies have faced equivalent rules since the EU Web Accessibility Directive (2016/2102), which uses the same standard. Enforcement varies by member state and can include penalties, so the safe target is clear: meet every Level A and AA media criterion. For how these frameworks fit together, see our overview of the European Accessibility Act.

Captions: the criterion most people get half-right

Captions are time-synchronized text overlaid on a video. Two WCAG criteria govern them, and the difference matters.

  • 1.2.2 Captions (Prerecorded), Level A: every prerecorded video with audio needs captions. This is the floor, not a stretch goal.
  • 1.2.4 Captions (Live), Level AA: live streams, webinars, and broadcasts need real-time captions too, typically via a CART provider or a vetted live ASR feed.

The frequent mistake is treating captions as a transcript of speech only. Proper captions also identify who is speaking and convey meaningful non-speech audio: [door slams], [applause], [ominous music]. A scene where dialogue stops and a phone rings off-screen is incomprehensible without that cue. Closed captions, which the user can toggle, are strongly preferred over open captions burned permanently into the frame, because burned-in text cannot be resized, translated, or turned off and often clashes with player controls.

On the production side, captions should be a real track, not auto-generated noise. Use a sidecar file in WebVTT (.vtt) or SRT format so the player exposes a captions button. For HTML5 video that means a track element with kind set to captions and the correct language. Start from the platform's auto-captions if you must, but always edit names, jargon, and punctuation before publishing, and verify the timing is in sync.

Transcripts and audio description: covering what captions miss

Transcripts

A transcript is a standalone text version of the content. For audio-only media like a podcast, a transcript is the whole accessibility story: criterion 1.2.1 (Audio-only and Video-only, Prerecorded, Level A) is satisfied by a text alternative that captures the spoken content. Transcripts also help search engines and let users skim, so they earn their keep beyond compliance. Place the transcript on the same page or one clear click away, never behind a vague download.

Audio description

Audio description narrates the important visual information a sighted viewer gets for free: on-screen text, a chart that appears silently, a character's gesture. Two criteria apply. 1.2.3 (Audio Description or Media Alternative, Prerecorded, Level A) lets you satisfy prerecorded video with either audio description or a full text alternative that includes visual detail. 1.2.5 (Audio Description, Level AA) raises the bar and requires actual audio description for prerecorded video. Because AA is the EAA and EN 301 549 baseline, plan for 1.2.5.

The cheapest way to comply is to design it out: write scripts so the narrator speaks any on-screen text and describes key actions aloud. A tutorial that says "click the blue Save button in the top right" instead of silently pointing needs no separate description track. When you cannot, produce a described version or a detailed transcript that conveys the visuals. Our deeper video and audio accessibility guide walks through each option.

Autoplay and audio controls

Media that moves or makes noise without consent is one of the most common and most fixable failures. Two criteria apply, both Level A.

  • 1.4.2 Audio Control: if any audio plays automatically for more than three seconds, you must provide a way to pause, stop, or mute it independently of the system volume. Auto-playing sound over a screen reader's own speech is disorienting and can drown it out entirely.
  • 2.2.2 Pause, Stop, Hide: any moving, blinking, or auto-updating content that starts automatically and lasts more than five seconds must be pausable, stoppable, or hideable. This catches autoplaying hero videos, looping background clips, and carousels.

The practical rule for marketing pages: if a background video autoplays, mute it by default, or give it a visible pause control. A silent, decorative loop with a working pause button is fine; an autoplaying clip with sound and no off switch is a Level A failure. Autoplay also drains data and battery on mobile, so the accessible choice is usually the better UX choice.

A quick checklist before you publish

  • Every video with audio has accurate, edited captions in a .vtt or .srt track, including speaker labels and key sounds (1.2.2 / 1.2.4).
  • Audio-only content has a transcript on or near the page (1.2.1).
  • Visual-only information is narrated in the audio or covered by audio description or a detailed transcript (1.2.3 / 1.2.5).
  • Nothing autoplays with sound without a mute, pause, or stop control (1.4.2), and any motion over five seconds is pausable (2.2.2).
  • Captions and controls remain operable by keyboard and visible against the video.

These items map directly to the official success criteria; you can look up the full text for each on our WCAG reference. When you are ready to confirm the rest of a page passes, run it through a free scan with AccessScan to catch missing controls, contrast issues, and structural problems around your embeds.

Check your site against AccessScan

See your issues ranked by impact in seconds — free.

Run a free accessibility scan

FAQ

Are auto-generated YouTube captions good enough for WCAG?

No. Automatic speech recognition routinely misreads names, technical terms, and anything spoken over background noise, and it omits speaker labels and sound effects. WCAG 1.2.2 requires captions that accurately convey dialogue and meaningful non-speech audio, so treat auto-captions as a first draft and edit them before publishing. The good news is that YouTube and most platforms let you download the auto-generated file, correct it, and re-upload it as a proper caption track.

What is the difference between captions and a transcript?

Captions are time-synced text shown on top of the video as it plays, so a Deaf or hard-of-hearing viewer follows along in real time. A transcript is a single text document containing all the dialogue plus descriptions of important visuals and sounds, read separately from the player. Captions satisfy 1.2.2; a full transcript helps satisfy 1.2.3 for prerecorded video and is the complete solution for audio-only content like podcasts.

Does a silent background video in my hero section need captions?

If it has no audio and conveys no information beyond decoration, it is exempt from the captions and audio-description criteria. But it still must not autoplay in a way that traps users, and any motion that lasts more than five seconds must be pausable under 2.2.2. If the clip carries meaning, for example a product demo with on-screen text, provide an equivalent such as a caption track or nearby description.

When do we need audio description?

You need audio description when the video shows information the soundtrack alone does not convey, such as on-screen text, a chart, or a silent action. Level AA, the EAA and EN 301 549 baseline, requires audio description for prerecorded video under 1.2.5. Many teams avoid a separate description track by writing scripts where the narrator speaks any on-screen text aloud, which makes the original video accessible without extra production.

More guides