Highlighting words as media plays

I wanted to see how hard it would be to use a combination of WebVTT, Intl.Segmenter, and the Custom Highlight API to make an audio transcript that highlights the current word you’re on without explicitly marking up the words. Look, Ma, no <span>s!

Well, it turns out to be doable, but kind of a pain, mostly because Intl.Segmenter wasn’t designed to be used with the DOM.

Here’s a recording of me reading the example transcript. Try playing it. (You may need to reload this page before it’ll work for inscrutable service worker reasons I should really try to figure out at some point.)

This demo requires the Custom Highlight API and Intl.Segmenter to be implemented. If your browser doesn’t have those things, maybe try this demo in a recent STP?

Example: a transcript of the <audio> element on this page

this is the title of the article.

hello, world

this is a test of the emergency broadcast system.

This isn’t a real thing, it’s just a test.

私はオコーナーです。元気ですか？

How it works

This is implemented in words.js, which defines two functions:

find_words({within}) uses Intl.Segmenter to find all the words in the provided DOM node. It returns a Highlight object containing a StaticRange for each word.
sing_along({track, words}) associates each cue in the given track with each word, and highlights each word as the media element’s current time changes.

Here’s how this page calls them:

Observations

The word highlighting handles cases where words span element boundaries, like "emergency" in the demo.
The word highlighting is locale-aware; notice how both the Japanese and English text get correctly segmented into words, and how the apostrophes in English contractions get treated correctly.
Highlighting does the right thing when you scrub the audio. This worked the first time and felt like magic.
I wrote the WebVTT by hand, so some of the timing is a bit off. 🙇🏻‍♀️