I wanted to see how hard it would be to use a combination of
WebVTT,
Intl.Segmenter, and the
Custom
Highlight API to make an audio transcript that highlights the
current word you’re on without explicitly marking up
the words. Look, Ma, no <span>s!
Here’s a recording of me reading the example transcript. Try
playing it. (You may need to reload this page before it’ll work for
inscrutable service worker reasons I should really try to figure out
at some point.)
This demo requires
the Custom
Highlight API and Intl.Segmenter to be implemented. If your browser
doesn’t have those things, maybe try this demo in a recent
STP?
Example: a transcript of the <audio>
element on this page
this is the title of the article.
hello, world
this is a test of the emergency broadcast
system.
This isn’t a real thing, it’s just a test.
私はオコーナーです。元気ですか?
How it works
This is implemented in words.js,
which defines two functions:
find_words({within}) uses Intl.Segmenter to find all the words in the
provided DOM node. It returns a
Highlight object containing a StaticRange for each word.
sing_along({track, words}) associates each cue
in the given track
with each word, and highlights each word as the media element’s
current time changes.
Here’s how this page calls them:
Observations
The word highlighting handles cases where words span element
boundaries, like "emergency" in the demo.
The word highlighting is locale-aware; notice how both the
Japanese and English text get correctly segmented into words, and
how the apostrophes in English contractions get treated correctly.
Highlighting does the right thing when you scrub the audio. This
worked the first time and felt like magic.
I wrote the WebVTT by hand, so some of the timing is a bit
off. 🙇🏻♀️