FiXato · @FiXato
393 followers · 9413 posts · Server toot.cat

Do you have a YouTube ¹ video link, but you would rather just read the subtitles / closed captions for it?

Then with a couple of command-line tools such as youtube-dl, grep, sed and tail, you can extract and re-format these to be a somewhat legible text:

youtube-dl --sub-lang en --sub-format vtt --skip-download --write-sub --output "tmp-mysubs" YOUTUBE_VIDEO_URL_HERE && grep -Pv '^(([0-9]{2}:){2}[0-9]{2}.[0-9]{3}( --> ([0-9]{2}:){2}[0-9]{2}.[0-9]{3})?|)$' tmp-mysubs*.vtt | sed 's/ / /g' | sed -z 's/[, ]\n/ /g' | sed --regexp-extended -z 's/([^\. ])\n([a-z])/\1 \2/g' | tail -n +4

See my earlier toot containing the captions for Tom Scott's video about Torpenhow Hill for an example of the output.

The youtube-dl command will extract the captions in Web Video Text Tracks (VTT) format and store them in a file prefixed with 'tmp-mysubs' ² (likely 'tmp-mysubs.en.vtt').

The grep command will remove the timestamps and empty lines.

The sed commands will try their best to make the text more legible by merging lines and replacing non-breaking-space HTML-entities.

Finally, the tail command will strip the WebVTT header info such as language and format. I'm not 100% certain this header will always be exactly 4 lines though, so you might want to leave it out.

¹: While I've written it for YouTube videos in mind, it will likely also work on other media sites supported by youtube-dl

² You can replace the prefix 'tmp-mysubs' with any prefix you see fit; just be sure to replace it in both the youtube-dl and grep commands.

Related hashtags:

#youtube #youtubedl #subtitles #subs #closedcaptions #captions #streamingmedia #accessibility #a11y #video #grep #sed #tail #cli #commandline #oneliner #vtt #webvtt #WebVideoTextTracks

Last updated 5 years ago