Automated transcription can get you 80-90% of the way now, meaning hand transcription (if needed) only requires a light touch. Transcription has gone from very expensive to now submitting your video to an API and getting back your transcript in a few hours for a few dollars ($0.10 - $1 a minute). Even at a very high rate, that's only $26 for the linked video.
I don't know, automatic subtitles on youtube suck for anything that is not average tone, voice, speed or vocabulary. Not to mention if you don't speak english.
Paying $26 for something similar doesn't feel good value.
We've tried something like this during a french python conf to get subtitles for deaf people. It was impossible to even understand the result.
Have you checked lately? Automated subtitles on YouTube are getting ridiculously good. They're currently almost as good as I am at recognizing speech, including accented speech and in many cases specialized technical terms.
Recently (as recent as a couple of months) they've still been terrible enough to be great entertainment to me. (I don't need them to understand videos, but I love to turn them on and laugh at the nonsense they generate.)
Are you sure you're thinking of the automated subtitles? Some youtube videos have manually written captions too.
In my experience, they're only good when I don't need them. If the audio is clear, it's astonishingly accurate (except for the lack of punctuation). But for heavily accented speech, or if there's any background noise, or there are multiple speakers, it completely fails, and it's better to go find a pair of headphones and turn up the volume.
Also, I hate how the auto-generated CC spits out one word at a time while scrolling the lines up, and how it keeps the previous caption on screen mixed in with the next one during long moments of non-speech (if the speaker isn't in frame, it's hard to know what the context is). That's not how I read—it's not how anyone reads. I would rather it showed a single block at a time, with the current word highlighted.
And what's with the random "yeah yeah yeah", "[applause]", or "[music]" captions when it's just guessing? Better to put question marks than be completely wrong.
lack of punctuation and capitalisation also makes informal speech quite difficult to understand sentences just run into another makes it hard to know what's referring to what gets split into unnatural sentences... you get the point.
However, automated CC is good when you're trying to find a particular part of a long video, and you don't want to watch the whole thing just to find it. It's easy to skim the captions to find a relevant part.
There are a few attempts, I have no idea how good they are. And they're manual transcription only.
There are several open source speech recognition systems (like Deepspeech and CMUSphinx) but there don't seem to be any user-friendly frontends for them.