For a software engineer with no experience in machine learning / AI, what does it mean to build your own language model? Does it require coding? Hundreds of hours of audio data from your own voice? A significant amount of computing power?
The tools available means you only need to provide a list of normal sentences, and they should include the words you'd like it to know about.
For my case I just wanted to train it on like 30 different sentences, that took less than a second. But for a general assistant ala Google Home you'll want a large number of sentences and I hear it can take a while (hour or few?).
Due to using probabilities it will match words in other sentences than what you give it, but from my understanding it will be partial to the ones you feed it if DeepSpeech mis-classifies a character or two.
> Hundreds of hours of audio data from your own voice?
I should clarify this. As I mentioned, training the neural net part requires tons of audio and the corresponding text (and people should totally contribute[1], the resulting data sets are released to the public). The neural net in DeepSpeech is then used on an audio stream and outputs a stream of characters.
Turning that stream of characters into sentences is what the language model is for.
Training the neural net is very data and compute intensive, but fortunately Mozilla provides pre-trained models.
Generating the language model is relatively cheap. And if your target language shares sounds with English, you may get away with using the English-trained neural net but with a non-English language model.