This can be trained using only 5 Seconds of reference audio: https://google.gith...

This can be trained using only 5 Seconds of reference audio: https://google.github.io/tacotron/publications/speaker_adapt... https://arxiv.org/pdf/1806.04558.pdf

It's been mentioned a bit already, but thought it was worth calling out. This may be one of the lowest-overhead ways to start experimenting, at least in terms of data collection.