In recent years, there has been substantial advancement in vocoders for DL audio applications. WaveGAN and MelGAN have emerged as promising solutions, harnessing the power of generative adversarial networks (GANs) to produce high-fidelity audio. Furthermore, parallel-waveGAN and HiFi-GAN have showcased improved efficiency with quicker inference times while maintaining exceptional audio quality.
I feel like that's where a lot of artifacts are introduced (at least for TTS) and the best methods a while ago were slow and autoregressive.