One of the popular speech-to-text models is Whisper, which starts with the conventional spectral analysis of the speech signal, and then feeds the data into a Transformer model. It works quite well.
Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"
https://openai.com/index/whisper/
Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"
https://apps.dtic.mil/sti/tr/pdf/ADA206826.pdf