Voice has the potential to be awesome. This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant. It doesn't have to be this way! I have a local demo using Llama 2 that responds in about half a second and it feels like talking to an actual person instead of like Siri or something.
I really should package it up so people can try it. The one problem that makes it a little unnatural is that determining when the user is done talking is tough. What's needed is a speech conversation turn-taking dataset and model; that's missing from off the shelf speech recognition systems. But it should be trivial for a company like OpenAI to build. That's what I'd work on right now if I was there, because truly natural voice conversations are going to unlock a whole new set of users and use cases for these models.
Completely agree, latency is key for unlocking great voice experiences. Here's a quick demo I'm working on for voice ordering https://youtu.be/WfvLIEHwiyo
Total end-to-end latency is a few hundred milliseconds: starting from speech to text, to the LLM, then to a POS to validate the SKU (no hallucinations are possible!), and finally back to generated speech. The latency is starting to feel really natural. Building out a general system to achieve this low-latency will I think end up being a big unlock for enabling diverse applications.
Since this is getting a bit of interest, here's one more demo of this https://youtu.be/cvKUa5JpRp4 This demo shows even lower latency, plus the ability to handle very large menus with lots of complicated sub-options (this restaurant has over a billion option combinations to order a coffee). The latency is negative in some places, meaning the system finishes predicting before I finish speaking.
We've built something similar that allows you to tweak/update notes & reminders https://qwerki.com/ (private beta) here's the video demo https://www.youtube.com/shorts/2hpBTxjplIE we've since moved to training our own LLAMA as it's more responsive & we have better reliability.
This is pretty good. Do you think running models locally will be able to achieve performance (getting task done successfully) compared to cloud based ones.i am assuming for context of a drive through scenario it should be ok but more complex systems might need external infromation
Definitely depends on the application, agreed. The more open ended the application the more dependent it is on larger LLMs (and other systems) that don't easily fit on edge. At the same time, progress is happening that is increasing the size of LLM that can be ran on edge. I imagine we end up in a hybrid world for many applications, where local models take a first pass (and also handle speech transcription) and only small requests are made to big cloud-based models as needed.
Lots of work around speculative decoding, optimizing across the ASR->LLM->TTS interfaces, fine-tuning smaller models while maintaining accuracy (lots of investment here), good old fashioned engineering around managing requests to the GPU, etc. We're considering commercializing this so I can't open source just yet, but if we end up not selling it I'll definitely think about opening it up.
We're design the stack to be fairly flexible. It's Python/Pytorch under the hood, with the ability to plug and play various off the shelf models. For ASR we support GCP/AssemblyAI/etc, as well as a customized self-hosted version of Whisper that is tailored for stream processing. For the LLM we support fine-tuned GPT3 models, fine-tuned Google text-bison models, or locally hosted fine-tuned Llama models (and a lot of the project goes into how to do the fine-tuning to ensure accuracy and low latency). For the TTS we support Elevenlabs/GCP/etc, and they all tie into the latency reducing approaches.
Neat! I appreciate your approach to preventing hallucinations. I've used something similar in a different context. People make a big deal about hallucinations but I've found that validation is one of the easier aspects of AI architecture.
Good question. Off the shelf TTS systems tend to enunciate every phoneme more like a radio talk show host rather than a regular person, which I find a bit off putting. I've been playing around with trying to get the voice to be more colloquial/casual. But I haven't gotten it to really sound natural yet.
100% Python but with a good deal of multiprocessing, speculative decoding, etc. As we move to production we can probably shave another 100ms off by moving over to a compiled system, but Python is great for rapid iteration.
Can I ask what your background is, and what things you're used to working with? I don't have the chops to build what you built, but I'd love to get there.
My advice is always to jump in and start building! My background is math originally, so I had some of the tools in my tool box, but I'm mostly self-taught in computer science and machine learning. I read textbooks, research papers, code repos, but most importantly I build a lot of stuff. Once I'm excited about an idea I'll figure out how to become an expert to make it a reality. Over the years the skills start to compound, so it also helps that I'm an old man!
That demo is pretty slick. What happens when you go totally off book? Like, ask it to recite the numbers of pi? Or if you become abusive? Will it call the cops?
It's trained to ignore everything else. That way background conversations are ignored as well (like your kids talking in the back of the car while you order).
> This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant.
Yep - it needs to be ready as soon as I'm done talking and I need to be able to interrupt it. If those things can be done then it can also start tentatively talking if I pause and immediately stop if I continue.
I don't want to have to think about how to structure the interaction in terms of explicit call/response chain, nor do I want to have to be super careful to always be talking until I've finished my thought to prevent it from doing its thing at the wrong time.
The interruption is an important point yeah. It's so annoying when Siri misunderstands again and starts rattling off a whole host of options. And keeps getting stuck in a loop if you don't respond.
In fact I'm really surprised these assistants are still as crap as they are. Totally scripted, zero AI. It seems low hanging fruit to implement an LLM but none of the big three have done so. Not even sure about the fringe ones like Cortana and Bixby
Yeah when I was developing it, it quickly became apparent that I needed to be able to interrupt it. So I implemented that. Pretty easy to implement actually. Much harder would be to have the model interrupt the human. But I think it is actually desirable for natural conversation, so I do think a turn-taking model should be able to signal the LLM to interrupt the human.
> determining when the user is done talking is tough.
Sometimes that task is tough for the speaker too, not just the listener. Courteous interruptions or the lack thereof might be a shibboleth for determining when we are speaking to an AI.
Yes interruptions are key, both ways. Having the user interrupt the bot is easy, but to have the bot interrupt the human will again require a model to predict when that should happen. But I do believe it is desirable for natural conversation.
Can you share a github link to this? Where are you reducing the latency? Are you processing the raw audio to text? In my experience ChatGPT generation time is much faster than local Lllama unless you're using something potato like a 7B model.
Unfortunately it has a really high "works on my machine" factor. I'm using Llama2-chat-13B via mlc-llm + whisper-streaming + coqui TTS. I just have a bunch of hardcoded paths and these projects tend to be a real pain to set up, so figuring out a nice way to package it up with its dependencies in a portable way is the hard part.
I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.
I wonder when computers will start taking our intonation into account too. That would really help with understanding the end of a phrase. And there’s SO MUCH information in intonation that doesn’t exist in pure text. Any AI that doesn’t understand that part of language will always still be kinda dumb, however clever they are.
You're right. Ultimately the only way this will really work is as an end-to-end model. Text will only get you so far. We could approximate it now with screenplay-like emotion annotations on text, which LLMs should both easily understand and be able to produce themselves (though you'd have to train a new speech recognition system to produce them). But end-to-end will be required eventually to reach human level fluency.
Don’t they do it already? There are a lot of languages where intonation is absolutely necessary to distinguish between some words, so I would be surprised that this not already taken into account by the major voice assistants.
In English, intonation changes the meaning of the word but not the word itself. From what I understand, in tonal languages tone changes the whole word. I don't think ML understands that difference yet.
Yeah they do. I was able to get ChatGPT-4 to transcribe 我哥哥高過他的哥哥, which says that they can. I did have to set the app to Chinese, and the original didn't work so I had to modify what I said slightly.
Of course speech recognition works for Chinese. What it doesn't do is transcribe intonation and prosody in non-tonal languages. It's not even clear how one would transcribe such a thing as I'm not aware of a standard notation.
Maybe? I thought IPA was just phonetic but I see that it does have some optional prosody stuff that could in theory cover some of it. I'm not sure how standard or complete it really is in practice.
I haven't heard of any large datasets of IPA transcripts of speech with the detail necessary to train a fully realistic STT->LLM->TTS system. If you know of some that would be interesting to look at.
Also curious to hear about your setup. Using whisper too? When I was experimenting with it there was still a lot of annoyance about hallucinations and I was hard coding some "if last phrase is 'thanks for watching', ignore last phrase"
I was just googling a bit to see what's out there now for whisper/llama combos and came across this: https://github.com/yacineMTB/talk
There's a demo linked on the github page that seems relatively fast at responding conversationally, but still maybe 1-2 seconds at times. Impressive it's entirely offline.
Lol yeah the hallucinations are a huge problem. Likely solvable, I think there are probably some bugs in various whisper implementations that are making the problem worse than it should be. I haven't really dug in on that yet though. I was hoping I could switch to a different STT model more designed for real time like Meta's SeamlessM4T but it's still under a non-commercial license and I did have an idea that I might want to try making a product sometime. I did see that yacine made that version but I haven't tried it so I don't know how it compares to mine.
Is there any extra work OpenAI’s product might be doing contributing to this latency that yours isn’t? Considering the scale they operate at and any reputational risks to their brand?
If you're suggesting that OpenAI's morality filters are responsible for a significant part of their voice response latency, then no. I think that's unlikely to be a relevant factor.
To me this is the cleanest and most efficient solution to the problem.
Tbh, ever since voice assistants landed I’ve wanted a handheld mic with a hardware button. No wake command, no (extra) surveillance, just snappy low-latency responses.
I'm using Llama2-chat-13B via mlc-llm @ 4bit quantization + whisper-streaming + coqui TTS, all running simultaneously on one 4090 in real time.
It didn't take long to prototype. Polishing and shipping it to non-expert users would take much longer than I've spent on it so far. I'd have to test for and solve a ton of installation problems, find better workarounds for whisper-streaming's hallucination issues, improve the heuristics for controlling when to start and stop talking, tweak the prompts to improve the suitability of the LLM responses for speech, fixup the LLM context when the LLM's speech is interrupted, probably port the whole thing to Windows for broader reach in the installed base of 4090s, possibly introduce a low-memory mode that can support 12GB GPUs that are much more common, document the requirements and installation process, and figure out hosting for the ginormous download it would be. I'd estimate at least 10x the effort I've spent so far on the prototype before I'd really be satisfied with the result.
I'd honestly love to do all that work. I've been prioritizing other projects because I judged that it was so obvious as a next step that someone else was probably working on the same thing with a lot more resources and would release before I could finish as a solo dev. But maybe I'm wrong...
> It didn't take long to prototype. Polishing and shipping it to non-expert users would take much longer than I've spent on it so far. I'd have to test for and solve a ton of installation problems
I've found some success at this by using Nix... but Nix is a whole 'nother ball of yarn to learn. It WILL get you to declarative/deterministic installs of any piece of the toolchain it covers, though, and it does a hell of a lot better job managing dependencies than anything in Python's ecosystem ever will (in fact, I am pretty sure that Python's being terrible at this is actually driving Nix adoption)
As an example of the power Nix might enable, check out https://nixified.ai/ (which is a project that hasn't been updated in some months and I hope is not dead... It does have some forks on Github, though). Assuming you already have Nix installed, you can get an entire ML toolchain up including a web frontend with a single command. I have dozens of projects on my work laptop, all with their own flake.nix files, all using their own versions of dependencies (which automatically get put on the PATH thanks to direnv), nothing collides with anything else, everything is independently updateable. I'm actually the director of engineering at a small startup and having our team's dev environments all controlled via Nix has been a godsend already (as in, a massive timesaver).
I do think that you could walk a live demo of this into, say, McDonald's corporate, and walk out with a very large check and a contract to hire a team to look into building it out into a product, though. (If you're going to look at chains, I'd suggest Wawa first though, as they seem to embrace new ordering tech earlier than other chains.)
Nix sounds good for duplicating my setup on other machines I control. But I'd like a way to install it on user machines, users who probably don't want to install Nix just for my thing. Nix probably doesn't have a way to make self contained packages, right?
> But I'd like a way to install it on user machines, users who probably don't want to install Nix just for my thing. Nix probably doesn't have a way to make self contained packages, right?
I mean... That's the heart of the problem right there. You can either have all statically compiled binaries (which don't need Nix to run) which have no outside dependencies but result in a ton of wasted disk space with duplicate dependency data everywhere, or you can share dependencies via a scheme, of which the only one that makes real sense (because it creates real isolation between projects but also lets you share equal dependencies with zero conflicts) is Nix's (all of the others have flaws and nondeterminism).
I wish docker could be used more easily with graphic cards and other hardware peripherals (speakers/mic in this case). It would solve a lot of these issues.
Actually I do think this is a good idea. For best latency there should be multiple LLMs involved, a fast one to generate the first few words and then GPT-4 or similar for the rest of the response. In the case that the fast model is unsure, it could absolutely generate filler words while it waits for the big model to return the actual answer. I guess that's pretty much how humans use filler words too!
I really should package it up so people can try it. The one problem that makes it a little unnatural is that determining when the user is done talking is tough. What's needed is a speech conversation turn-taking dataset and model; that's missing from off the shelf speech recognition systems. But it should be trivial for a company like OpenAI to build. That's what I'd work on right now if I was there, because truly natural voice conversations are going to unlock a whole new set of users and use cases for these models.