Voice has the potential to be awesome. This demo is really underwhelming to me b...

TheEzEzz · on Sept 25, 2023

Completely agree, latency is key for unlocking great voice experiences. Here's a quick demo I'm working on for voice ordering https://youtu.be/WfvLIEHwiyo

Total end-to-end latency is a few hundred milliseconds: starting from speech to text, to the LLM, then to a POS to validate the SKU (no hallucinations are possible!), and finally back to generated speech. The latency is starting to feel really natural. Building out a general system to achieve this low-latency will I think end up being a big unlock for enabling diverse applications.

TheEzEzz · on Sept 25, 2023

Since this is getting a bit of interest, here's one more demo of this https://youtu.be/cvKUa5JpRp4 This demo shows even lower latency, plus the ability to handle very large menus with lots of complicated sub-options (this restaurant has over a billion option combinations to order a coffee). The latency is negative in some places, meaning the system finishes predicting before I finish speaking.

arcticfox · on Sept 25, 2023

Holy cow. That's better than the average human drive-through attendant.

jonplackett · on Sept 25, 2023

This is cool. But I want to see how it handles you going back one and tweaking it.

armini · on Sept 26, 2023

We've built something similar that allows you to tweak/update notes & reminders https://qwerki.com/ (private beta) here's the video demo https://www.youtube.com/shorts/2hpBTxjplIE we've since moved to training our own LLAMA as it's more responsive & we have better reliability.

cyrux004 · on Sept 25, 2023

This is pretty good. Do you think running models locally will be able to achieve performance (getting task done successfully) compared to cloud based ones.i am assuming for context of a drive through scenario it should be ok but more complex systems might need external infromation

TheEzEzz · on Sept 25, 2023

Definitely depends on the application, agreed. The more open ended the application the more dependent it is on larger LLMs (and other systems) that don't easily fit on edge. At the same time, progress is happening that is increasing the size of LLM that can be ran on edge. I imagine we end up in a hybrid world for many applications, where local models take a first pass (and also handle speech transcription) and only small requests are made to big cloud-based models as needed.

wordpad25 · on Sept 25, 2023

Can you share the source code? What did you do to improve the latency?

TheEzEzz · on Sept 25, 2023

Lots of work around speculative decoding, optimizing across the ASR->LLM->TTS interfaces, fine-tuning smaller models while maintaining accuracy (lots of investment here), good old fashioned engineering around managing requests to the GPU, etc. We're considering commercializing this so I can't open source just yet, but if we end up not selling it I'll definitely think about opening it up.

7_hours_ago · on Sept 26, 2023

Can you at least share the stack that you're using in building this? What kind of business model are you considering in commercializing it?

TheEzEzz · on Sept 26, 2023

We're design the stack to be fairly flexible. It's Python/Pytorch under the hood, with the ability to plug and play various off the shelf models. For ASR we support GCP/AssemblyAI/etc, as well as a customized self-hosted version of Whisper that is tailored for stream processing. For the LLM we support fine-tuned GPT3 models, fine-tuned Google text-bison models, or locally hosted fine-tuned Llama models (and a lot of the project goes into how to do the fine-tuning to ensure accuracy and low latency). For the TTS we support Elevenlabs/GCP/etc, and they all tie into the latency reducing approaches.

Breza · on Oct 5, 2023

Neat! I appreciate your approach to preventing hallucinations. I've used something similar in a different context. People make a big deal about hallucinations but I've found that validation is one of the easier aspects of AI architecture.

nelox · on Sept 25, 2023

The voice does not seem to be able to pronounce the L in “else”. What’s happening there?

TheEzEzz · on Sept 25, 2023

Good question. Off the shelf TTS systems tend to enunciate every phoneme more like a radio talk show host rather than a regular person, which I find a bit off putting. I've been playing around with trying to get the voice to be more colloquial/casual. But I haven't gotten it to really sound natural yet.

g0atbutt · on Sept 25, 2023

This is a very slick demo. Nice job!

TheEzEzz · on Sept 25, 2023

Thanks! It's a lot of fun building with these new models and recent AI approaches.

arktiso · on Sept 25, 2023

Wow, the latency on requests feels great!! I’m really curious: is this running entirely with Python?

TheEzEzz · on Sept 25, 2023

100% Python but with a good deal of multiprocessing, speculative decoding, etc. As we move to production we can probably shave another 100ms off by moving over to a compiled system, but Python is great for rapid iteration.

mach1ne · on Sept 25, 2023

Manna v0.7

kortex · on Sept 25, 2023

Context for the unaware: https://en.m.wikipedia.org/wiki/Manna_(novel)

swsieber · on Sept 26, 2023

That's way slick.

Can I ask what your background is, and what things you're used to working with? I don't have the chops to build what you built, but I'd love to get there.

TheEzEzz · on Oct 2, 2023

My advice is always to jump in and start building! My background is math originally, so I had some of the tools in my tool box, but I'm mostly self-taught in computer science and machine learning. I read textbooks, research papers, code repos, but most importantly I build a lot of stuff. Once I'm excited about an idea I'll figure out how to become an expert to make it a reality. Over the years the skills start to compound, so it also helps that I'm an old man!

simian1983 · on Sept 25, 2023

That demo is pretty slick. What happens when you go totally off book? Like, ask it to recite the numbers of pi? Or if you become abusive? Will it call the cops?

TheEzEzz · on Sept 25, 2023

It's trained to ignore everything else. That way background conversations are ignored as well (like your kids talking in the back of the car while you order).

edge17 · on Sept 26, 2023

How do you train for this?

yarone · on Sept 25, 2023

Nice work, very cool!

furyofantares · on Sept 25, 2023

> This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant.

Yep - it needs to be ready as soon as I'm done talking and I need to be able to interrupt it. If those things can be done then it can also start tentatively talking if I pause and immediately stop if I continue.

I don't want to have to think about how to structure the interaction in terms of explicit call/response chain, nor do I want to have to be super careful to always be talking until I've finished my thought to prevent it from doing its thing at the wrong time.

wkat4242 · on Sept 25, 2023

The interruption is an important point yeah. It's so annoying when Siri misunderstands again and starts rattling off a whole host of options. And keeps getting stuck in a loop if you don't respond.

In fact I'm really surprised these assistants are still as crap as they are. Totally scripted, zero AI. It seems low hanging fruit to implement an LLM but none of the big three have done so. Not even sure about the fringe ones like Cortana and Bixby

famouswaffles · on Sept 25, 2023

I mean Microsoft is planning to. Rolling out as soon as tomorrow.

https://youtu.be/5rEZGSFgZVY

wkat4242 · on Sept 26, 2023

Windows 11 copilot is not really the same thing though. They don't do something like homepods you can have around your house.

singularity2001 · on Sept 26, 2023

The CallAnnie demo allows interruption and its such a leap forward compared to Siri

modeless · on Sept 25, 2023

Yeah when I was developing it, it quickly became apparent that I needed to be able to interrupt it. So I implemented that. Pretty easy to implement actually. Much harder would be to have the model interrupt the human. But I think it is actually desirable for natural conversation, so I do think a turn-taking model should be able to signal the LLM to interrupt the human.

dotancohen · on Sept 25, 2023

  > determining when the user is done talking is tough.

Sometimes that task is tough for the speaker too, not just the listener. Courteous interruptions or the lack thereof might be a shibboleth for determining when we are speaking to an AI.

modeless · on Sept 25, 2023

Yes interruptions are key, both ways. Having the user interrupt the bot is easy, but to have the bot interrupt the human will again require a model to predict when that should happen. But I do believe it is desirable for natural conversation.

_kb · on Sept 25, 2023

From prior experience, courteous interruption is a skill that a lot of humans find challenging at times too (myself included).

rayuela · on Sept 25, 2023

Can you share a github link to this? Where are you reducing the latency? Are you processing the raw audio to text? In my experience ChatGPT generation time is much faster than local Lllama unless you're using something potato like a 7B model.

modeless · on Sept 25, 2023

Unfortunately it has a really high "works on my machine" factor. I'm using Llama2-chat-13B via mlc-llm + whisper-streaming + coqui TTS. I just have a bunch of hardcoded paths and these projects tend to be a real pain to set up, so figuring out a nice way to package it up with its dependencies in a portable way is the hard part.

I'm mostly using llama2 because I wanted it to work entirely offline, not because it's necessarily faster, although it is quite fast with mlc-llm. Calling out to GPT-4 is something I'd like to add. I think the right thing is actually to have the local model generate the first few words (even filler words sometimes maybe) and then switch to the GPT-4 answer whenever it comes back.

kordlessagain · on Sept 26, 2023

Here's a link to a project that claims half second latency for the transcription part: https://github.com/gaborvecsei/whisper-live-transcription

jonplackett · on Sept 25, 2023

I wonder when computers will start taking our intonation into account too. That would really help with understanding the end of a phrase. And there’s SO MUCH information in intonation that doesn’t exist in pure text. Any AI that doesn’t understand that part of language will always still be kinda dumb, however clever they are.

modeless · on Sept 25, 2023

You're right. Ultimately the only way this will really work is as an end-to-end model. Text will only get you so far. We could approximate it now with screenplay-like emotion annotations on text, which LLMs should both easily understand and be able to produce themselves (though you'd have to train a new speech recognition system to produce them). But end-to-end will be required eventually to reach human level fluency.

hk__2 · on Sept 25, 2023

Don’t they do it already? There are a lot of languages where intonation is absolutely necessary to distinguish between some words, so I would be surprised that this not already taken into account by the major voice assistants.

bobsmooth · on Sept 25, 2023

In English, intonation changes the meaning of the word but not the word itself. From what I understand, in tonal languages tone changes the whole word. I don't think ML understands that difference yet.

fragmede · on Sept 26, 2023

Yeah they do. I was able to get ChatGPT-4 to transcribe 我哥哥高過他的哥哥, which says that they can. I did have to set the app to Chinese, and the original didn't work so I had to modify what I said slightly.

https://www.tiktok.com/t/ZT86psPxY/

Roughly translated, my older brother is taller than that other guy's older brother.

modeless · on Sept 26, 2023

Of course speech recognition works for Chinese. What it doesn't do is transcribe intonation and prosody in non-tonal languages. It's not even clear how one would transcribe such a thing as I'm not aware of a standard notation.

fragmede · on Sept 29, 2023

IPA format should cover that, no?

modeless · on Sept 30, 2023

Maybe? I thought IPA was just phonetic but I see that it does have some optional prosody stuff that could in theory cover some of it. I'm not sure how standard or complete it really is in practice.

I haven't heard of any large datasets of IPA transcripts of speech with the detail necessary to train a fully realistic STT->LLM->TTS system. If you know of some that would be interesting to look at.

dsp_person · on Sept 25, 2023

Also curious to hear about your setup. Using whisper too? When I was experimenting with it there was still a lot of annoyance about hallucinations and I was hard coding some "if last phrase is 'thanks for watching', ignore last phrase"

I was just googling a bit to see what's out there now for whisper/llama combos and came across this: https://github.com/yacineMTB/talk

There's a demo linked on the github page that seems relatively fast at responding conversationally, but still maybe 1-2 seconds at times. Impressive it's entirely offline.

modeless · on Sept 25, 2023

Lol yeah the hallucinations are a huge problem. Likely solvable, I think there are probably some bugs in various whisper implementations that are making the problem worse than it should be. I haven't really dug in on that yet though. I was hoping I could switch to a different STT model more designed for real time like Meta's SeamlessM4T but it's still under a non-commercial license and I did have an idea that I might want to try making a product sometime. I did see that yacine made that version but I haven't tried it so I don't know how it compares to mine.

QuantumG · on Sept 26, 2023

Turn the volume on your microphone down and watch as Whisper just starts SCREAMING.

jimmytucson · on Sept 25, 2023

> It doesn't have to be this way!

Is there any extra work OpenAI’s product might be doing contributing to this latency that yours isn’t? Considering the scale they operate at and any reputational risks to their brand?

modeless · on Sept 26, 2023

If you're suggesting that OpenAI's morality filters are responsible for a significant part of their voice response latency, then no. I think that's unlikely to be a relevant factor.

famouswaffles · on Sept 25, 2023

Here's something with very little latency. https://www.bland.ai/

barfingclouds · on Sept 26, 2023

There needs to be an optional button that you hold while speaking and let go when you are done. If button is not held it should auto detect

joshspankit · on Sept 27, 2023

To me this is the cleanest and most efficient solution to the problem.

Tbh, ever since voice assistants landed I’ve wanted a handheld mic with a hardware button. No wake command, no (extra) surveillance, just snappy low-latency responses.

pmarreck · on Sept 25, 2023

Do you have a rough design outline of what you built? I feel like we're on the cusp of something like this and it sounds amazing.

modeless · on Sept 26, 2023

I'm using Llama2-chat-13B via mlc-llm @ 4bit quantization + whisper-streaming + coqui TTS, all running simultaneously on one 4090 in real time.

It didn't take long to prototype. Polishing and shipping it to non-expert users would take much longer than I've spent on it so far. I'd have to test for and solve a ton of installation problems, find better workarounds for whisper-streaming's hallucination issues, improve the heuristics for controlling when to start and stop talking, tweak the prompts to improve the suitability of the LLM responses for speech, fixup the LLM context when the LLM's speech is interrupted, probably port the whole thing to Windows for broader reach in the installed base of 4090s, possibly introduce a low-memory mode that can support 12GB GPUs that are much more common, document the requirements and installation process, and figure out hosting for the ginormous download it would be. I'd estimate at least 10x the effort I've spent so far on the prototype before I'd really be satisfied with the result.

I'd honestly love to do all that work. I've been prioritizing other projects because I judged that it was so obvious as a next step that someone else was probably working on the same thing with a lot more resources and would release before I could finish as a solo dev. But maybe I'm wrong...

pmarreck · on Sept 26, 2023

> It didn't take long to prototype. Polishing and shipping it to non-expert users would take much longer than I've spent on it so far. I'd have to test for and solve a ton of installation problems

I've found some success at this by using Nix... but Nix is a whole 'nother ball of yarn to learn. It WILL get you to declarative/deterministic installs of any piece of the toolchain it covers, though, and it does a hell of a lot better job managing dependencies than anything in Python's ecosystem ever will (in fact, I am pretty sure that Python's being terrible at this is actually driving Nix adoption)

As an example of the power Nix might enable, check out https://nixified.ai/ (which is a project that hasn't been updated in some months and I hope is not dead... It does have some forks on Github, though). Assuming you already have Nix installed, you can get an entire ML toolchain up including a web frontend with a single command. I have dozens of projects on my work laptop, all with their own flake.nix files, all using their own versions of dependencies (which automatically get put on the PATH thanks to direnv), nothing collides with anything else, everything is independently updateable. I'm actually the director of engineering at a small startup and having our team's dev environments all controlled via Nix has been a godsend already (as in, a massive timesaver).

I do think that you could walk a live demo of this into, say, McDonald's corporate, and walk out with a very large check and a contract to hire a team to look into building it out into a product, though. (If you're going to look at chains, I'd suggest Wawa first though, as they seem to embrace new ordering tech earlier than other chains.)

modeless · on Sept 26, 2023

I'm not the guy working on ordering, it's this guy https://news.ycombinator.com/user?id=TheEzEzz.

Nix sounds good for duplicating my setup on other machines I control. But I'd like a way to install it on user machines, users who probably don't want to install Nix just for my thing. Nix probably doesn't have a way to make self contained packages, right?

pmarreck · on Sept 27, 2023

> But I'd like a way to install it on user machines, users who probably don't want to install Nix just for my thing. Nix probably doesn't have a way to make self contained packages, right?

I mean... That's the heart of the problem right there. You can either have all statically compiled binaries (which don't need Nix to run) which have no outside dependencies but result in a ton of wasted disk space with duplicate dependency data everywhere, or you can share dependencies via a scheme, of which the only one that makes real sense (because it creates real isolation between projects but also lets you share equal dependencies with zero conflicts) is Nix's (all of the others have flaws and nondeterminism).

joshspankit · on Sept 27, 2023

I wish docker could be used more easily with graphic cards and other hardware peripherals (speakers/mic in this case). It would solve a lot of these issues.

yieldcrv · on Sept 25, 2023

all it has to do is add a random selection of "uhms" and "ahhs" and "mmm"

modeless · on Sept 25, 2023

Actually I do think this is a good idea. For best latency there should be multiple LLMs involved, a fast one to generate the first few words and then GPT-4 or similar for the rest of the response. In the case that the fast model is unsure, it could absolutely generate filler words while it waits for the big model to return the actual answer. I guess that's pretty much how humans use filler words too!

dragonwriter · on Sept 25, 2023

Unfortunately, Bark is probably way too slow to use for the TTS portion given the latency concerns or that would be covered.