I think probably training on pause tokens or something similar would be the key ...

phire · on Dec 11, 2023

Yes, you could probably fine-tune (or even zero-shot) a LLM to handle the "knowing when to jump in" use case.

The real problem is that it's simply too computationally expensive to continually feed audio and video into it one of these massive LLMs just in case it might decide to jump in.

I was wondering if you could train a lightweight monitoring model that continually watching the audio/video input and only tried to work out when the full-sized LLM might want to jump in and generate a response.

b112 · on Dec 11, 2023

As the human brain is a clump if regions all interconnected and interacting, for example, one may focus their attention elsewhere until their name is called, having a ight model wait for an important queue makes sense more than fiscally.

One time I was so distracted, I missed an entire paragraph someone said to me, walked to my car, drove away, and 5 minutes later processed it.

Ajedi32 · on Dec 11, 2023

Yeah, one thing I've noticed myself do is that when I'm focused on something else and someone suddenly gets my attention I'll replay the last few seconds of the conversation in my head to get context on what was being talked about before I respond. That seems pretty trivial to do with a LLM; it doesn't need to be using 100% of its "brainpower" at all times.