None of inference frameworks (vLLM/SGLang) supports the full model, let alone no...

AndreSlavescu · 2025-12-10T20:50:21 1765399821

We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.

Check it out here: https://models.hathora.dev/model/qwen3-omni

sosodev · 2025-12-10T21:49:08 1765403348

Is your work open source?

AndreSlavescu · 2025-12-15T19:15:48 1765826148

At the moment, no unfortunately. However, to my recent knowledge of open source alternatives, the vLLM team published a separate repository for omni models now:

https://github.com/vllm-project/vllm-omni

I have not yet tested out if this does full speech to speech, but this seems like a promising workspace for omni-modal models.

red2awn · 2025-12-10T21:08:52 1765400932

Nice work. Are you working on streaming input/output?

AndreSlavescu · 2025-12-10T21:14:31 1765401271

Yeah, that's something we currently support. Feel free to try the platform out! No cost to you for now, you just need a valid email to sign up on the platform.

valleyer · 2025-12-11T10:07:57 1765447677

I tried this out, and it's not passing the record (n.) vs. record (v.) test mentioned elsewhere in this thread. (I can ask it to repeat one, and it often repeats the other.) Am I not enabling the speech-to-speech-ness somehow?

AndreSlavescu · 2025-12-15T19:14:35 1765826075

From my understanding of the above problem, this would be something to do with the model weights. Have you tested this with the transformers inference baseline that is shown on huggingface?

In our deployment, we do not actually tune the model in any way, this is all just using the base instruct model provided on huggingface:

https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct

And with the potential concern around conversation turns, our platform is designed for one-off record -> response flows. But via the API, you can build your own conversation agent to use the model.

sosodev · 2025-12-10T20:29:12 1765398552

That's unfortunate but not too surprising. This type of model is very new to the local hosting space.

whimsicalism · 2025-12-11T02:23:42 1765419822

Makes sense, I think streaming audio->audio inference is a relatively big lift.

red2awn · 2025-12-11T09:21:18 1765444878

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.

whimsicalism · 2025-12-11T17:37:21 1765474641

I imagine you have to start decoding many speculative completions in parallel to have true low latency.