10M Context Window with such a cheap performance WHILE having one of the top LMA...

jasonjmcghee · 2025-04-05T20:56:32 1743886592

I suppose the question is, are they also training a 288B x 128 expert (16T) model?

Llama 4 Colossus when?

polishdude20 · 2025-04-05T21:25:27 1743888327

What does it mean to have 128 experts? I feel like it's more 128 slightly dumb intelligences that average out to something expert-like.

Like, if you consulted 128 actual experts, you'd get something way better than any LLM output.

tucnak · 2025-04-05T23:14:07 1743894847

Let's see how that 10M context holds up, 128k pretrain is good indicator is not a scam but we're yet to see any numbers on this "iRoPE" architecture, at 17b active parameters and with 800G fabrics hitting the market, I think it could work, like I'm sure next year it'll be considered idiotic to keep K/V in actual memory.