Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

10M Context Window with such a cheap performance WHILE having one of the top LMArena scores is really impressive.

The choice to have 128 experts is also unseen as far as I know, right? But seems to have worked pretty good as it seems.



I suppose the question is, are they also training a 288B x 128 expert (16T) model?

Llama 4 Colossus when?


What does it mean to have 128 experts? I feel like it's more 128 slightly dumb intelligences that average out to something expert-like.

Like, if you consulted 128 actual experts, you'd get something way better than any LLM output.


Let's see how that 10M context holds up, 128k pretrain is good indicator is not a scam but we're yet to see any numbers on this "iRoPE" architecture, at 17b active parameters and with 800G fabrics hitting the market, I think it could work, like I'm sure next year it'll be considered idiotic to keep K/V in actual memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: