Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What sticks out to me is the 60% win rate vs GPT-4o when it comes to actual usage by humans for programming tasks. So in reality it's barely better than GPT-4o. That the figure is higher for mathematical calculation isn't surprising because LLMs were much worse at that than at programming to begin with.


I'm not sure that's the right way to interpret it.

If some tasks are too easy, both models might give satisfactory answers, in which case the human preference might as well be a coin toss.

I don't know the specifics of their methodology though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: