Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult (simonwillison.net)
7 points by janpio 21 days ago | hide | past | favorite | 2 comments


The writing is quite Claude-y. I don't know if it's just me, but Claude models spend more time talking about their feelings to me than they do giving me smart answers.

> And honestly, this is mainly on me. I’ve fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they’ve fallen one-by-one and now I’m embarrassingly lacking in suitable challenges to help evaluate new models.

> “Here’s an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5” would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: