Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

astrange · 2025-11-26T04:43:58 1764132238

The writing is quite Claude-y. I don't know if it's just me, but Claude models spend more time talking about their feelings to me than they do giving me smart answers.

> And honestly, this is mainly on me. I’ve fallen behind on maintaining my own collection of tasks that are just beyond the capabilities of the frontier models. I used to have a whole bunch of these but they’ve fallen one-by-one and now I’m embarrassingly lacking in suitable challenges to help evaluate new models.

> “Here’s an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5” would excite me a lot more than some single digit percent improvement on a benchmark with a name like MMLU or GPQA Diamond.

ChrisArchitect · 2025-11-24T20:50:29 1764017429

More discussion: https://news.ycombinator.com/item?id=46037637