Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
GodelNumbering
25 days ago
|
parent
|
context
|
favorite
| on:
Claude Opus 4.5
The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.
grantpitt
25 days ago
[–]
do say more
GodelNumbering
25 days ago
|
parent
[–]
Makes it sound like a one trick pony
jascha_eng
25 days ago
|
root
|
parent
|
next
[–]
Anthropic is leaning into agentic coding and heavily so. It makes sense to use swe verified as their main benchmark. It is also the one benchmark Google did not get the top spot last week. Claude remains king that's all that matters here.
Mkengin
25 days ago
|
root
|
parent
|
next
[–]
I am eagerly awaiting swe-rebench results for November with all the new models:
https://swe-rebench.com/
grantpitt
25 days ago
|
root
|
parent
|
prev
[–]
well, it's a big trick
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: