Exactly. We don't do claims about humans. But there is room for improvement on c... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		RicardoRei 5 days ago \| parent \| context \| favorite \| on: New benchmark shows top LLMs struggle in real ment... Exactly. We don't do claims about humans. But there is room for improvement on current LLMs... For researchers to be able to improve LLMs we first need to know how to evaluate them. We can only improve what we can measure so we studied how to measure them :)

arisAlexis 5 days ago [–]

So it's a cyclical grading? Like elementary math students grading each other's solutions right?

How can this even be valid scientifically

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact