Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exactly. We don't do claims about humans. But there is room for improvement on current LLMs... For researchers to be able to improve LLMs we first need to know how to evaluate them. We can only improve what we can measure so we studied how to measure them :)




So it's a cyclical grading? Like elementary math students grading each other's solutions right?

How can this even be valid scientifically




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: