Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The benchmark you're linking in 2 is genuinely meaningless due to it being 1 specific task. I can easily make a benchmark for another task (that I'm personally working on) where e.g. Gemini is much better than GPT4-Vision and any Claude model (not sure about GPT-4o yet) and then post that as a benchmark. Does that mean Gemini is better at image reasoning? No.

These benchmarks are really missing the mark and I hope people here are smart enough to do their own testing or rely on tests with a much bigger variety of tasks if they want to measure overall performance. Because currently we're at a point where the big 3 (GPT, Claude, Gemini) each have tasks that they beat the other two at.



It's a test used for humans. I personally am not a big fan of the popular benchmarks because they are, ironically, the narrow tasks that these models are trained on. In fact, GPT-4o performance on key benchmarks has been higher, but on real world tasks, it has flopped on everything we used other models on.

They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.

GPT tends to work with whatever you throw at it while Gemini just hides behind arbitrary benchmarks. If there are tasks that some models are better than others at, than by all means let's highlight them, rather than acting defensive when another model does much better at a certain task.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: