The benchmark you're linking in 2 is genuinely meaningless due to it being 1 spe...

muzani · on May 18, 2024

It's a test used for humans. I personally am not a big fan of the popular benchmarks because they are, ironically, the narrow tasks that these models are trained on. In fact, GPT-4o performance on key benchmarks has been higher, but on real world tasks, it has flopped on everything we used other models on.

They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.

GPT tends to work with whatever you throw at it while Gemini just hides behind arbitrary benchmarks. If there are tasks that some models are better than others at, than by all means let's highlight them, rather than acting defensive when another model does much better at a certain task.