What sticks out to me is the 60% win rate vs GPT-4o when it comes to actual usage by humans for programming tasks. So in reality it's barely better than GPT-4o. That the figure is higher for mathematical calculation isn't surprising because LLMs were much worse at that than at programming to begin with.