According to the benchmarks here they're claiming up to 97% accuracy. That ought...

JimDabell · 2025-12-12T10:49:40 1765536580

Something that is 97% accurate is wrong 3% of the time, so pointing out that it has gotten something wrong does not contradict 97% accuracy in the slightest.

refactor_master · 2025-12-12T09:28:11 1765531691

Gemini routinely makes up stuff about BigQuery’s workings. “It’s poorly documented”. Well, read the open source code, reason it out.

Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?

TeMPOraL · 2025-12-12T12:11:47 1765541507

I.e. like most restaurants and food delivery? :). Though 3% problem rate is optimistic.

AstroBen · 2025-12-11T23:32:32 1765495952

Does code work if it's 97% correct?

It's not okay if claims are totally made up 1/30 times

Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them

fooker · 2025-12-12T09:29:06 1765531746

> Does code work if it's 97% correct?

Of course it does. The vast majority of software has bugs. Yes, even critical one like compilers and operating systems.

mbesto · 2025-12-12T15:11:23 1765552283

> Or maybe these benchmarks are all wrong

You must be new to LLM benchmarks.