Again, I'm not claiming malicious intent. But model performance depends on a number of factors and the end-user just sees benchmarks for a specific configuration. For me to have a high degree of confidence in a provider I would need to see open and continuous benchmarking of the end-user API.
All those are completely irrelevant. Quantization is just a cost optimization.
People are claiming that Anthropic et all changes the quality of the model after the initial release, which is entirely different and the industry as a whole has denied. When a model is released under a certain version, the model doesn’t change.
The only people who believe this are in the vibe coding community, believing that there’s some kind of big conspiracy, but any time you mention “but benchmarks show the performance stays consistent” you’re told you’re licking corporate ass.
I might be misunderstanding your point, but quantization can have a dramatic impact on the quality of the model's output.
For example, in diffusion, there are some models where a Q8 quant dramatically changes what you can achieve compared to fp16. (I'm thinking of the Wan video models.) The point I'm trying to make is that it's a noticeable model change, and can be make-or-break.
Of course, no one is debating that. What’s being debated is whether this is done after a model’s initial release, eg Anthropic will secretly change the new Opus model to perform worse but be more cost efficient in a few weeks.
That’s not the point — it’s just a day in the life of ops to tweak your system to improve resource utilization and performance. Which can cause bugs you don’t expect in LLMs. it’s a lot easier to monitor performance in a deterministic system, but harder to see the true impact a change has to the LLM
There was one well-documented case of performance degradation which arose from a stupid bug, not some secret cost cutting measure.