Friend, the creator of this new progress is a machine learning PhD with a decade of experience in pushing machine learning forward. He knows a lot of math too. Maybe there is a chance that he too can tell the difference between a meaningless advance and an important one?
That is as pure an example of the fallacy of argument from authority[1] as I have ever seen especially when you consider that any nuance in the supposed letter from the researchers to the board will have been lost in the translation from "sources" to the journalist to the article.
That fallacy's existence alone doesn't discount anything (nor have you shown it's applicable here), otherwise we'd throw out the entire idea of authorities and we'd be in trouble
When the person arguing uses their own authority (job, education) to give their answer relevance, then stating that the authority of another person is greater (job, education) to give that person's answer preeminence is valid.
I am neither a mathematician or LLM creator but I do know how to evaluate interesting tech claims.
The absolute best case scenario for a new technology is that it when it seems like a toy for nerds, and doesn't outperform anything we have today, but the scaling path is clear.
Its problems just won't matter if it does that one thing with scaling. The web is a pretty good hypermedia platform, but a disastrously bad platform for most other computer applications. Nevertheless the scaling of URIs and internet protocols have caused us to reorganize our lives around it. And then if there really are unsolvable problems with the platform they just get offloaded onto users. Passwords? Privacy? Your problem now. Surely you know to use a password manager?
I think this new wave of AI is going to be like that. If they never solve the hallucination/confabulation issue, it's just going to become your problem. If they never really gain insight, it's going to become your problem to instruct them carefully. Your peers will chide for not using a robust AI-guardrail thing or not learning the basics of prompt engineering like all the kids do instinctively these days.
How on earth could you evaluate the scaling path with too little information. That's my point. You can't possibly know that a technology can solve a given kind of problem if it can only so far solve a completely different kind of problem which is largely unrelated!
Saying that performance on grade-school problems is predictive of performance on complex reasoning tasks (including theorem proving) is like saying that a new kind of mechanical engine that has 90% efficiency can be scaled 10x.
These kind of scaling claims drive investment, I get it. But to someone who understands (and is actually working on) the actual problem that needs solving, this kind of claim is perfectly transparent!
Any claims of objective, quantitative measurements of "scaling" in LLMs is voodoo snake oil when measured against some benchmarks consisting of "which questions does it answer correctly". Any machine learning PhD will admit this, albeit only in a quiet corner of a noisy bar after a few more drinks than is advisable when they're earning money from companies who claim scaling wins on such benchmarks.
For the current generative AI wave, this is how I understand it:
1. The scaling path is decreased val/test loss during training.
2. We have seen multiples times that large decreases in this loss have resulted in very impressive improvements in model capability across a diverse set of tasks (e.g. gpt-1 through gpt-4, and many other examples).
3. By now, there is tons of robust data demonstrating really nice relationships between model size, quantity of data, length of training, quality of data, etc and decreased loss. Evidence keeps building that most multi-billion param LLMs are probably undertrained, perhaps significantly so.
4. Ergo, we should expect continued capability improvement with continued scaling. Make a bigger model, get more data, get higher data quality, and/or train for longer and we will see improved capabilities. The graphs demand that it is so.
---
This is the fundamental scaling hypothesis that labs like OpenAI and Anthropic have been operating off of for the past 5+ years. They looked at the early versions of the curves mentioned above, extended the lines, and said, "Huh... These lines are so sharp. Why wouldn't it keep going? It seems like it would."
And they were right. The scaling curves may break at some point. But they don't show indications of that yet.
Lastly, all of this is largely just taking existing model architectures and scaling up. Neural nets are a very young technology. There will be better architectures in the future.
I don't think they will go anywhere. Europe doesn't have the ruthlessness required to compete in such an arena, it would need far more unification first before that could happen. And we're only drifting further apart it seems.
But he also has the incentive to exaggerate the AI's ability.
The whole idea of double-blind test (and really, the whole scientific methodology) is based on one simple thing: even the most experienced and informed professionals can be comfortably wrong.
We'll only know when we see it. Or at least when several independent research groups see it.
> even the most experienced and informed professionals can be comfortably wrong
That's the human hallucination problem. In science it's a very difficult issue to deal with, only in hindsight you can tell which papers from a given period were the good ones. It takes a whole scientific community to come up with the truth, and sometimes we fail.
I don't think so. The truth is advanced by individuals, not by the collective. The collective is usually wrong about things as long as they possibly can be. Usually the collective first has too die before it accepts the truth.
> I thought (and could be wrong) that all of these concerns are based on a very low probability of a very bad outcome.
Among knowledgeable people who have concerns in the first place, I'd say giving the probability of a very bad outcome of cumulative advances as "very low" is a fringe position. It seems to vary more between "significant" and "close to unity".
There are some knowledgeable people like Yann LeCun who have no concerns whatsoever but they seem singularly bad at communicating why this would be a rational position to take.
Given how dismissive LeCun is of the capabilities of SotA models, I think he thinks the state of the art is very far from human, and will never be human-like.
Myself, I think I count as a massive optimist, as my P(doom) is only about 15% — basically the same as Russian Roulette — half of which is humans using AI to do bad things directly.
Ah finally the engineers approach to the news. I'm not sure why we have to have hot takes, instead of dissecting the news and trying to tease out the how.