1) The "bitter lesson" may not be true, and there is a fundamental limit to transformer intelligence.
2) The "bitter lesson" is true, and there just isn't enough data/compute/energy to train AGI.
All the cognition should be happening inside the transformer. Attention is all you need. The possible cognition and reasoning occurring "inside" in high dimensions is much more advanced than any possible cognition that you output into text tokens.
This feels like a sidequest/hack on what was otherwise a promising path to AGI.
On the contrary, this suggests that the bitter lesson is alive and kicking. The bitter lesson doesn't say "compute is all you need", it says "only those methods which allow you to make better use of hardware as hardware itself scales are relevant".
This chain of thought / reflection method allows you to make better use of the hardware as the hardware itself scales. If a given transformer is N billion parameters, and to solve a harder problem we estimate we need 10N billion parameters, one way to do it is to build a GPU cluster 10x larger.
This method shows that there might be another way: instead train the N billion model differently so that we can use 10x of it at inference time. Say hardware gets 2x better in 2 years -- then this method will be 20x better than now!
I'd be shocked if we don't see diminishing returns in the inference compute scaling laws. We already didn't deserve how clean and predictive the pre-training scaling laws were, no way the universe grants us another boon of that magnitude
The similarity is cosmetic only. The reason it is used is because it's easy to leverage existing work in LLMs, and scaling (although not cheap) is an obvious approach.
> Does that mean human intelligence is cheapened when you talk out a problem to yourself?
In a sense, maybe yeah. Of course if one were to really be absolute about that statement it would be absurd, it would greatly overfit the reality.
But it is interesting to assume this statement as true. Oftentimes when we think of ideas "off the top of our heads" they are not as profound as ideas that "come to us" in the shower. The subconscious may be doing 'more' 'computation' in a sense. Lakoff said the subconscious was 98% of the brain, and that the conscious mind is the tip of the iceberg of thought.
lol come on it’s not the exact same thing. At best this is like gagging yourself while you talk about it then engaging yourself when you say the answer. And that presupposing LLMs are thinking in, your words, exactly the same way as humans.
Admittedly not my most articulate, my exasperation showed through. To some extent it seems warranted as it tends to be the most effective tactic against hyperbole. Still trying to find a better solution.
Karpathy himself believes that neural networks are perfectly plausible as a key component to AGI. He has said that it doesn't need to be superseded by something better, it's just that everything else around it (especially infrastructure) needs to improve. As one of the most valuable opinions in the entire world on the subject, I tend to trust what he said.
I think it's too soon to tell. Training the next generation of models means building out entire datacenters. So while they wait they have engineers build these sidequests/hacks.
Attention is about similarity/statistical correlation which is fundamentally stochastic , while reasoning needs to be truthful and exact to be successful.
Imagine instead the bitter lesson says: we can expand an outwards circle in many dimensions of ways to continuously mathematically manipulate data to adjust outputs.
Even the attention-token approach is on the grand scale of things a simple line outwards from the centre; we have not even explored around the centre (with the same compute spend) for things like non-token generation, different layers/different activation functions and norming / query/key/value set up (why do we only use the 3 inherent to contextualising tokens, why not add a 4th matrix for something else?), character, sentence, whole thought, paragraph one-shot generation, positional embeddings which could work differently.
The bitter lesson says there is almost a work completely untouched by our findings for us to explore. The temporary work of non-data approaches can piggy back off a point on the line; it cannot expand it like we can as we exude out from the circle..
This kind of short-sighted, simplistic reasoning / behaviour is what I worry about the most in terms of where our society is going. I always wonder - who will be the people buying or using your software (build very cheaply and efficiently with AI) once they can do the same, or get replaced by AI, or bankrupt themselves?
Everybody seems to be so focused on how to get ahead in race to profitability, that they don't consider the shortcut they are taking might be leading to a cliff.
Except that these aren't thoughts. These techniques are improvements to how the model breaks down input data, and how it evaluates its responses to arrive at a result that most closely approximates patterns it was previously rewarded for. Calling this "thinking" is anthropomorphizing what's really happening. "AI" companies love to throw these phrases around, since it obviously creates hype and pumps up their valuation.
Human thinking is much more nuanced than this mechanical process. We rely on actually understanding the meaning of what the text represents. We use deduction, intuition and reasoning that involves semantic relationships between ideas. Our understanding of the world doesn't require "reinforcement learning" and being trained on all the text that's ever been written.
Of course, this isn't to say that machine learning methods can't be useful, or that we can't keep improving them to yield better results. But these are still methods that mimic human intelligence, and I think it's disingenuous to label them as such.
Idk if I'm "feeling the AGI" if I'm being honest.
Also... telling that they choose to benchmark against CodeForces rather than SWE-bench.