I continue to be surprised by the talk of general artifical intelligence when it comes to LLMs. At their core, they are text predictors, and they're often pretty good at that. But anything beyond that, they are decidely unimpressive.
I use Copilot on a daily basis, which uses GPT 4 in the backend. It's wrong so often that I only really use it for boilerplate autocomplete, which I still have to review. I've had colleagues brag about ChatGPT in terms of code it produces, but when I ask how long it took in terms of prompting, I'll get an answer of around a day, and that was even using fragments of my code to prompt it. But then I explain that it would take me probably less than an hour from scratch to do what it took them and ChatGPT a full day to do.
So I just don't understand the hype. I'm using Copilot and ChatGPT 4. What is everyone else using that gives them this idea that AGI is just around the corner? AI isn't even here. It's just advanced autocomplete. I can't understand where the disconnect is.
Look at the sample chain-of-thought for o1-preview under this blog post, for decoding "oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz". At this point, I think the "fancy autocomplete" comparisons are getting a little untenable.
I’m not seeing anything convincing here. OpenAI says that it’s models are better at reasoning and asserts they are testing this by comparing how it does solving some problems between o1 and “experts” but it doesn’t show the experts or o1s responses to these questions nor does it even deign to share what the problems are. And, crucially, it doesn’t specify if writings on these subjects were part of training data.
Call me a cynic here but I just don’t find it too compelling to read about OpenAI being excited about how smart OpenAIs smart AI is in a test designed by OpenAI and run by OpenAI.
Especially given this tech's well-documented history of using rigged demos, if OpenAI insists on doing and posting their own testing and absolutely nothing else, a little insight into their methodology should be treated as the bare fucking minimum.
It depends on how well you understand how the fancy autocomplete is working under the hood.
You could compare GPT-o1 chain of thought to something like IBM's DeepBlue chess-playing computer, which used MTCS (tree search, same as more modern game engines such as AlphaGo)... at the end of the day it's just using built-in knowledge (pre-training) to predict what move would most likely be made by a winning player. It's not unreasonable to characterize this as "fancy autocomplete".
In the case of an LLM, given that the model was trained with the singular goal of autocomplete (i.e. mimicking the training data), it seems highly appropriate to call that autocomplete, even though that obviously includes mimicking training data that came from a far more general intelligence than the LLM itself.
All GPT-o1 is adding beyond the base LLM fancy autocomplete is an MTCS-like exploration of possible continuations. GPT-o1's ability to solve complex math problems is not much different from DeepBlue's ability to beat Garry Kasparov. Call it intelligent if you want, but better to do so with an understanding of what's really under the hood, and therefore what it can't do as well as what it can.
Saying "it's just autocomplete" is not really saying anything meaningful since it doesn't specify the complexity of completion. When completion is a correct answer to the question that requires logical reasoning, for example, "just autocomplete" needs to be able to do exactly that if it is to complete anything outside of its training set.
It's just a shorthand way of referring to how transformer-based LLMs work. It should go without saying that there are hundreds of layers of hierarchical representation, induction heads at work, etc, under the hood. However, with all that understood (and hopefully not needed to be explicitly stated every time anyone wants to talk about LLMs in a technical forum), at the end of the day they are just doing autocomplete - trying to mimic the training sources.
The only caveat to "just autocomplete" (which again hopefully does not need to be repeated every time we discuss them), is that they are very powerful pattern matchers, so all that transformer machinery under the hood is being used to determine what (deep, abstract) training data patterns the input pattern best matches for predictive purposes - exactly what pattern(s) it is that should be completed/predicted.
This is the tough part to tell - are there any such questions that exist that have not already been asked?
The reason Chat-GPT works is its scale. to me, that makes me question how "smart" it is. Even the most idiotic idiot could be pretty decent if he had access to the entire works of mankind and infinite memory. Doesn't matter if his IQ is 50, because you ask him something and he's probably seen it before.
How confident are we this is not just the case with LLMs?
I'm highly confident that we haven't learnt every thing that can be learnt about the world, and that human intelligence, curiosity and creativity are still being used to make new scientific discoveries, create things that have never been seen before, and master new skills.
I'm highly confident that the "adjacent possible" of what is achievable/discoverable today, leveraging what we already know, is constantly changing.
I'm highly confident that AGI will never reach superhuman levels of creativity and discovery if we model it only on artifacts representing what humans have done in the past, rather than modelling it on human brains and what we'll be capable of achieving in the future.
Of course there are such questions. When it comes to even simple puzzles, there are infinitely many permutations possible wrt how the pieces are arranged, for example - hell, you could generate such puzzles with a script. No amount of precanned training data can possibly cover all such combinations, meaning that the model has to learn how to apply the concepts that make solution possible (which includes things such as causality or spatial reasoning).
Right, but typically LLMs are really poor at this. I can come up with some arbitrary systems of equations for it to solve and odds are it will be wrong. Maybe even very wrong.
That is more indicative of the quality of their reasoning than their ability to reason in principle, though. And maybe even quality of their reasoning specifically in this domain - e.g. it's not a secret that most major models are notoriously bad at tasks involving things like counting letters, but we also know that if you specifically train a model to do that, it does in fact drastically improve its performance.
On the whole I think it shouldn't be surprising that even top-of-the-line LLMs today can't reason as well as a human - they aren't anywhere near as complex as our brains. But if it is a question of quality rather than a fundamental disability, then larger models and better NN designs should be able to gradually push the envelope.
Well, tons of ways. I can't imagine what an "autocomplete only" human would look like, but it'd be pretty dire - maybe like an idiot savant with a brain injury who could recite whole books given the opening sentence, but never learn anything new.
No, it doesn't. You can read more when that was first posted to Hacker News. If I recall and understand correctly, they're just using the output of sublayers as training data for the outermost layer. So in other words, they're faking it and hiding that behind layers of complexity
The other day, I asked Copilot to verify a unit conversion for me. It gave an answer different than mine. Upon review, I had the right number. Copilot had even written code that would actually give the right answer, but their example of using that code performed the actual calculations wrong. It refused to accept my input that the calculation was wrong.
So not only did it not understand what I was asking and communicating to it, it didn't even understand its own output! This is not reasoning at any level. This happens all the time with these LLMs. And it's no surprise really. They are fancy, statistical copy cats.
From an intelligence and reasoning perspective, it's all smoke and mirrors. It also clearly has no relation to biological intelligent thinking. A primate or cetacean brain doesn't take the billions of dollars and how much energy to train on terabytes of data. While it's fine that AI might be artificial and not an analog of biological intelligence, these LLMs bear no resemblance to anything remotely close to intelligence. We tell students all the time to "stop guessing". That's what I want to yell at these LLMs all the time.
Dude, it's not the LLM that does the reasoning. Rather it's the layers and layers of scaffolding around LLM that simulate reasoning.
The moment 'tooling' became a thing for LLM, it reminded me 'rules' for expert system which caused one of the AI winter. The number of 'tools' you need to solve real use cases will be untenable soon enough.
Well, I agree that the part that does the reasoning isn't an LLM in the naive form.
But that "scaffolding" seems to be an integral part of the neural net that has been built. It's not some Python for-loop that has been built on top of the neural network to brute force the search pattern.
If that part isn't part of the LLM, then o1 isn't really an LLM anymore, but a new kind of model. One that can do reasoning.
And if we chose to call it an LLM, well then now LLM's can also do reasoning intrinsically.
Reasoning, just like intelligence (of which it is part) isn't an all or nothing capability. o1 can now reason better than before (in a way that is more useful in some contexts than others), but it's not like a more basic LLM can't reason at all (i.e. generate an output that looks like reasoning - copy reasoning present in the training set), or that o1's reasoning is human level.
From the benchmarks it seems like o1-style reasoning-enhancement works best for mathematical or scientific domains where it's a self-consistent axiom-driven domain such that combining different sources for each step works. It might also be expected to help in strict rule-based logical domains such as puzzles and games (wouldn't be surprising to see it do well as a component of a Chollet ARC prize submission).
o1 has moved "reasoning" from training time to partly something happening at inference time.
I'm thinking of this difference as analogus to the difference between my (as a human) first intution (or memory) about a problem to what I can achieve by carefully thinking about it for a while, where I can gradually build much more powerful arguments, verify if they work and reject parts that don't work.
If you're familiar with chess terminology, it's moving from a model that can just "know" what the best move is to one that combines that with the ability to "calculate" future moves for all of the most promising moves, and several moves deep.
Consider Magnus Carlsen. If all he did was just did the first move that came to his mind, he could still beat 99% of humanity at chess. But to play 2700+ rated GM's, he needs to combine it with "calculations".
Not only that, but the skill of doing such calculations must also be trained, not only by being able to calculate with speed and accuracy, but also by knowing what parts of the search tree will be useful to analyze.
o1 is certainly optimized for STEM problems, but not necessarily only for using strict rule-based logic. In fact, even most hard STEM problems need more than the ability to perform deductive logic to solve, just like chess does. It requires strategical thinking and intuition about what solution paths are likely to be fruitful. (Especially if you go beyond problems that can be solved by software such as WolframAlpha).
I think the main reason STEM problems was used for training is not so much that they're solved using strict rule-based solving strategies, but rather because a large number of such problems exist that have a single correct answer.
Here now, you just need a few more ice cold glasses of the kool-aide. Drink up!
LLMs are not on the path to AGI. They’re a really cool parlor trick and will be powerful tools for lots of tasks, but won’t be sci-fi cool.
Copilot is useful and has definitely sped up coding, but like you said, only in a boilerplate sort of way and I need to cleanup almost everything it writes.
LLMs let the massively stupid and incompetent produce something that on the surface looks like a useful output. Most massively stupid incompetent people don't know they are that. You can work out the rest.
I use Copilot on a daily basis, which uses GPT 4 in the backend. It's wrong so often that I only really use it for boilerplate autocomplete, which I still have to review. I've had colleagues brag about ChatGPT in terms of code it produces, but when I ask how long it took in terms of prompting, I'll get an answer of around a day, and that was even using fragments of my code to prompt it. But then I explain that it would take me probably less than an hour from scratch to do what it took them and ChatGPT a full day to do.
So I just don't understand the hype. I'm using Copilot and ChatGPT 4. What is everyone else using that gives them this idea that AGI is just around the corner? AI isn't even here. It's just advanced autocomplete. I can't understand where the disconnect is.