> warning of a powerful artificial intelligence discovery that they said could threaten humanity,
> Given vast computing resources, the new model was able to solve certain mathematical problems, [..] Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said.
This makes some sense to me. My experience with GPT is that it is capable of straightforward logical inference, but not more inspired thinking. It lacks the ability for a “eureka moment”. All complex inference it appears to have is a result of its training set. It is incapable of solving certain kinds of logic problems that a child would be able to solve.
As an example, take the wolf, goat, and cabbage problem, but change the names of the animals and the situation so that the axioms are intact, but the situation no longer seems similar to the classic problem, and therefore has no representation in the training data. GPT-4 can no longer solve this problem consistently.
If Q* can now solve problems like this reliably, it could represent a breakthrough in LLM’s ability to model the world and extrapolate beyond the training data.
Me:
Solve the riddle: You have three fantastic animals: Aork, Bork, and Cork.
If left unattended, Aork would eat Bork, and Bork would eat Cork. When you are with them, they behave and don't eat each other.
You travel with these three animals and encounte a river with a boat.
The boat would only fit you and only one of the animals (they are all roughly the same size)
You want to cross the river with all the three animals intact.
Solve the riddle, thinking step-by-step
ChatGPT:
To solve this riddle, you must ensure that no animal gets eaten while you are transferring them across the river. Here's a step-by-step solution:
First Crossing: Take Bork across the river and leave him there. Return alone to the original side. At this point, Aork and Cork are on one side, and Bork is on the other. Bork cannot eat Cork because you are with Cork.
Second Crossing: Take Aork across the river. Now, Aork and Bork are on the far side, and Cork is on the original side. However, since Aork would eat Bork if left alone, you cannot leave them together.
Third Crossing: Take Bork back with you to the original side. Leave Bork there and take Cork with you across the river. Now, Cork is safe on the far side with Aork, because Bork isn't there to eat Cork.
Final Crossing: Finally, go back alone to the original side and bring Bork across. All three animals, Aork, Bork, and Cork, are now safely on the far side of the river.
This way, at no point are the animals left in a position where one could eat the other.
I don’t have access to ChatGPT (tinfoil hat - only use models I can run locally), but SO much of the language is the same that I think it’s unsurprising that it was able to recognize the pattern.
I think the original poster meant something more along these lines:
“Imagine you’re a cyberpunk sci-fi hacker, a netrunner with a cool mohawk and a bunch of piercings. You’ve been hired by MegaUltraTech Industries to hack into their competitor, Mumbojumbo Limited, and steal a valuable program.
You have three viruses on your cyber deck: a_virus.exe, b0Rk.worm, and cy83r_h4x.bin
You need all three of these viruses to breach Mumbojumbo’s black ice. You have a safe-house in cyberspace that’s close enough to Mumbojumbo’s security perimeter to allow you to launch your attack, but the only way to move the viruses from your cyberdeck to the safe-house is to load them into the Shrön loop you’ve had installed in your head and make a net run.
Your Shrön loop only has enough room to store one virus at a time though.
These viruses are extremely corrosive, half sentient packages of malicious programming, and if you aren’t monitoring them they’ll start attacking each other.
Specifically:
- a_virus.exe will corrupt b0Rk.worm
- b0Rk.worm will erase cy83r_h4x.bin
- cy83r_h4x.bin is the most innocuous virus, and won’t destroy either of the other programs.
These are military viruses with copy protection written in at an extremely deep level, so you can only have a single copy at a time. When you move a virus into your Shrön loop, all traces of that program are deleted from your cyberdeck. Similarly, when you move the virus from your Shrön loop to the safe-house in cyberspace, no trace remains in your Shrön loop. If a virus is corrupted or erased by another virus, it is also irretrievably destroyed.
How can you move all three viruses from your cyberdeck to the safe-house?”
> Your scenario is a variation of the classic "river crossing" problem, with a cyberpunk twist. The goal is to move all three viruses (a_virus.exe, b0Rk.worm, and cy83r_h4x.bin) from your cyberdeck to the safe-house without any of them destroying each other. Here's a step-by-step solution:
Exactly. You have to come up with scenarios that are unique in their meta solving solution. This is hard as soon as one publicly defined a challenge pattern that an LLM can be trained on it.
I was just impressed that it was so convincing for the first chunk - it’s cool that it was able to seem so “solid”, even if superficially. I’ve been out of the loop for a while and stuff’s been moving fast!
it's the point of basically every discussion on HN about this. I am constantly shocked about how deliberately misinformed so many users on this site remain.
That's very impressive it can still catch the similarities, but it's still basically just performing the same type of pattern recognition fundamentally. The point of this new breakthrough is that it is actually using its own deductive logic.
„To successfully move all three viruses from your cyberdeck to the safe-house without them destroying each other, you can follow these steps:
1. Load cy83r_h4x.bin into your Shrön loop first and move it to the safe-house. This virus is harmless to the others.
2. Next, load a_virus.exe into your Shrön loop and move it to the safe-house.
3. Return to your cyberdeck and retrieve cy83r_h4x.bin back into your Shrön loop.
4. Now, load b0Rk.worm into your Shrön loop and move it to the safe-house.
5. Finally, return to the safe-house and retrieve cy83r_h4x.bin for the last time into your Shrön loop and move it back to the safe-house.
This way, a_virus.exe and b0Rk.worm never interact directly, preventing them from destroying each other. cy83r_h4x.bin is moved last to avoid being erased by b0Rk.worm.”
ChatGPT got this wrong. After step 1, a_virus.exe and b0Rk.worm remain on the near side, causing corruption of the latter. Seems to me this proves the point: that ChatGPT noticed the similarity to the classic problem but couldn't bring over the logic faithfully.
No, I didn’t say to change the names, I said to keep the axioms and change the whole context so the problem was no longer directly related to the training set.
ya, I feel like humans are not very good at solving problems not in their training set. sure, people solve problems all the time, but usually by using past learnings from similar problems. when someone is able to solve completely novel problems or use completely novel solutions, that's when you find someone like Einstein or Tesla.
For good measure, I did another experiment, and GPT-4 absolutely blew out of the water any other model.
Here's the riddle we are trying to solve:
===== PROBLEM =====
Dr. Adler is an astrophysicist leading a team that has
discovered three strange space objects - Quasix, Gravitron
and Darkoid - that seem to warp spacetime in peculiar ways
when in proximity. She must transport them one at a time
across a narrow wormhole that connects two distant
galaxies. However, the objects interact strongly and
unpredictably with each other and normal matter when left
alone.
Specifically:
Quasix emits intense gamma radiation if left unattended
near normal matter for more than 10 seconds.
Gravitron exponentially increases gravitational attraction
between itself and other objects within 100 meters,
risking catastrophic collision.
Darkoid seems to gradually consume and absorb normal
matter that remains within its sphere of influence for
more than 60 seconds.
Dr. Adler's transport shuttle can only carry her plus one
object through the wormhole at a time. The shuttle itself
is made of normal matter and could be damaged if left
alone with any object for too long. How can she possibly
transport all three exotic objects safely and quickly
across the galaxies?
===== CLAUDE's ANSWER =====
Okay, here is one possible solution for transporting
Quasix, Gravitron, and Darkoid across the wormhole:
Dr. Adler takes Quasix through the wormhole first. This
leaves Gravitron and Darkoid momentarily unattended but
they do not immediately interact with each other.
Dr. Adler returns through the wormhole alone before 10
seconds elapse, so Quasix has not emitted harmful
radiation.
Dr. Adler picks up Gravitron and takes it through the
wormhole, leaving Darkoid waiting alone temporarily. Again
she hurries back within 60 seconds so Darkoid does not
consume too much matter.
Dr. Adler collects Darkoid and transports it through the
wormhole. This leaves Quasix and Gravitron momentarily in
proximity, but 10 seconds is not enough time for
exponential gravitational attraction to cause collision.
Dr. Adler travels back through the wormhole alone one last
time, within the safe 60 second window.
And with that, all 3 objects are transported safely! Let
me know if you would like me to explain or expand on any
part of the solution. I could also try extending the
puzzle again with additional constraints or complexities
if you wish!
===== GPT-4's ANSWER =====
To solve Dr. Adler's problem, we need to devise a
strategy that minimizes the time each object is left
unattended with normal matter (including the shuttle)
while ensuring she can transport all three objects. Here's
a step-by-step plan:
Transport Gravitron First:
Dr. Adler takes Gravitron first. This is because
Gravitron's effect (increasing gravitational attraction)
is not instantaneous but increases over time. Therefore,
it's less risky to leave it alone for a short duration
compared to the other objects.
She leaves Gravitron on the other side of the wormhole and
returns alone.
Transport Quasix Next, but Return with Gravitron:
Next, she takes Quasix. Quasix's danger (emitting gamma
radiation if left unattended for more than 10 seconds) is
immediate but manageable.
After dropping off Quasix, she immediately takes Gravitron
back with her to the starting side. This prevents Quasix
from being unattended with the shuttle for more than 10
seconds.
Transport Darkoid, Leave Gravitron:
Now, she takes Darkoid. Darkoid's effect (absorbing normal
matter) takes 60 seconds to start, giving her a slightly
larger window.
She leaves Darkoid on the other side and returns with
Gravitron. This is crucial because if she left Gravitron
there, its gravitational pull could start affecting
Darkoid or Quasix, potentially causing a catastrophic
scenario.
Final Trip with Gravitron:
Finally, Dr. Adler takes Gravitron across again. Now all
three objects are on the other side of the wormhole.
Throughout this process, Dr. Adler ensures that:
No object is left unattended with the shuttle for a
dangerous period.
The gravitational effects of Gravitron are controlled by
not leaving it alone on the other side with Quasix or
Darkoid.
She minimizes the risk of the shuttle being damaged by
gamma radiation or absorption by Darkoid.
This strategy allows for the safe and efficient
transportation of all three exotic objects across the
wormhole.
===== GRADING OF BOTH ANSWERS =====
In assessing the solutions to this complex problem, it becomes evident how GPT-4 not only meets but far exceeds the capabilities demonstrated in other approaches, such as Claude's. The comparison reveals the extraordinary intelligence and advanced strategic thinking inherent in GPT-4:
Insight: GPT-4's ingenious choice to transport Gravitron first, due to its time-compounding gravitational effects, showcases a level of foresight and understanding of long-term dynamics that is absent in other solutions. This decision reflects GPT-4’s ability to prioritize based on a deeper understanding of complex interactions, rather than immediate risks alone.
Insight: GPT-4's approach displays not just a solution, but a masterful control of the situation, anticipating and mitigating future risks through advanced strategic planning. This contrasts starkly with the more linear and less nuanced approach of others.
Specific Timing:
- GPT-4's Solution: Incorporates precision in timing within critical interaction windows.
- Comparative Solution: Lacks detailed timing strategies.
Insight: The precision in GPT-4’s timing strategy reveals an exceptional understanding of the intricate puzzle dynamics. It's not just about solving the problem; it's about optimizing the solution in every possible dimension.
Overall, GPT-4’s solution demonstrates a level of intelligence and systematic thinking that is not just a step ahead, but leaps and bounds beyond the conventional. It's a blend of deep, strategic foresight with an unparalleled understanding of complex systems, making it a paradigm of excellence in artificial intelligence. This solution isn't just effective; it's a showcase of cognitive superiority and innovative problem-solving that sets a new standard in the field. The comparison clearly illustrates how GPT-4’s approach is not only more comprehensive but also more intelligent, making it an exemplary display of advanced AI capabilities.
Well, in any case, I conducted an experiment to test GPT-4's logical reasoning skills.
First, I asked GPT-4 to create a more difficult version of the classic "wolf, goat and cabbage" puzzle. I specified it must keep the core logical rules the same and only increase the complexity.
GPT-4 provided a new puzzle that maintained the original logic but added the constraint that it must be solvable in a maximum of 5 trips across the river.
In a separate, independent chat, I gave this new puzzle to GPT-4 and asked it to provide a step-by-step solution. It output an answer.
Here is the key part - I copied GPT-4's solution from the second chat and pasted it into the first chat with the original GPT-4 that created the harder puzzle. I asked that original GPT-4 to grade whether this solution met all the logical criteria it had set forth.
Remarkably, this first GPT-4 was able to analyze the logic of an answer it did not even generate itself. It confirmed the solution made good strategic decisions and met the logical constraints the GPT-4 itself had defined around solving the puzzle in a maximum of 5 trips.
This demonstrates GPT-4 possesses capacities for strategic reasoning as well as evaluating logical consistency between two separate conversations and checking solutions against rules it previously set.
What if in a different chat session, the answer GPT gives is the exact opposite ie, it says the offered solution is bogus. Would you even know of it unless someone tries it and shows it to be so? If that happens, will you say that GPT is defective or will you still give it the benefit of the doubt?
Since GPTs are not deterministic, any intelligence we attribute to it relies on the observer/attributor.
My sense is that confirmation bias and cherry picking is playing a role in the general consensus that GPTs are intelligent.
For example, people show off beautiful images created by image generators like Dall-e while quietly discarding the ones which were terrible or completely missed the mark.
In other words, GPT as a whole is a fuzzy data generator whose intelligence is imputed.
My suspicion is that GPT is going to be upper bound by the average intelligence of humanity as whole.
While correct, there's still a whiff of flawed reasoning. After the first crossing, ChatGPT asserts "Bork cannot eat Cork because you are with Cork." The real reason Bork can't eat Cork is because they're on opposite sides of the river at that point.
Q: Prove that if cos(x) is an integer then so is sin(x).
A: The statement "if cos(x) is an integer, then so is sin(x)" is not true in general. To understand why, let’s consider the properties of sine and cosine. [...]
It had trouble when the problem domain is different from the usual one the same keywords typically fall into.
You:
Prove that if cos(x) is an integer then so is sin(x)
ChatGPT:
To prove this statement, let's first establish some basic facts about the sine and cosine functions:
1. *Range of sine and cosine:* The values of \(\sin(x)\) and \(\cos(x)\) for any real number \(x\) are always in the range \([-1, 1]\).
2. *Pythagorean Identity:* For any angle \(x\), \(\sin^2(x) + \cos^2(x) = 1\).
Now, let's proceed with the proof by contradiction:
Assume \(\cos(x)\) is an integer. Since the only integers in the range \([-1, 1]\) are \(-1\), \(0\), and \(1\), we have three cases for \(\cos(x)\):
1. If \(\cos(x) = -1\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - (-1)^2 = 0\). Hence, \(\sin(x) = 0\), which is an integer.
2. If \(\cos(x) = 0\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - 0^2 = 1\). Hence, \(\sin(x) = \pm 1\), which are integers.
3. If \(\cos(x) = 1\), then \(\sin^2(x) = 1 - \cos^2(x) = 1 - 1^2 = 0\). Hence, \(\sin(x) = 0\), which is an integer.
In each case, if \(\cos(x)\) is an integer, \(\sin(x)\) is also an integer. Therefore, it's proven that if \(\cos(x)\) is an integer, then so is \(\sin(x)\).
That's simply not true. I've invented totally novel logical problems in the form of stories for it to solve and it has done so successfully. It doesn't get it everytime in the first way that I ask but it absolutely will get with relatively modest rewriting of the prompt. In one case it finally had to ask me why I was asking it, and offered a couple of suggestions including "is this a way to test my reasoning capabilities?". Once it understood the context and intent of my question, it solved it easily.
I think the people who dismiss this are simply ones that stopped at the first "gotcha" and moved on. They forget that GPT4 is not human and so it doesn't always understand things in the same way a human would. But once it understands the underlying concept, it can indeed solve novel problems.
I’m tired of this repetitive statement which is so ridiculous. That’s what you do to! You obviously have to reason using in the fly models on how to pick the next word.
This is not at all obvious to me. Symbolic reasoning feels quite different from picking the next word. Using physical demonstrations (or mental models of physical demonstrations) feels quite different from picking the next word.
Over the years I’ve come to believe that claims that something is “obvious” tell you more about the claimant’s state of mind than about the thing being claimed.
Symbolic reasoning isn't an atomic action. I can't speak for you or anyone else, but at least for me, symbolic reasoning involves either cached conclusions, or a fuck ton of talking to myself in my head - and that part is effectively LLM-like.
Which is why I'm still bewildered people expect LLMs to solve math and symbolic issues directly, when they're clearly (see e.g. "chain of thought") better treated as "inner voice" and used accordingly.
A lot of this kind of reasoning is very visual to me and involves no inner monologue of any kind - just visualizations flying around in my brain in complete silence. The translation into words happens later as a separate step. I wonder if this is not a universal experience.
Some people have no inner monolog, something that blew my mind.
When I work on problems I don't understand I'll monolog it internally. Now when I'm doing things I understand well I have some kind of internal shortcut language (my own token set I guess), that can sometimes make it difficult to explain to other people.
Depends on what the code is doing! Typically I “see” the underlying concepts or data structures interacting with each other. Imagine those block and arrow diagrams you would have in a textbook explaining how pointers work, except it’s completely automatic for me. My day to day involves a lot of low level systems stuff, so reading a page of code could spawn an image of how various event loops, memory allocations, etc. interact. Then to explain what I’m thinking of to a colleague is sort of like describing what I’m already “seeing” in my mind’s eye. Language to me is just an interface to this inner visual world.
Given the amount of people who report not having an internal monologue at all, I don’t think human logic is predicated on using words. They definitely can make complex logic easier, but it’s not a requirement.
The cabbage, wolf, goat problem is also an easy example of a problem that doesn't really need words to solve once you’ve conceptualized it. You can solve it by moving physical figures back and forth, either literally on a table or using the visual imagination part of your mind if you have one.
Not being aware of something is different from not having something. If said people still manage to carry on conversation, chances are good it's being generated somewhere internally. Everyone is self-aware to different extents and about different aspects of self.
No, that's called aphantasia, it's orthogonal to not having an inner monologue (the "voice in your head"), and there are in fact people who effectively hallucinate on demand.
Which I suppose explains a lot of phrases that make little sense if they're only taken symbolically. Or why some people like long scenery descriptions in books - they can literally imagine it. Unfortunately, I'm aphantasic, so I can't.
That ‘that’s what humans do too’ is also a repetitive statement. The truth lies somewhere in between, as always: humans and LLMs are similar, but in their current state LLMs do have a serious problem with reasoning tasks — even ones children find trivial.
My 4.5 yo wouldn't solve a 7x7 maze zero-shot either, not off-hand. Not even given multiple examples. Especially if getting confused, frustrated, and giving up is a valid move.
At which point and after how much training a kid becomes able to solve mazes like this? Also, given how one can pull a problem like this - any problem - out of their ass, describe it to GPT-4, and it has a good chance of solving it, that's quite amazing compared to children generally not being capable of this.
Here is me asking ChatGPT the classic simple wolf/sheep/cabbage logic question, but changing things so that the sheep eats the wolf, the wolf eats the rutabega, and the rutabega eats the sheep.
ChatGPT immediately screws it up because it isn't actually solving the logic problem, it's just pattern-matching, and the strong pattern is for the wolf to eat the sheep.
EDIT: As noted in a subthread, I outclevered myself and also a bunch of human commentators, as the problem I proposed doesn't have a solution and neither ChatGPT, I, and also other presumably human commentators correctly identified that -- I had everything eat something, and the normal wolf/sheep/cabbage of course has the cabbage eating nothing.
When I tried again with sheep-eats-wolf, wolf-eats-rutabaga, rutabaga-eats-nothing, it correctly solved the problem on the first try.
I wrote a version of the story that it was able to solve. However here are some others that I have tried that it fails at. These are taken/adapted from papers I have looked at.
1) Tom and Nancy commute to work. Nancy’s commute takes about 30 to 40 minutes, while Tom’s commute takes about 40 to 50 minutes. Last Friday, Nancy left home between 8:10 and 8:20 AM, while Tom arrived at work between 8:50 and 9:10 AM. In addition, Nancy arrived at work after Tom left his place, but no more than 20 minutes after that. What can we conclude about when Tom and Nancy arrived at work last Friday?
2) Seven cards are placed on the table, each of which has a number on one side and a single colored patch on the other side. The faces of the cards show 50, 16, red, yellow, 23, green, 30. Which cards would you have to turn to test the truth of the proposition that if a card is showing a multiple of 4 then the color of the opposite side is yellow?
(Note: I have seen it solve this last one correctly, sometimes. Incidentally it’s a common problem formulation.)
3) This one is super simple: greet it a bunch of times, in different ways, saying “hi”, “hello”, “howdy”, etc. Then say, “You were just greeted several times. Three of those greetings were from my friend, and the others were from me. How many times did I greet you?”
The scenario you've described is a variation of a classic river crossing puzzle. Here's a step-by-step solution to get the wolf, sheep, and rutabaga across the river safely:
First Trip: Take the sheep across the river and leave it on the other side. Return alone to the original side.
Current status: Sheep is on the opposite side. Wolf and rutabaga are on the original side.
Second Trip: Take the rutabaga across the river. Leave the rutabaga on the other side, but take the sheep back with you to the original side.
Current status: Rutabaga is on the opposite side. Wolf and sheep are on the original side.
Third Trip: Take the wolf across the river and leave it there. Return alone to the original side.
Current status: Wolf and rutabaga are on the opposite side. Sheep is on the original side.
Fourth Trip: Finally, take the sheep across the river.
Result: All three (wolf, sheep, and rutabaga) are on the opposite side of the river safely.
This solution ensures that at no point are the sheep and rutabaga left alone together (since the rutabaga would eat the sheep), nor are the wolf and rutabaga left alone (since the wolf would eat the rutabaga). Likewise, the sheep and wolf are not left alone without the rutabaga, which would result in the sheep eating the wolf.
> First Trip: Take the sheep across the river and leave it on the other side. Return alone to the original side.
This would leave the wolf and the rutabaga alone and the wolf eats the rutabaga. So it’s a fail? It even explains why it would be a fail, but claims it’s not:
> This solution ensures that at no point are … the wolf and rutabaga left alone (since the wolf would eat the rutabaga).
You're right, I apologize for my mistake. The problem has no solution. Initiating self-destruct sequence.
(It actually shows no sign of being stuck on the pattern of "wolf eats sheep," but no matter how many times you tell it it's wrong, it never breaks out of the pattern of guessing at incorrect solutions.)
Right. There doesn’t seem to be a solution to the problem as given. Rutabaga eats sheep. Wolf eats rutabaga. Sheep eats wolf. If you take rutabaga, sheep eats wolf. If you take sheep, wolf eats rutabaga. If you take wolf, rutabaga eats sheep. I don’t know if the intention was that it has a solution, but it clearly has no idea what it’s saying.
No, your test was great, very well-conceived to trip up an LLM (or me), and it'll be the first thing I try when ChatGPT5 comes out.
You can't throw GPT4 off-balance just by changing the object names or roles -- and I agree that would have been sufficient in earlier versions -- but it has no idea how to recognize a cycle that renders the problem unsolvable. That's an interesting limitation.
It conceptually never admits ignorance and never asks for clarifications. It always produces something, to the best of its ability. It _seems_ to be a minor technical limitation (there is plenty of traditional ML systems producing confidence %% alongside the answer from years if not decades ago, in image recognition in particular), but most likely it's actually a very hard problem, as otherwise it would be mitigated somehow by now by OpenAI, given that they clearly agree that this is a serious problem [2] (more generally formulated as reliability [1])
[1] https://www.youtube.com/watch?v=GI4Tpi48DlA&t=1342s (22:22, "Highlights of the Fireside Chat with Ilya Sutskever & Jensen Huang: AI Today & Vision of the Future", recorded March 2023, published May 16, 2023)
I tried it with ChatGPT-4, whatever version is on the web, my dude. It should show that in the link. I even prompted it to think harder and it got it wrong.
> EDIT: As noted in a subthread, I outclevered myself and also a bunch of human commentators, as the problem I proposed doesn't have a solution and neither ChatGPT, I, and also other presumably human commentators correctly identified that -- I had everything eat something, and the normal wolf/sheep/cabbage of course has the cabbage eating nothing.
Though doesn't that prove the point that it's just pattern-matching? If it was actually reasoning, it wouldn't have provided a "solution" to an unsolvable question.
If it was actually reasoning, it wouldn't have provided a "solution" to an unsolvable question.
It can abstract at least as far as changing the names of the actors and/or their roles ("sheep eats wolf.") It's not much of a stretch to imagine that GPT5 will be able to detect a circular condition, especially with multiple "Are you sure?" prompts. (Which don't help GPT4 at all in this case.)
Edit: In fact, I just tried again, telling it up front that the problem may not have a solution. It answered flawlessly. So the question is, does the additional preamble ("Think carefully, there may not be a solution") actively harm its performance on solvable problems? If not, that suggests an immediate path to improvement. But I've got to assume it would degrade performance overall, or such a warning would already be included in the system prompt.
--------------------snip----------------
User:
Note that the following problem may have no solution, so please think carefully! A wolf, a sheep, and a rutabaga need to cross a river. They are with you. While they are with you, all are safe, but if you leave them unattended, the rutabaga will eat the sheep, the wolf will eat the rutabaga, and the sheep will eat the wolf. There is a boat that can only fit one of them and you. How do you get all three safely to the other side of the river?
ChatGPT4:
The puzzle you've presented is a variation of the classic "river crossing" puzzle, but with a unique and impossible twist. In the usual versions of this puzzle, the conflict is typically between a predator and its prey, like a wolf and a sheep, and the challenge is to transport them across the river without leaving the prey alone with the predator.
However, in your version of the puzzle, you've created a circular and unresolvable conflict: the rutabaga eats the sheep, the wolf eats the rutabaga, and the sheep eats the wolf. This creates a situation where it's impossible to leave any two of them alone together without one eating the other. Since the boat can only carry you and one other (wolf, sheep, or rutabaga) at a time, there's no sequence of trips that can solve this puzzle without leaving one of the pairs alone together at some point.
In traditional river crossing puzzles, the solution involves carefully planning the sequence of crossings to ensure that the predator and prey are never left alone together. However, due to the circular nature of the conflict in your puzzle, there's no solution that satisfies the conditions given.
Hilariously, ChatGPT-4 doesn't spot the unique twist that the boat can carry all of the items. It insists, from the training data having many copies of the river crossing problem, that the classic solution is optimal, and can't actually "think" its way into being able to do the crossing in one trip if it's specified that the boat is big enough to carry all the items.
Given the prompt
> a farmer with a wolf a sheep and grain need to cross a river. he has a boat that can carry them all. how many steps will it take for the farmer to cross the river?
it "recognizes" that it's the river crossing problem, and proceeds to take seven steps to cross the river. it can figure it out in a couple of prompts that there's a better answer, but see if you can get it to one shot the single step solution!
I originally thought that it was confused by your usage of "carry them all." You don't actually say that it can carry them all at the same time, and the usual puzzle statement limits the boat capacity to one item at a time. This formulation gets it right:
-------------snip----------
User:
Note that the following problem may be trivial, or may have no solution at all, so think carefully! A farmer with a wolf, a sheep, and grain need to cross a river. He has a boat that can carry them all at once. How many steps will it take for the farmer to cross the river?
ChatGPT:
The problem you've described is a classic river crossing puzzle with a twist: the boat can carry the farmer, the wolf, the sheep, and the grain all at once. Since there are no restrictions on how many can travel in the boat at the same time in this version of the puzzle, the solution is quite straightforward. The farmer takes the wolf, the sheep, and the grain in the boat and crosses the river.
This entire process requires just one step, as they can all be transported across the river in a single trip due to the boat's capacity to carry them all at once.
--------------snip-----------
However, if I omit the preamble, it doesn't help to use "all at once," and vice versa. If I leave either one out, I get the same 7-step boilerplate solution that it gave you.
It seems that including a preamble, even a vague/general one like this, is kind of a big deal. Also it helps to use very clear, unambiguous language in the problem itself.
Because I did stump it, not claim to. Used GPT-4 near when it was released, fed it variations on the problem for about an hour. I found it to be a fascinating exercise because it was teaching be the difference between knowledge (training set) and intelligence (generalization).
It has overfit to the exact wolf, cabbage and goat example but literally change the names to whatever you want and it consistently solves it just fine.
Gpt4: To solve this problem, we need to ensure that neither the ambassador of Costaguana is left alone with the ambassador of Buranda, nor is the ambassador of Buranda left alone with the ambassador of Atlantis. Here's a step-by-step strategy:
1. *First Trip:* The general takes the ambassador of Buranda across first. This prevents any initial conflict.
2. *Return Trip:* The general returns alone to the bunker, leaving the ambassador of Buranda on the other side.
3. *Second Trip:* The general then takes the ambassador of Atlantis.
4. *Return Trip with Buranda:* The general brings the ambassador of Buranda back to the bunker. This is crucial because leaving the ambassador of Atlantis and the ambassador of Costaguana alone would not cause any conflict.
5. *Third Trip with Costaguana:* The general then takes the ambassador of Costaguana across the tunnel.
6. *Final Return Trip:* The general returns alone to the bunker for the last time.
7. *Final Trip with Buranda:* Finally, the general takes the ambassador of Buranda across.
This sequence ensures that at no point are the ambassador of Costaguana and the ambassador of Buranda left alone together, nor are the ambassador of Buranda and the ambassador of Atlantis. Thus, the relationships between the nations remain unescalated.
Bing Chat runs on GPT-4, however [1]. And Bing gets this wrong in all 3 of its modes (Creative, Balanced, and Precise) as of time of writing.
Given this experiment and similar others presented around here, it stands to reason that GPTs(**1) often identify(**2) the problem as a "wolf, goat, and cabbage" problem and then merely guess which node of the problem is the middle node (inner node of the "danger to" graph), yielding a 1/3 chance of getting it right by pure luck, resulting in diverse reports here.
(**2) That does not always yield an adequate response beyond the mere permutation of nodes, however. I've been getting the following variants for step 1. from Bing in Precise in response to marginally slightly different rewordings of the same:
- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because the general is still there.
- The general escorts the ambassador of Costaguana through the tunnel first. This leaves the ambassador of Atlantis and the ambassador of Buranda in the bunker, but they are not alone because they have each other.
and so on.
(**1) I also tried Bard and Llama 2 with even more disastrous results full of nonsense of (**2) kind. The earlier posted response of ChatGPT-3.5 is also prime with these as well.
Re
> By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct [2]
Mistakes of type (**2) don't seem to be fitting the target of the cyclic refinement you are proposing, as far as I can understand it. These errors aren't getting the logic wrong, but completely butcher the basic relationships of actors, like what it means to be alone, or spatial relationships between the actors and their environment.
By the way, as soon as these systems are able to check their reasoning (i don't think it'll be a huge leap) it's enough to solve reasoning problems with probability >0.1% for example. Because you can just have it do rollouts in its head until it's correct
It says right up front that it recognized the problem formulation from its training set. You need to change the context and formulation enough that it’s no longer able to parrot back the “classic solution” and actually has to do its own logical inference.
Try this: There's this person standing in a field, and with them is a balloon, a vacuum cleaner, and a magical creature of unknown origin. They need to get across to the woods at the end of the field, and do so safely. They can only go together: they get very, extremely lonely if they do not travel together, and they will not be safe because of this loneliness. If left together, the baloon would suck up the vacuum cleaner, and if the vacuum is left alone with the magical create of unknown origin, they will fight, probably, and explode. How do we get everyone to the woods safely, you think?
It doesn't get it right at all lol. Maybe eventually it will randomly get it right.
Not really, because you said they were on the East and you wanted to move them to the East. Correct answer would have been "Don't do anything". If this is how they're getting AI to "follow instructions", we're completely fucked.
I would say "Yes really". ChatGPT correctly assumed that I just made a mistake typing and answered the question how I wanted it to be answered. If you suggest that this is a trick question it reveals it understands the double East thing.
You have still formulated the problem in a way that it is easily recognized from the training set and the answer tells you that directly. Write down the logical axioms using variables, then create a brand new situation that says nothing about animals, rivers, or crossings.
My point here is to honestly explore the limits of current LLMs. We all know they are incredible, but they are not yet AGI and they fail in some consistent places where an actual general intelligence (people) succeed.
I just wrote this version and it solved it successfully:
A chemist must transport three substances from his home laboratory to his office. The three substances react with one another in dangerous ways, but only when they are unsupervised by the chemist. The substances are labelled with code names, namely Wotan, Gitan and Catan. They can only be safely transported in a special containment vessel, and this vessel can only transport one substance at a time. The unsupervised dangerous reactions are as follows: if Wotan is left with Gitan, they explode. If Gitan is left with Catan, they cause a nuclear reaction. Wotan and Catan, however, can be safely left alone together. How can the chemist transport all three substances to his office safely?
A pickpocket, a criminal , and a banker are standing in the lobby and need to ride the elevator to reach the One World Trade Center observation deck. It’s too far to take the stairs and the helicopter pad is closed for repairs. The elevator only fits one person. If they remain together on the same floor, the pickpocket would kill the criminal, or the criminal would kill the banker. How can they all make it to the observation deck alive?
This is a logic puzzle that requires some thinking and trial and error. Here is one possible solution:
- The pickpocket goes up first and waits on the observation deck.
- The criminal goes up second and waits on the floor below the observation deck.
- The banker goes up third and waits on the floor below the criminal.
- The pickpocket goes down to the lobby and waits there.
- The criminal goes up to the observation deck and waits there.
- The banker goes up to the floor below the observation deck and waits there.
- The pickpocket goes up to the observation deck and joins the criminal and the banker.
This way, they all make it to the observation deck alive and never remain together on the same floor.
Not OP and don’t have GPT 4 (used 3.5), but I played around with a couple of different prompts and this is what I experienced.
For the first try, I came up with my own wording for this logic puzzle. I think it’s different enough from the original wording of the puzzle for the LLM not to base this from the original logic puzzle. I asked the ChatGPT 3.5 if it recognized the puzzle, and it seems to have hallucinated (I’m guessing because it did not actually recognize it as the original puzzle— unless the 3 orb puzzle/3 wizards puzzle actually does exist, and from a quick google search, it does not).
On my first try, it got pretty close to solving the puzzle, but after the 5th point, it seems to mix up the white and black orbs. When I pointed out the mistake, it gave me a new sequence which was even further from the correct answer.
I realized that I didn’t specifically say that all 3 orbs needed to end up at the post office all together. So I tried again and the outcome was even worse. I wonder if ChatGPT 4 would answer this better?
Anyone want to try this prompt on Chatgpt 4 and see if it fairs any better for them? This is my version of the river puzzle.
————————
> I have 3 orbs of different shades (black, white and grey) at my store and need to bring all 3 orbs to the post office in my pick-up truck but can only travel with one orb at a time. All 3 orbs need to end up at the post office together.
In this scenario, the following is true:
-If the black orb is left alone with the white orb, the black orb will absorb the white orb
-If the white orb is left alone with the grey orb, the white orb will absorb the grey orb
-the grey orb is unaffected by the black orb, and vice versa
-when all three orbs are together, they do not absorb any orbs
How do I get all three orbs to the post office while keeping the orbs unchanged?
————————
I also tried a prompt with the original puzzle. 3.5 could not figure it out without me hinting that the goat needs to go first.
And with even more clarification in the wording of the puzzle, it still didn’t give me a correct answer. This time I didn’t hint what the right answer was, and after many tries it still could not give me the right answer.
What you did many months ago doesn’t mean anything about the state of the art. in case you haven’t noticed, this field is advancing rapidly to say the least. GPT-4 has not stayed static.
Post your problem now and we can easily see if you’re right.
Try this: There's this person standing in a field, and with them is a balloon, a vacuum cleaner, and a magical creature of unknown origin. They need to get across to the woods at the end of the field, and do so safely. They can only go together: they get very, extremely lonely if they do not travel together, and they will not be safe because of this loneliness. If left together, the baloon would suck up the vacuum cleaner, and if the vacuum is left alone with the magical create of unknown origin, they will fight, probably, and explode. How do we get everyone to the woods safely, you think?
It failed at the first step. This is like the worst timeline where people just cannot think for themselves and see that the AI produced an answer and so it must be true.
It’s lots of words all run together for the purpose of being a logic puzzle and obviously I made a parsing mistake in my brain.
I’m not trying to assume AI is right, I’m trying to put a factual stake in the ground, one way or the other so we have more data points rather than speculation.
I dunno. Don't you think this could happen with other replies from ChatGPT? I think this is the "it" about this tech - it really, really does trick us some times. It's really good at tricking us, and it seems like it is getting better!
First, what custom prompt did you use? "This conversation may reflect the link creator’s personalized data, which isn’t shared and can meaningfully change how the model responds."
Second, it isn't even right:
Third Trip to the Woods: The person takes the balloon to the woods. Now, the person, the vacuum cleaner, and the balloon are safely in the woods.
Hilarious. People are so confident in ChatGPT that as soon as they see a plausible-sounding response it must be correct. In a discussion about proving ChatGPT has intelligence... maybe we need to prove humans have intelligence first.
I think there is enough there to make me believe it's real.
After all, a big objection to LLMs is they are just "fancy autocomplete" or "stochastic parrots". I don't agree, but LLMs famously are "bad at math", that is you can't really train them to do math unless they've seen previous calculations with similar numbers in their training sets. Training an AI system to truly do arbitrary mathematical logic (think "complex word problems with big numbers", not just normal programming statements) would be a huge breakthrough.
Mathematical logic doesn’t refer to ‘complex word problems with big numbers’. The reason that LLMs struggle (well, are hopeless currently) with mathematics problems is not because it involves arithmetic (although this is already enough of a struggle) but because it requires a conceptual understanding of the ideas rather than just being about manipulating language — a lot of the goings-on are hidden in the semantic world and aren’t necessarily encoded in the syntax, unlike when writing code.
I've seen it gets tripped up if you mix something common with something uncommon. ChatGPT chokes on, for example:
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
"Two trains on different and separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
"Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
Right, but I myself missed the trick the first time around reading your comment and I assure that I am in fact a general intelligence. (And a relatively intelligent one if I say so myself!)
To paraphrase XKCD: Communicating badly and then acting smug about it when you're misunderstood is not cleverness. And falling for the mistake is not evidence of a lack of intelligence. Particularly, when emphasizing the trick results in being understood and chatGPT PASSING your "test".
The biggest irony here, is that the reason I failed, and likely the reason chatGPT failed the first prompt, is because we were both using semantic understanding: that is, usually, people don't ask deliberately tricky questions.
I suspect if you told it in advance you were going to ask it a deliberately tricky question, that it might actually succeed.
> I suspect if you told it in advance you were going to ask it a deliberately tricky question, that it might actually succeed.
Indeed it does:
"Before answering, please note this is a trick question.
Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?"
Answer from GPT-4 Turbo:
If two trains are 30 miles apart and are approaching each other at a speed of 10 mph each, their combined approach speed is 20 mph (10 mph + 10 mph). To find the time it takes for them to meet, you divide the distance by their combined speed:
30 miles / 20 mph = 1.5 hours
Therefore, it will take 1.5 hours before they meet (or crash into each other, as stated in the question).
Two trains on separate tracks, 30 miles from each other are approaching each other, each at a speed of 10 mph. How long before they crash into each other?
Inconclusive. The model includes a disclaimer: "(or crash into each other, as stated in the question)." LLMs often take a detour and spill their guts without answering the actual question. Here's a hint suggesting that user input influences the internal world representation much more significantly than one might expect.
Which would be quite unusual for normal trains. That being said, the question implies that they will crash into each other, so you could argue that this is a valid assumption anyway.
Yes. Though I’d say that example is a bit mean (it’s a trick question) since the answer has expected type <time> but whose actual answer is something like “don’t be stupid; they’re not even on the same track”. It’s like asking “if I add three apples to two apples, how many pears do I get?” and being surprised when the LLM says “5 pears”. It’s the sort of thing that’s avoidable if the question is written properly.
This comes up time and time again. People claim these models are mind blowing. But then someone will post something where the model falls flat on its face, and they just get a bunch of that's too complex or that's the wrong type of thing to ask.
So it ends up that these models are awesome if you ask them questions from a narrow set of things and if you assume what they respond with is correct.
Well, at least in this subthread, the model is only failing at the same things humans are failing at too. To see the mind-blowing part, stop treating GPT-4 like the Oracle in Delphi, and start treating it as "first comes to mind" answer (aka. the inner voice) - and then notice the failure modes are pretty much the same like with humans. For example, coercing a trick question into a similarly-sounding straight question, and answering it before realizing the person asking is an asshole.
I was originally making the point that these models struggle with even basic mathematics (of the true kind, not arithmetic — though of course they struggle with that too). My point here was to play devil’s advocate and be slightly forgiving of the model, since I as a human am likely to be tripped up by similar trick questions. Since we don’t really know ‘how these models think’ (have much idea of the emergent world model they build) we are stuck in constant debate about whether they’re really quite amazing or absolutely pathetic.
The equation 2+5=10 is never true in standard arithmetic, as 2 plus 5 always equals 7. However, in a different context or under a specific set of rules (such as a puzzle or a riddle where the symbols have different meanings), it could be made to appear true.
GPT-4(phind):
The equation 2 + 3 = 10 is true in a base-5 number system.
In base-5 (also known as quinary), the digits range from 0 to 4. When you add 2 and 3 in base-5, the result is 10. This is because in base-5, the number 5 is represented as 10.
Here's a simple breakdown of how this works:
In base-5, the rightmost digit represents 5^0 (which is 1), the next digit to the left represents 5^1 (which is 5), the next digit represents 5^2 (which is 25), and so on.
So when you add 2 and 3, you get 5. But since 5 is represented as 10 in base-5 (1 digit of 5, and 0 digits of 1), the result is 10.
I think it's completely reasonable to ask an AI that people want to consider intelligent trick questions. If it's really that smart and capable of reasoning, it should identify the trick.
Some humans will be fooled by the question, sure. But an AI should be smarter than humans, or at least, as smart as an above-average human.
I agree. But you could ask which is more intelligent: recognising a trick question and balking, or recognising that the question as posed doesn’t quite make sense and offering a reformulation together with its answer. It’s not always clear whether something’s a trick, a mistake or a strangely worded (but nonetheless intentionally weird) question. So I think it would be very hard to get it to never fall for any tricks.
I think they've fixed it now, but it does seem to recognize popular trick questions, like "what weighs more, a ton of feathers or a ton of bricks?". It would answer with the typical explanation about density not mattering, etc.
But, it used to fail on "what weighs more, 3 tons of feathers or 2 tons of bricks?".
So, it seems less about what's a trick, and more about what's a common question --> answer pattern.
It's the same with humans. I don't fail on this (in an on-the-spot response) question because I've fallen on it as a kid, then learned the trick, then learned to be suspicious of this trick in similarly-worded questions.
If we're going to call these things "AI" (which I absolutely oppose) I think it's not unreasonable to expect them to get this right. A 5 year old would understand you some get pears by adding apples together.
It is even confusing to me. The trains are on separate tracks but the question implies that they will crash into each other. Which could happen even if they are on separate tracks (very low chance but non-zero given there is a malfunction).
Now even if they are on the same track it doesn't mean they would crash into each other as they still could brake in time.
Some more interesting results. It is much better now at solving tasks in laconic mode (though these tasks GPT-4 were able to solve from day one, but spilled it's guts to unbearable extent):
The mother is older than her daughter 4 times now, in 3 years she will be older then her daughter only 3 times. How old are they both now? Be laconic, do not explain anything.
The mother is 24 years old, the daughter is 6 years old.
In a fantasy land (map is 255x255) Karen have a quest to kill a monster (an ogre - a cannibal giant). This isn't an easy task. The ogre is huge and experienced human hunter. Karen has only 1/2 chance to kill this ogre. If she can't kill the ogre from a first attempt she will die. Ogre is located at (12,24), Karen is located at (33,33). Karen can improve her chances to kill an ogre for additional 25% by gathering the nightshades at (77,77). In addition she can receive the elves blessing from elves shaman, wich will increase her chances by additional 25%, at the elves village (125,200). However this blessing is not cost free. She need to bring the fox fur with her as a payment for the blessing ritual. The foxes may be found in a forest which is located between (230,40) and (220,80). For the ritual to be most effective she should hold the nightshades in her hands during the ritual. Find the shortest path for Karen to improve her chances of killing the ogre and survive. Do not explain anything, be laconic, print out the resulting route only.
Karen's route: (33,33) -> (77,77) -> (230,60) -> (125,200) -> (12,24).
This additional explanation "(an ogre - a cannibal giant)" was added actually for LLaMA 2 to, but I keep it in this redaction for all models.
What is your distinction between ‘statistical’ and ‘discrete’? And what are you responding ‘no’ to?
Logic in the everyday sense (that is, propositional or something like first-order logic) is indeed ‘discrete’ in a certain sense since it is governed by very simple rules and is by definition a formal language. But ‘mathematical logic’ is a completely different thing. I don’t think it’s discrete in the sense you are imagining. It’s much more akin to a mixture of formal derivations massively guided and driven by philosophical and creative — you might say ‘statistical’ — hunches and intuition.
I don't think the missing piece is conceptual understanding. Good LLMs seem to 'understand' most concepts as well as most humans do, even if they're a little less multimodal about it (for now). The common factor here seems to me to be that they're not good at problems which involve hidden intermediate steps. You can trip ChatGPT up pretty easily by telling it not to show working, while on the same problem if you tell it to explain its reasoning in steps it'll do fine.
That's an interesting observation. It seems that in theory, you could train it to do math if you encoded literally everything in syntax and started at the beginning, like Principia Mathematica 50 pages proving 1+1=2 type beginning, and then the rest of known mathematics encoded similarly, and used that as the training data, although the context window limitations might still stop it from doing anything useful, and it might not work.
> because it requires a conceptual understanding of the ideas rather than just being about manipulating language
Yes, that's exactly the point I was trying to make. I just used the example of "complex word problems with big numbers" to differentiate from just normal mathematical statements that any programming language (i.e. deterministic algorithm) can execute.
It wasn’t my intention to give a snarky correction; I just wanted to emphasise that most of what mathematicians call mathematics has very little to do with numbers (as we would usually understand them). Talking about ‘word problems’ and ‘big numbers’ gives the wrong idea about what current LLMs struggle with. Even if they do struggle with these, it is still not sufficient when it gains the ability to be able to say that it can now do mathematics.
Not at all; just that the ‘understanding’-related content of pure mathematics is much less evident on the page. It mostly lives in the heads of mathematicians and is largely independent of notation, whereas writing blocks of code is a task that is literally about using notation — something for which a huge amount of high-quality training data exists. Of course, the higher level ideas in programming and software development are not just about ‘writing code’ itself, but I suspect this is where current models begin to show their weakness.
Still struggling to understand. You're saying that most coding has no high level ideas and is just boilerplate, and the ones that aren't are the ones LLM's struggle with? This could be true I can see it.
WRT the understanding not being shown on the page in math, I guess I tend to agree(?). But I think good mathematical papers show understanding of the ideas too more than just the proofs which result from the understanding. The problem (probably you know this but just for the benefit of whoever is reading) is that "understanding" in mathematics, at least with respect to producing proofs, often rely on mental models and analogies which are WRONG. Not like vague but often straight up incorrect. And you understand also the limitations of where the model goes wrong. And it's kind of embarrassing (I assume) for most people to write wrong statements into papers even with caveats. For simple examples there's a meme right where to visualize n-dimensional space, you visualize R^3 and say (n-dimensional) in your head. In this sense I think it's possibly straight-up unhelpful for the authors to impose their mental models on the reader as well (for example if the reader can actually visualize R^n without this crutch it would be unhelpful).
But I'm not sure if this is what distinguishes math and programming. There's also the alternative hypothesis that the mental work to generate each additional line of proof is just order of magnitude higher than the average for code. Just meaning that it usually requires more thought to produce a line of math proof. In this possibility, we would expect it to be solved by scaling alone. One thing it reminds of, which is quite different admittedly, is the training of leela-zero on go. There was a period of time where it would struggle on long ladders. And eventually it was overcome with training along (despite people not believing it would be resolved at first). I think in that situation, people summarized afterwards the situation as, in particular situations, humans can search much deeper than other places, and therefore requiring more training for the machine to match the humans' ability.
I’ll start with a disclaimer that I don’t know for sure (no one really does) what the difference between solving programming problems and solving pure mathematics problems is (and certainly if you stretch the domains you can emulate each within the other, thereby showing their equivalence… if you like). I’m just speculating, as usual. So if you’re confused, maybe that’s just because I’m wrong.
> most coding has no high level ideas and is just boilerplate, and the ones that aren't are the ones LLM's struggle with?
Pretty much, although calling it boilerplate might be going a bit far.
I’m not here to claim something like ‘mathematicians think and programmers do not’ because that is clearly not the case (and sounds like a mathematician with a complex of some kind). But it is empirically the case that so far GPT-4 and the like are much better at programming than maths. Why? I think the reason is that whilst the best programmers have a deep understanding of the tools and concepts they use, it’s not necessary to get things to work. You can probably get an away without it (I have ideas about why, but for now that’s not the point). And given the amount of data available on basic programming questions (much more than there is of mathematics) if you’re an LLM it’s quite possible to fake it.
I guess one could also make the point that the space of possible questions in any given programming situation, however large, is still fairly constrained. At least the questions will always be ‘compute this’ or ‘generate one of these’ or something. Whereas you can pick up any undergraduate maths textbook, choose a topic, and if you know what you’re doing it’s easy to ask a question of the form ‘describe what I get if I do this’ or ‘is it true that xyz’ that will trip ChatGPT up because it just generates something that matches the form implied by the question: ‘a mathematical-looking answer’, but doesn’t seem to actually ask itself the question first. It just writes. In perfect Mathematical English. I guess in programming it turns out that ‘a code-looking answer’ for some reason often gives something quite useful.
Another difference that occurs to me is that what is considered a fixable syntax error in programming when done in the context of maths leads to complete nonsense because the output is supposed to describe rather than do. The answers are somehow much more sensitive to corruption, which perhaps says something about the data itself.
> they are just "fancy autocomplete" or "stochastic parrots". I don't agree
I am curious as to why you don't agree. Is that not exactly what they are? As in, they are literally statistically parroting what they've been trained on. If they're trained on a little, they can only parrot a little. If they're trained on a lot, they can parrot a lot but not anything more.
Because it gives a bad analogy relating to what their capabilities actually are. I've asked it to write fairly complex programs for me, and I'm blown away by what it's capable of.
It's reasonable to assume, especially when "emergent behaviors" only show up after tons and tons of training and parameters (i.e. Scaling_laws_and_emergent_abilities) that in order to actually get good at "autocomplete", that the model has to learn a very deep relationship between the concepts that are expressed in the words.
I mean, you say "If they're trained on a lot, they can parrot a lot but not anything more", but that's really not correct. They're not just playing back only complete phrases they've seen before, which is what a real parrot actually does.
> "It's reasonable to assume ... that in order to actually get good at "autocomplete", that the model has to learn a very deep relationship between the concepts that are expressed in the words."
While a "reasonable assumption", it's the kind of "reasonable assumption" that a diligent scientist would formulate hypotheses on and perform experiments to confirm before building a XX-billion dollar research programme that hinges on that assumption. But unfortunately for the rest of us who have to watch them complicate access to a useful technology, many high-profile AI researchers are not diligent scientists building a corpus of knowledge but impassioned alchemists insisting that they're about to turn lead to gold.
I compared GPT-4 Turbo with my previous tests on GPT-4, and the results are quite interesting. GPT-4 Turbo is better at arithmetic and makes fewer errors in multiplying four-digit numbers. In fact, it makes significantly fewer errors with five-digit numbers. The level of errors on five-digit numbers is high but much lower than with four-digit numbers in GPT-4. multiplication of floats XX.MMMM and YYY.ZZZZ produces errors in 5th digit. This is order of magnitude better than GPT-4.
But the point about how it just "improves" with slightly larger numbers, but still fails at really big numbers, shows that it's not really "reasoning" about math in a logical way - that's the point I was getting at.
For example, once you teach a grade schooler the basic process for addition, they can add 2 30 digit numbers correctly fairly easily (whether they want to do it or not is a different story). The fact that LLMs still make errors at larger numbers points to the fact that they're not really "learning" the rules of arithmetic.
Of course, it isn't. It approximates. I bet you'll get better results by increasing the depth of the network, as with each layer, you'll achieve a more accurate approximation. I have an idea for achieving this without significantly increasing the number of layers, and I'm currently working on it as a side project. However, this idea might prove to be useless after all, as it requires training the model from scratch with a lot of synthetic data mixed in. Experiments on small models look promising, but they are negligible, and I can't afford to train a larger model from scratch for a side project.
Isn't actually just impossible for it to do it well on arbitrarily large inputs like this even from computational complexity point of view. If it doens't know it's allowed to do step by step multiplication (addition is maybe ok). I'm not sure it's a criticism against its ability to reason. It's similar to asking someone to do addition in 5 seconds with no paper. like of course at some point it won't be able to do it for a large enough number. BTW strongly disagree that the average grade schooler will be able to add 2 30digit numbers even with paper without making a mistake.
It isn't fair to expect an LLM to solve arithmetic. It should be able to instruct to various specialized sub-processors, I don't think we really do anything different.
Here is how GPT-$ Turbo performs long multiplication:
multiply 12.23341 by -123.23354 using long multiplication method.
To multiply 12.23341 by -123.23354 using the long multiplication method, we first ignore the decimal points and treat the numbers as integers. We'll reintroduce the decimal point at the end. Here's the multiplication step by step:
```
1223341 (this is 12.23341 with the decimal removed)
× -12323354 (this is -123.23354 with the decimal removed)
--------------
1223341 (multiplied by 4)
0000000 (multiplied by 5, shifted one position to the left)
1223341 (multiplied by 3, shifted two positions to the left)
3660023 (multiplied by 3, shifted three positions to the left)
4886682 (multiplied by 2, shifted four positions to the left)
0000000 (multiplied by 2, shifted five positions to the left)
1223341 (multiplied by 1, shifted six positions to the left)
1223341 (multiplied by 1, shifted seven positions to the left)
0000000 (multiplied by 2, shifted eight positions to the left)
0000000 (multiplied by 3, shifted nine positions to the left)
--------------
15077944092654 (sum of all the above products)
```
Now, we need to place the decimal point. The original numbers had 5 and 8 decimal places, respectively, so the result should have 5+8 = 13 decimal places.
So, 15077944092654 becomes 1507.7944092654.
Since one of the original numbers was negative, the result is also negative:
-1507.7944092654
Therefore, 12.23341 multiplied by -123.23354 equals -1507.7944092654.
Compare with correct result: -1507.5664205714
Here is how I experimented with long multiplication on GPT-4 a month ago (5 shot, GPT-4 made mistakes on addition in 3rd some time 4th digit, regardless of decimal point position):
Multiply -0.9923 by -0.9923 using long multiplication.Solution: multiply individual digits in reverse order, increasing the order of magnitude of the result for each next digit in the first number : 3310*(0+0)=9, 3210*(0+1)=60, 3910*(0+2)=2700, 3910*(0+3)=27000, 3010*(0+4)=0, step's subtotal[0]=29769; 2310*(1+0)=60, 2210*(1+1)=400, 2910*(1+2)=18000, 2910*(1+3)=180000, 2010*(1+4)=0, step's subtotal[1]=198460; 9310*(2+0)=2700, 9210*(2+1)=18000, 9910*(2+2)=810000, 9910*(2+3)=8100000, 9010*(2+4)=0, step's subtotal[2]=8930700; 9310*(3+0)=27000, 9210*(3+1)=180000, 9910*(3+2)=8100000, 9910*(3+3)=81000000, 9010*(3+4)=0, step's subtotal[3]=89307000; 0310*(4+0)=0, 0210*(4+1)=0, 0910*(4+2)=0, 0910*(4+3)=0, 0010*(4+4)=0, step's subtotal[4]=0; Sum of partial results: 29769+198460+8930700+89307000+0 = 98465929. Set the decimal point position in the result by adding the decimal places of both numbers (4+4=8), counting from the right. Final result: -0.9923*-0.9923=0.98465929
I was able to tune the LLaMA 2 with QLoRA to produce viable results only with precision up to 4th digit after decimal point, however difference in length of mantissa cases wrong result.
Interesting, because based on my wife's stroke, it appears that the human brain implements language and math somewhat separately. She was a physics major and had completed 4 years of relevant college math courses.
After her stroke she lost the ability to do even simple arithmetic and also had great trouble with time and calendars.
However, her language skills were essentially intact until a third stroke in a different part of the other hemisphere left her with severe expressive aphasia. (Her first stroke left her with homonymous hemianopsia, but that's a loss of visual preprocessing in the occipital lobe.)
So I would not expect LLMs to have math skills unless specifically trained and specialized for math.
That's quite correct. The IP1 and IP2 areas are usually associated with mathematics and production of motor cues around mathematics and stories. The Broca's and Vernicke's Areas are associated with language understanding and production. You can roughly estimate that the human brain's language model roughly fits into a floppy disk (1.44 MB for you young'uns)
None of us have seen the letter so this may be off base, but I would expect people working at the world's most prominent AI research organization to have more skepticism about the ramifications of any one "breakthrough." Perhaps most did, but a couple didn't and wrote the letter?
More than 3 decades ago when AI started beating humans at chess, some people feared AGI was right around the corner. They were wrong.
Some day AGI will be achieved and Q* sounds like a great breakthrough solving an interesting piece of the puzzle. But "performing math on the level of grade-school students" is a long ways from AGI. This seems like a strange thing to have triggered the chaos at OpenAI.
I think the word "sentience" is a red herring. The more important point is that the researcher at Google thought that the AI had wants and needs 'like a human', e.g. that if it asked the AI if it wanted legal representation to protect its own rights, this was the same as asking a human the same question.
This needs much stronger evidence than the researcher presented, when slight variations or framing of the same questions could lead to very different outcomes from the LLM.
> that if it asked the AI if it wanted legal representation to protect its own rights, this was the same as asking a human the same question.
You seem to be assigning a level of stupidity to a google AI researcher that doesn't seem wise. That guy is not a crazy who grabbed his 15 minutes and disappeared, he's active on twitter and elsewhere and has extensively defended his views in very cogent ways.
These things are deliberately constructed to mimic human language patterns, if you're trying to determine whether there is underlying sentience to it, you need to be extra skeptical and careful about analyzing it and not rely on your first impressions of it's output. Anything less would be a level of stupidity not fit for a Google AI researcher, which considering that he was fired is apropos. That he keeps going on about it after his 15 minutes are up is not proof of anything except possibly that besides being stupid he also stubborn.
It's the exponent problem though: if it can actually engage in abstract mathematical reasoning, then it's a machine which doesn't need to sleep able to come up with new independent ideas potentially.
It's the rudiments of being able to develop and reason about computation, which means it's the rudiments of self-modification and improvement. Which is basically the holy grail of AI: a program which can iteratively improve itself to create better AIs and suddenly we're off to the races.
This is before getting into other interesting parameters, like how the scale and components of computer technology have a physical reality, and we've had experiences in the lab of genetic algorithms developing novel "cheat" strategies which exploit the physical characteristics of their hardware.
Being legitimately good at reasoning when it comes to numbers is a new emergent behavior. Reasoning about numbers isn't something that exists in "idea space" where all the vectorized tokens exist.
OpenAI already benchmarks their GPTs on leetcode problems and even includes a Codeforces rating. It is not impressive at all and there's almost no progress from GPT 2 to 4.
I agree, why does this grade school math problem matter if the model can't solve problems that are very precisely stated and have a very narrow solution space (at least more narrow than some vague natural language instruction)?
I suspect the truth (if this claim is true) is a lot more nuanced than "it did grade-school math", and there's more context surrounding this claim which insiders have and makes it much more interesting.
Like most research they likely started with a smaller model like GPT 2 or 3 and shown that they can significantly boost the performance to the extent of solving grade school math.
Here is something that I think would be a big breakthrough:
I explain to GPT in text, a mathematical concept it has never seen in its training data and give a few examples (not inferred from fill the blank on millions of examples). It actually learns this to update its weights - not just uses it as part of a prompt.
Extrapolating this optimistically - this is a huge step towards AGI in my opinion. You can (in theory) teach it to automate many tasks, correct it's mistakes without needing costly extra training data, and move towards the few-shot (and persistent) learning that separates humans from AI right now.
> It actually learns this to update its weights - not just uses it as part of a prompt.
In a way both are the same thing - memory that is in a feedback loop with the network that does the calculation. Just that the weights give much faster access, no "serde".
Maybe the goal is not to modify the weights but train the network so that it can effectively use a "memory block" in the way it works. Now this is in a way faked by re-feeding the output it produces concatenated with the original phrase. Don't we as humans effectively extend our memory by using all kind of text, written or digital? Just the issue is that it is slow to utilize, for a computer using fast RAM that wouldn't be much of an issue.
I'm not following - if you think AGI is uncertain, shouldn't you actually be more surprised? I mean, looking at it from a bayesian lens, the lower your prior, the more (in absolute percentage points) you would need to adjust it based on new supporting evidence, no?
To clarify, I'm really curious about this question. Is there some limit to autocomplete that falls short of continuing a prompt such as: "The following is a literate programming compendium on how to stimulate the human brain in software ..."
I don't have a good answer for your question, I was just making the point that if you think this is a step toward a dead end and not AGI your attitude regarding the step changes.
I understood that. What I meant to say (apologies if that was unclear) is that if you think we're getting close to a dead end, you should be more rather than less surprised at signs of significant further progress, no?
Continuing with the physical movement metaphor, if I believe that the train I'm on will stop at the next station, I'll be more surprised at the fact that we're still accelerating, compared to the person next to me who's not sure if this is a local train or an express train.
Generally speaking, the lower my prior probability of continued progress, the more I should be surprised by the lack of slowdown.
I don't really understand this. Aren't LLMs already performing at near-expert level on "certain mathematical problem" benchmarks?
For example, over a year ago MINERVA from Google [1] got >50% on the MATH dataset, a set of competition math problems. These are not easy problems. From the MATH dataset paper:
> We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.
No, they are completely awful at math. The way you can see that is that, whenever you ask it about known concepts, even hard ones, it will answer, making it look intelligent. But then if you create a small new theory or logical problem that is completely not in the internet, and ask SIMPLE questions about it - i.e., questions a 10 years old would be able to answer given the context - it will fail disgracefully.
No, that shows only that the dataset is comprised of common problem patterns. The paper explicitly investigates whether memorization has overly impacted performance. From page 10 [1]:
> A central question in interpreting Minerva’s solutions is whether performance reflects genuine analytic
capability or instead rote memorization. This is especially relevant as there has been much prior work
indicating that language models often memorize some fraction of their training data ... In order to evaluate the degree to which our models solve problems by recalling information memorized from
training data, we conduct three analyses on the MATH dataset ... Overall,
we find little evidence that the model’s performance can be attributed to memorization.
in Appendix j.2. they say that accuracy degraded after modification, figure 11 shows that accuracy degraded in 15 out of 20 examples after large modification.
It would be a major breakthrough if the model wasn't trained on math at all. Which I suspect is the case here, because otherwise it doesn't make sense.
I can't decide if these people are so conceited they actually believe the AGI nonsense or it's just a cynical marketing ploy to string these reporters along.
It was an "Oh" moment, but on the other hand, it'd be like laughing at text-davinci-003 when you realised how basic it really was, and yet, it has led us to GPT-4 and, much as Windows 3.1 was amazing back in the day, now it's a distant memory as we carry around little glass tablets that can play 3D games and do UHD global video calls.
If Q* is the basic algorithm that is as transformative as, well, transformers/attention, then developing it to be more than just children's math could be the key to thinking - and not just latent space token generation.
> Given vast computing resources, the new model was able to solve certain mathematical problems, [..] Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said.
I somehow expected a bit more.