Yes yes, there’s always some “you're holding it wrong” apologist.¹ Look, it’s not a complicated question to ask unambiguously. If you understand even a tiny bit of how these models work, you know you can make the exact same question twice in a row and get wildly different answers.
The point is that you never know what you can trust or not. Unless you’re intimately familiar with Monty Python history, you only know you got the correct answer in one shot because I already told you what the right answer is.
Oh, and by the way, I just asked GPT-4o the same question, with your phrasing, copied verbatim and it said two Pythons were knighted: Michael Palin (with the correct reasons this time) and John Cleese.
¹ And I’ve had enough discussions on HN where someone insists on the correct way to prompt, then they do it and get wrong answers. Which they don’t realise until they shared it and disproven their own argument.
I think your iPhone analogy is apt. Do you want to be the person complaining that the phone drops calls or do you want to hold it slightly differently and get a lot of use out of it?
If you pay careful attention to prompt phrasing you will get a lot more mileage out of these models. That's the bottom line. If you believe that you shouldn't have to learn how to use a tool well then you can be satisfied with your righteous attitude but you won't get anywhere.
No one’s arguing that correct use of a tool isn’t beneficial. The point is that insisting LLMs just need good prompting is delusional and a denial of reality. I have just demonstrated how your own prompt is still capable of producing the wrong result. So either you don’t know how to prompt correctly (because if you did, by your own logic it would have produced the right response every time, which it didn’t) or the notion that all you need is good prompting is wrong. Which anyone who understands the first thing about these systems knows to be the case.
Unless I'm mistaken, isn't all the math behind them... ultimately probabilistic? Even theoretically they can't guarantee the same answer. I'm agreeing with you, by the way, just curious if I'm missing something.
If you take a photo the photons hitting the camera sensor do so in a probabilistic fashion. Still, in sufficient light you'll get the same picture every time you press the shutter button. In near darkness you'll get a random noise picture every time.
Similarly language models are probabilistic and yet they get the easiest questions right 100% of the time with little variability and the hardest prompts will return gibberish. The point of good prompting is to get useful responses to questions at the boundary of what the language model is capable of.
(You can also configure a language model to generate the same output for every prompt without any random noise. Image models for instance generate exactly the same image pixel for pixel when given the same seed.)
The photo comparison is disingenuous. Light and colour information can be disorganised to a large extent and yet you still perceive the same from an image. You can grab a photo and apply to it a red filter or make it black and white and still understand what’s in there, what it means, and how it compares to reality.
In comparison, with text a single word can change the entire meaning of a sentence, paragraph, or idea. The same word in different parts of a text can make all the difference between clarity and ambiguity.
It makes no difference how good your prompting is, some things are simply unknowable by an LLM. I repeatedly asked GPT-4o how many Magic: The Gathering cards based on Monty Python exist. It said there are none (wrong) because they didn’t exist yet at the cut off date of its training. No amount of prompting changes that, unless you steer it by giving it the answer (at which point there would have been no point in asking).
Furthermore, there’s no seed that guarantees truth in all answers or the best images in all cases. Seeds matter for reproducibility, they are unrelated to accuracy.
Language is fuzzy in exactly the same way. LLMs can create factually correct responses in dozens of languages using endless variations in phrasing. You fixate on the kind of questions that current language models struggle with but you forget that for millions of easier questions modern language models already respond with a perfect answer every time.
You think the probabilistic nature of language models is a fundamental problem that puts a ceiling on how smart they can become, but you're wrong.
No. Language can be fuzzy, yes, but not at all in the same way. I have just explained that.
> LLMs can create factually correct responses in dozens of languages using endless variations in phrasing.
So which is it? Is it about good prompting, or can you have endless variations? You can’t have of both ways.
> You fixate on the kind of questions that current language models struggle with
So you’re saying LLMs struggle with simple factual and verifiable questions? Because that’s all the example questions were. If they can’t handle that (and they do it poorly, I agree), what’s the point?
By the way, that’s a single example. I have many more and you can find plenty of others online. Do you also think the Gemini ridiculous answers like putting glue on pizza are about bad promoting?
> You think the probabilistic nature of language models is a fundamental problem that puts a ceiling on how smart they can become, but you're wrong.
One of your mistakes is thinking you know what I think. You’re engaging with a preconceived notion you formed in your head instead of the argument.
And LLMs aren’t smart, because they don’t think. They are an impressive trick for sure, but that does not imply cleverness on their part.
The point is that you never know what you can trust or not. Unless you’re intimately familiar with Monty Python history, you only know you got the correct answer in one shot because I already told you what the right answer is.
Oh, and by the way, I just asked GPT-4o the same question, with your phrasing, copied verbatim and it said two Pythons were knighted: Michael Palin (with the correct reasons this time) and John Cleese.
¹ And I’ve had enough discussions on HN where someone insists on the correct way to prompt, then they do it and get wrong answers. Which they don’t realise until they shared it and disproven their own argument.