Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They really lie.

Not on purpose; because they are trained on rewards that favor lying as a strategy.

Othello-GPT is a good example to understand this. Without explicit training, but on the task of 'predicting moves on an Othello board', Othello-GPT spontaneously developed the strategy of 'simulate the entire board internally'. Lying is a similar emergent, very effective strategy for reward.





> They really lie. Not on purpose

You can't lie by accident. You can tell a falsehood, however.

But where LLMs are concerned, they don't tell truths or falsehoods either, as "telling" also requires intent. Moreover, LLMs don't actually contain propositional content.


I think you’re saying this with unwarranted confidence.

Reference: https://www.science.org/content/article/ai-hallucinates-beca...

If you don't know the answer, and are only rewarded for correct answers, guessing, rather than saying "I don't know", is the optimal approach.


It's more than just that, but thanks for that link, I've been meaning to dig it up and revisit it. Beyond hallucinations, there are also deceptive behaviors like hiding uncertainty, omitting caveats or doubling down on previous statements even when weaknesses are pointed out to it. Plus there necessarily will be lies in the training data as well, sometimes enough of them to skew the pretrained/unaligned model itself.

Not sure if that counts as lying but I've heard that an ML model (way before all this GPT LLM stuff) learned to classify images based on the text that was written. For an obfuscated example, it learned to read "stop", "arrêt", "alto", etc. on a stop sign instead of recognizing the red octagon with white letters. Which naturally does not work when the actual dataset has different text.

typographic attacks against vision-language models are still a thing with more recent models like GPT4-V: https://arxiv.org/abs/2402.00626

That does feel a little more like over-fitting, but you might be able to argue that there's some philosophical proximity to lying.

I think, largely, the

  Pre-training -> Post-training -> Safety/Alignment training
pipeline would obviously produce 'lying'. The trainings are in a sort of mutual dissonance.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: