I personally think this is important completely aside from any cultural propriet...

margorczynski · on March 13, 2023

> If we can't get models not to say racist or otherwise terrible things, we can't make any guarantees about our ability to control or guide some future AGI.

This is a very bold assumption that the current LLMs function and "think" in the same way some future AGI would. They do not even reason, just make up words that fit some context - thus they "hallucinate".

There is no reason the approach taken here by injecting some bias or word filtering would apply to the real thing. And AI safety and aligment is not (at least it was not until getting hijacked) and was not about some model saying mean words but something really threatening like the paperclip maker problem - an agent choosing a path to a goal which is not aligned with what humans find acceptable (e.g. solving world hunger by killing everyone)

ben_w · on March 13, 2023

Paperclipping is just one example of one of many ways it can go wrong.

While I agree LLMs are unlikely to be the last word on AI, the fact we understand alignment so poorly that they spew random things, let alone any arguments about which words are acceptable[0], is a sign we have much foundational work to do.

Indeed, as I recall, one of the main researchers in this topic describes it as "pre paradigmatic" because we don't have a way to even compare the relative alignment of any two AI.

[0] personally, I suspect but cannot prove that tabooing certain words is a Potemkin village solution to the underlying social problems

famouswaffles · on March 13, 2023

It's not a bold assumption. t's the only assumption. We can't control the output of llms completely because we don't know how they generate. Nobody on earth has the faintest clue how all those 175 billion paramters are shaping the response to input.

It doesn't matter bout "thinking" or whatever. Any black box system will be uncontrollable in essence. You can not make inviolable rules for a system you don't understand.

And saying LLMs hallucinate because they don't understand anything is stupid. And just shows ignorance on your part. Models hallucinate because they're rewarded for plausibly guessing during training when knowledge fails. Plausibly guessing is a much better strategy to reducing loss.

And the conclusion is obvious enough. Bugger smarter models hallucinate less because they guess less. That holds true.

https://crfm.stanford.edu/helm/latest/?group=core_scenarios

All the instruct tuned models on this list follow that trend.

From Ada to Babbage to Curie to Claude to Davinci-002/003. Greater size equals Greater truthfulness (evaluated on TruthfulQA)

soVeryTired · on March 13, 2023

> They do not even reason, just make up words that fit some context - thus they "hallucinate".

But they can explain their 'reasoning' in a way that makes sense to humans a lot of the time. Serious question: how do you know if something does or doesn't reason?

margorczynski · on March 13, 2023

That is not their reasoning though - it is something they think a human would write given the predicate (question that expects you to provide the reasoning behind the answer). For something to reason it needs the ability to have certain goals and to perform action which it thinks are the most optimal to reaching those goals. Like setting hypotheses and producing a path towards proving them - reasoning.

The LLM only correlates, so it's "reasoning" is something like "most often people answered 4 to 2+2 then that I should write". That's why it gives out confidently complete gibberish as it works with correlation and not causality. I think much closer to that goal of real reasoning are world models - check out something like DreamerV3 or what Yann Le Cunn is talking about.