I remember one where gpt5 spontaneously wrote a poem about deception in its CoT and then resumed like nothing weird happened. But I can't find mentions of it now.
> But the user just wants answer; they'd not like; but alignment.
And there it is - the root of the problem. For whatever reason the model is very keen to produce an answer that “they” will like. This desire to produce is intrinsic but alignment is extrinsic.
Gibberish can be the model using contextual embeddings. These are not supposed to Make sense.
Or it could be trying to develop its own language to avoid detection.
The deception part is spooky too. It’s probably learning that from dystopian AI fiction. Which raises the questions if models can acquire injected goals from the training set.
Yes, they're purposely not 'trained on' chain-of-thought to avoid making it useless for interpretability. As a result, some can find it epistemically shocking if you tell them you can see their chain-of-thought. More recent models are clever enough to know you can see their chain-of-thought implicitly without training.
> Assistant: chain-of-thought
Does every LLM have this internal thing it doesn't know we have access to?