I'm not sure you understood what the paper was saying.
The LLM in the paper isn't being trained to output filler tokens until it finds an answer, it's trained to provide a better answer when it's given filler tokens. The only tokens the paper's LLM will predict are "true" and "false", the filler tokens are input-only.
And the paper doesn't find that filler tokens are "purer" than chain-of-thought: it describes them as less effective than CoT, though still a perf boost over getting the raw answer on certain types of tasks.
I think you may need to read my comment again. It's even mentioned in the summary:
> In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.
The paper discusses an unexplained benefit of additional computation regardless of which token is selected be it symbols or Lorem Ipsum.
I didn't mention anything about training. I'm speaking based on how the Transformer architecture itself is designed.
The "unauditable" computation is simply a result of how machine learning models work. The extra computation made available is mentioned in my explanation.
The LLM in the paper isn't being trained to output filler tokens until it finds an answer, it's trained to provide a better answer when it's given filler tokens. The only tokens the paper's LLM will predict are "true" and "false", the filler tokens are input-only.
And the paper doesn't find that filler tokens are "purer" than chain-of-thought: it describes them as less effective than CoT, though still a perf boost over getting the raw answer on certain types of tasks.