Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I get the feeling this works due to the following:

1. An input is processed with answer generated token-by-token.

2. The model can output based on probability the answer or a filler token. Low probability answers are ignored in favour of higher probability filler tokens (I don't have a better token to answer with than .....)

3. At a certain point, an alignment is made with what was learnt previously triggering a higher probability of outputting a better token.

This intuition being that I've noticed models respond differently based on where in context information appears: Can't speak for different embedding methods however as I'm sure this changes my thoughts on above.

If instead chain of thought prompting is used, the tokens further generated may interfere with the output probability.

So further to this, I'm thinking filler tokens allow for a purer ability for a model to surface the best answer it has been trained on without introducing more noise. Or we can use methods that resample multiple times to find the highest outputs.

These LLMs are practically search engines in disguise.



I'm not sure you understood what the paper was saying.

The LLM in the paper isn't being trained to output filler tokens until it finds an answer, it's trained to provide a better answer when it's given filler tokens. The only tokens the paper's LLM will predict are "true" and "false", the filler tokens are input-only.

And the paper doesn't find that filler tokens are "purer" than chain-of-thought: it describes them as less effective than CoT, though still a perf boost over getting the raw answer on certain types of tasks.


I think you may need to read my comment again. It's even mentioned in the summary:

> In summary, our results show that additional tokens can provide computational benefits independent of token choice. The fact that intermediate tokens can act as filler tokens raises concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought tokens.

The paper discusses an unexplained benefit of additional computation regardless of which token is selected be it symbols or Lorem Ipsum.

I didn't mention anything about training. I'm speaking based on how the Transformer architecture itself is designed.

The "unauditable" computation is simply a result of how machine learning models work. The extra computation made available is mentioned in my explanation.

Keen to hear your thoughts on it though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: