Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My intuitive understanding of transformers is that each token (input+output) gives the model some "thinking space" that can be used to store reasoning information (independently of the token): thus, a transformer should be more clever when completing a 1000 tokens sequence than a 2 tokens sequence. This seems in line with the paper's finding.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: