My intuitive understanding of transformers is that each token (input+output) giv...

My intuitive understanding of transformers is that each token (input+output) gives the model some "thinking space" that can be used to store reasoning information (independently of the token): thus, a transformer should be more clever when completing a 1000 tokens sequence than a 2 tokens sequence. This seems in line with the paper's finding.