I think the general idea is that since partial embeddings can be copied laterally (between token positions) from one layer of a transformer to the next, then additional work done at filler positions can also be copied to following real token positions. There's obviously a limit to how useful this can be since these are added parallel token steps rather than sequential transformer layer ones, and results from different experiments seem to be mixed.
Still, I don't see how this really works .. more compute / embedding transformations are being potentially applied to the prediction, but in what circumstances are these filler positions being used in a useful way? The filler token embeddings themselves presumably aren't matching attention keys, but positional encodings for adjacent tokens will be similar, which is maybe what triggers lateral copying into (and perhaps out of) filler positions?
Still, I don't see how this really works .. more compute / embedding transformations are being potentially applied to the prediction, but in what circumstances are these filler positions being used in a useful way? The filler token embeddings themselves presumably aren't matching attention keys, but positional encodings for adjacent tokens will be similar, which is maybe what triggers lateral copying into (and perhaps out of) filler positions?