Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.
I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.
I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)
You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would
a) be far more compute intensive to train and run (especially train)
B)be susceptible to all of the issues that RNNs have.
C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it
I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.