After reading all these, I've landed at the following conclusion: Regardless of ...

After reading all these, I've landed at the following conclusion:

Regardless of its architecture there is only a finite amount of information the language model can work with at any given time. It depends on the task at hand which way of "forgetting" causes the least problems.

For coding and math a perfect context with a well defined maximum length of 16k..256k tokens paired with high quality ICL would work better than automated "random" forgetting. However, it requires a good strategy to present only the information relevant for the task to fit into the maximum context length.

For free-form literature and other non-technical stuff automated forgetting is likely beneficial, because you don't need to come up with a strategy to choose what's important to keep in-context. What you get is automated gradual forgetting and "mixing up" past memories, just like in humans.

Since I'm a software developer geek I strongly prefer the first one, but as you can see, it depends on the task at hand.