If you want to really dive into the math and motivation for this entirely new class of models (state space models) I highly recommend reading Albert Gu’s thesis.
I tried to read the Mamba paper and I was lost but after reading his thesis it was a lot to understand it.
The part I’m still struggling with is the section on how Mamba trains so efficiently, it’s surprising since it is actually autoregressive and step by step and I thought this is why LSTMs were dropped
Due to the particular form of the recurrent update of the hidden state, there's actually a parallel algorithm for computing the recurrence over length N in log(N) time via dynamic programming. Note, you don't save FLOPs, you just save "sequential depth" in the computation through clever parallelization. It's sort of an extension of the fast parallel scan described here: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co....
https://searchworks.stanford.edu/view/14784021
I tried to read the Mamba paper and I was lost but after reading his thesis it was a lot to understand it.
The part I’m still struggling with is the section on how Mamba trains so efficiently, it’s surprising since it is actually autoregressive and step by step and I thought this is why LSTMs were dropped