SSM models strengths are with continuous data like audio and video, they struggle with discrete data like text/ DNA. This newest architecture uses selective attention to try to address the weaknesses around discrete data with some loss in performance in continuous tasks, empirically shown here with audio. The empirical exploration was limited to smaller size models, the performance as larger scales is yet to be explored in practice.
I found my deepest understanding of the selection mechanism came from struggling with the discretization in section 2, followed by the deeper explanations of the variables involved in 3.5.2. This video gives excellent background to SSMs, along with a detailed walk through of the paper itself.[a]
I am still coming to understand S4, SSMs in general but the video suggested this annotated explainer that has been helping a lot[b].
I would also point out section 3.1 and it’s discussion of the tradeoffs between compression and effectiveness as particularly interesting.
I do wonder how many different GPUs / hardware architectures will be able to execute the optimizations that are described as critical. I think the nature of the optimizations is the part of the paper I understand least well.
The paper taken at face value looks very exciting. The promise of a very large context window with 5x throughput for inference would be huge if it proves to scale well. I do wonder if it will make sense to train SSMs without this selection mechanism for specific continuous use cases where it seems to perform better or if other architectures will prove to better serve those cases.
The distinction I remember from the paper, is that discrete data (text, DNA) could be modeled effectively with only real-valued components, while continuous data (audio, video) benefited from complex numbers -- in their state. Evidently summarized from the authors statements characterizing existing work, preexisting wisdom on SSM/S4 models.
other model details the authors note
that most prior State space models use
complex numbers in their state but it
has been empirically observed that
completely real valued State space
models seem to work fine and possibly
even better in some settings so they use
real values as the default which work
well for all but one of their tasks next
just following this, another impressive snippet
it succeeds on test sequence lengths of
up to a million tokens which is 4,000
times longer than it saw during training
while none of the other methods compared
to generalize to Beyond twice their
training length
If you want to really dive into the math and motivation for this entirely new class of models (state space models) I highly recommend reading Albert Gu’s thesis.
I tried to read the Mamba paper and I was lost but after reading his thesis it was a lot to understand it.
The part I’m still struggling with is the section on how Mamba trains so efficiently, it’s surprising since it is actually autoregressive and step by step and I thought this is why LSTMs were dropped
Due to the particular form of the recurrent update of the hidden state, there's actually a parallel algorithm for computing the recurrence over length N in log(N) time via dynamic programming. Note, you don't save FLOPs, you just save "sequential depth" in the computation through clever parallelization. It's sort of an extension of the fast parallel scan described here: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co....
“Everywhere we tried” does not include actually beating the performance (or getting particularly close to) sota transformers of the same size. That said, these models were trained for less time than the sota models so there is promise.
What models in particular do you think it should have been compared to? It seems like they chose some of the most understood / cited models at around 1 B parameters.
Question for anyone reading the paper; what exactly is the notation in these two stages below supposed to mean?
𝑦(𝑡) = Ch(𝑡) (1b) 𝑦𝑡 = Ch𝑡 (2b)
This is from the section describing state space models. The paragraphs before seem to describe e.g. 𝑦(𝑡) as being presumably the real value element in function/sequence 𝑦 evaluated at time 𝑡. However, it then randomly adds the 𝑦𝑡 (note the 𝑡 is supposed to be subscript) notation for what looks to be a similar equivalence and I’m not sure how thats supposed to differ in meaning from 𝑦(𝑡). Am I dumb?
y(t) is likely continuous and y_t , the subscript form, is likely the discrete version.
For example it is common in digital devices to sample something continuous(analog) every unit of time and use whatever value was captured at that moment for whatever the goal is, ideally if the something is sampled frequently enough you can even recreate the continuous version from the values.
As the other comment suggested, it's the discrete subscript. Normally t is notation for continuous time, and k the discrete time subscript, but all this is just convention.
Out of curiosity, does anyone feel as though there's any benefit to linking to reddit when we can link to whatever the reddit post actually is? I for one do not click the link and read discussion on reddit - if I wanted that sort of discussion, I would browse there, not HN.
> does anyone feel as though there's any benefit to linking to reddit
You were certainly right in this case, but I'm not ready to ban reddit.com because I think there are still some positive threads that come in through that window. If anyone disagrees strongly enough to peruse https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... and make the case against, the court will grant standing :)
Speaking purely for myself, (curious what other people think) my preference is _always_ for the link to go to the most "principal" source. Then, in the comments, add a link to a pop news article or the comments section.
I hate links to twitter posts, feels like undeserved promotion of Elon Musk. Just find what the twitter links to, or the actual research paper, and submit that.
Links to reddit are okay, it's devolving, for sure, but depending on topic sometimes I find more intelligent comments there. (RARELY)
> I hate links to twitter posts, feels like undeserved promotion of Elon Musk. Just find what the twitter links to, or the actual research paper, and submit that.
Treat knowledge with respect regardless of where you get it from.
We might be on the cusp of dethroning the attention & transformer dynasty! Anyone have context on how the mamba architecture fares with multimodal models?
SSM models strengths are with continuous data like audio and video, they struggle with discrete data like text/ DNA. This newest architecture uses selective attention to try to address the weaknesses around discrete data with some loss in performance in continuous tasks, empirically shown here with audio. The empirical exploration was limited to smaller size models, the performance as larger scales is yet to be explored in practice.
I found my deepest understanding of the selection mechanism came from struggling with the discretization in section 2, followed by the deeper explanations of the variables involved in 3.5.2. This video gives excellent background to SSMs, along with a detailed walk through of the paper itself.[a]
I am still coming to understand S4, SSMs in general but the video suggested this annotated explainer that has been helping a lot[b].
I would also point out section 3.1 and it’s discussion of the tradeoffs between compression and effectiveness as particularly interesting.
I do wonder how many different GPUs / hardware architectures will be able to execute the optimizations that are described as critical. I think the nature of the optimizations is the part of the paper I understand least well.
The paper taken at face value looks very exciting. The promise of a very large context window with 5x throughput for inference would be huge if it proves to scale well. I do wonder if it will make sense to train SSMs without this selection mechanism for specific continuous use cases where it seems to perform better or if other architectures will prove to better serve those cases.
----
[a] https://www.youtube.com/watch?v=ouF-H35atOY
[b] https://srush.github.io/annotated-s4/