Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[dupe] Mamba outperforms transformers "everywhere we tried" (reddit.com)
66 points by behnamoh on Dec 11, 2023 | hide | past | favorite | 25 comments


Still chewing but my takeaways thus far:

SSM models strengths are with continuous data like audio and video, they struggle with discrete data like text/ DNA. This newest architecture uses selective attention to try to address the weaknesses around discrete data with some loss in performance in continuous tasks, empirically shown here with audio. The empirical exploration was limited to smaller size models, the performance as larger scales is yet to be explored in practice.

I found my deepest understanding of the selection mechanism came from struggling with the discretization in section 2, followed by the deeper explanations of the variables involved in 3.5.2. This video gives excellent background to SSMs, along with a detailed walk through of the paper itself.[a]

I am still coming to understand S4, SSMs in general but the video suggested this annotated explainer that has been helping a lot[b].

I would also point out section 3.1 and it’s discussion of the tradeoffs between compression and effectiveness as particularly interesting.

I do wonder how many different GPUs / hardware architectures will be able to execute the optimizations that are described as critical. I think the nature of the optimizations is the part of the paper I understand least well.

The paper taken at face value looks very exciting. The promise of a very large context window with 5x throughput for inference would be huge if it proves to scale well. I do wonder if it will make sense to train SSMs without this selection mechanism for specific continuous use cases where it seems to perform better or if other architectures will prove to better serve those cases.

----

[a] https://www.youtube.com/watch?v=ouF-H35atOY

[b] https://srush.github.io/annotated-s4/


The distinction I remember from the paper, is that discrete data (text, DNA) could be modeled effectively with only real-valued components, while continuous data (audio, video) benefited from complex numbers -- in their state. Evidently summarized from the authors statements characterizing existing work, preexisting wisdom on SSM/S4 models.

https://youtubetranscript.com/?v=ouF-H35atOY (same video, I'd watched previous to seeing this hn post)

  other model details the authors note
  that most prior State space models use
  complex numbers in their state but it
  has been empirically observed that
  completely real valued State space
  models seem to work fine and possibly
  even better in some settings so they use
  real values as the default which work
  well for all but one of their tasks next
just following this, another impressive snippet

  it succeeds on test sequence lengths of
  up to a million tokens which is 4,000
  times longer than it saw during training
  while none of the other methods compared
  to generalize to Beyond twice their
  training length


[flagged]


Who are you and why should we take your questions so seriously?


If you want to really dive into the math and motivation for this entirely new class of models (state space models) I highly recommend reading Albert Gu’s thesis.

https://searchworks.stanford.edu/view/14784021

I tried to read the Mamba paper and I was lost but after reading his thesis it was a lot to understand it.

The part I’m still struggling with is the section on how Mamba trains so efficiently, it’s surprising since it is actually autoregressive and step by step and I thought this is why LSTMs were dropped


Due to the particular form of the recurrent update of the hidden state, there's actually a parallel algorithm for computing the recurrence over length N in log(N) time via dynamic programming. Note, you don't save FLOPs, you just save "sequential depth" in the computation through clever parallelization. It's sort of an extension of the fast parallel scan described here: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co....


Title is misleading.

“Everywhere we tried” does not include actually beating the performance (or getting particularly close to) sota transformers of the same size. That said, these models were trained for less time than the sota models so there is promise.


What models in particular do you think it should have been compared to? It seems like they chose some of the most understood / cited models at around 1 B parameters.


This is the reddit title almost verbatim. If you have a problem with that, you better leave a comment in that thread.


“please use the original title, unless it is misleading or linkbait” --https://news.ycombinator.com/newsguidelines.html


Yes, that's right. And I used the original title (though it was way longer).


That part was accurately in scare quotes.


Question for anyone reading the paper; what exactly is the notation in these two stages below supposed to mean?

    𝑦(𝑡) = Ch(𝑡) (1b) 𝑦𝑡 = Ch𝑡 (2b)
This is from the section describing state space models. The paragraphs before seem to describe e.g. 𝑦(𝑡) as being presumably the real value element in function/sequence 𝑦 evaluated at time 𝑡. However, it then randomly adds the 𝑦𝑡 (note the 𝑡 is supposed to be subscript) notation for what looks to be a similar equivalence and I’m not sure how thats supposed to differ in meaning from 𝑦(𝑡). Am I dumb?


y(t) is likely continuous and y_t , the subscript form, is likely the discrete version.

For example it is common in digital devices to sample something continuous(analog) every unit of time and use whatever value was captured at that moment for whatever the goal is, ideally if the something is sampled frequently enough you can even recreate the continuous version from the values.


Thanks for clarifying that! I think this makes sense, I wish I had a stronger math background so that this kind of thing doesn’t trip me up as much.


As the other comment suggested, it's the discrete subscript. Normally t is notation for continuous time, and k the discrete time subscript, but all this is just convention.

yₖ = Chₖ


This is a reddit post which links to a Twitter post [0] which links to this paper [1] and this repo [2]

[0] - https://nitter.net/_albertgu/status/1731727672286294400

[1] - https://arxiv.org/abs/2312.00752

[2] - https://github.com/state-spaces/mamba

Out of curiosity, does anyone feel as though there's any benefit to linking to reddit when we can link to whatever the reddit post actually is? I for one do not click the link and read discussion on reddit - if I wanted that sort of discussion, I would browse there, not HN.


Ah yes, plus your link [1] already had significant attention here (thanks!):

Mamba: Linear-Time Sequence Modeling with Selective State Spaces - https://news.ycombinator.com/item?id=38522428 - Dec 2023 (34 comments)

> does anyone feel as though there's any benefit to linking to reddit

You were certainly right in this case, but I'm not ready to ban reddit.com because I think there are still some positive threads that come in through that window. If anyone disagrees strongly enough to peruse https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... and make the case against, the court will grant standing :)


I am with you, I don't see the point of linking to the reddit post in a situation like this.


Speaking purely for myself, (curious what other people think) my preference is _always_ for the link to go to the most "principal" source. Then, in the comments, add a link to a pop news article or the comments section.

I hate links to twitter posts, feels like undeserved promotion of Elon Musk. Just find what the twitter links to, or the actual research paper, and submit that.

Links to reddit are okay, it's devolving, for sure, but depending on topic sometimes I find more intelligent comments there. (RARELY)


> I hate links to twitter posts, feels like undeserved promotion of Elon Musk. Just find what the twitter links to, or the actual research paper, and submit that.

Treat knowledge with respect regardless of where you get it from.



We might be on the cusp of dethroning the attention & transformer dynasty! Anyone have context on how the mamba architecture fares with multimodal models?



This is pretty awesome and I'm curious to try the code out. Has anybody been able to run this locally?


dead in the water.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: