Mamba Explained: The State Space Model Taking On Transformers

Straw · on Feb 25, 2024

The SSMs papers and blogs always have unnecessarily complicated explanations. At this point I almost wonder if its to hide how simple the underlying algorithms are, or to make them seem fancy.

SSMs are doing exponentially weighted moving averages (EMA). That's it- to summarize the past, you take an EMA of a variable output at each time step. Mamba changes one key thing- instead of decaying the past by a fixed amount each step as in a constant-time EMA, we have another output which decides how much to forget, or equivalently, how much 'time' has passed since the last observation in our EMA.

All of the matrix equations, continuous time, discretization, etc, will end up with a dynamic-forgetting EMA as I describe above. This also makes the benefits and limitations clear- finite state size, has to decide at a given layer what to forget before it sees the past at that layer.

logicchains · on Feb 25, 2024

Are there any fundamental differences between Mamba, Retnet and RWKV, or are they all variants of this same architecture?

Straw · on Feb 26, 2024

No, all of these use the same fundamental architecture with minor tweaks, such as the dynamic gate for mamba or an outer product paramterization of the values for RWKV-v5

pama · on Feb 26, 2024

A dynamic gate is a pretty distinct feature from previous SSM architectures in my opinion. In a sense, the overall fundamental architecture of mamba is still that of the transformer but with attention replaced by an SSM with dynamic gating. All of deep learning uses closely related ideas, but the SSM class of models took advantage of stability guarantees from integrators in control theory and created a class of RNN that don’t have to worry about exploding gradients. Mamba is one of the ways to make these SSM models much more expressive.

Straw · on Feb 29, 2024

Its distinct, but not very- its an EMA without assuming uniform time. The stability of EMA has nothing to do with integrators in control theory and neither do these models.

These models aren't really RNNs- they have only a linear gate which cannot depend on previous tokens at this layer, so they cant update their state in a way which depends on the current state very much.

ogogmad · on Feb 25, 2024

That might explain the motivation for why the Δ variable is used and varied; but not the "Selectivity", which the article says is expressed by how the matrices B and C vary while consuming input.

Something I've noticed is that B, C and Δ depend only on the current token. See this: https://www.kolaayonrinde.com/blog/images/mamba/ssm_algorith... -- Another thing is that I've noticed that the definition of "SSM" in the image I've linked to is apparently recursive. This is also in the Arxiv paper. Strange.

+1 though for making me go back to the article and read it more carefully! +1 also to the article.

ogogmad · on Feb 25, 2024

OK, I've noticed that the pseudo-code above is vectorised, and so there's no recursion. The SSM function is actually described at the start of the paper, and an efficient hardware-aware implementation is suggested in section 3: https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf

binarymax · on Feb 25, 2024

I hadn’t heard of Mamba before reading this article, and I was wondering if anyone has tried setting importance of a token as a TF-IDF or BM25 lookup. Requires a first pass to construct the token index but otherwise it seems like it would address the big issue that all these architectures have - they don’t know how “important” a token is. Interestingly this seems to be the crux of Mamba - deciding what tokens to forget! EMA other treats all tokens equally at sequence time. What if the tokens were weighted beforehand and the weights were passed as an attention mechanism? I wonder if anyone has tried something like this.

halflings · on Feb 25, 2024

The importance (e.g. attention) needs to be dynamic, e.g. one token will be important to some other tokens but not others.

tf-idf and similar heuristics are what we were using before attention came along, e.g. tf-idf weighted bag-of-words representation of word2vec embeddings. That approaches fails in so many cases.

binarymax · on Feb 25, 2024

Attention in transformers works because over time the model learns token importance based on frequency and context.

If you don’t have attention and need a fast substitute for “forgetting” non important tokens, then BM25 is an intuitive hypothesis.

curious_cat_163 · on Feb 25, 2024

To use your metaphor, TF-IDF will result in ‘fixed’ weights.

Attention makes it so that the weights of each token can be different in each sequence of tokens. Same token gets different weights depending on who its ‘neighbors’ in the sequence end up being.

This property allows the models to solve a variety of natural language problems and gets ‘used’ by the model to express context-aware dependencies.

littlestymaar · on Feb 26, 2024

Given that GP explicitly said “if you don't have attention”, and we're in a thread about a language model whose main characteristics is not to use attention, I don't understand why you insist in talking about attention …

curious_cat_163 · on Feb 26, 2024

I mean, if we are going to get past attention (very much on board with the idea!), then it might help to know what it is really contributing to a model.

My response was trying to clarify some confusion.

I am all for alternatives to attention. I don’t think BM25 cuts it. I don’t think anything that samples tokens based on BM25 weights (the idea in this subthread) would cut it.

binarymax · on Feb 26, 2024

What confusion? I know exactly how BM25 works and how Transformers work. I stated a hypothesis and asked if anyone has tried it. You say it won’t work. That’s just your opinion. Do you have proof or evidence? This is science. Dismissal of ideas without evidence goes against scientific principles.

curious_cat_163 · on March 2, 2024

Just catching up to this thread again. You had said:

"I was wondering if anyone has tried setting importance of a token as a TF-IDF or BM25 lookup."

So, I take it back. This is not a confusion. You are right to call it out. :)

I like this idea directionally. A lot of energy (literally) would be saved if we could get to the model accuracy outcomes with static weights like this.

However, I do think that this (as stated in your original message) would not work as well as transformer or SSM and I explained my reasoning as to why, already. I don't have an empirical proof (not having run the experiment) but if you believe in it, you should try it and share your findings.

nelsondev · on Feb 26, 2024

Not exactly related, but in the same vein - Deep Impact - deep learning to find term impacts in the context of their document.

https://arxiv.org/abs/2104.12016

torginus · on Feb 26, 2024

Is this analogous to digital filters, where Transformers are the FIR filters that operate on the history of input, and IIR filters, which take past inputs into account with an exponentially decaying importance?

CrypticShift · on Feb 25, 2024

> In other words, you can drag and drop downloaded states into your model, like literal plug-in cartridges

The same could be said of "control vectors" [1]. Both ideas are still experimental, but is seems to me IINM that they could replace "system prompts" and "RAG" respectively.

[1] https://news.ycombinator.com/item?id=39414532

refulgentis · on Feb 25, 2024

Can control vectors replace RAG?

i.e. if I want the model to give me a summary of the news today, and the model was trained before today, can control vectors help?

p1esk · on Feb 25, 2024

No technique can get you the news other than actually searching for and then parsing the published news.

refulgentis · on Feb 25, 2024

Can a control vector replace system prompts?

i.e. can it do in-context learning without the context?

jncfhnb · on Feb 25, 2024

It more or less is the same as a system prompt

refulgentis · on Feb 25, 2024

So, no

jncfhnb · on Feb 26, 2024

So, yes, but not in a meaningfully different form

Der_Einzige · on Feb 25, 2024

Whoever is downvoting this post needs to stop.

The concepts behind control vectors, i.e. "representation engineering" are not especially new and have been highly effective in the diffusion space. I always find it entertaining when LLM folks act like they're discovering stuff that waifu stable diffusion folks knew for 6 months + about - like "concept slider loras".

CuriouslyC · on Feb 25, 2024

You are right that playing with AI image generation models is really good for building intuition about AI models in general, even if they seem superficially different. It's kind of like surveying a battlefield from the air.

refulgentis · on Feb 25, 2024

I don't know what you mean, can you help me?

I'm familiar with our intrepid stable diffusion sailors.

I don't know why you think the post is being downvoted.

I don't know why it would be verboten to downvote it, or indicative of the downvoter being an LLM fanatic who thinks they discovered everything.

I am puzzled by the post because it claims RAG can be replaced by control vectors.

I'm also puzzled because it claims prompts can be replaced by control vectors.

I get that if system prompts were only to shift output tone, control vectors could replace that case, but that seems narrow compared to the full set of things prompt input enables (inter alia, the in-context learning)

jncfhnb · on Feb 25, 2024

Most of these things aren’t much better than a single weighted token though

Der_Einzige · on Feb 25, 2024

First it was longformer, and linear attention models. Then it was RWKV and now it's Mamba. So many bombastic claims of improved architectural performance - and no open source models that beat the thing they purport to beat. The proof is always in the pudding, and these models will remain a curiosity for most until their weights are being benchmarked favorably on LLM leaderboards.

digdugdirk · on Feb 25, 2024

Yes, that's technically accurate. But I prefer to think of the entire LLM space as a new scientific field that started when OpenAI released ChatGPT.

In that context, all new research directions are valuable simply for the fact that they're expanding the foundation of the field. 5 years from now, who knows what the most effective models will use under the hood, but the more we can learn about them in general, the better.

lettergram · on Feb 25, 2024

lol I think in general, LLM research traces its origins back to all the standard deep learning techniques: NNs, CNNs, LSTMs, RNNs, etc.

In 2018, with the release of transformers (via google) it enabled much more rapid training of models and more generalization with less data. 100% of the LLMs (as you’d probably thing of them)trace their origins to BERT.

That said, my team was working with hundred million to low billions of parameter LSTMs & CNNs back in 2016-2017 that were comparable to some lighter weight LLMs today.

In my opinion, the greatest strides in the space has less to do with the underlying architecture, and more to do with improved data formatting, accessibility and compute improvements.

CityOfThrowaway · on Feb 25, 2024

The field of research here is far older than ChatGPT's release. Neural network research has been going on for at least 50 years.

Most of the research that enabled ChatGPT was also already known. "Attention is all you need" was a 2017 paper.

It still is a fast evolving field, but not one that just kicked off.

sigmoid10 · on Feb 25, 2024

True, but bear in mind the Mamba preprint is less than three months old. A lot of people are probably experimenting with these ideas right now and training a completely new, large foundation model with a different architecture will take a significant amount of time.

nickpsecurity · on Feb 26, 2024

GPT3-176B cost $30 million dollars in compute plus millions in design, preprocessing, and operations. Then, it was able to perform as much better than prior architectures as it does today. You might want to include that in your challenge for competing models.

Let’s rephrase it. If their architecture is superior, and they have $30 million dollars, and similar preparation for training, and similar operational teams during training, then we can see if they can beat the model they’re comparing themselves to. Except, the alternatives don’t have tens of millions of dollars with the best support teams. So, the proof you seek hasn’t had a chance to happen due to severe lack of resources.

Hence, comparisons to GPT2 and small versions of GPT3. Even that might not be fair given the money and teams behind even small GPT3’s. Execution of the project is as critical for success as the model architecture.

imjonse · on Feb 25, 2024

Most (all?) open-ish 7B+ models today are finetunes of proprietary/semi-closed/bigbudget LLMs. There is no such foundation model for Mamba yet.

imjonse · on Feb 25, 2024

Explaining Mamba is a rite of passage, like the monad tutorials of yore.

SkyMarshal · on Feb 25, 2024

Mamba is like a burrito...

kekebo · on Feb 25, 2024

It gets soggy and disintegrates when not consumed swiftly?

sja · on Feb 25, 2024

Or Balks[0]:

BALK RULES! IMPORTANT! 1. You can’t just be up there and just doin’ a balk like that.

1a. A balk is when you

1b. Okay well listen. A balk is when you balk the

1c. Let me start over

1c-a. The pitcher is not allowed to do a motion to the, uh, batter, that prohibits the batter from doing, you know, just trying to hit the ball. You can’t do that.

1c-b. Once the pitcher is in the stretch, he can’t be over here and say to the runner, like, “I’m gonna get ya! I’m gonna tag you out! You better watch your butt!” and then just be like he didn’t even do that.

1c-b(1). Like, if you’re about to pitch and then don’t pitch, you have to still pitch. You cannot not pitch. Does that make any sense?

1c-b(2). You gotta be, throwing motion of the ball, and then, until you just throw it.

1c-b(2)-a. Okay, well, you can have the ball up here, like this, but then there’s the balk you gotta think about.

1c-b(2)-b. Fairuza Balk hasn’t been in any movies in forever. I hope she wasn’t typecast as that racist lady in American History X.

1c-b(2)-b(i). Oh wait, she was in The Waterboy too! That would be even worse.

1c-b(2)-b(ii). “get in mah bellah” – Adam Water, “The Waterboy.” Haha, classic…

1c-b(3). Okay seriously though. A balk is when the pitcher makes a movement that, as determined by, when you do a move involving the baseball and field of

2. Do not do a balk please.

[0]: https://justinbee.tumblr.com/post/15309101943/best-explanati...

hyperbovine · on Feb 25, 2024

Similar market share too.

behnamoh · on Feb 25, 2024

Can the low adoption of Mamba be attributed to what is being discussed today on HN (https://news.ycombinator.com/item?id=39491863)?

Basically, Nvidia et al. don't want the AI research to move in a direction that requires less GPU compute, less training data, and less inference compute.

Someone on HN (I don't remember the name) mentioned that the idea of deep learning is backed by big tech because it benefits them the most as they are the only players in town with huge amounts of data. If the AI community would find entirely different approaches to AGI (maybe not even learning), who do you think would suffer the most from the implications?

p1esk · on Feb 25, 2024

This doesn’t make sense - there are literally thousands of academic AI research labs who are severely limited by compute resources. If anything could work better than transformers and require less compute they would be all over that.

behnamoh · on Feb 25, 2024

I guess the argument is that most AI research is supported by the big tech, and they have heavily invested in the deep learning approach.

If the fundings were funneled to research groups working on alternative approaches, maybe we'd see the same amount of progress in AI only using another approach.

kettleballroll · on Feb 25, 2024

As a member of the research community: that's nonsense. Like already pointed out: academic groups (who by no means are dependent on big tech) would jump all over that. Mamba has been out long enough that you'd already see tons of papers at arxiv showing mamba dominating transformers in all sorts of applications. But that's not happening, despite the ton of hype. That doesn't mean that mamba is nonsense. Just that it isn't the immediate transformer killer. It remains to be seen if something comes from it, eventually.

godelski · on Feb 25, 2024

As a member of the research community: that's nonsense. Publishing is an extremely noisy process in ML and is getting increasingly difficult for smaller non big tech collaborating labs. Reviewers' go to are: more datasets, scale, not novel. The easiest way to approach this is to work off of pretrained models. This is probably more obvious in the NLP world.

I agree that Mamba doesn't solve everything and it still needs work. But I disagree with the logic that there isn't an issue of railroading.

p1esk · on Feb 25, 2024

What’s the main difference between an ape’s brain and a human brain? Scale. So that’s the train we’re riding at the moment. No roadblocks yet, aside from cost.

godelski · on Feb 25, 2024

> What’s the main difference between an ape’s brain and a human brain? Scale.

This is incredibly naive with absolutely no scientific basis. There is no evidence that this is in scale of data nor scale of architecture.

There are a number of animals with larger brains in terms of both mass and total number of neurons. An African Elephant has roughly 3x the number of neurons humans have. Dolphins beat humans in total surface area. Neanderthals are estimated to have had larger brains too! It isn't mass, neurons, neuron density, surface area. We aren't just scaled up chimps.

p1esk · on Feb 25, 2024

Other animals with larger brains might have other bottlenecks preventing them from reaching full potential of their intelligence. Neanderthals might have been smarter than us, but went extinct for reasons not related to intelligence.

But my point stands - our brains have evolved directly from apes brains and the main difference between them and us is brain size.

godelski · on Feb 25, 2024

> Other animals with larger brains might have other bottlenecks

>>> What’s the main difference between an ape’s brain and a human brain? Scale.

Your argument is inconsistent. Very clearly everything isn't scale or we'd use other things besides transformers. Different architectures scale in different ways and everything has different inductive biases. No one doubts scale is important, but there's a lot more.

p1esk · on Feb 25, 2024

Scale is all we need for transformers (so far). It might also be all we need for ape brains. It’s not all we need for whatever elephant or dolphin brains evolved from.

When this stops being the case for transformers, we will need something else. I’m just pointing out it’s not the case yet.

godelski · on Feb 25, 2024

I see no evidence of this in biology nor in ML. I've read those scale papers. I've worked on scale myself. I'll bet the farm that scale isn't all you need. But I won't be surprised if people say that it is all scale.

If you really think it is all scale, train a 7T ResNet MLP based model for NLP. If scale is all you need, make a LLM without DPO or RLHF. If scale is all you need, make SD3 with a GAN. Or what about a VAE, Normalizing Flow, HMM? Do it with different optimizers. Do it with gradient free methods. Do it with different loss functions.

The bitter lesson wasn't "scale is all you need." That's just a misinterpretation.

Edit: It's fine to disagree. We can compete on the ideas and methods. That's good for our community. So continue down yours, and I'll continue down mine. All I ask is that since your camp is more popular, you don't block those on my side. If you're right, we'll get to AGI soon. If we're right, we still might. But if we're right and you all block us, we'll get another AI winter in between. If we're right and you all don't block us, we can progress without skipping a beat. Just don't put all your eggs in one basket. It isn't good for anyone.

p1esk · on Feb 25, 2024

I said “scale is all you need for transformers”. That has been true since GPT1. The best way to improve our best model today still seems to be “make it larger and train it on more data”.

If you disagree please suggest a better way, or at least provide evidence that scaling up no longer works for transformers.

algo_trader · on Feb 25, 2024

> at least provide evidence that scaling up no longer works for transformers.

Isnt the Mixture-of-Experts trend (GPT4 is MoE?) kinda of a proof ?

godelski · on Feb 26, 2024

Of scale? I would think not. I would say they are evidence against scale because they are more an argument for multi agent systems. Scale is about a singular framework. What that means is debatable though (I mean anything we call a singular network can be decomposed into sub networks. It's messy), hence the other part of my comment about not scale solutions being claimed as scale.

> I said “scale is all you need for transformers”

No you didn't. What kicked this all off was

> What’s the main difference between an ape’s brain and a human brain? Scale.

Don't retcon.

p1esk · on Feb 25, 2024

I think they went MoE purely because straight up scaling from 175B to 1.8T is just too expensive. But it’s still 10x scaling, right?

frozenseven · on Feb 26, 2024

Scary how someone can be so confident in their wrong information.

> An African Elephant has roughly 3x the number of neurons humans have.

An African elephant's brain is not a scaled up chimp brain in any way. African elephants have less cortical neurons than a chimp, and roughly a third of the amount that humans have.

> Dolphins beat humans in total surface area.

Animals even less related to humans and chimps, with no prehensile appendages, living in an environment where building stuff is exceedingly difficult. And of course their brains are obviously different from any great ape.

> Neanderthals are estimated to have had larger brains too!

And were just as smart as us and also had a scaled up chimp brain.

godelski · on Feb 26, 2024

animal :: cortical neurons (b) :: total neurons (b)

Human :: 16 :: 86

Gorilla :: 9.1 :: 33

Chimp :: 6 :: 22

African Elephant :: 5.6 :: 251

Chimps are generally considered more intelligent than gorillas.

Bottlenose Dolphins have 11-15b cortical neurons while humans are in the range 14-18 (range is measurement uncertainty). It's also worth noting these dolphins have a larger brain mass (1.6 kg) and larger cortical surface (3700 cm2) than humans (1.3 kg and 2400 cm2, respectively).

> with no prehensile appendages, living in...

So more than scale. Glad we agree. Seems you also agree that architecture matters too.

frozenseven · on Feb 26, 2024

> Chimps are generally considered more intelligent than gorillas.

And chimps are genetically more similar to humans than gorillas. A chimp brain is more similar to a human brain than a gorilla brain.

> So more than scale. Glad we agree.

We absolutely do not agree. Notice how nobody suggested that a human brain is a scaled up version of an axolotl's brain? Yeah, that didn't happen. I do wonder why?

algo_trader · on Feb 25, 2024

> Just that it isn't the immediate transformer killer.

What is the best/stable-ish linear alternative for transformer right now? Especially for text generation and summarization.

We have domain specific ways of over sampling and search, so we much prefer less expensive models.

anon291 · on Feb 26, 2024

As someone who's worked at several NVIDIA competitors, including Groq, I can guarantee you that, based on my knowledge, of existing products, they would be able to make much more money should they have lower memory footprint models. Given the amount of VC capital deployed for this (on the order of 100s of millions), I don't believe this is a reasonable take.

Sure, NVIDIA et al may not want that (although, again I don't see why... they too can't produce chips fast enough so being able to provide models for customers now ought to be good), but there's so much money out there that does...

landryraccoon · on Feb 25, 2024

Why would Meta, Microsoft, Amazon and Google want Nvidia to remain dominant in hardware? Are you treating “big tech” like they all have one hive mind?

behnamoh · on Feb 25, 2024

For MSFT, AMZN, GOOG, the competitive advantage comes from having huge datasets (that Nvidia doesn't have). It's a symbiosis that benefits the data-rich and GPU-rich players.

pixl97 · on Feb 25, 2024

This still makes little sense as that scale will always matter. If you can drop the compute cost of a model by 10x it means you can increase model integrity/intelligence/speed etc beyond what your compute bound competitors have.

Simply put, for the time being huge datasets are going to be needed and those with bigger (cleaner?) datasets will have a better behaving model.

fauigerzigerk · on Feb 25, 2024

Where is the symbiosis? If data is the differentiator, how do the data owners benefit from Nvidia eating into their margins?

coldtea · on Feb 26, 2024

It's sumbiosis based on a common factor they both appreciate:

- the data/processing having to be large means the data-owners have a benefit

- the data/processing having to be large means NVIDIA also has a benefit (sells more GPUs to handle all that load)

fauigerzigerk · on Feb 26, 2024

Data owners are benefitting from having access to data while others don't.

High processing cost is not a benefit to them at all. It's just a cost eating into their margins.

nyrikki · on Feb 25, 2024

The fact that 'removing the “quadratic bottleneck”' involves either reduced expressability compared to self attention or disproving SETH is another reason.

The quadratic bottleneck is due to the lower bounds of exhaustive search.

The papers on this only ever seem to reference perplexity.

The fact it can append a word to "I'm going to the beach" that sounds good doesn't mean it is useful.

There is no free lunch, and this project hasn't shown that the costs are acceptable.

"I'm going to the beach" + house

Doesn't help if what you needed was

"I'm going to the beach" + tomorrow

I do hope that there is more information on the costs, or that they have disproven SETH soon.

soVeryTired · on Feb 25, 2024

What's SETH in this context? I googled to no avail.

nyrikki · on Feb 25, 2024

Strong Exponential Time Hypothesis

Here is how it relates to attention.

https://arxiv.org/abs/2209.04881

fbdab103 · on Feb 25, 2024

It is a really recent development. Even if this architecture is technically superior, it could take time before a model using it becomes competitive.

Or maybe it does not pan out at all. We are still at the stage where people are throwing everything at the wall to see what sticks. Some promising ideas which work at small scale do not work at bigger.

CuriouslyC · on Feb 25, 2024

This. Hyperparameter tuning and training include a lot of model specific black magic. Transformers have had time to mature, it'll take a while for other stuff to catch up even if they have a higher potential ceiling.

koayon · on Feb 25, 2024

Definitely agree that a lot of work going into hyperparameter tuning and maturing the ecosystem will be key here!

I'm seeing the Mamba paper as the `Attention Is All You Need` of Mamba - it might take a little while before we get everything optimised to the point of a GPT-4 (it took 6 years for transformers but should be faster than that now with all the attention on ML)

koayon · on Feb 25, 2024

Another interesting one is that the hardware isn't really optimised for Mamba yet either - ideally we'd want more of the fast SRAM so that we can store more larger hidden states efficiently

godelski · on Feb 25, 2024

Yes and no.

The thing is that Mamba is not perfect. There's no neural architecture to rule them all, if you will. I think the bigger issue is that we more act like there is and get on bandwagons. Let me give a clearer example from the past so we can see. The predecessor to DDPM (the work that kicked off the diffusion model era) was published in 2015[0], only a year after GANs[1]. Diffusion then showed promise but why did DDPM come out in 2020[2]? Because everyone was working on GANs. GANs produced far better images and diffusion (still is) was a lot more computationally intensive. Plus, all the people working on these diffusion models were in the same camp as those working on Normalizing Flows and other density based models, and fewer people are interested in understanding density estimation.

So the entire problem is that the community hopped on a singular railroad for research direction. There was still working going on in that direction but it wasn't nearly getting the attention that GANs got. It's hard to know if things were blocked from publication because they weren't as good as GANs. I can say from personal experience I had a Flow publication blocked because reviewers were concerned with its quality compared to GANs (this was 2019/2020, this paper will never be published because now it is even hard to publish a GAN work).

So yes and no because there is certainly railroading happening but there are also real critiques to Mamba. But what people often forget is that it is incredibly naive to compare new methods to existing methods on a direct one-to-one comparison. You're comparing something that has hundreds of hours to thousands of hours from a handful to a few dozen eyes against works with millions of hours and millions of eyes. Evaluation is just a really fucking hard thing to do but it is easy to just look at some benchmarks, even if they don't mean much. This is a fairly generalization notion though, so take the lesson to heart. But Mamba seems a bit different than our diffusion/GAN story, in that it is getting more attention than diffusion did in the 2016-2019.

[0] https://arxiv.org/abs/1503.03585

[1] https://arxiv.org/abs/1406.2661

[2] https://arxiv.org/abs/2006.11239

lamson · on Feb 25, 2024

I don't really think big tech have that much control. They are aiming for optimal profit solution, thing likes AI monopoly with huge self-supervised learning just popped up recently when ChatGPT performs really well, a couple years ago, people still believed modular and supervised learning is the key to AI application. So simply current scaling deep learning/llm is most promising and it works while tradional methods don't. If there is something that works as good as current solution and requires less resource, they will go for it very fast, see the implementation of Flash attention as an example.

imjonse · on Feb 25, 2024

Low adoption is primarily caused by it being relatively recent and there are no 7B or larger public Mamba-based models to start a comparison in earnest with widely used transformer based LLMs.

ssivark · on Feb 25, 2024

It’s less compute for the same model sizes. Rest assured that there will still be a race to scale model sizes (and data) to achieve better performance.

jahewson · on Feb 25, 2024

This early in the game? No. If LLMs become vastly cheaper and faster then adoption (and model size) will increase in line with that.

nbardy · on Feb 25, 2024

No these things just take time.

There is no conspiracy again efficient training. Companies aren’t going to lower compute budgets with more efficiency.

All the top labs are increasing efficiency, but they are using that to get more out of their large runs not spend less. Most companies have a relatively fixed training budget for their large runs and are trying to get the most out of it, bot save money,

Mamba is actually being scaled up and tested across other fields(bio) at a rapid pace compared to other architectures

godelski · on Feb 25, 2024

> There is no conspiracy

Fwiw, the OP isn't suggesting conspiracy. The notion is more about convergent thinking.

kgeist · on Feb 26, 2024

If Mamba selectively forgets "unnecessary" details, can it repeat the input verbatim (if asked)?

hackerlight · on Feb 26, 2024

If you ask at the end of the prompt then it may have already deliberately tossed the information it deemed irrelevant prior to the question. These aren't transformers. In general the recall for arbitrary information will be worse.

kgeist · on Feb 26, 2024

So the questions should come before the content and it might work?

I think that's how also RWKV works.

hackerlight · on Feb 27, 2024

It's known to help although I wouldn't expect it to be perfect recall unless the network is big enough.

The network will read the data token by token. So if you put the question at the beginning it will know what information it needs to pay attention to inside the rest of your context. Of course, if the network is too small, it still won't be perfect recall for a sufficiently complicated/large question/context.

alok-g · on Feb 26, 2024

A potentially naive question. Isn't this modeled like a Kalman Filter?

Edit: Sounds like it is. https://openreview.net/pdf?id=AL1fq05o7H

kken · on Feb 25, 2024

There is also this: https://jackcook.com/2024/02/23/mamba.html

fancyfredbot · on Feb 25, 2024

See also: https://jackcook.com/2024/02/23/mamba.html

thecolorgreen · on Feb 25, 2024

Why doesn't Equation 1b use the h' defined in Equation 1a?

koayon · on Feb 25, 2024

Hey! OP here Great question - h' in Equation 1a refers to the derivative of h with respect to time (t). This is a differential equation which we can solve mathematically when we have x in order to get a closed-form solution for h. We would then plug in that h (the hidden state) into equation 1b.

In our case, we don't actually wait for a closed-form solution but instead compute the discrete representation (Equation 2)

Hope that helps!

atlacatl_sv · on Feb 25, 2024

I believe h' is for the next state. y(t) is to predict the next word so it uses the current hidden state h(t).

givemeethekeys · on Feb 25, 2024

Transformers, Rise of the Mambas, coming to a theater near you!

vanjajaja1 · on Feb 26, 2024

I really enjoyed this article, thanks

AndrewKemendo · on Feb 25, 2024

Someone is going to re-invent Bellman's equations and call it Learnformer