Overview
The Transformer architecture, dominant in large-scale AI models since its introduction in 2017, faces scalability limits as sequence lengths grow. A key bottleneck is that self-attention computation scales quadratically with context length, which can make processing very long sequences computationally inefficient.
What Mamba Proposes
Recent research introduces a new architecture called Mamba, based on a "selective state space model" (selective SSM). This design generalizes the earlier S4 architecture (Structured State Spaces for Sequence Modeling) by allowing the model to selectively attend to or ignore incoming inputs. The modification makes certain parameters functions of the input, a small change that the authors report yields significant benefits.
Performance Claims
According to the authors, Mamba can match or outperform Transformer models on language modeling tasks. It is reported to scale linearly with context length and to handle sequences up to million-token lengths in practice, with up to 5x inference throughput improvements in some settings. The authors also report state-of-the-art results across multiple modalities including language, audio, and genomics. Their Mamba-3B model is reported to outperform Transformer models of comparable size and to be competitive with Transformer models roughly twice its size on language modeling benchmarks.
Relation to Prior Work
S4 has previously demonstrated effectiveness on long-range dependency benchmarks such as Long Range Arena and Path-X. S4 and other state space model approaches relate to RNNs, CNNs, and classical state space models (SSM). The Mamba paper compares its approach against other SSM-based methods and efficient attention variants, including linear attention, H3, Hyena, RetNet, and RWKV, many of which are used as baselines in the study.
Authors and Contributions
The paper is authored by Albert Gu (Assistant Professor, Machine Learning, Carnegie Mellon University) and Tri Dao (Chief Scientist at Together.AI and incoming Assistant Professor of Computer Science at Princeton University). Albert Gu is credited with the selective SSM generalization of S4. Tri Dao is known for FlashAttention and subsequent work like Flash Attention v2 and Flash-Decoding, which optimize attention computation and memory usage for long-context inference. The two authors have collaborated previously.
Resources
Model code and pretrained checkpoints are available: https://github.com/state-spaces/mamba
Paper: https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf