MambaQuant: Post-Training Quantization for Mamba Models

Overview

MambaQuant implements W8A8 and W4A8 post-training quantization (PTQ) for the Mamba series models, achieving near-floating-point accuracy and outperforming methods such as QuaRot. This work was accepted by the AI conference ICL R-2025.

Abstract

Mamba is an efficient sequence model comparable to Transformer architectures and shows strong potential as a foundation model across tasks. Quantization is commonly applied to neural networks to reduce model size and lower inference latency. However, research on quantizing Mamba models is limited. Existing quantization techniques that work well on CNNs and Transformers do not translate directly to Mamba: for example, even in a W8A8 configuration, QuaRot reduced accuracy by 21% on Vim-T. We investigate this problem and identify several key challenges.

Key Challenges

First, there are substantial outliers in gate projections, output projections, and matrix multiplications. Second, Mamba's unique parallel scan operation further amplifies these outliers, causing uneven and long-tailed data distributions. Third, even after applying Hadamard transforms, weight and activation variances remain inconsistent across channels. To address these issues, we propose MambaQuant, a PTQ framework that includes: 1) enhanced rotation based on the Karhunen-Loève transform (KLT) to make the rotation matrix adapt to different channel distributions; and 2) smoothed fused rotation to balance channel variances and optionally merge extra parameters into model weights.

Technical Details

We observed significant outliers in both weights and activations of Mamba models. Linear-layer weights contain outliers, especially in the gate projection layers of Mamba-LLM for language tasks. Some linear-layer inputs exhibit large variance across channels, notably in the output projection layer of Vim for vision tasks.

The parallel scan operator (PScan) further amplifies activation outliers. To obtain hidden states at each timestep, the PScan operator (Smith et al., 2022) repeatedly multiplies a fixed parameter matrix. High-value channels are amplified while lower-value channels are suppressed, and these inter-channel value differences propagate to activations.

Hadamard-based methods have recently been successful for quantizing Transformer-based LLMs because they can uniformize maxima and provide an equivalent transform. For instance, QuaRot preserved 99% zero-shot performance when quantizing LLAMA2-70B to 4 bits. However, applying these methods directly to Mamba leads to large accuracy drops; even at 8-bit quantization, average accuracy on Vim dropped by more than 12%.

MambaQuant: Proposed Solutions

To address the above problems, we present MambaQuant, the first comprehensive PTQ design for the Mamba series that achieves high-accuracy W8A8 and W4A8 quantization. Main contributions include:

Offline enhanced rotation based on the Karhunen-Loève transform (KLT). This technique multiplies a Hadamard matrix with a KLT matrix so the rotation matrix can adapt to different channel distributions.
Online smoothed fused rotation. This method applies smoothing before the Hadamard transform. The additional smoothing parameters are flexibly fused into Mamba module weights to avoid extra memory and inference costs.

Results

By aligning both the maxima and variances of quantized data across channels, MambaQuant significantly improves quantization quality. Experimental results show that MambaQuant outperforms QuaRot and other quantization schemes on Mamba models. For W8A8, accuracy loss across a range of vision and language benchmarks is under 1%. W4A8 quantization also achieves state-of-the-art results.

Notably, our channel variance alignment method yields a clear and visible accuracy improvement, for example when using KLT.

Impact

This work is the first to achieve high-accuracy quantization on Mamba models, enabling more efficient deployment and inference, particularly on edge devices.