# ICML 2024 BiMamba Rebuttals
## Response to All Reviewers
We would like to express our sincere gratitude to the reviewers for the valuable feedback and constructive suggestions.
We are glad that the reviewers found our novel *Matrix Mixer* view for sequence modeling interesting and effective, and that they found the paper to be well-written and easy to follow.
In this shared response, we address some of the common concerns, and provide new empirical results and analyses:
1. **GLUE Performance:** In addressing the concerns raised by reviewers zzad and PoE8 with our model's GLUE performance, we gently point out that the results in the submission **exceeded the Transformer baseline without any hyperparameter tuning whatsoever**. We have since substantially improved our results, **now leading BERT by 1.1, and M2 by 3.4 points** through minimal tuning of the finetune hyperparameters, all the while using less compute than the baselines. We further highlight that our model **excels across domains, achieving a top 1% ImageNet accuracy of 81% vs. ViT's 78.8%.**
2. **Limitations:** The reviewers have rightfully noted the need for a discussion on the limitations of our framework. We agree that this is an important discussion that is rich in the nuances of **Representation-Computation tradeoff and hardware efficiency concerns**. We delve into the details of these nuances below.
3. **Reproducibility:** We provide step-by-step instructions to pretrain the Bi-Mamba model and evaluate it on all GLUE tasks. We will open-source the model code and instructions to reproduce all main results from the submission.
<!-- different inductive biases -->
### Performance: ImageNet and GLUE
Before we present our improved GLUE results, we deem it important to highlight that the paper's scope extends beyond language tasks alone; **our proposed framework and method is broadly applicable to all sequence modeling tasks**. Notably, Bi-Mamba achieves substantially better results on ImageNet than prior methods using the standard recipe [1], **achieving top-1% accuracy of 81% vs ViT's 78.8%**, demonstrating our method's efficacy.
For the rest of this section, we expand on new results on the GLUE benchmark.
We recognize the concerns from reviewers zzad and PoE8 regarding Bi-Mamba's model size and performance on the GLUE benchmark. In light of their feedback, we have made significant improvements to our results - **outperforming BERT by 1.1 points, a considerable jump from our previous 0.4 point lead**. To achieve this, we made the following adjustments:
- **Fine-tuning Strategy**: We want to emphasize that the well-established BERT and M2 models benefit from highly optimized training and finetuning recipes. **The GLUE scores in our submission were produced using the M2 recipe out-of-the-box, without any hyperparameter tuning**, despite which Bi-Mamba outperformed both BERT and M2. To improve our results, **we only do a short sweep for the learning rate and number of epochs for finetuning tasks**. We ensure that the number of epochs do not surpass their original values for fairness. Please refer to Table 1 for the finetune recipes used by BERT, M2, and Bi-Mamba .
- **Parameter Matching BERT and Bi-Mamba**: We reduce the number of layers of BiMamba from $24$ to $23$ (n_param: 116M → 112M) to parameter match the model to BERT (110M)
**Table 1:** The finetuning hyperparameters of learning rate and number of epochs for each task. We highlight the differences in finetuning recipes from BERT by **bold** texts.
| Model | MNLI | QNLI | QQP | RTE | SST2 | MRPC | COLA | STS |
|---------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|
| BERT [2] | lr=5e-5, wd=5e-6, epochs=3, seq_len=256 | lr=1e-5, wd=1e-6, epochs10, seq_len=256 | lr3e-5, wd=3e-6, epochs=5, seq_len=256 | lr=1e-5, wd=1e-6, epochs3, seq_len=256 | lr3e-5, wd=3e-6, epochs3, seq_len=256 | lr=8e-5, wd=8e-6, epochs=10, seq_len=256 | lr5e-5, wd=5e-6, epochs=10, seq_len=256 | lr3e-5, wd=3e-6, epochs=10, seq_len=256 |
| M2 | lr=5e-5, wd=5e-6, epochs=3, seq_len=**128** | lr=**5e-5**, wd=1e-6, epochs=10, seq_len=**128**, **pool_all=True**^1 | lr=3e-5, wd=**1e-2**, epochs=**10**, seq_len=**128** | lr=1e-5, wd=**1e-2**, epochs=**6**, seq_len=**128** | lr=3e-5, wd=3e-6, epochs=3, seq_len=**128** | lr=**5e-5**, wd=**1e-2**, epochs=10, seq_len=**128** | lr=5e-5, wd=5e-6, epochs=10, seq_len=**128** | lr=**7e-5**, wd=**1e-2**, epochs=10, seq_len=**128** |
| Bi-Mamba | lr=**1e-4**, wd=5e-6, epochs=**2**, seq_len=256 | lr=**5e-5**, wd=1e-6, epochs=**7**, seq_len=256 | lr=**5e-5**, wd=3e-6, epochs=**3**, seq_len=256 | lr=1e-5, wd=1e-6, epochs=3, seq_len=256 | lr=**5e-5**, wd=3e-6, epochs=**2**, seq_len=256 | lr=8e-5, wd=8e-6, epochs=10, seq_len=256 | lr=**1e-4**, wd=5e-6, epochs=10, seq_len=256 | lr=3e-5, wd=3e-6, epochs=10, seq_len=256 |
^1: Global average pooling over input tokens, instead of appending [CLS] token
Our updated GLUE scores for BiMamba (112M) are listed in Table 2. We note that **Bi-Mamba surpasses the heavily-tuned scores of BERT and M2 across all tasks; Bi-Mamba achieves a 0.4 score lead in MNLI and an overall lead of 1.1 GLUE score compared to BERT.**
**Table 2:** *The updated GLUE scores for BiMamba. The reported numbers are the averages of five runs.*
| Model | #Params | MNLI | QNLI | QQP | RTE | SST2 | MRPC | COLA | STS | AVG |
|----------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|
| BERT | 110M | 84.1 | 89.8 | 91.2 | 77.2 | 91.2 | 87.5 | 54.6 | 88.9 | 83.2 |
| M2 | 116M | 80.5 | 86.0 | 87.0 | 69.3 | 92.3 | 89.2 | 56.0 | 86.9 | 80.9 |
| Bi-Mamba (w/ M2 recipe) | 116M | 83.7 | 89.7 | 89.7 | 77.4 | 92.8 | **91.5** | 54.7 | **90.1** | 83.7 |
| Bi-Mamba | 112M | **84.5** (*Δ=+0.8*) | **90.0** (*Δ=+0.3*) | **91.3** (*Δ=+1.6*) | **77.5** (*Δ=+0.1*) | **93.5** (*Δ=+0.7*) | 91.2 (*Δ=-0.3*) | **57.2** (*Δ=+2.5*) | 88.9 (*Δ=-1.2*) | **84.3** (*Δ=+0.6*) |
[1] *A ConvNet for the 2020s. Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie*
[2] *Mosaicbert: A bidirectional encoder optimized for fast pretraining. Jacob Portes, Alexander Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle*
### Limitations and Discussion of Our Method
We appreciate the reviewers bringing up the absence of a discussion of the limitations of our framework and method. We recognize its importance and we now delve into the nuances of some of the trade-offs and concerns involved:
**1. Representation-Computation Tradeoff:**
While **structured matrix mixers are computationally more efficient** than their dense matrix mixer counterparts like softmax attention, **they are also representationally less expressive**, which may be seen as a limitation of these methods. For instance, concurrent works [3,4,5] have begun investigating the representational power of SSMs by analyzing their performance on memorization-centric tasks. They report that SSMs with a fixed model capacity are eventually outperformed by softmax attention for longer sequences. This can be viewed as a consequence of the matrix being *too structured*, and hence *too inexpressive* for the problem.
On the other hand, we remark that the *degree of structure* of a structured matrix is a knob that can be tuned according to the specific task, that is we can tradeoff the **computational efficiency of the method for larger expressivity**. For instance, within the structured matrix class of low rank matrices, we can tune the rank upto the size of the matrix, which is the sequence length. As the rank of the matrix class increases so does its expressive power; however, at the same time it also diminishes the compute efficiency associated with the matrix being low rank.
As another example, in response to Reviewer PoE8's second question on the performance of SSMs on retrieval-style tasks, **we demonstrate this tradeoff for SSD**, which is the modern variant of Mamba. Specifically, we show that **SSD is able to recover the accuracy attained by softmax attention once we control for the compute capacity** of the model. In contrast, hardware limitations of the selective scan algorithm make it impractical to match the compute capacity in Mamba, explaining the emerging findings from [3,4,5] that SSMs underperform on memorization-centric tasks. This makes it evident that the development and analysis of SSMs is an active area of research with substantial room for exploration and improvement.
**2. Hardware Efficiency:** Despite the fact that structured matrices have associated sub-quadratic matrix multiplication algorithms, their implementation **may not be *hardware-friendly*,** which can reduce the execution speed in practice.
In the next revision of the paper, we will include a comprehensive discussion of the limitations associated with structured matrices.
[3] *Zoology: Measuring and Improving Recall in Efficient Language Models. Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Ré*
[4] *Simple linear attention language models balance the recall-throughput tradeoff. Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré*
[5] *Repeat after me: Transformers are better than state space models at copying. Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach*
-----------------------------------------------------------------------------
### Reproducibility of results
Recognizing the importance of reproducibility, we are pleased to share the source code for our Bi-Mamba through this [URL](https://gist.github.com/anonymous-icml-10127/c45d3f762d2a27299ac2340fde5cc713).
After following the preparation steps described in [M2 repository](https://github.com/HazyResearch/m2) and properly placing the provided codes, the results in Table 2 can be reproduced using the following commands.
```
# Pretrain on C4
composer -n 8 main.py BiMamba-24layers-116M-C4.yaml
# Finetune from C4-pretrained weights
python glue.py BiMamba-24layers-116M-GLUE.yaml
# Print random seeds
ls ./local-finetune-checkpoints/BiMamba-24layers-116M/task=mnli/ | grep -oP 'seed=\K\d+'
# Finetune from MNLI-pretrained weights
# Insert a seed number into {SEED} printed by the above command
python glue.py BiMamba-24layers-116M-GLUE.yaml from_mnli=True \
base_run_name="BiMamba-24layers-116M-from-mnli" \
local_pretrain_checkpoint_folder="./local-finetune-checkpoints/BiMamba-24layers-116M/task=mnli/seed={SEED}"
```
We are finalizing preparations for the public release of the code to further facilitate research transparency. To note, the provided code employs PyTorch exclusively for ease of understanding. For the public release, the 'mamba_chunk_scan_fused' function will be substituted with a Triton-based alternative, which significantly enhances training speed. We welcome any further requests for clarification or additional information regarding our training procedures.
<!-- [tell them why we have not released full triton code?] We have not provided this version in this response because it is from the "State Space Duality" submission from the supplemental that our paper is based on-->
------------------------------------------
## Response to Reviewer 1
We are glad by the reviewer's recognition of our work's contributions, particularly how the Matrix Mixer view offers valuable insights into the performance of Transformers and the latest SSMs. We now turn to addressing their questions and concerns.
> 1. The reproducibility of results is not clear.
Addressing the reviewer's concerns on reproducibility, we provide step-by-step instructions for replicating the pretraining and finetuning results of Bi-Mamba, as outlined in the common response.
----------------
> 2. the proposing of bidirectional mamba is somehow straightforward
We agree with the reviewer that using bidirectional sequence models is quite common in the literature. We further note that many approaches have been developed for incorporating bidirectionality into recurrent models [1]. However, **these approaches treat the causal sequence mixer as a black-box and utilize heuristics like addition, concatenation, hadamard product to devise bidirectional encoders**. From various heuristic extensions of Mamba on bidirectional settings developed by academics [2,3,4], and from the GitHub [issues](https://github.com/state-spaces/mamba/issues/99) and [pull requests](https://github.com/state-spaces/mamba/pull/52) raised by practioners, it is clear **that this is not a settled problem in the machine learning community**.
In our work, we approach this problem **under a more natural paradigm of structured sequence mixers**. This premise allows us to narrow down the architecture search space, which would otherwise be very broad. We also show that various existing, performant sequence mixers can be subsumed under this framework, giving it further credance. Under this framework, we then **principally arrive at Quasiseparable matrices**, as our matrix mixer of choice for Bi-Mamba. Through comprehensive ablations (see Table 4 and Figure 3 in the paper), **we demonstrate that the theoretically motivated structured matrix approach outperforms the other heuristically motivated bidirectional methods.**
[1] *Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. Alex Graves, Santiago Fernández, Jürgen Schmidhuber*.
[2] *Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang*.
[3] *VMamba: Visual State Space Model. Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu*.
[4] *Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling. Yair Schiff, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, Volodymyr Kuleshov*.
> 3. Although using more parameters, Bi-Mamba does not show clear improvements over previous models.
We understand the reviewer's concern on the model size and the performance of Bi-Mamba on GLUE tasks. We have now significantly improved our results - **Bi-Mamba achieves a lead of 0.4 points in MNLI and an average lead of 1.1 points across all tasks, compared to the heavily tuned BERT results**. Furthermore, we have **parameter matched Bi-Mamba** (116M → 112M) to BERT (110M). To achieve these improvements, unlike our initial approach of using out-of-the-box M2 recipe, we perform **a minimal sweep over learning rates and number of epochs for finetuning tasks**. For complete details, we invite the reviewer to refer to our shared response.
> 4. Limitations of structured matrices.
We are glad that the reviewer raised this question, and we kindly request the reviewer to view the shared response for a detailed discussion on the **Representation-Computation tradeoff, and the hardware concerns** associated with structured matrices.
------------------------------------------
## Response to Reviewer 2
We sincerely appreciate the reviewer's acknowledgement of the contributions of our work, particularly highlighting the insights gained from the Matrix Mixer perspective and the positive remarks on our development of Bi-Mamba, which demonstrates notable performance. We are pleased to address the reviewer's concern.
> It could not be possible to encompass all algorithms via the Matrix Mixer view.
We acknowledge that the vast landscape of potential sequence mixing algorithms signifies that some may not fit within the Matrix Mixer framework. Nevertheless, the Matrix Mixer perspective remains a valuable framework for developing innovative algorithms. Our paper introduces new matrix mixers that achieve commendable accuracy, as shown in Table 5. We have conducted more experiments and evaluated additional methods, details of which will be incorporated in the next revision.
------------------------------------------
## Response to Reviewer 3
We are grateful for the reviewer's recognition of *Matrix Mixers* as an innovative and effective approach to sequence modeling. We are also glad that the reviewer found our validation of Bi-Mamba across multiple tasks and domains extensive. We are eager to address the concerns and questions below.
> 1. It seems hard to fill the 0.4 MNLI gap, even though bi-mamba uses larger model size.
We understand the reviewer's concern on the model size and the performance of Bi-Mamba on GLUE tasks. We have now significantly improved our results - **Bi-Mamba achieves a lead of 0.4 points in MNLI and an average lead of 1.1 points across all tasks, compared to the heavily tuned BERT results**. Furthermore, we have **parameter matched Bi-Mamba** (116M → 112M) to BERT (110M). To achieve these improvements, unlike our initial approach of using out-of-the-box M2 recipe, we perform **a minimal sweep over learning rates and number of epochs for finetuning tasks**. For complete details, we invite the reviewer to refer to our shared response.
----------------
> 2. SSMs have critical weaknesses in retrieval-like tasks. The related tasks can be evaluated and reported in the paper.
We appreciate the reviewer's excellent question on the perfomance of Bi-Mamba, and more broadly SSMs, on retrieval-like tasks. They have correctly noted that **concurrent works** like Zoology [1] and Repeat After Me [2] report that SSM models like Mamba tend to underperform on the **memorization-centric** tasks like Associative Recall (AR). However, we would like to highlight **that these experiments have been performed on older variants of SSMs, and their test design does not fully demonstrate the capabilities of SSMs.**
Specifically, prior works **have not controlled for the effective state (memorization) capacity of the models** when comparing these methods. In a nutshell, **we can view SSMs with a knob of "state size" to control the model capacity**. They become equivalent to attention in capacity and compute cost with state size equals the sequence length^1. Since older variants of SSMs like Mamba were practically infeasible to run on larger state sizes, previous works did not fully test the representational capabilities of SSMs. We now empirically validate our hypothesis on the **latest iteration of SSMs - SSD, whose algorthmic improvements allows it to run on much larger state sizes**. Furthermore, we note that our model **Bi-Mamba is based on SSD, which allows it to enjoy the same retrieval capabilities as SSD.**
^1: This simplified explanation assumes that the number of layers and the model dimension are the same
**Experimental Setting:** We test on the Multi-Query Associative Recall (MQAR) synthetic benchmark, introduced in [1]. For our experiment, we chose the sequence length, and the number of kv-pairs: $(l,d)=(1024, 256)$; this setting is more difficult than the most challenging scenario $(l,d)=(512,128)$ reported in [1]. Other hyperparameters remain unchanged from those used in [1].
| Model | Model Dimension | State Size | Number of Layers | Capacity wrt to Attention | Accuracy |
|-----------|-----------------|------------|------------------|---------------------------------------|------------|
| Attention | 64 | N/A | 2 | 1x | 1.0 |
| Mamba | 64 | 16 | 2 | 1/32x | 0.00 |
| SSD | 64 | 64 | 4 | 1/8x | 0.44 |
| SSD | 64 | 128 | 8 | 1/2x | 0.93 |
| SSD | 64 | 256 | 4 | 1x | 0.99 |
We observe that due to a very low capacity, **Mamba completely fails to learn the task**. On the other hand, **as we match the capacity of the SSD model with softmax attention, we recover its strong performance**, thus validating our hypothesis.
[1] *Zoology: Measuring and Improving Recall in Efficient Language Models. Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, Christopher Ré*
[2] *Repeat after me: Transformers are better than state space models at copying. Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach*
-------------------
<!-- > 3. The advantages of the proposed method can be better highlighted.
We thank the reviewer's feedback. We will revise our paper to more effecitvely underscore the advantages and contributions of our method.
(@June I love this answer, wish all of them could be this!) -->
-----------------------
> 3. What are the numeric issues of the proposed model? Mamba model often needs fp32 for stable training.
We appreciate the reviewer's concern regarding the stability of training with mixed precision. However, since we use the newer variant of Mamba (SSD), **we were able to successfully train and evaluate all the structured matrix variants, including Bi-Mamba, using bfloat16 floating point numbers**. Throughout the experiments, we encountered no instances of instability that could be attributed to the use of low precision floating point numbers.
--------------------------
> 4. No scaling curves provided.
We understand and value the reviewer's request for scaling curves for a more thorough understanding of how Bi-Mamba's performance varies with compute, dataset size and model size. However, since this requires substantial computational resources, producing these plots is currently beyond the scope of our academic compute budget.
--------------------------
> 5. Limitations of structured matrices.
We are glad that the reviewer raised this question, and we kindly request the reviewer to view the shared response for a detailed discussion on the **Representation-Computation tradeoff, and the hardware concerns** associated with structured matrices.
-------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------