A Deep Dive into MAMBA and State Space Models for Long-Sequence Modeling

# A Deep Dive into MAMBA and State Space Models for Long-Sequence Modeling ## Authors Aashamsu Nepal - [@ashamsu95](https://github.com/ashamsu95) Mirko Schürlein - [@mirkosprojects](https://github.com/mirkosprojects) ## Introduction This research article provides a comprehensive overview of the MAMBA architecture. We begin by introducing **State Space Models (SSMs)** as the foundation and gradually build up to the full **MAMBA** architecture. Along the way, we explore several important models, including **Structured State Space Models (S4)**, **Selective State Space Models (S6)**, as well as contributions from **DSS** and **S4D**. Additionally, we examine the relationship between SSMs and traditional sequence models such as **Recurrent Neural Networks (RNNs)** and **Long Short-Term Memory networks (LSTMs)**. > [!TIP] Code > This research is accompanied by code that closely follows the same structure and concepts presented throughout: [https://github.com/mirkosprojects/mamba-deep-dive](https://github.com/mirkosprojects/mamba-deep-dive) ## Table of Contents - [1) State Space Model (SSM)](#1\)-State-Space-Model-(SSM)) - [1.1) Derivation from Control Theory](#1.1\)-Derivation-from-Control-Theory) - [1.2) Discretization](#1.2\)-Discretization) - [1.3) Convolution](#1.3\)-Convolution) - [1.4) State Diagram](#1.4\)-State-Diagram) - [1.5) Summing up](#1.5\)-Summing-up) - [2) Structured State Space Model (S4)](#2\)-Structured-State-Space-Model-(S4)) - [2.1) High-Order-Polynomial-Projections-(HiPPO)](#2.1\)-High-Order-Polynomial-Projections-(HiPPO)) - [2.2) Diagonalization](#2.2\)-Diagonalization) - [2.3) State Diagram](#2.3\)-State-Diagram) - [2.4) Summing up](#2.4\)-Summing-up) - [3) Selective State Space Model (S6)](#3\)-Selective-State-Space-Model-(S6)) - [3.1) Motivation](#3.1\)-Motivation) - [3.2) Selectivity](#3.2\)-Selectivity) - [3.3) Algorithm](#3.3\)-Algorithm) - [3.4) Parallel Scan](#3.4\)-Parallel-Scan) - [3.5) Kernel Fusion](#3.5\)-Kernel-Fusion) - [4) MAMBA](#4\)-MAMBA) - [4.1) MAMBA Architecture](#4.1\)-MAMBA-Architecture) - [4.2) Benchmark Results](#4.2\)-Benchmark-Results) - [5) References](#5\)-References) # 1) State Space Model (SSM) ## 1.1) Derivation from Control Theory SSMs evolve from control theory. They are defined with the following two equations: $$ \begin{aligned} x'(t) &= Ax(t) + Bu(t) &&\text{(State Equation)}\\ y(t) &= Cx(t) + Du(t) &&\text{(Output Equation)} \end{aligned} $$ With: $$ \begin{aligned} u(t)&: \text{Input Signal}\\ x(t)&: \text{Latent State}\\ y(t)&: \text{Output Signal}\\ \end{aligned} $$ Where a 1-dimensional input signal $u(t)$ is mapped into an n-dimensional latent state $x(t)$, which is finally projected into a 1-dimensional output signal $y(t)$. These two equations can be represented as the block diagram below: ![state diagram continuous](https://hackmd.io/_uploads/r1eKzVX4lx.png) *State diagram of a continuous linear time-invariant (LTI) system.* With: - **B** (Input Matrix): how the input $u(t)$ influences the latent state $x(t)$ - **C** (Output Matrix): how the latent state $x(t)$ influences the output $y(t)$ - **A** (State Matrix): how the current latent state $x(t)$ is influenced by itself over time - **D** (Feedforward Matrix): how the input $u(t)$ influences the output $y(t)$ Such models are frequently used in electrical or mechanical engineering to describe linear time-variant systems, where the matrices $A$, $B$, $C$ and $D$ are calculated from a system of differential equations. The classic example of this is the spring system shown below. > [!NOTE] Feedforward Matrix D > We omit $D$ in the following parts of this article because it can be easily replaced with a feedforward/skip connection. ### Example: Spring System ![spring system](https://hackmd.io/_uploads/ByUvfphQle.gif =500x) *Animation of a spring system that is initiated with the force $u_k$. Animation adapted from [[7]](#article-7).* Suppose we want to model the above system with a spring attached to a mass that is sliding on the floor. This system is described by the following differential equation. $$ my''(t) = u(t) - by'(t) - ky(t) $$ Where: $$ \begin{aligned} y(t)&: \text{position}\\ y'(t)&: \text{velocity}\\ y''(t)&: \text{acceleration}\\ u(t)&: \text{applied force}\\ k&: \text{spring constant}\\ b&: \text{friction coefficient}\\ m&: \text{mass} \end{aligned} $$ We choose new variables $x_1$ and $x_2$ as: $$ \begin{aligned} x_1(t) &:= y(t)\\ x_2(t) &:= x_1'(t) \end{aligned} $$ In the differential equation, we can substitute: $$ \begin{aligned} y(t) &= x_1(t)\\ y'(t) &= x_2(t)\\ y''(t) &= x_2'(t)\\ \end{aligned} $$ Putting $x_1, x_2, x_2'$ in the differential equation and solving for $x_2'$ yields: $$ \begin{aligned} &&mx_2'(t) &= u(t) - bx_2(t) - kx_1(t)\\ \Rightarrow &&x_2'(t) &= \frac{1}{m}u(t) - \frac{b}{m}x_2(t) - \frac{k}{m}x_1(t) \end{aligned} $$ We now have equations for $x_1'(t)$, $x_2'(t)$ and $y(t)$, which we can arange in a from which we can construct the state space model: $$ \begin{aligned} \begin{bmatrix} x_1'(t)\\ x_2'(t) \end{bmatrix} &= \begin{bmatrix} 0 & 1\\ -\frac{k}{m} & -\frac{b}{m} \end{bmatrix} \begin{bmatrix} x_1(t)\\ x_2(t) \end{bmatrix}+ \begin{bmatrix} 0\\ \frac{1}{m} \end{bmatrix} u(t)\\ y(t) &= \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} x_1(t)\\ x_2(t) \end{bmatrix} \end{aligned} $$ Where: $$ \begin{aligned} A &= \begin{bmatrix} 0 & 1\\ -\frac{k}{m} & -\frac{b}{m} \end{bmatrix}\\ B &= \begin{bmatrix} 0\\ \frac{1}{m} \end{bmatrix}\\ C &= \begin{bmatrix} 1 & 0 \end{bmatrix} \end{aligned} $$ This model allows us to apply a force $u(t)$ to the system and monitor the position of the mass $y(t)$, as can be seen in the animation above. > [!NOTE] Parameter Intialization > Note that we chose the newly introduced parameters $x_1$ and $x_2$ in a certain way to make the calculation of the state space model easy. > This initialization is not the only option i.e. **State Space Models are not unique**. In fact, many systems with different matrices yield exactly the same differential equation. This fact will come into play later for the initialization of our neural network. > A detailed explanation of State Space Models with additional examples can be found in [[18]](#article-18). To use this architecture as a machine learning model, we can simply initialize the matrices randomly and make them trainable through backpropagation. Before we do so, we're gonna have to adjust the model a little bit. ## 1.2) Discretization So far, our SSM describes a continuous-time system. However, we want to provide it with discrete input, which is where discretization comes into play. ### Sampling ![sampling](https://hackmd.io/_uploads/ryg6X_WHxg.png) *Sampling of a continuous signal in intervals of $\Delta$.*  We first measure the values of the continuous signal at regular intervals of length $\Delta$, obtaining: $$ x[k] = x(t_k) \qquad \text{where: } t_k = k\Delta $$ Now, $x[k]$ is a discrete signal, that we can work with computationally. ### Modelling between sample points After sampling, we must model how the input behaves between sample points. This is necessary to derive the system's discrete-time response. Common choices include: - **Bilinear Transform**: A numerical method that approximates the system's dynamics using the trapezoidal rule. - **Zero Order Hold (ZOH)**: Assumes the input remains constant between samples. #### Bilinear Transform We can approximate the derivative $x'(t)$ and the function values $x(t)$ at these sample points as follows: $$ \begin{aligned} &x'(t_k) \approx \frac{x[k+1] - x[k]}{\Delta} &x(t_k) \approx \frac{x[k+1] + x[k]}{2} \end{aligned} $$ ![bilinear transform](https://hackmd.io/_uploads/HkacEdWSgl.png) *Approximation of derivative $x'(t)$ and function values $x(t)$ using the bilinear transform.* Substituting $x(t)$ and $x'(t)$ for $x(t_k)$ and $x'(t_k)$ in the state equation yields: $$ \begin{aligned} \frac{x[k+1] - x[k]}{\Delta} = A \cdot \frac{x[k+1] + x[k]}{2} + Bu[k] \end{aligned} $$ Multiply both sides by $\Delta$: $$ \begin{aligned} x[k+1] - x[k] = \frac{\Delta}{2} A (x[k+1] + x[k]) + \Delta Bu[k] \end{aligned} $$ Bring $x[k+1]$ terms to one side: $$ \begin{aligned} \left(I - \frac{\Delta}{2} A \right) x[k+1] = \left(I + \frac{\Delta}{2} A \right) x[k] + \Delta Bu[k] \end{aligned} $$ Multiply both sides by $\left(I - \frac{\Delta}{2} A\right)^{-1}$ from the left: $$ \begin{aligned} x[k+1] = \underbrace{\left(I - \frac{\Delta}{2} A\right)^{-1} \left(I + \frac{\Delta}{2} A\right)}_{\bar{A}} \cdot x[k] + \underbrace{\left(I - \frac{\Delta}{2} A\right)^{-1} \cdot \Delta B}_{\bar{B}} u[k] \end{aligned} $$ > [!IMPORTANT] Bilinear Transform > $$ > \begin{aligned} > \bar{A} &=(I-\frac{\Delta}{2}A)^{-1}\cdot(I+\frac{\Delta}{2}A)\\ > \bar{B} &=(I-\frac{\Delta}{2}A)^{-1}\cdot \Delta B\\ > \bar{C} &=C > \end{aligned} > $$ #### Zero Order Hold The Zero Order Hold (ZOH) discretization method holds the input constant between time steps. Rather than providing a full derivation, we present the resulting discretized system matrices below. ![zero order hold](https://hackmd.io/_uploads/S1hLjxjfel.png) *Zero Order Hold dicretization. Image: Vedant Jumle [[9]](#article-9).* > [!IMPORTANT] Zero Order Hold > $$ > \begin{aligned} > \bar{A} &= e^{\Delta A}\\ > \bar{B} &= (\Delta A)^{-1} \cdot (e^{\Delta A}-I) \cdot \Delta B\\ > \bar{C} &= C > \end{aligned} > $$ ### Discretized SSM The discretized SSM is now a sequence-to-sequence map $u[k] \mapsto y[k]$ instead of a function-to-function map $u(t) \mapsto y(t)$. This allows us to recurrently calculate the state and output. > [!WARNING] Notation > Instead of using $u[k]$, $x[k]$ and $y[k]$, we now switch to notation $u_k$, $x_k$ and $y_k$. $$ \begin{aligned} x_k &= \bar{A}x_{k-1} + \bar{B}u_{k} &&\text{(State Equation)}\\ y_k &= \bar{C}x_k &&\text{(Output Equation)} \end{aligned} $$ The discretized SSM can be visualized using a discrete state diagram. We replace the $\int$-block with a $z^{-1}$-block, which can be thought of as "holding" the current signal for one time step before passing it along. ![state diagram discrete](https://hackmd.io/_uploads/rJa5MVmElg.png) *State diagram of a discrete LTI system.* > [!NOTE] Takeaway > We can discretize the continuous LTI-system by defining a **step size $\Delta$** and calculating $\bar{A}$, $\bar{B}$ and $\bar{C}$ using the **Bilinear Transform** or **Zero Order Hold**. > This allows us to recurrently calculate the state $x_k$ and output $y_k$. ## 1.3) Convolution Using the discretized SSM, let's recurrently calculate a few steps. For simplicity, we assume the initial state $x_{-1}$ to be 0. Step 1: $$ \begin{aligned} x_0 &= 0 + \bar{B}u_0 \\ y_0 &= \bar{C}x_0 \end{aligned} $$ Step 2: $$ \begin{aligned} x_1 &= \bar{A}x_0 + \bar{B}u_1\\ y_1 &= \bar{C}x_1 \end{aligned} $$ Step k: $$ \begin{aligned} x_k &= \bar{A}x_{k-1} + \bar{B}u_k\\ y_k &= \bar{C}x_k \end{aligned} $$ This recurrent formulation is very similar to what we use in Recurrent Neural Networks (RNNs), where the output of a neuron at one time step is fed back as an input to the network at the next time step. Concretely, $x_k$ can be viewed as the hidden state of an RNN. We can unfold the recurrent representation as is shown in the image below. ![Recurrence](https://hackmd.io/_uploads/ryojidlrgx.png) *Recurrent and unfolded representation of the SSM.* > [!NOTE] Recurrence during training > Recurrence is useful during inference, where we want to generate one output $y_k$ at a time, however during training this method isn't feasible because we cannot parallelize it. During training, we have a known input sequence $u$ and a known output sequence $y$. To find a representation that is parallelizable, we unroll the SSM once again and expand the inner terms. Step 1: $$ \begin{aligned} x_0 &= \bar{B}u_0 \\ y_0 &= \bar{C}\bar{B}u_0 \end{aligned} $$ Step 2: $$ \begin{aligned} x_1 &= \bar{A}\bar{B}u_0 + \bar{B}u_1\\ y_1 &= \bar{C}\bar{A}\bar{B}u_0 + \bar{C}\bar{B}u_1 \end{aligned} $$ Step 3: $$ \begin{aligned} x_2 &= \bar{A}^2\bar{B}u_0 + \bar{A}\bar{B}u_1 + \bar{B}u_2\\ y_2 &= \bar{C}\bar{A}^2\bar{B}u_0 + \bar{C}\bar{A}\bar{B}u_1 + \bar{C}\bar{B}u_2 \end{aligned} $$ Step k:  $$ \begin{array}{r c c c c c c c} x_k &=& \bar{A}^k\bar{B}u_0 &+& \bar{A}^{k-1}\bar{B}u_1 &+& \dots &+& \bar{A}\bar{B}u_{k-1} &+& \bar{B}u_k \\ y_k &=& \bar{C}\bar{A}^k\bar{B}u_0 &+& \bar{C}\bar{A}^{k-1}\bar{B}u_1 &+& \dots &+& \bar{C}\bar{A}\bar{B}u_{k-1} &+& \bar{C}\bar{B}u_k \end{array} $$ As we can see in the formula above, calculating $y_k$ depends on all input $u_0$ to $u_k$ and the terms $\bar{C}\bar{A}^0\bar{B}$ to $\bar{C}\bar{A}^k\bar{B}$. If we group these terms into a convolution kernel $\bar{K}$, we can express the calculation of $y$ as a convolution, which is easily parallelizable: $$ \bar{K} = (\bar{C} \bar{B}, \bar{C} \bar{A}^1 \bar{B}, \dots, \bar{C} \bar{A}^k \bar{B}) $$ $$ y = \bar{K} * u $$ An animation of this convolution is shown below, where the input $u$ is shown in orange, the kernel $\bar{K}$ in blue and the output $y$ in green. ![convolution](https://hackmd.io/_uploads/B16n_Werel.gif) *Convolutional calculation of $y$ using the convolution kernel $K$.* This is incredibly powerful because it allows us to calculate all outputs $y$ using a single convolution. This is especially helpful during training, where we have the full input sequence $u$ and want to compute the complete output sequence $y$ in parallel before performing backpropagation. > [!CAUTION] Matrix Multiplication > However we still have a computational problem here, which is that in order to compute $\bar{K}$, we need to power the $\bar{A}$ matrix by itself $k$ times. > This is computationally very expensive and not feasible! This issue is addressed by the authors of the paper *Efficiently Modeling Long Sequences with Structured State Spaces [[1]](#paper-1)*, who design the SSM in such a way that it reduces computational cost while retaining the same functionality. The architecture they come up with is called the **Structured State Space Model** or **S4**, which we will investigate in [Chapter 2](#2\)-Structured-State-Space-Model-(S4)). ## 1.4) State Diagram To get a better understanding of the discretized SSM, we visualize the matrices. ### State Diagram with one channel ![state diagram](https://hackmd.io/_uploads/ryHxR_gHeg.png) *State diagram of a state space model with one-dimensional inputs.* In the state diagram above each element $u[k]$ of the input signal $u$ get's projected into the state $x[k]$ using the matrix $B \in {\mathbb R}^{N \times 1}$. It then get's projected down into the element $y[k]$ of the output signal $y$ using the matrix $C \in {\mathbb R}^{1 \times N}$. >[!NOTE] > Compared to the visualization in the chapter [Discretized SSM](#Discretized-SSM), we removed the $z^{-1}$-block and show the state $x[k]$ instead. This is done purely for a more visually appealing diagram and probably abhorrent to the eyes of a control systems engineer. > <details> > <summary>More mathematically correct state diagram</summary> > >![state diagram complicated](https://hackmd.io/_uploads/BkRZROxHgg.png) ### State Diagram with mutliple channels > [!TIP] Word Embeddings > In Natural Language Modelling, each token is represented by an embedding vector. To utilize SSMs for Language Modelling, we therefore need to increase the input and output dimensionality. > This is done by stacking multiple independent SSMs in parallel, introducing a new **"hidden dimension" $H$** to all matrices. ![state diagram](https://hackmd.io/_uploads/rJPQAOlSex.png) *State diagram of a system with multiple channels.* In the above state diagram each element $u[k]$ of the input signal $u$ get's projected into the state $x[k]$ using the matrix $B \in {\mathbb R}^{H \times N}$. It then get's projected down into the element $y[k]$ of the output signal $y$ using the matrix $C \in {\mathbb R}^{H \times N}$. ### Matrix Sizes | Single Channel | Multiple Channels | | -------- | -------- | | ![matrices](https://hackmd.io/_uploads/Sk5zFL1Exx.png) | ![matrices](https://hackmd.io/_uploads/BkjNVI1Nge.png) | ## 1.5) Summing up We now have a model, that maps a sequence of tokenized inputs $u$ to a sequence of tokenized outputs $y$. We can train the model by making the matrices $\bar{A}$, $\bar{B}$, $\bar{C}$ and the step size $\Delta$ trainable parameters and adjust them through backpropagation. > [!NOTE] SSM Representations > The SSM can be thought of as having three different representations; **Continuous**, **Recurrent** and **Convolutional**. > ![representations](https://hackmd.io/_uploads/HyAzTObBxe.png) > *Continous, Recurrent and Convolutional representation of an SSM. Image: Albert Gu et al. [[19]](#article-19).* > > Discretizing the model allows for stepwise computation through recurrence, while the convolutional representation enables parallel training. >[!TIP] Linear State Space Layer (LSSL) >What we have just introduced is actually not an SSM but what's often referred to as a **Linear State Space Layer (LSSL)**. We can build an SSM by stacking multiple LSSLs. This is the equivalent of stacking convolutional layers to form a Convolutional Neural Network (CNN). >[!CAUTION] Why is it not used? > Research shows that the SSM still struggles with modelling long sequenes. We will talk more about this in the following chapter. # 2) Structured State Space Model (S4) Structured SSMs attempt to solve two major problems of SSMs. 1. Initializing the $\bar{A}$ Matrix in such a way that the SSM becomes better at learning long sequences 2. Reducing the computational cost for caclulating $\bar{K}$ by reducing Matrix Multiplications ## 2.1) High Order Polynomial Projections (HiPPO) SSMs show strong performance in modeling sequential data, however they still struggle with long-range dependencies. The authors of *HiPPO: Recurrent Memory with Optimal Polynomial Projections [[3]](#paper-3)* suggest that this is due to the A-Matrix being initialized badly. ### Motivation To the rescue comes HiPPO, an online-compression framework which provides a principled method to compress a continuous signal $f(t)$ into a fixed number of coefficients $c$, from which the original signal can be reconstructed as an approximation $g(t)$. It does so by using a set of Basis Functions $P(t)$ (in this case Legendre Polynomials) and weighing each Basis $P_i(t)$ with a coefficient $c_i$ before adding them together. $$ f(t) \approx g(t) = \sum_{i=0}^N c_i P_i(t) $$ This effectively projects the function $f(t)$ into a different basis, for which we can define a limited dimesionality by limiting the number of Polynomials we want to use, therefore compressing the signal. > [!NOTE] Legendre Polynomials > Albert Gu et al. use a set of orthogonal polynomials as a basis called **Legendre Polynomials** which you can find the first six of in the image below. However it is to note that other bases such as the Fourrier basis can be used as well. >![legendre](https://hackmd.io/_uploads/BJjYkcnzlg.png =300x) >*First 6 Legendre Polynomials. Image: [Wikipedia](https://en.wikipedia.org/wiki/Legendre_polynomials).* ### Example An example of this approximation can be seen in the image below, where the dotted graph is the approximation $g(t)$ of the function $f(t)$ and the colored graphs depict the first 5 Legendre Polynomials $P_0(t), \ldots ,P_4(t)$. We can approximate the original signal $f(t)$ by weighting each Polynomial $P_i(t)$ with a coefficient $c_i$ and adding them together. ![approximation](https://hackmd.io/_uploads/HydkFyeBgx.png) *Approximation $g(t)$ of a function $f(t)$ using Legendre Polynomials $P_i(t)$ as a Basis.* ### Measure In order to calculate the coefficients $c$ that optimally approximate $f(t)$, we first need to define a measure $\mu$, which we can use to calculate the difference between the approximation $g(t)$ and the original signal $f(t)$. The authors of HiPPO propose the three different measures shown in the table below. | Scaled Legendre Measure (HiPPO-LegS) | Translated Legendre Measure (HiPPO-LegT) | Translated Laguerre Measure (HiPPO-LagT) | | -------- | -------- | -------- | | ![hippo legs](https://hackmd.io/_uploads/ryI0-bkree.png =250x) | ![hippo legt](https://hackmd.io/_uploads/rkA4l-yree.png =250x) | ![hippo lagt](https://hackmd.io/_uploads/rJpnx-kSgl.png =250x) | | $$\mu^{(t)}(x)=\frac{1}{t}\mathbb{I}[0,t](x)$$ | $$\mu^{(t)}(x)=\frac{1}{\theta}\mathbb{I}[t-\theta,t](x)$$ | $$\mu^{(t)}(x)=e^{-(t-x)}\mathbb{I}[-\infty,t](x)$$ | | This measure assigns a uniform weight to all history $[0, t]$. | This measure assigns a uniform weight for a sliding window with the length of $\theta$, focusing on the most recent history $[t-\theta, t]$. | This measure uses an exponentially decaying measure, which assigns more importance to the most recent history. | > [!NOTE] Indicator Function > Note that in the formulas above, $\mathbb{I}$ represents an **indicator function**. It is equal to 1 when x is inside the given interval and 0 if x is outside. ### Calculating the Coefficients Our goal is to do **online function approximation**, meaning that with every new data point in $f(t)$, we re-calculate the coefficients $c$. This means that we can represent the coefficients as a time-dependent function $c(t)$. We won't go into the full derivation of HiPPO. What is important to know though is that we can form an ordinary differential equation (ODE) called the **HiPPO Operator**. Where the coefficients $c(t)$ are represented as the latent state and the transition matrices $A(t)$ and $B(t)$ depend on the chosen **measure**. $$ \begin{aligned} c'(t) &= A(t)c(t) + B(t)f(t) &\text{(HiPPO Operator)} \end{aligned} $$ Assuming the matrices $A(t)$ and $B(t)$ are known, the ODE can be used to solve for the coefficients $c(t)$. In a discrete case, the coefficients $c_k$ are calculated through recurrence. $$ \begin{aligned} c_k &= Ac_{k-1} + Bf_k &\text{(Discrete HiPPO Operator)} \end{aligned} $$ ### Definition of $A$ and $B$ What's still missing is the definition of the $A$ and $B$ matrices for the different measures $\mu$. The derivation is out of the scope of this research, you can find it in appendix D of the HiPPO paper [[3]](#paper-3). Find the $A$ and $B$ matrices for the different meausures $\mu$ in the table below. --- | LegS | LegT | LagT | | --------------------- | --------------------- | --------------------- | | $$\mu^{(t)}=\frac{1}{t}\mathbb{I}[0,t]$$ | $$\mu^{(t)}=\frac{1}{\theta}\mathbb{I}[t-\theta,t]$$ | $$\mu^{(t)}=e^{-(t-x)}\mathbb{I}[-\infty,t]$$ | | $$A_{nk} = -\begin{cases}\begin{aligned}&(2n+1)^\frac{1}{2}(2k+1)^\frac{1}{2} &if\ n>k\\&n+1 &if\ n=k\\&0 &if\ n<k\\\end{aligned}\end{cases}$$ | $$A_{nk} = -\frac{1}{\theta}\begin{cases}\begin{aligned}&(-1)^{n-k}(2n+1)^\frac{1}{2} &if\ n\ge k\\&2n+1 &if\ n<k\\\end{aligned}\end{cases}$$ | $$A_{nk} = -\frac{1}{\theta}\begin{cases}\begin{aligned}&1 &if\ n\ge k\\&0 &if\ n<k\\\end{aligned}\end{cases}$$ | | $$B_n = (2n+1)^\frac{1}{2}$$ | $$B_n = \frac{1}{\theta}(2n+1)(-1)^n$$ | $$B_n=1$$ | > [!NOTE] Piecewise Definition > The Matrices above are defined using piecewise functions. For a matrix with N rows and K columns, we can insert the parameters n and k for every element and get the resulting matrix. > ##### Example: LegS Matrix > $$ > A = \begin{bmatrix} > 1 & 0 & 0 & 0 \\ > \sqrt3 & 2 & 0 & 0 \\ > \sqrt5 & \sqrt15 & 3 & 0 \\ > \sqrt7 & \sqrt21 & \sqrt35 & 4 \\ > \end{bmatrix} > $$ ### Connection to RNNs, LSTMs, LMUs The HiPPO Operator can be integrated into RNNs by simply projecting the state into coefficients $c$ and passing them along. This is very similar to LSTMs and GRUs, which use a gating-mechanism instead. HiPPO-LegT is exactly equivalent to the [Legendre Memory Unit (LMU)](https://proceedings.neurips.cc/paper_files/paper/2019/file/952285b9b7e7a1be5aa7849f32ffff05-Paper.pdf). More information about integrating HiPPO into ML models can be found in the article *HiPPO: Recurrent Memory with Optimal Polynomial Projections [[11]](#article-11)*. > [!NOTE] Takeaway > By initializing the $A$ Matrix using the HiPPO Matrix, we can make sure that the SSM is good at modelling long sequences. > The authors of the HiPPO Framework report that initializing the $\bar{A}$ Matrix with HiPPO improved their model's performance on the pMNIST dataset significantly, increasing it's accuracy from 69.9% with a randomly initialized $\bar{A}$ Matrix to 98.3%! ## 2.2) Diagonalization ### Motivation As discussed in chapter [1.3) Convolution](#1.3\)-Convolution), computing the convolution kernel $\bar{K}$ requires raising $\bar{A}$ to the power of $k$, which is computationally expensive. $$ \bar{K}= (\bar{C}\bar{B}, \bar{C}\bar{A}\bar{B}, \bar{C}\bar{A^2}\bar{B}, \dots, \bar{C}\bar{A^k}\bar{B}) $$ However if $\bar{A}$ was a diagonal Matrix, exponentiation would be much simpler, as each element could be powered individually, therefore eliminating the need for repeated matrix multiplication. For example: $$ \bar{A}^2 = \bar{A} \cdot \bar{A} = \begin{bmatrix} a_1 & 0 & \cdots & 0\\ 0 & a_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & a_N \end{bmatrix} \cdot \begin{bmatrix} a_1 & 0 & \cdots & 0\\ 0 & a_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & a_N \end{bmatrix}= \begin{bmatrix} a_1^2 & 0 & \cdots & 0\\ 0 & a_2^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & a_N^2 \end{bmatrix} $$ Given a diagonal matrix $\bar{A}$, each term $(\bar{C}\bar{A^k}\bar{B})$ of the convolution kernel $\bar{K}$ can be easily computed. $$ \begin{aligned} \bar{C}\bar{A^k}\bar{B} &= \begin{bmatrix} c_1 & c_2 & \dots & c_N \end{bmatrix} \cdot \begin{bmatrix} a_1^k & 0 & \cdots & 0\\ 0 & a_2^k & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & a_N^k \end{bmatrix} \cdot \begin{bmatrix} b_1\\ b_2\\ \vdots\\ b_N \end{bmatrix}\\ &= \begin{bmatrix} c_1 a_1^k & c_2 a_2^k & \dots & c_N a_N^k \end{bmatrix} \cdot \begin{bmatrix} b_1\\ b_2\\ \vdots\\ b_N \end{bmatrix}\\ &= \sum_{i=0}^N c_i a_i^k b_i \end{aligned} $$ By arranging the elements $a_i$ of $A$ in a certain form, we can rewrite the computation of $\bar{K}$ as follows. $$ \bar{K}= \begin{bmatrix} c_1 b_1 & c_2 b_2 & \dots & c_N b_N \end{bmatrix} \cdot \underbrace{\begin{bmatrix} 1 & a_1 & a_1^2 & \cdots & a_1^k\\ 1 & a_2 & a_2^2 & \cdots & a_2^k\\ 1 & a_3 & a_3^2 & \cdots & a_3^k\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & a_N & a_N^2 & \cdots & a_N^k \end{bmatrix}}_{Vandermonde(\bar{A})} $$ The final matrix above is called the **Vandermonde matrix** $V(\bar{A})$. It can be efficiently constructed from $\bar{A}$ and only requires exponentiation of the individual elements $a_i$. Puting it all together: $$ \bar{K} = (\bar{B} \odot \bar{C}) \cdot V(\bar{A}) $$ Where $\bar{B} \odot \bar{C}$ is an element-wise product and $(\bar{B} \odot \bar{C}) \cdot (V(\bar{A}))$ is the only matrix multiplication. > [!NOTE] Takeaway > We have shown that computing $\bar{K}$ is less computationally expensive if $\bar{A}$ is a diagonal matrix, however we still need to actually diagonalize $\bar{A}$. ### Simple Diagonalization The authors of S4 provide a derivation for the diagonalization of $\bar{A}$ which we won't go into the details of. It is based on the following two theorems. ##### Theorem 1 > A matrix $A$ is diagonalizable if it is possible to reconstruct if from a diagonal matrix $\Lambda$ and an invertible matrix $P$. $$ \begin{aligned} A = P \Lambda P^{-1} \\ \Lambda = P^{-1} A P \end{aligned} $$ ##### Theorem 2 > A State Space Transformation is an operation, where we can take one SSM and map it to a different but equivalent SSM. We perform a State Space Transformation by defining a new state vector $\tilde{x}$ based on the state vector $x$ of our existing SSM like so: > $$ > \tilde{x} = P^{-1} x > $$ The result is a different but equivalent SSM: $$ SSM(A,B,C,D) \sim SSM(\Lambda, P^{-1}B, CP, D) $$ > [!CAUTION] Numerical Instability > Diagonalizing the HiPPO Matrix this way doesn't work due to numerical instability. When initializing $\bar{A}$ with HiPPO and then diagonalizing it, the diagonalized version will have exponentially large entries. ### DPLR Diagonalization Algorithm The authors of S4 discard the idea of using a diagnoal $\bar{A}$ matrix due to numerical instability and instead come up with a combination of a diagonal matrix $\Lambda$ and two low-rank matrices $P$ and $Q$, which they call **D**iagonal **P**lus **L**ow **R**ank (DPLR) alongside an algorithm that makes the calculation of $\bar{K}$ less computationally expensive. ![dplr](https://hackmd.io/_uploads/Sk3q39lVex.png =300x) *Diagonal matrix $\Lambda$ (left) and two low-rank matrices $P$ and $Q$ (right). Image: Sascha Kirch [[15]](#article-15).* Note that the DPLR matrix is no longer fully diagonal due to the low-rank terms $P$ and $Q$, which necessitates a specialized algorithm. > [!TIP] Algorithm > The algorithm they propose is complex and out of the scope of this research. Find it below for completeness. > ![algorithm](https://hackmd.io/_uploads/S1sOTqeElx.png) > *Algorithm for efficient calculation of $\bar{K}$. Image: Albert Gu et al. [[1]](#paper-1).* ### S4D Diagonalization Following the work of S4, *Gupta, A., Gu, A., & Berant, J. [[5]](#paper-5)* came up with **Diagonal State Spaces (DSS)**, which further simplifies the DPLR matrix by removing the low-rank terms altogether. This eliminates the need for the algorithm mentioned above, since computing $\bar{K}$ for diagonal matrices is computationally inexpensive. The authors of S4 followed up with the paper *On the Parameterization and Initialization of Diagonal State Space Models [[4]](#paper-4)*, introducing the **S4D** architecture and providing analysis of this approach as well as the additional initialization techniques below. | S4D-Inv | S4D-Lin | S4D-Inv2 | S4D-Quad | S4D-Real | | ------------------------------------------------------- | ----------------------------- | ------------------------------------------------------ | -------------------------------- | ---------------- | | $$A_n = -\frac{1}{2}+i\frac{N}{\pi}(\frac{N}{2n+1}-1)$$ | $$A_n = -\frac{1}{2}+i\pi n$$ | $$A_n = -\frac{1}{2}+i\frac{N}{\pi}(\frac{N}{n+1}-1)$$ | $$A_n = -\frac{1}{\pi}(1+2n)^2$$ | $$A_n = -(n+1)$$ | Note that since we now use diagonal matrices $A$, we can index them using only one parameter $n$ for both rows and columns, whereas in [HiPPO](#2.1\)-High-Order-Polynomial-Projections-(HiPPO)) we used $k$ and $n$. > [!NOTE] Takeaway > The core challenge in computing $\bar{K}$ is that it requires the calculation of $\bar{A}^k$, which is computationally expensive. > In S4, $\bar{A}$ is initialized using HiPPO, converted into DPLR form and then processed with a complex algorithm to compute $\bar{K}$ efficiently. In contrast, DSS and S4D initialize $\bar{A}$ as a diagonal matrix from the start, eliminating the need for both DPLR conversion and the complex computation of $\bar{K}$. ## 2.3) State Diagram ![state diagram](https://hackmd.io/_uploads/rJuEC1xVeg.gif) *State Diagram with a diagonal $\bar{A}$ matrix. The matrix can be represented as a vector.* The image above shows the state diagram using a diagonal $\bar{A}$ matrix. We reduced it's size from $N \times N$ to $1 \times N$ and can visualize it as a vector. ## 2.4) Summing up The result of the two optimizations to the SSM (HiPPO-Initialization + DPLR-Diagonalization) is the Structured State Space Model (S4). The term structured refers to the specific form imposed on the state matrix $\bar{A}$. Building on this work, the two subsequent models DSS and S4D simplify the architecture by initializing $\bar{A}$ as a diagonal matrix from the start. # 3) Selective State Space Model (S6) ## 3.1) Motivation The S4, DSS, and S4D architectures compress all past information into a fixed state. This allows fast training through convolution and memory efficient inference using recurrence. However, these models cannot selectively forget or update information in a context-aware manner, unlike attention heads in Transformer models. Let’s examine some of the tasks where traditional SSMs encounter difficulties. ### Example #### Copying Tasks ![copying task](https://hackmd.io/_uploads/HJxu-17Egx.png) *Illustration of a simple Copying Task with constant spacing between input and output (left) and a Selective Copying Task with random spacing in between inputs (right). Image: Albert Gu et al. [[2]](#paper-2).*  In the **Copying Task** shown above, the model is required to reproduce the input sequence with a fixed time delay. This task relies solely on temporal dependencies and does not require any understanding of the input content or context. As such, it can be handled effectively by vanilla State Space Models (SSMs) or more advanced models like S4. In contrast, the **Selective Copying Task** requires the model to output only specific tokens (the colored ones) while ignoring others (the white ones). This introduces a need for context-aware reasoning, as the model must differentiate relevant inputs based on their content. Vanilla SSMs struggle with this task because their parameters remain fixed across all inputs, causing them to treat every token equally without regard to context. #### Induction Head Task ![induction task](https://hackmd.io/_uploads/H1mygd27ex.png) *The model must retrieve the correct output based on contextual patterns. Image: Albert Gu et al. [[2]](#paper-2).* Another task, where regular SSMs do not perform well is the **Induction Head Task**. As shown in the figure, the model needs to recall information from previous inputs to generate the current output. An example of such a task would be filtering out bad words from a stream of text. A vanilla SSM would struggle because it cannot dynamically decide which words to ignore or keep based on context. To overcome this problem, the authors of the paper *Mamba: Linear-Time Sequence Modeling with Selective State Spaces [[2]](#paper-2)* propose a new architecture called **Selective State Space Model (S6)**. ### Effectiveness vs. Efficiency The difference between Transformers and SSMs is best explained by looking at the state size and performance of each model. Transformers retain the entire input sequence, enabling attention to any token. In contrast, State Space Models compress all information into a hidden state of fixed size. Consequently, the computational and memory requirements of Transformers scale with sequence length, whereas those of SSMs remain constant, offering greater efficiency for long sequences. What we really want is a model inbetween, that is powerful and efficient at the same time. ![state size](https://hackmd.io/_uploads/rkyA10z4el.png) *Effectiveness vs. efficiency tradeoff: Mamba selectively compresses data, balancing the power of Transformers with the efficiency of RNNs. Image: Maarten Grootendorst [[10]](#article-10).* ## 3.2) Selectivity The authors of MAMBA propose a model that **selectively** compresses only relevant information into the state, resulting in a model with a more balanced tradeoff between performance and efficiency. Selectivity is unfortunately impossible with standard SSMs because of their time-invariance. The $\bar{A}$, $\bar{B}$ and $\bar{C}$ matrices are fixed and independent of the input, which prevents the model from dynamically distinguishing between relevant and irrelevant input. Selective State Space Models overcome this problem by making the matrices $\bar{B}$ and $\bar{C}$ as well as the step size $\Delta$ time-dependent, effectively turning the system into a **linear time-variant (LTV)** system. In a discrete case, this can be shown by adding the subscript $k$ to the matrices: $$ \begin{align} h_k &= \bar{A}_k h_{k-1} + \bar{B}_k x_k \\ y_k &= \bar{C}_k h_k \end{align} $$  > [!WARNING] Notation differences > > The notation changes in the MAMBA paper. The hidden state, previously denoted as $x$, is now represented by $h$, and the input, previously denoted as $u$, is now represented by $x$. > > $$ > \begin{align} > \text{Hidden State}: \qquad x \rightarrow h\\ > \text{Input}: \qquad u \rightarrow x\\ > \end{align} > $$ > > Additionally, the notation for the matrix dimensions mentioned in [Chapter 1.4](#1.4\)-State-Diagram) change as well. What was previously denoted as the hidden- or channel-dimension $H$ is now $D$. > > $$ > \text{Hidden Dimension}: \quad H \rightarrow D > $$ ## 3.3) Algorithm ![algorithm](https://hackmd.io/_uploads/r15zQXAmgx.png) *Comparison of the algorithms of structured SSM (S4) and selective SSM (S6). Image: Albert Gu et al. [[2]](#paper-2).* The table above illustrates the differences between S4 and S6. Note that both models leave the A-Matrix unchanged. > [!TIP] Reminder >Matrix A in S4 comes from a special structured matrix called HiPPO. The authors of Mamba chose to keep this structure unchanged for stability and efficiency. The matrices $B$ and $C$, as well as the step-size $\Delta$ are no longer learnable parameters because they are dependent on the input and change over time. The learnable parameters can instead be represented as the linear projections $s_B(x)$, $s_C(x)$, $s_{\Delta}(x)$, which project an input into the corresponding matrix. $$ \begin{align} s_B(x) &= \text{Linear}_N(x), \\ s_C(x) &= \text{Linear}_N(x), \\ s_\Delta(x) &= \text{Broadcast}_D(\text{Linear}_1(x)), \\ \tau_\Delta &= \text{softplus} \\ \end{align} $$ By making the matrices $B$, $C$, and $\Delta$ time-dependent, each input token is associated with a unique set of parameters. This introduces selectivity, allowing the model to be more selective about which information is saved in the state. Since some inputs may be more relevant than others for a given task, this time dependence helps the model more effectively filter out irrelevant information. ![s4 matrices](https://hackmd.io/_uploads/HkAnsVRmle.png) *Matrix sizes of the structured state space model (S4). Image: Maarten Grootendorst [[10]](#article-10).* The Image below illustrates how the sizes of matrices $A$, $B$ and $C$ change compared to S4. ![s6 matrices](https://hackmd.io/_uploads/S1hZTVCXlg.png) *Matrix sizes of the selective state space model (S6). Image: Maarten Grootendorst [[10]](#article-10).* >[!CAUTION] Convolution > The time variance of the system unfortunately makes convolution during training impossible because the convolution kernel $\bar{K}$ changes depending on the input, meaning we cannot precompute it. > > When processing $y_1$, the first element of $\bar{K}$ is $k_1=\bar{C_1}\bar{B_1}$ > ![convolution1](https://hackmd.io/_uploads/Byw6MH0Xlg.png) >*Image: Oleguer Canal [[16]](#article-16)* > > On the next convolution step, the first element of $\bar{K}$ is completely different: $k_1=\bar{C_2}\bar{B_2}$ > ![convolution2](https://hackmd.io/_uploads/HyDSQHAXgx.png) > *Image: Oleguer Canal [[16]](#article-16).* > > Because convolution isn't an option, we need to switch back to using the model in its recurrent form. To overcome this limitation, the authors of Mamba revisit the core computational challenges of SSMs and address them using three main techniques: * Parallel Scan * Kernel Fusion * Recomputation ## 3.4) Parallel Scan Parallel scan is an algorithm introduced by Blelloch in 1990 in the paper *[Prefix Sums and Their Applications](https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf)*. > [!NOTE] Prefix Sums Problem >We have an array with n elements Array=[1,2,3....n] >then Prefix of nth element will be PrefixSums[n] = Array[0] + Array[1] + … + Array[n] >For example, >[3 1 7 0 4 1 6 3] >would return >[0 3 4 11 11 15 16 22] The above problem can be tackled in two ways. ### Sequential We can compute this sequentially by looping through the array. ``` out[0] = 0 for j in range(1, n): out[j] = out[j - 1] + in[j - 1] ``` * Depth: O(n) ### Parallel Also you can restructure it using a technique called parallel scan, which breaks the dependency chain by splitting the work and merging the results in a tree like fashion. ``` # Phase 1: apply function forall j in parallel: temp[j] = f(in[j]) # Phase 2: use parallel scan on temp all_prefix_sums(out, temp) ``` * Depth: O(log n) > [!CAUTION] In parallel, We still need the previous values to compute the next one > For simple prefix sums, the benefit of using parallel scan isn’t always huge unless the array is large. In Mamba, we're not just summing numbers. We're dealing with a time-dependent state-space recurrence of the form: $$ \begin{align} h_k &= \bar{A}_k h_{k-1} + \bar{B}_k x_k \\ \end{align} $$ Each step includes a learned, input-conditioned matrix multiplication and input injection. If you had a sequence of 1000 tokens, computing this recurrence step by step would require 1000 sequential steps. That’s slow and GPU-unfriendly. ![recurrence](https://hackmd.io/_uploads/HyVBEwRXxe.gif) *Figure shows how hidden state is calculated with out recurrence step. Image: Sascha Kirch [[15]](#article-15).* This is where parallel scan becomes powerful. First, since the input injection term $\bar{B}_k x_k$ doesn’t depend on the previous state, all threads on the GPU can compute this part in parallel. Then, instead of computing each state one-by-one, for understanding let's represent the state equation: $$ \begin{align} h_k &= \bar{A}_k h_{k-1} + \bar{B}_k x_k \\ \end{align} $$ as a function: $$ \begin{align} f_k(h) = \bar{A}_k h_{k-1} + \bar{B}_k x_k \end{align} $$ These are associative, which means we can compose them in parallel: $$ f_1 \circ f_2,\quad f_3 \circ f_4,\quad\text{....} $$ This is exactly what the first row in the figure below is doing. The second and third rows merge transformation pairs recursively, building a composition tree that reduces the computation depth from linear to logarithmic. ![parallel scan](https://hackmd.io/_uploads/H1IhVP0Qxl.gif) *Figure shows how hidden state is calulated with Parallel Scan. Image: Sascha Kirch [[15]](#article-15).* Once all the functions are composed, we make a final pass to evaluate them with the initial state and recover the actual hidden states. Because each thread is now doing substantial matrix based work not just simple additions the scan offers real, tangible performance benefits. Instead of calculating each output one at a time like traditional models (where each step waits for the previous one), Mamba first does all the simple input processing for every step in the sequence at once . It then figures out the update rules for how each step should change the model’s state, but doesn’t apply them yet. Once everything is ready, it walks through the sequence and applies those rules to get the final result. This way, even though the model still follows a logical order (where earlier steps affect later ones), most of the hard work is done ahead of time and in parallel, which makes it much faster and more efficient for handling long sequences. So, instead of needing 8 steps for 8 tokens, the scan allows the model to compute everything in 3 steps (log₂(8) = 3), as shown in the diagram with most of the heavy lifting done in parallel. ## 3.5) Kernel Fusion To understand kernel fusion, let’s first look at how memory works in a GPU. In the figure below, we can see that GPUs have two main types of memory: High Bandwidth Memory (HBM), often called DRAM, and SRAM. While GPUs also have local memory (specific to each thread), we’ll ignore that here since it’s not directly relevant to our topic. ![kernel fusion](https://hackmd.io/_uploads/HJnaTuWVxg.png) *Overview of a GPU memory block. Image: Albert Gu et al. [[2]](#paper-2)* DRAM (or HBM) is what we usually refer to when we talk about GPU memory — the 16GB, 24GB, or 48GB we see in GPU specs. SRAM, on the other hand, is much faster and is used internally by GPU cores during computation. For example, when a GPU needs to perform matrix multiplication, it first loads the necessary data from DRAM into SRAM. The cores perform the computation using the fast SRAM, and then the result is written back to DRAM. If we look at the specs of a modern GPU like the RTX 5090 (example below), we notice something important: It can perform 3352 TOPS (trillions of operations per second), yet its memory bandwidth is “only” 1792 GB/s. This means that although the GPU is capable of extremely fast computation, moving data between DRAM and SRAM is comparatively slow. ![nvidia specs](https://hackmd.io/_uploads/BkobCOZNlx.png) *Specification table of RTX 5090 Image: [Nvidia](https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/)* >[!CAUTION] A common bottleneck: IO-bound > Sometimes, when we write GPU code (e.g., CUDA kernels), the code runs slowly not because the math is slow, but because we spend too much time copying data between DRAM and SRAM. This kind of slowdown is called being IO-bound (i.e., limited by memory bandwidth, not compute power). ### Mamba’s Feature The idea behind mamba and the authors of Mamba they note: >*"The main idea is to leverage properties of modern accelerators (GPUs) to materialize the state ℎ only in more efficient levels of the memory hierarchy. In particular, most operations (except matrix multiplication) are bounded by memory bandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This includes our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to a significant speedup compared to a standard implementation.* > >*Concretely, instead of preparing the scan input (𝑨, 𝑩) of size (B, L, D, N) in GPU HBM (high-bandwidth memory), we load the SSM parameters (Δ, 𝑨, 𝑩, 𝑪) directly from slow HBM to fast SRAM, perform the discretization and recurrence in SRAM, and then write the final outputs of size (B, L, D) back to HBM.* The image below shows which operations are performed directly in SRAM and which are copied back to DRAM. This minimizes memory reads/writes and makes execution much faster. ![overview](https://hackmd.io/_uploads/H18EGF-Ngx.png) *Overview Structure of S6 showing each channel of an input $x$ to output $y$ through a higher dimensional latent state $h$. Image: Albert Gu et al. [[2]](#paper-2).* > [!NOTE] Kernel Fusion Simplified > >Normally, each tensor operation (like ReLU, etc.) runs in a separate CUDA kernel loading from memory, processing, then writing back. This back-and-forth to memory (HBM & SRAM) happens for every step, slowing things down. > >With kernel fusion, multiple operations are combined into a single kernel: > >* Load tensor once >* Do all ops in fast SRAM >* Write final result once > >This cuts down memory access and speeds up computation significantly. > [!NOTE] Takeaway > Mamba keeps scan operations in fast GPU memory (SRAM) and fuses recurrence, discretization, and projection into one custom CUDA kernel. Since it's memory-bound, not compute bound, reducing memory traffic makes Mamba much faster with just smarter GPU usage. ### Recomputation Recomputation is a technique used to save memory during training. The idea is pretty simple instead of storing every intermediate result in memory (like all the hidden states or layer outputs), we recompute some of them on the fly when needed especially during the backward pass. Normally, during training, we store everything so we can compute gradients later. But this can use up a lot of memory, especially with long sequences like in Mamba. So instead, we save memory by not storing all intermediate hidden states (h₀, h₁, h₂...). When it's time to backpropagate, we recompute just the values we need from earlier steps using the same recurrence rule. This ends up being more efficient. This idea is also a common trick to reduce the memory footprint at train time. # 4) MAMBA Now that we have the selective state space model (S6) defined, the last remaining step is to encapsulate it in a block that can be stacked to build a deep learning architecture. > [!NOTE] Notation > For notation, we orient ourselfes on the code found in [mamba-minimal](https://github.com/johnma2006/mamba-minimal). They put the MAMBA Block inside a Residual Block, which is repeated in the model. ## 4.1) MAMBA Architecture ![architecture](https://hackmd.io/_uploads/ry1joN7Eee.png) *MAMBA Block Architecture (left) and MAMBA Model (right). Image adapted from Sascha Kirch [[15]](#article-15).* ##### Parameters - **d_model** (Parameter): Input and Output dimensionality of the model - **expand** (Parameter): Expansion factor - **d_inner** (Variable): The size of $D$ of the selective SSM, `d_inner = expansion * d_model` - **d_state** (Parameter): State size $N$ of the selective SSM - **d_conv** (Parameter): Convolution kernel size for `conv1d` - **vocab_size** (Parameter): The size of the word embeddings ##### MAMBA Block Let's start with the innermost MAMBA block. - A **Linear Projection** `in_proj` is used to project from `d_model` to `2*d_inner` - The output is **split** into two paths of dimension `d_inner` - Path 1 contains a **convolution** that preserves the previous shape - Both paths contain a **SiLU** activation function - Path 1 goes through the **selective SSM**, where $D$ is equal to `d_inner` - The two paths are **combined** using element-wise mutliplication, which makes path 2 act as a gating function that can selectively retain information - Another **Linear Projection** `out_proj` is used to project from `d_inner` to `d_model`. ##### Residual Block Surrounding the MAMBA block is a residual block - The input is **split** into two paths - Path 1 is normalized using **RMS Layer Normalization** and then fed through the **MAMBA block**, while path 2 simply propagates the input without applying any operation - The two paths are **added** together element-wise ##### MAMBA Model Surrounding the stack of residual blocks is the model, which contains the embedding. - A **Linear Projection** `embedding` is used to project from `vocab_size` to `d_model` - The stack of MAMBA Blocks are applied - A **Linear Projection** `head` is used to project from `d_model` to `vocab_size`. ## 4.2) Benchmark Results In the earlier chapters, we introduced two key problems: selective copying and the induction head, which serve as the motivation behind Mamba. Now, let's examine how Mamba performs with these tasks. ### Selective Copying The Selective Copying task challenges the memorization abilities of recurrent models by randomizing the spacing between tokens. MAMBA's selective SSM is compared against other architectures like H3. Mamba performed exceptionally well in these evaluations. While the selection mechanism of S6 effectively solves this task, it performs even better when combined with more powerful architectures. ![image](https://hackmd.io/_uploads/HkehgvxBxl.png) *Selective Copying task Accuracy. Image: Albert Gu et al. [[2]](#paper-2).* ### Induction Heads The Induction Head task is a simple but insightful test that helps us understand how well large language models (LLMs) can learn and retain information within a sequence. From a mechanistic interpretability standpoint, this task provides valuable insight into the model's ability to learn from context. For example, if the model learns that "Germany is hot in July," it should be able to remember this pattern and, whenever it encounters "Germany" and "July" again in the future, it should correctly predict "hot" based on its prior learning. ![image](https://hackmd.io/_uploads/rkSqmDlHgl.png) *Performance of MAMBA on the induction head task compared to baseline models. Image: Albert Gu et al. [[2]](#paper-2).* Here they trained a 2-layer model on the induction heads task with sequences of length 256 and a vocabulary size of 16, They evaluated the model’s ability to generalize and extrapolate by testing it on sequences ranging from 64 to over a million tokens. They used both multi-head attention and SSM variants, with MAMBA using a model dimension of 64 and other models using 128. The results showed that MAMBA, specifically its selective SSM layer, solved the task perfectly by remembering relevant tokens while ignoring the rest. It was able to generalize to sequences 4000 times longer than what it saw during training, outperforming other methods that struggled with sequences beyond twice the training length. > [!NOTE] Honorable Mentions > > **Falcon MAMBA 7B**: This SSLM (State Space Language Model) outperforms traditional transformer models like Meta’s Llama 3.1 8B and Mistral’s 7B, and it has been independently verified by Hugging Face as the top performing open-source model globally. > > **Cartesia**: Another exciting platform in the SSM space, Cartesia, is making significant strides by integrating SSM technology into customer products. # 5) References ## 1. Papers 1. [Efficiently Modeling Long Sequences with Structured State Spaces, Gu, A., Goel, K., & Ré, C.](https://arxiv.org/abs/2111.00396)<a id="paper-1"></a> 2. [Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Gu, A., & Dao, T.](https://arxiv.org/abs/2312.00752)<a id="paper-2"></a> 3. [HiPPO: Recurrent Memory with Optimal Polynomial Projections, Gu, A., Dao, T., Ermon, S., Rudra, A., & Re, C.](https://arxiv.org/abs/2008.07669)<a id="paper-3"></a> 4. [On the Parameterization and Initialization of Diagonal State Space Models, Gu, A., Gupta, A., Goel, K., & Ré, C.](https://arxiv.org/abs/2206.11893)<a id="paper-4"></a> 5. [Diagonal State Spaces are as Effective as Structured State Spaces, Gupta, A., Gu, A., & Berant, J.](https://arxiv.org/abs/2203.14343)<a id="paper-5"></a> ## 2. Articles 6. [Introduction to State Space Models (SSM), Loïck BOURDOIS](https://huggingface.co/blog/lbourdois/get-on-the-ssm-train)<a id="article-6"></a> 7. [The Annotated S4, Albert Gu, Karan Goel, and Christopher Ré](https://srush.github.io/annotated-s4/)<a id="article-7"></a> 8. [How transformers, RNNs and SSMs are more alike than you think, Stanislav Fedotov](https://medium.com/nebius/how-transformers-rnns-and-ssms-are-more-alike-than-you-think-cd0f899893d8)<a id="article-8"></a> 9. [Mamba: SSM, Theory, and Implementation in Keras and TensorFlow, Vedant Jumle](https://towardsdatascience.com/mamba-ssm-theory-and-implementation-in-keras-and-tensorflow-32d6d4b32546/)<a id="article-9"></a> 10. [A Visual Guide to Mamba and State Space Models, Maarten Grootendorst](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state)<a id="article-10"></a> 11. [HiPPO: Recurrent Memory with Optimal Polynomial Projections, Albert Gu*, Tri Dao*, Stefano Ermon, Atri Rudra, and Chris Ré](https://hazyresearch.stanford.edu/blog/2020-12-05-hippo)<a id="article-11"></a> 12. [Understanding State Space Models, George Novack](https://tinkerd.net/blog/machine-learning/state-space-models/)<a id="article-12"></a> 13. [Towards Mamba State Space Models for Images, Videos and Time Series, Sascha Kirch](https://towardsdatascience.com/towards-mamba-state-space-models-for-images-videos-and-time-series-1e0bfdb5933a/) <a id="article-13"></a> 14. [Structured State Space Models Visually Explained, Sascha Kirch](https://towardsdatascience.com/structured-state-space-models-visually-explained-86cfe2757386/)<a id="article-14"></a> 15. [Here Comes Mamba: The Selective State Space Model, Sascha Kirch](https://towardsdatascience.com/here-comes-mamba-the-selective-state-space-model-435e5d17a451/)<a id="article-15"></a> 16. [HiPPOs 🦛, Mambas 🐍, and Other Creatures, Oleguer Canal](https://bocachancla.blog/posts/mamba/)<a id="article-16"></a> 17. [History of State Space Models (SSM) in 2022, Loïck BOURDOIS](https://huggingface.co/blog/lbourdois/ssm-2022)<a id="article-17"></a> 18. [State Space Representations of Linear Physical Systems, Erik Cheever](https://lpsa.swarthmore.edu/Representations/SysRepSS.html)<a id="article-18"></a> 19. [Structured State Spaces: Combining Continuous-Time, Recurrent, and Convolutional Models, Albert Gu et al.](https://hazyresearch.stanford.edu/blog/2022-01-14-s4-3)<a id="article-19"></a> ## 3. Videos 20. [Mamba - a replacement for Transformers?, Samuel Albanie](https://www.youtube.com/watch?v=ouF-H35atOY) <a id="video-20"></a> 21. [MAMBA from Scratch: Neural Nets Better and Faster than Transformers, Algorithmic Simplicity](https://www.youtube.com/watch?v=N6Piou4oYx8)<a id="video-21"></a> 22. [ State Space Models (SSMs) and Mamba, Serrano.Academy](https://www.youtube.com/watch?v=g1AqUhP00Do) <a id="video-22"></a> 23. [ Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math, Umar Jamil](https://www.youtube.com/watch?v=8Q_tqwpTpVU&t)<a id="video-23"></a> 24. [ Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained), Yannic Kilcher](https://www.youtube.com/watch?v=9dSkvxS2EB0&t)<a id="video-24"></a> 25. [Intuition behind Mamba and State Space Models | Enhancing LLMs!, Maarten Grootendorst](https://www.youtube.com/watch?v=BDTVVlUU1Ck&t)<a id="video-25"></a>