# Efficient Transformer Survey
Self-Attention:
$$ \text{Attention}(Q,K,V) = \text{softmax}(QK^T)V$$
where $Q\in\mathbb{R}^{n\times d},K\in\mathbb{R}^{n\times d}$ and $V\in\mathbb{R}^{n\times d}$ are queries, keys and values. In our survey, many methods have tried to reduce the complexity on $\text{softmax}(QK^T)$, which is the root cause of quadratic complexity on attention, especially to reduce the number of rows in $K$ and $V$.
To do that, some of them tried to reduce the number of tokens in $K$ through [fixed patterns](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Fixed--Factorized-Patterns) (e.g., just attending near tokens) or [learnable patterns](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Sparse-Attention-Based-on-Learnable-Patterns) (e.g., finding a few representative tokens through clustering).
In other direction, [memory was used to use smaller $K$](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Having-Memories) (e.g., by compressing the past through additional NeuralNet or using trainable additional tokens which are called as inducing points or global tokens).
As the last, some of them tried [to approximate the matrix $QK^T \in \mathbb{R}^{n \times n}$ to low-rank matrix](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Low-Rank--Kernels) (e.g., project $K$ to smaller matrix $\in \mathbb{R}^{k \times d}$).

## Fixed / Factorized Patterns
#### Sparse Transformer (Child et al., 2019)
- Motivation
- Early layers learn locally connected patterns
- Middle layers (19-20) attention split to row and column attention
- Several layers showed global, data-dependant access patterns
- Typical later layers (64-128) exhibit high sparsity

- Separate full attention operations across several steps of attention
- Global connectivity after several layers
- More efficient while still providing global context to any given position
- (b) Strided attention works well for data that naturally aligns with the stride (eg. images or music), but not text
- \(c\) Fixed attention pattern work better for text allowing all future cells to attend to certain cells or groups of cells (eg. important words or phrases)

#### Blockwise Transformer (Qiu et al., 2019)
- Uses sparse block matrices and are easier to implement without custom GPU kernel
- Use different permutations for different heads

#### Axial Transformer (Ho et al., 2019)
- Applies multiple attentions, each along a single axis of the input tensor
- Straightforward to implement and does not require custom kernel for an efficient implementation

#### Image Transformer (Parmer et al., 2018)
- Restricts self attention to only local neighborhoods
- Partitions the image into query blocks and memory blocks
- For all queries from a single query block, the model attends to the same memory block
- Two schemes:
- 1-Dimensional local attention
- Image flattened into raster scan order
- Paritioned into non-overlapping query blocks Q
- For each query block, a fixed number of prior pixels are used as memory block
- 2-Dimensional local attention
- Image is partitioned into non-overlapping rectangular query blocks
- Memory block extends query block
- Downsides: loses global receptive field

## Sparse Attention Based on Learnable Patterns
#### Reformer (Kitaev et al., 2020)
- Reducing complexity by focusing on the keys that are closest to the queries through locality sensitive hashing (**random projection** to smaller space and compare).
- $O(n^2) \rightarrow O(n\log n)$
#### Routing Transformer (Roy et al., 2020)
- **Clustering based** local (attending near ones) and sparse (attending based on context-based sparsity).
- Train K centroids through data from batch
- Inference Q and K on the centroids and select top w samples based on distance from each centroid.
- Attending on the selected Q and K.
- $O(n^2) \rightarrow O(n^{1.5})$

#### Sinkhorn Transformer (Tay et al., 2020b)
- Sparse attention based differentiable sorting through SortNet for Sinkhorn sorting.
- Attending on the representative token on each block which is from the SortNet.
- $O(n^2) \rightarrow O(B^2+N_B^2)$ where $N_B$ is the number of block and $B = N/N_B$ ($N >> N_B$, e.g., when $N=1024, N_B=64$).
## Having Memories
#### Transformer-XL (Dai et al., 2019)
- By attending memory that is the output of each layer of Transformer, it extends its memory span.

#### Compressive Transformer (Rae et al., 2018)
- Compress the past in the memory through a compression function to extend the memory span.
- Also includes auxiliary loss to reconstruct old memories from the compressed memories

#### Set Transformer (Lee et al., 2019)
- Introducing inducing point that is trainable memory to cover large set with smaller one.
#### Memory Compressed (Liu et al., 2018)
- Using memory compressed attention that uses compressed keys and values through additional ConvNet.

#### Longformer (Beltagy et al., 2020)
- Handling long sequence through global windowed attention.

#### ETC: Encoding Long and Structured Inputs in Transformers (Ainslie et al., 2020)
- Represent hierarchy through global-local attention (HIBERT (Zhang et al., 2019)) with two types of input, global/local input.
- Requiring pre-trained model (e.g., BERT)
- Adapt CPC for pretraining ETC model

#### Big Bird (Zaheer et al., 2020)
- Combines 3 different types of attention:
- Global Attention
- Similar to Longformer/ETC. The global attention can be either specific input that can attend to all other inputs and vice versa
- Or extra inputs (like CLS tokens)
- Local Attention
- Similar to fixed pattern approaches, each query extends $w/2$ tokens to the left and $w/2$ tokens to the right.
- Random Attention
- Each query attends to certain number of random keys
- Because of the global tokens, cannot be used to autoregressively decode.

## Low Rank / Kernels
#### Linformer:

- the N x N (bi-directional) self-attention matrix is low rank
- project N x d dimensional keys and values to k x d dimension, compute the attention matrix based on projected keys and values, reduce time and space complexity from $O(n^2)$ to $O(n)$
- projection on the length dimension mixes tokens at different temporal position, how to cooperate with casual mask is not explained
#### Linear Transformer
- replace the softmax with kernel functions


- if saving $S_p$ and $Z_p$ once computed, and then computing $V'_i$ for each query, time and space complexity is O(n).
#### Performer

- starting from generalized kernelizable attention, use Orthogonal Random Features (ORFs), $\phi: \mathbb{R}^d \to \mathbb{R}^r_+$, to approximate the kernel function, thus avoid computing the attention matrix
#### Synthesizer

- approximate the attention matrix with neural nets (each token $x_i$ is projected to a vector of length $N$ using a two-layered non-linear feed-forward network), instead of computing it from keys, queries, and values
- space complexity is still O(n^2)