Efficient Transformer Survey

# Efficient Transformer Survey Self-Attention: $$ \text{Attention}(Q,K,V) = \text{softmax}(QK^T)V$$ where $Q\in\mathbb{R}^{n\times d},K\in\mathbb{R}^{n\times d}$ and $V\in\mathbb{R}^{n\times d}$ are queries, keys and values. In our survey, many methods have tried to reduce the complexity on $\text{softmax}(QK^T)$, which is the root cause of quadratic complexity on attention, especially to reduce the number of rows in $K$ and $V$. To do that, some of them tried to reduce the number of tokens in $K$ through [fixed patterns](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Fixed--Factorized-Patterns) (e.g., just attending near tokens) or [learnable patterns](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Sparse-Attention-Based-on-Learnable-Patterns) (e.g., finding a few representative tokens through clustering). In other direction, [memory was used to use smaller $K$](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Having-Memories) (e.g., by compressing the past through additional NeuralNet or using trainable additional tokens which are called as inducing points or global tokens). As the last, some of them tried [to approximate the matrix $QK^T \in \mathbb{R}^{n \times n}$ to low-rank matrix](https://hackmd.io/vdYVt7SbQIKaK9XIP3Nujg?both#Low-Rank--Kernels) (e.g., project $K$ to smaller matrix $\in \mathbb{R}^{k \times d}$). ![](https://i.imgur.com/5nRZRcN.png) ## Fixed / Factorized Patterns #### Sparse Transformer (Child et al., 2019) - Motivation - Early layers learn locally connected patterns - Middle layers (19-20) attention split to row and column attention - Several layers showed global, data-dependant access patterns - Typical later layers (64-128) exhibit high sparsity ![](https://i.imgur.com/aFyI4If.png) - Separate full attention operations across several steps of attention - Global connectivity after several layers - More efficient while still providing global context to any given position - (b) Strided attention works well for data that naturally aligns with the stride (eg. images or music), but not text - $c$ Fixed attention pattern work better for text allowing all future cells to attend to certain cells or groups of cells (eg. important words or phrases) ![](https://i.imgur.com/npNllWt.png) #### Blockwise Transformer (Qiu et al., 2019) - Uses sparse block matrices and are easier to implement without custom GPU kernel - Use different permutations for different heads ![](https://i.imgur.com/n2XKLyy.png) #### Axial Transformer (Ho et al., 2019) - Applies multiple attentions, each along a single axis of the input tensor - Straightforward to implement and does not require custom kernel for an efficient implementation ![](https://i.imgur.com/0uBmkqX.png) #### Image Transformer (Parmer et al., 2018) - Restricts self attention to only local neighborhoods - Partitions the image into query blocks and memory blocks - For all queries from a single query block, the model attends to the same memory block - Two schemes: - 1-Dimensional local attention - Image flattened into raster scan order - Paritioned into non-overlapping query blocks Q - For each query block, a fixed number of prior pixels are used as memory block - 2-Dimensional local attention - Image is partitioned into non-overlapping rectangular query blocks - Memory block extends query block - Downsides: loses global receptive field ![](https://i.imgur.com/RJDtnQk.png) ## Sparse Attention Based on Learnable Patterns #### Reformer (Kitaev et al., 2020) - Reducing complexity by focusing on the keys that are closest to the queries through locality sensitive hashing (**random projection** to smaller space and compare). - $O(n^2) \rightarrow O(n\log n)$ #### Routing Transformer (Roy et al., 2020) - **Clustering based** local (attending near ones) and sparse (attending based on context-based sparsity). - Train K centroids through data from batch - Inference Q and K on the centroids and select top w samples based on distance from each centroid. - Attending on the selected Q and K. - $O(n^2) \rightarrow O(n^{1.5})$ ![](https://i.imgur.com/1WgSyUJ.png) #### Sinkhorn Transformer (Tay et al., 2020b) - Sparse attention based differentiable sorting through SortNet for Sinkhorn sorting. - Attending on the representative token on each block which is from the SortNet. - $O(n^2) \rightarrow O(B^2+N_B^2)$ where $N_B$ is the number of block and $B = N/N_B$ ($N >> N_B$, e.g., when $N=1024, N_B=64$). ## Having Memories #### Transformer-XL (Dai et al., 2019) - By attending memory that is the output of each layer of Transformer, it extends its memory span. ![](https://i.imgur.com/YJJpYSS.png) #### Compressive Transformer (Rae et al., 2018) - Compress the past in the memory through a compression function to extend the memory span. - Also includes auxiliary loss to reconstruct old memories from the compressed memories ![](https://i.imgur.com/qDA3YVV.png) #### Set Transformer (Lee et al., 2019) - Introducing inducing point that is trainable memory to cover large set with smaller one. #### Memory Compressed (Liu et al., 2018) - Using memory compressed attention that uses compressed keys and values through additional ConvNet. ![](https://i.imgur.com/OEKKyTN.png) #### Longformer (Beltagy et al., 2020) - Handling long sequence through global windowed attention. ![](https://i.imgur.com/fQkJGYW.png) #### ETC: Encoding Long and Structured Inputs in Transformers (Ainslie et al., 2020) - Represent hierarchy through global-local attention (HIBERT (Zhang et al., 2019)) with two types of input, global/local input. - Requiring pre-trained model (e.g., BERT) - Adapt CPC for pretraining ETC model ![](https://i.imgur.com/Z1WjQA8.png) #### Big Bird (Zaheer et al., 2020) - Combines 3 different types of attention: - Global Attention - Similar to Longformer/ETC. The global attention can be either specific input that can attend to all other inputs and vice versa - Or extra inputs (like CLS tokens) - Local Attention - Similar to fixed pattern approaches, each query extends $w/2$ tokens to the left and $w/2$ tokens to the right. - Random Attention - Each query attends to certain number of random keys - Because of the global tokens, cannot be used to autoregressively decode. ![](https://i.imgur.com/1dnfsrB.png) ## Low Rank / Kernels #### Linformer: ![](https://i.imgur.com/FUuqBHe.png) - the N x N (bi-directional) self-attention matrix is low rank - project N x d dimensional keys and values to k x d dimension, compute the attention matrix based on projected keys and values, reduce time and space complexity from $O(n^2)$ to $O(n)$ - projection on the length dimension mixes tokens at different temporal position, how to cooperate with casual mask is not explained #### Linear Transformer - replace the softmax with kernel functions ![](https://i.imgur.com/u6dPYyt.png) ![](https://i.imgur.com/gOciCER.png) - if saving $S_p$ and $Z_p$ once computed, and then computing $V'_i$ for each query, time and space complexity is O(n). #### Performer ![](https://i.imgur.com/tJ9fSzs.png) - starting from generalized kernelizable attention, use Orthogonal Random Features (ORFs), $\phi: \mathbb{R}^d \to \mathbb{R}^r_+$, to approximate the kernel function, thus avoid computing the attention matrix #### Synthesizer ![](https://i.imgur.com/vH1Idbj.png) - approximate the attention matrix with neural nets (each token $x_i$ is projected to a vector of length $N$ using a two-layered non-linear feed-forward network), instead of computing it from keys, queries, and values - space complexity is still O(n^2)