# Transformers using Named Tensor Notation written by [@marc_lelarge](https://twitter.com/marc_lelarge) We are using the [Named Tensor Notation](https://arxiv.org/abs/2102.13196) from David Chiang, Alexander M. Rush and Boaz Barak to describe the basic (i.e. without position encoding and autoregressive mask) encoder block of Transformers defined in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. For a presentation of attention mechanism and transformers, see [Module 12 - Attention and Transformers](https://dataflowr.github.io/website/modules/12-attention/) ![](https://i.imgur.com/07LP8sL.png) ## Basics about Named Tensor Notation These notations should feel natural as it implements the elementwise operations, broadcasting, reductions and contractions natural in numpy and PyTorch. Perhaps, we should recall that functions from vectors to vectors lift to functions on tensors that operate along one axis but leave the tensor shape unchanged. For example, the following softmax function defined by \begin{align*} \newcommand{\namedtensorstrut}{\vphantom{fg}} \newcommand{\nfun}[2]{\mathop{\underset{\substack{#1}}{\namedtensorstrut\mathrm{#2}}}} \newcommand{\name}[1]{\mathsf{\namedtensorstrut #1}} \newcommand{\ndef}[2]{\newcommand{#1}{\name{#2}}} \ndef{\ax}{ax} \ndef{\bx}{bx} \newcommand{\reals}{\mathbb{R}} \ndef{\batch}{batch} \ndef{\layer}{layer} \ndef{\chans}{chans} \ndef{\key}{key} \ndef{\seq}{seq} \ndef{\val}{val} \ndef{\heads}{heads} \ndef{\hidden}{hidden} \ndef{\height}{height} \ndef{\width}{width} \newcommand{\nbin}[2]{\mathbin{\underset{\substack{#1}}{\namedtensorstrut #2}}} \newcommand{\ndot}[1]{\nbin{#1}{\odot}} \nfun{\ax}{softmax} \colon \mathbb{R}^{\ax } &\rightarrow \mathbb{R}^{\ax } \\ \nfun{\ax}{softmax}(X) &= \frac{\exp(X)}{\sum_{\ax} \exp(X)} \end{align*} will act on any $X\in \reals^{\ax \times \bx}$ so that $\nfun{\ax}{softmax}(X)=Y\in \reals^{\ax \times \bx}$ is such that $\sum_{\ax} Y = \mathbf{1}\in \reals^{\bx}$. The function $\nfun{\ax}{softmax}$ is only defined for tensors having an axis named $\ax$, this is the meaning of the line: $\nfun{\ax}{softmax} \colon \mathbb{R}^{\ax } \rightarrow \mathbb{R}^{\ax }$. Note in particular, that $\reals^\ax$ is NOT the domain of definition of the function, it contains the minimal set of named axis required to apply the function. Recall that elmentwise multiplication is denoted by $\ndot{}$ and the dot-product along axis $\ax$ is denoted $\ndot{\ax}$. Here is one example taken from the original paper: \begin{align*} \newcommand{\nmatrix}[3]{#1\begin{array}[b]{@{}c@{}}#2\\\begin{bmatrix}#3\end{bmatrix}\end{array}} A = \nmatrix{\height}{\width}{ 3 & 1 & 4 \\ 1 & 5 & 9 \\ 2 & 6 & 5 } \in \reals^{\height \times \width}, \end{align*} and \begin{align*} x &= \nmatrix{\height}{}{ 2 \\ 7 \\ 1 } \in \reals^{\height}. \end{align*} Then, we have \begin{align*} A \ndot{} x = \nmatrix{\height}{\width}{ 3\cdot 2 & 1\cdot 2 & 4\cdot 2 \\ 1\cdot 7 & 5\cdot 7 & 9\cdot 7 \\ 2\cdot 1 & 6\cdot 1 & 5\cdot 1 } \in \reals^{\height \times \width}, \end{align*} and \begin{align*} A \ndot{\height} x = \sum_{\height} A \ndot{} x = \nmatrix{}{\width}{ 6+7+2 & 2+35+6 & 8+49+5 } \in \reals^{\width}, \end{align*} corresponding to the standard matrix multiplication. ## Linear layers For $W\in \reals^{\ax \times \bx}$ and $b\in \reals^{\bx}$, we define a linear layer by: \begin{align*} \mathrm{LN_{W}}\colon \reals^\ax &\rightarrow \reals^{\bx}\\ \mathrm{LN_{W}}(X) &= X \ndot{\ax} W + b \end{align*} Note that if $X\in \reals^{\ax\times \seq}$, then we have $\mathrm{LN_{W}}(X)\in \reals^{\bx \times \seq}$. ## Feedforward neural networks Now a feedforward neural network is defined for $W_1\in \reals^{\ax\times \hidden}$, $b_1\in \reals^{\hidden}$ and $W_2\in \reals^{\hidden\times \bx}$, $b_2\in \reals^{\bx}$ as: \begin{align*} \mathrm{FFN}: \reals^{\ax} & \rightarrow \reals^{\bx}\\ \mathrm{FFN}(X) &= \mathrm{ReLU}(X \ndot{\ax} W_1 + b_1) \ndot{\hidden} W_2+b_2 \end{align*} Again, if $X\in \reals^{\ax\times \seq}$, then we have $\mathrm{FFN}(X)\in \reals^{\bx \times \seq}$. ## Attention and SelfAttention \begin{align*} \text{Attention} \colon \mathbb{R}^{\key} \times \mathbb{R}^{\seq \times\key} \times \mathbb{R}^{\seq \times\val} &\rightarrow \mathbb{R}^{\val} \\ \text{Attention}(Q,K,V) &= \left( \nfun{\seq}{softmax} \frac{Q \ndot{\key} K}{\sqrt{|\key|}} \right) \ndot{\seq} V. \end{align*} This definition takes a single query $Q$ vector and returns a single result vector (and actually could be further reduced to a scalar values as $\val$ is not strictly necessary). To apply to a sequence, we can give $Q$ a $\seq'$ axis, and the function will compute an output sequence. Providing $Q$, $K$, and $V$ with a $\heads$ axis lifts the function to compute multiple attention heads. We can now define SelfAttention. Let $W_Q\in \reals^{\chans\times \key}$, $b_Q\in \reals^\key$, $W_K\in \reals^{\chans\times \key}$, $b_K\in \reals^\key$ and $W_V\in \reals^{\chans \times \val}$, $b_V\in \reals^\val$, with $|\val| = |\chans|$: \begin{align*} \text{SelfAttention} \colon \mathbb{R}^{\chans \times\seq} &\rightarrow \mathbb{R}^{\val\times \seq} \\ \text{SelfAttention}(X) &= \text{Attention}(\mathrm{LN_Q}(X), \mathrm{LN_K}(X), \mathrm{LN_V}(X)), \end{align*} where $\mathrm{LN_Q}$ (resp. $\mathrm{LN_K}$ and $\mathrm{LN_V}$) are linear layers associated with $W_Q$ (resp. $W_K$ and $W_V$). Note that the names of the output are $\seq$ and $\val$ and to be able to add this to the input we need to rename $\val \to \chans$ and this is possible because $|\val|=|\chans|$. So that in the end, we can do \begin{align*} X + \mathrm{SelfAttention}(X)_{\val \to \chans} \in \reals^{\chans \times \seq} \end{align*} ## Normalization Layers We can define a single generic standardization function as: \begin{align*} \nfun{\ax}{standardize} \colon \mathbb{R}^{\ax } &\rightarrow \mathbb{R}^{\ax } \\ \nfun{\ax}{standardize}(X) &= \frac{X - \nfun{\ax}{mean}(X)}{\sqrt{\nfun{\ax}{var}(X) + \epsilon}} \end{align*} where $\epsilon > 0$ is a small constant for numerical stability. Note that \begin{align*} \nfun{\ax}{mean}(X) = \frac{1}{|\ax|}\sum_{\ax} X, \end{align*} so that if $X\in \reals^{\ax \times \bx}$, then $\nfun{\ax}{mean}(X) \in \reals^{\bx}$ and we are using broadcasting when we write $X - \nfun{\ax}{mean}(X)\in \reals^{\ax \times \bx}$ in the numerator above and similarly for the denominator. Then, we can define the three kinds of normalization layers, all with type $\reals^{\batch \times \chans \times \layer} \rightarrow \reals^{\batch \times \chans \times \layer}$: \begin{align*} \text{BatchNorm}(X; \gamma, \beta) &= \nfun{\batch,\layer}{standardize}(X) \ndot{} \gamma + \beta & \gamma, \beta &\in \reals^{\chans} \\ \text{InstanceNorm}(X; \gamma, \beta) &= \nfun{\layer}{standardize}(X) \ndot{} \gamma + \beta & \gamma, \beta &\in \reals^{\chans} \\ \text{LayerNorm}(X; \gamma, \beta) &= \nfun{\layer,\chans}{standardize}(X) \ndot{} \gamma + \beta & \gamma, \beta &\in \reals^{\chans\times \layer} \end{align*} Note that the shape of the output is always the same as the shape of the input for normalization layers. For transformers, we will use a particular LayerNorm where $|\layer|=1$, so that we can simplify it as follows: \begin{align*} \text{LayerNorm}\colon \reals^{\chans} &\rightarrow \reals^{\chans}\\ \text{LayerNorm}(X) &= \nfun{\chans}{standardize}(X) \ndot{} \gamma + \beta & \gamma, \beta &\in \reals^{\chans} \end{align*} ## A simple Transformer block To simplify, we omit multiple heads here. We also present the pre-LN Transformer (see [On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745) by Xiong et al.) where LayerNorm are put before SelfAttention and Feed Forward Network. ![](https://i.imgur.com/ldpA9AW.png) \begin{align*} X &\in \reals^{\chans \times \seq}\\ X_1&=\text{LayerNorm}(X) \in \reals^{\chans\times \seq}\\ X_2&=X+\mathrm{SelfAttention}(X_1)_{\val \to \ax} \in \reals^{\chans \times \seq}\\ Y &=X_2 + \mathrm{FFN}(\text{LayerNorm}(X_2))\in \reals^{\chans\times \seq} \end{align*} Note that we take the simple version of LayerNorm with $|\layer|=1$. The values of the axis $\seq$ are mixed only in the SelfAttention layer. ## Summary [![](https://i.imgur.com/kgtJWBn.png)](https://www.dataflowr.com) \begin{align*} \text{LayerNorm}\colon \reals^{\chans} &\rightarrow \reals^{\chans}\\ \text{LayerNorm}(X) &= \nfun{\chans}{standardize}(X) \ndot{} \gamma + \beta & \gamma, \beta &\in \reals^{\chans} \end{align*} \begin{align*} \text{Attention} \colon \mathbb{R}^{\key} \times \mathbb{R}^{\seq \times\key} \times \mathbb{R}^{\seq \times\val} &\rightarrow \mathbb{R}^{\val} \\ \text{Attention}(Q,K,V) &= \left( \nfun{\seq}{softmax} \frac{Q \ndot{\key} K}{\sqrt{|\key|}} \right) \ndot{\seq} V. \end{align*} Let $W_Q, W_K\in \reals^{\chans\times \key}$, $b_Q,b_K\in \reals^\key$, and $W_V\in \reals^{\chans \times \val}$, $b_V\in \reals^\val$, with $|\val| = |\chans|$: \begin{align*} \text{SelfAttention} \colon \mathbb{R}^{\chans \times\seq} &\rightarrow \mathbb{R}^{\val\times \seq} \\ \text{SelfAttention}(X) &= \text{Attention}(X \ndot{\chans} W_Q + b_Q, X \ndot{\chans} W_K + b_K, X \ndot{\chans} W_V + b_V) \end{align*} \begin{align*} \mathrm{FFN}: \reals^{\chans} & \rightarrow \reals^{\chans}\\ \mathrm{FFN}(X) &= \mathrm{ReLU}(X \ndot{\chans} W_1 + b_1) \ndot{\hidden} W_2+b_2 \end{align*} \begin{align*} X &\in \reals^{\chans \times \seq}\\ X_1&=\text{LayerNorm}(X) \in \reals^{\chans\times \seq}\\ X_2&=X+\mathrm{SelfAttention}(X_1)_{\val \to \ax} \in \reals^{\chans \times \seq}\\ Y &=X_2 + \mathrm{FFN}(\text{LayerNorm}(X_2))\in \reals^{\chans\times \seq} \end{align*} For a presentation of attention mechanism and transformers, see [Module 12 - Attention and Transformers](https://dataflowr.github.io/website/modules/12-attention/) ###### tags: `public` `dataflowr` `transformers`