Transformers using Named Tensor Notation

written by @marc_lelarge

We are using the Named Tensor Notation from David Chiang, Alexander M. Rush and Boaz Barak to describe the basic (i.e. without position encoding and autoregressive mask) encoder block of Transformers defined in Attention Is All You Need by Vaswani et al.

For a presentation of attention mechanism and transformers, see Module 12 - Attention and Transformers

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Basics about Named Tensor Notation

These notations should feel natural as it implements the elementwise operations, broadcasting, reductions and contractions natural in numpy and PyTorch.
Perhaps, we should recall that functions from vectors to vectors lift to functions on tensors that operate along one axis but leave the tensor shape unchanged. For example, the following softmax function defined by

\begin{aligned} \underset{\begin{array}{c} ax \end{array}}{softmax} : R^{ax} & \to R^{ax} \\ \underset{\begin{array}{c} ax \end{array}}{softmax} (X) & = \frac{\exp (X)}{\sum_{ax} \exp (X)} \end{aligned}

will act on any

X \in R^{ax \times bx}

so that

\underset{\begin{matrix} ax \end{matrix}}{softmax} (X) = Y \in R^{ax \times bx}

is such that

\sum_{ax} Y = 1 \in R^{bx}

. The function

\underset{\begin{matrix} ax \end{matrix}}{softmax}

is only defined for tensors having an axis named

ax

, this is the meaning of the line:

\underset{\begin{matrix} ax \end{matrix}}{softmax} : R^{ax} \to R^{ax}

. Note in particular, that

R^{ax}

is NOT the domain of definition of the function, it contains the minimal set of named axis required to apply the function.

Recall that elmentwise multiplication is denoted by

\underset{\begin{matrix}  \end{matrix}}{⊙}

and the dot-product along axis

ax

is denoted

\underset{\begin{matrix} ax \end{matrix}}{⊙}

. Here is one example taken from the original paper:

\begin{array}{r} A = height \begin{array}{c} width \\ [\begin{matrix} 3 & 1 & 4 \\ 1 & 5 & 9 \\ 2 & 6 & 5 \end{matrix}] \end{array} \in R^{height \times width}, \end{array}

and

\begin{aligned} x & = height \begin{array}{c} [\begin{matrix} 2 \\ 7 \\ 1 \end{matrix}] \end{array} \in R^{height} . \end{aligned}

Then, we have

\begin{array}{r} A \underset{\begin{array}{c}  \end{array}}{⊙} x = height \begin{array}{c} width \\ [\begin{matrix} 3 \cdot 2 & 1 \cdot 2 & 4 \cdot 2 \\ 1 \cdot 7 & 5 \cdot 7 & 9 \cdot 7 \\ 2 \cdot 1 & 6 \cdot 1 & 5 \cdot 1 \end{matrix}] \end{array} \in R^{height \times width}, \end{array}

and

\begin{array}{r} A \underset{\begin{array}{c} height \end{array}}{⊙} x = \sum_{height} A \underset{\begin{array}{c}  \end{array}}{⊙} x = \begin{array}{c} width \\ [\begin{matrix} 6 + 7 + 2 & 2 + 35 + 6 & 8 + 49 + 5 \end{matrix}] \end{array} \in R^{width}, \end{array}

corresponding to the standard matrix multiplication.

Linear layers

For

W \in R^{ax \times bx}

and

b \in R^{bx}

, we define a linear layer by:

\begin{aligned} {LN}_{W} : R^{ax} & \to R^{bx} \\ {LN}_{W} (X) & = X \underset{\begin{array}{c} ax \end{array}}{⊙} W + b \end{aligned}

Note that if

X \in R^{ax \times seq}

, then we have

{LN}_{W} (X) \in R^{bx \times seq}

Feedforward neural networks

Now a feedforward neural network is defined for

W_{1} \in R^{ax \times hidden}

b_{1} \in R^{hidden}

and

W_{2} \in R^{hidden \times bx}

b_{2} \in R^{bx}

as:

\begin{aligned} FFN : R^{ax} & \to R^{bx} \\ FFN (X) & = ReLU (X \underset{\begin{array}{c} ax \end{array}}{⊙} W_{1} + b_{1}) \underset{\begin{array}{c} hidden \end{array}}{⊙} W_{2} + b_{2} \end{aligned}

Again, if

X \in R^{ax \times seq}

, then we have

FFN (X) \in R^{bx \times seq}

Attention and SelfAttention

\begin{aligned} Attention : R^{key} \times R^{seq \times key} \times R^{seq \times val} & \to R^{val} \\ Attention (Q, K, V) & = (\underset{\begin{array}{c} seq \end{array}}{softmax} \frac{Q \underset{\begin{array}{c} key \end{array}}{⊙} K}{\sqrt{| key |}}) \underset{\begin{array}{c} seq \end{array}}{⊙} V . \end{aligned}

This definition takes a single query

Q

vector and returns a single result vector (and actually could be further reduced to a scalar values as

val

is not strictly necessary). To apply to a sequence, we can give

Q

{seq}^{'}

axis, and the function will compute an output sequence. Providing

Q

K

, and

V

with a

heads

axis lifts the function to compute multiple attention heads.

We can now define SelfAttention. Let

W_{Q} \in R^{chans \times key}

b_{Q} \in R^{key}

W_{K} \in R^{chans \times key}

b_{K} \in R^{key}

and

W_{V} \in R^{chans \times val}

b_{V} \in R^{val}

, with

| val | = | chans |

\begin{aligned} SelfAttention : R^{chans \times seq} & \to R^{val \times seq} \\ SelfAttention (X) & = Attention ({LN}_{Q} (X), {LN}_{K} (X), {LN}_{V} (X)), \end{aligned}

where

{LN}_{Q}

(resp.

{LN}_{K}

and

{LN}_{V}

) are linear layers associated with

W_{Q}

(resp.

W_{K}

and

W_{V}

).
Note that the names of the output are

seq

and

val

and to be able to add this to the input we need to rename

val \to chans

and this is possible because

| val | = | chans |

. So that in the end, we can do

\begin{array}{r} X + SelfAttention (X)_{val \to chans} \in R^{chans \times seq} \end{array}

Normalization Layers

We can define a single generic standardization function as:

\begin{aligned} \underset{\begin{array}{c} ax \end{array}}{standardize} : R^{ax} & \to R^{ax} \\ \underset{\begin{array}{c} ax \end{array}}{standardize} (X) & = \frac{X - \underset{\begin{array}{c} ax \end{array}}{mean} (X)}{\sqrt{\underset{\begin{array}{c} ax \end{array}}{var} (X) + ϵ}} \end{aligned}

where

ϵ > 0

is a small constant for numerical stability.

Note that

\begin{array}{r} \underset{\begin{array}{c} ax \end{array}}{mean} (X) = \frac{1}{| ax |} \sum_{ax} X, \end{array}

so that if

X \in R^{ax \times bx}

, then

\underset{\begin{matrix} ax \end{matrix}}{mean} (X) \in R^{bx}

and we are using broadcasting when we write

X - \underset{\begin{matrix} ax \end{matrix}}{mean} (X) \in R^{ax \times bx}

in the numerator above and similarly for the denominator.

Then, we can define the three kinds of normalization layers, all with type

R^{batch \times chans \times layer} \to R^{batch \times chans \times layer}

\begin{aligned} BatchNorm (X; γ, β) & = \underset{\begin{array}{c} batch, layer \end{array}}{standardize} (X) \underset{\begin{array}{c}  \end{array}}{⊙} γ + β & γ, β & \in R^{chans} \\ InstanceNorm (X; γ, β) & = \underset{\begin{array}{c} layer \end{array}}{standardize} (X) \underset{\begin{array}{c}  \end{array}}{⊙} γ + β & γ, β & \in R^{chans} \\ LayerNorm (X; γ, β) & = \underset{\begin{array}{c} layer, chans \end{array}}{standardize} (X) \underset{\begin{array}{c}  \end{array}}{⊙} γ + β & γ, β & \in R^{chans \times layer} \end{aligned}

Note that the shape of the output is always the same as the shape of the input for normalization layers.

For transformers, we will use a particular LayerNorm where

| layer | = 1

, so that we can simplify it as follows:

\begin{aligned} LayerNorm : R^{chans} & \to R^{chans} \\ LayerNorm (X) & = \underset{\begin{array}{c} chans \end{array}}{standardize} (X) \underset{\begin{array}{c}  \end{array}}{⊙} γ + β & γ, β & \in R^{chans} \end{aligned}

A simple Transformer block

To simplify, we omit multiple heads here. We also present the pre-LN Transformer (see On Layer Normalization in the Transformer Architecture by Xiong et al.) where LayerNorm are put before SelfAttention and Feed Forward Network.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

\begin{aligned} X & \in R^{chans \times seq} \\ X_{1} & = LayerNorm (X) \in R^{chans \times seq} \\ X_{2} & = X + SelfAttention (X_{1})_{val \to ax} \in R^{chans \times seq} \\ Y & = X_{2} + FFN (LayerNorm (X_{2})) \in R^{chans \times seq} \end{aligned}

Note that we take the simple version of LayerNorm with

| layer | = 1

. The values of the axis

seq

are mixed only in the SelfAttention layer.

Summary

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

\begin{aligned} LayerNorm : R^{chans} & \to R^{chans} \\ LayerNorm (X) & = \underset{\begin{array}{c} chans \end{array}}{standardize} (X) \underset{\begin{array}{c}  \end{array}}{⊙} γ + β & γ, β & \in R^{chans} \end{aligned}

\begin{aligned} Attention : R^{key} \times R^{seq \times key} \times R^{seq \times val} & \to R^{val} \\ Attention (Q, K, V) & = (\underset{\begin{array}{c} seq \end{array}}{softmax} \frac{Q \underset{\begin{array}{c} key \end{array}}{⊙} K}{\sqrt{| key |}}) \underset{\begin{array}{c} seq \end{array}}{⊙} V . \end{aligned}

Let

W_{Q}, W_{K} \in R^{chans \times key}

b_{Q}, b_{K} \in R^{key}

, and

W_{V} \in R^{chans \times val}

b_{V} \in R^{val}

, with

| val | = | chans |

\begin{aligned} SelfAttention : R^{chans \times seq} & \to R^{val \times seq} \\ SelfAttention (X) & = Attention (X \underset{\begin{array}{c} chans \end{array}}{⊙} W_{Q} + b_{Q}, X \underset{\begin{array}{c} chans \end{array}}{⊙} W_{K} + b_{K}, X \underset{\begin{array}{c} chans \end{array}}{⊙} W_{V} + b_{V}) \end{aligned}

\begin{aligned} FFN : R^{chans} & \to R^{chans} \\ FFN (X) & = ReLU (X \underset{\begin{array}{c} chans \end{array}}{⊙} W_{1} + b_{1}) \underset{\begin{array}{c} hidden \end{array}}{⊙} W_{2} + b_{2} \end{aligned}

\begin{aligned} X & \in R^{chans \times seq} \\ X_{1} & = LayerNorm (X) \in R^{chans \times seq} \\ X_{2} & = X + SelfAttention (X_{1})_{val \to ax} \in R^{chans \times seq} \\ Y & = X_{2} + FFN (LayerNorm (X_{2})) \in R^{chans \times seq} \end{aligned}

For a presentation of attention mechanism and transformers, see Module 12 - Attention and Transformers

Transformers using Named Tensor Notation

Basics about Named Tensor Notation

Linear layers

Feedforward neural networks

Attention and SelfAttention

Normalization Layers

A simple Transformer block

Summary

tags: public dataflowr transformers

Read more

A* algorithm

Broadcasting in Python: K-means algorithm

Basics about probability distribution and Gaussians

Autodiff and Backpropagation

tags: `public` `dataflowr` `transformers`