written by @marc_lelarge
We are using the Named Tensor Notation from David Chiang, Alexander M. Rush and Boaz Barak to describe the basic (i.e. without position encoding and autoregressive mask) encoder block of Transformers defined in Attention Is All You Need by Vaswani et al.
For a presentation of attention mechanism and transformers, see Module 12 - Attention and Transformers
These notations should feel natural as it implements the elementwise operations, broadcasting, reductions and contractions natural in numpy and PyTorch.
Perhaps, we should recall that functions from vectors to vectors lift to functions on tensors that operate along one axis but leave the tensor shape unchanged. For example, the following softmax function defined by
will act on any
Recall that elmentwise multiplication is denoted by
and
Then, we have
and
corresponding to the standard matrix multiplication.
For
Note that if
Now a feedforward neural network is defined for
Again, if
This definition takes a single query
We can now define SelfAttention. Let
where
Note that the names of the output are
We can define a single generic standardization function as:
where
Note that
so that if
Then, we can define the three kinds of normalization layers, all with type
Note that the shape of the output is always the same as the shape of the input for normalization layers.
For transformers, we will use a particular LayerNorm where
To simplify, we omit multiple heads here. We also present the pre-LN Transformer (see On Layer Normalization in the Transformer Architecture by Xiong et al.) where LayerNorm are put before SelfAttention and Feed Forward Network.
Note that we take the simple version of LayerNorm with
Learn More →
Let
For a presentation of attention mechanism and transformers, see Module 12 - Attention and Transformers
public
dataflowr
transformers