Self-attention Transformers

Issues with recurrent models

Linear Interaction Distance

RNNs take
$O (Sequence Length)$ steps for distant word pairs to interact
This means it will be hard to interact with words that are distant to the current word, where it may have been necessary to interact.
Linear order of words is "baked in". Linear order isn't the right way to think about sentences.

Lack of parallelizability

Forward and backward passes have
$O (Sequence Length)$ unparallelizable operations.
- GPUs can perform a bunch of independent computations at once. At the same time, we can't directly parallelize all the operations because future RNN hidden states can't be computedin full before past RNN hidden states have been computed.
This inhibits training on very large datasets.

Self-Attention

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

There are a few things to note before we can use self-attention as an NLP block
- The order in which words appear in a sentence is not taken into account
- The network is purely linear so stacking more self-attention layers just re-averages vectors
- Since the network is inter-connected, the model may able to "see" values at future time-steps

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported