# Self-attention Transformers ## Issues with recurrent models ### Linear Interaction Distance * RNNs take $O(\text{Sequence Length})$ steps for distant word pairs to interact * This means it will be hard to interact with words that are distant to the current word, where it may have been necessary to interact. * Linear order of words is "baked in". Linear order isn't the right way to think about sentences. ### Lack of parallelizability * Forward and backward passes have $O(\text{Sequence Length})$ **unparallelizable operations**. * GPUs can perform a bunch of independent computations at once. At the same time, we can't directly parallelize all the operations because future RNN hidden states can't be computedin full before past RNN hidden states have been computed. * This inhibits training on very large datasets. ## Self-Attention ![](https://i.imgur.com/XZUpm7U.png) ![](https://i.imgur.com/ZiPUXJD.png) * There are a few things to note before we can use self-attention as an NLP block * The order in which words appear in a sentence is not taken into account * The network is purely linear so stacking more self-attention layers just re-averages vectors * Since the network is inter-connected, the model may able to "see" values at future time-steps ![](https://i.imgur.com/nsywa6K.png) ![](https://i.imgur.com/l0nkPFy.png)