There is no commentSelect some text and then click Comment, or simply add a comment to this page from below to start a discussion.
Self-attention Transformers
Issues with recurrent models
Linear Interaction Distance
RNNs take steps for distant word pairs to interact
This means it will be hard to interact with words that are distant to the current word, where it may have been necessary to interact.
Linear order of words is "baked in". Linear order isn't the right way to think about sentences.
Lack of parallelizability
Forward and backward passes have unparallelizable operations.
GPUs can perform a bunch of independent computations at once. At the same time, we can't directly parallelize all the operations because future RNN hidden states can't be computedin full before past RNN hidden states have been computed.