Notes on Universal Transformers

# Notes on Universal Transformers #### Author: [Sharath Chandra](https://sharathraparthy.github.io/) ## [Paper Link](https://arxiv.org/pdf/1807.03819.pdf) ## Introduction 1. RNNs have an ability to handle sequential data and have been successful in many sequential modeling tasks such as machine translation, music generation, sentiment analysis etc. 2. However these models suffer from the ability to parallelly process the data. 3. Recently transformer networks proposed a way to entirely remove the RNN inductive bias by leveraging the popular self-attention mechanism. This lead to parallelization and faster training time. 4. However there are some tasks which the RNNs handle with ease but the transformer like architecture fail to solve them. 5. To tackle those issues, this paper proposes best of both worlds: parallelization with RNN inductive bias. 6. The proposed model is the generalization of transformer networks, has increased theoretical capabilities and improved results over a wide variety of challenging tasks. ## Architecture 1. One of the notable difference between the transformers (Vaswani et al., 2017) and the universal transformers is having the recurrent inductive bias. 2. This makes the universal transformers to have variable depth as compared to constant depth in transformers. 3. In each recurrent step, the representation is refined by going through the self-attention block and then through the transition operator giving it the ability to have variable depths. More precisely, for a input sequence of length $m$, we construct an embedding matrix $H^0 \in \mathbb{R}^{m \times d}$. We the iteratively pass this embedding matrix the self-attention block (multi-headed scalar dot product attention) and then through the recurrent transition function. We get the refined embedding $H^t$ after $t$ recurrent steps. 4. This type of recurrent updates are done in both encoder and decoder block while maintaing the same encoder decoder communication as in transformers (i.e., the way the input attention is fed into decoder to cross-attend). This is shown in the figure below. ![](https://i.imgur.com/3srMUAy.png) 5. This paper also proposses a dynamic halting mechanism which prioritizes the training time depending on how ambiguous the input symbols are. More precisely this paper uses Adaptive Computation Time (ACT) (Graves, 2016) which dynamically modulates e compuatational time depending on the input sequence. This gives the UTs to have dynamic variable depths which is lacking in the standard transformers. 6. Also note that in UT both the self-attention and transition weights are tied across layers. ## Experiments and results You can check the paper for the detailed experiments and results. Overall ddue to the RNN inductive bias, dynamic halting resulting in variable depth transformers, UTs consistently outperform standard transformers across all tasks tested.