Transformer - HackMD

1. # Transformer ## Overview ![](https://i.imgur.com/DnvRE1F.png) ![](https://i.imgur.com/FpTbFjk.png) ## Code * [Awesome Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer) * [Compressing Large-Scale Transformer-Based Models: A Case Study on BERT, arXiv:2002.11985, 2020](https://arxiv.org/abs/2002.11985) * [ViT-PyTorch](https://github.com/lucidrains/vit-pytorch) ## Model * [huggingface! State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0](https://github.com/huggingface/transformers) ## Others * [Transformer各层网络结构详解！面试必备！(附代码实现)](http://blog.itpub.net/69942346/viewspace-2658350/) ## Papers * [A Survey on Visual Transformer](https://arxiv.org/abs/2012.12556) * [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169) * [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv preprint arXiv:1810.04805 ](https://arxiv.org/abs/1810.04805)[[Code]](https://github.com/google-research/bert) * [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)[[Code]](https://github.com/huggingface/transformers) * [DETR: End-to-End Object Detection with Transformers, ECCV 2020](https://arxiv.org/abs/2005.12872) [[Code]]( https://github.com/facebookresearch/detr.) * [End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315, 2020](https://arxiv.org/abs/2011.09315) * [Deformable-DETR: Deformable transformers for end-to-end object detection](https://arxiv.org/abs/2010.04159) --- * [TinyBert: Distilling BERT for natural language understanding, EMMLP 2020](https://arxiv.org/abs/1909.10351) ## Baseline * [ConvLSTM-PyTorch](https://github.com/jhhuang96/ConvLSTM-PyTorch) ## Note of Transformer by Hong-yi Lee ### [Transformer (2019-05-31)](https://www.youtube.com/watch?v=ugWDIIOHtPA) ### [【機器學習2021】自注意力機制 (Self-attention) (上)(2021-03-12)](https://www.youtube.com/watch?v=hYdO9CscNes) ### [【機器學習2021】自注意力機制 (Self-attention) (下)(2021-03-21)](https://www.youtube.com/watch?v=gmsMY5kc-zw&t=1s) ### [【機器學習2021】Transformer (上)](https://www.youtube.com/watch?v=n9TlOhRjYoc) ### [【機器學習2021】Transformer (下)](https://www.youtube.com/watch?v=N6aRv06iv2g) ![](https://i.imgur.com/vsOArUL.png) --- ![reference link](https://i.imgur.com/i5Fuuxi.png) --- ![reference link](https://i.imgur.com/5VEAzPf.png) --- ![reference link](https://i.imgur.com/fPMSZRw.png) Do weighting sum --- ![reference link](https://i.imgur.com/sTHjsKd.png) Self-Transformer可以做到計算平行化，且可以接受long dependencey data ## How to do parallel computing??? ![](https://i.imgur.com/HUOAZtN.png) * A = Q-K ![](https://i.imgur.com/lHj49Oj.png) --- ![reference link](https://i.imgur.com/gJMdiuC.png) * O = VA ![](https://i.imgur.com/diVHMzc.png) * Summary ![](https://i.imgur.com/0JaSHjz.png) * Mult-head Self-Attention (同一個query 分裂！) ![](https://i.imgur.com/YtB7OYJ.png) Multi-head 好處：每個head可以關注不同的資訊 * 解決位置輸入順序的問題：Position Encoding (+ $e^{i}$) ![](https://i.imgur.com/eHsNVkh.png) 用一個one-hot vector $p^{i}$來表示$e^{i}$的位置 --- ## Replace RNN with self-attention ![reference link](https://i.imgur.com/rQMb3Rs.png) ![](https://i.imgur.com/NqTcK29.png) --- ![reference link](https://i.imgur.com/Ws6T8pm.png) ![](https://i.imgur.com/80QZi7F.png) ![](https://i.imgur.com/LhdsqRS.png) --- ![reference link](https://i.imgur.com/Ba5WwoQ.png) --- ![reference link](https://i.imgur.com/dNj8ZzZ.png) # Application ![](https://i.imgur.com/ihULXEK.png) # Self-Attention (stacked self-attention) ![](https://i.imgur.com/1jzUbZI.jpg) ![](https://i.imgur.com/jvfjkRy.png) ![](https://i.imgur.com/HHEhepm.png) ![](https://i.imgur.com/81s0O1L.png) ![](https://i.imgur.com/rjjPfhI.jpg) 所以其實要求的參數是 $W^q$, $W^k$, $W^v$? ![](https://i.imgur.com/rmJH4WZ.jpg) ![](https://i.imgur.com/1U9JC7k.png) ![](https://i.imgur.com/Jpfz4Kc.png) ![](https://i.imgur.com/qpyPwRY.png) ![](https://i.imgur.com/tp7Gc9A.png) ![](https://i.imgur.com/zeTaQ23.png) --- # Transformer ![](https://i.imgur.com/TEsF9P8.png) # DETR ![](https://i.imgur.com/ygKhHia.png) ![](https://i.imgur.com/4gXflCQ.png) --- ## Transformer Basics [[Link]](https://arxiv.org/pdf/2012.12556.pdf) ![](https://i.imgur.com/ikeS1rR.png) the attention function between different input vectors is calculated as follows ![](https://i.imgur.com/zq6KQDj.png) The process can be unified into a single function: ![](https://i.imgur.com/47VugoW.png) The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: The key matrix K and value matrix V are derived from the encoder module, and the query matrix Q is derived from the previous layer. ## Transformer v.s. CNN v.s. RNN Compared with CNNs, which focus only on local characteristics, transformer can capture long-distance characteristics, meaning that it can easily derive global information. ![](https://i.imgur.com/qtLlKGd.png) ![](https://i.imgur.com/T1d7E1c.png) ![](https://i.imgur.com/uBNFp8F.png) ![](https://i.imgur.com/t9gD6ux.png) ![](https://i.imgur.com/ecaYeLY.png) ![](https://i.imgur.com/UqyuTG9.png) self-attention計算量很大！ ![](https://i.imgur.com/qZPgBci.png) And in contrast to RNNs, whose hidden state must be computed sequentially, transformer is more efficient because the output of the self-attention layer and the fully connected layers can be computed in parallel and easily accelerated. ------ * [long range arena a benchmark for efficient transformers](https://arxiv.org/abs/2011.04006) * [efficient transformers a survey](https://arxiv.org/abs/2009.06732)