1. # Transformer
## Overview


## Code
* [Awesome Visual-Transformer](https://github.com/dk-liang/Awesome-Visual-Transformer)
* [Compressing Large-Scale Transformer-Based Models: A Case Study on BERT, arXiv:2002.11985, 2020](https://arxiv.org/abs/2002.11985)
* [ViT-PyTorch](https://github.com/lucidrains/vit-pytorch)
## Model
* [huggingface! State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0](https://github.com/huggingface/transformers)
## Others
* [Transformer各层网络结构详解!面试必备!(附代码实现)](http://blog.itpub.net/69942346/viewspace-2658350/)
## Papers
* [A Survey on Visual Transformer](https://arxiv.org/abs/2012.12556)
* [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169)
* [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv preprint arXiv:1810.04805 ](https://arxiv.org/abs/1810.04805)[[Code]](https://github.com/google-research/bert)
* [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108)[[Code]](https://github.com/huggingface/transformers)
* [DETR: End-to-End Object Detection with Transformers, ECCV 2020](https://arxiv.org/abs/2005.12872) [[Code]](
https://github.com/facebookresearch/detr.)
* [End-to-end object detection with adaptive clustering transformer. arXiv preprint arXiv:2011.09315, 2020](https://arxiv.org/abs/2011.09315)
* [Deformable-DETR: Deformable transformers for end-to-end object detection](https://arxiv.org/abs/2010.04159)
---
* [TinyBert: Distilling BERT for natural language understanding, EMMLP 2020](https://arxiv.org/abs/1909.10351)
## Baseline
* [ConvLSTM-PyTorch](https://github.com/jhhuang96/ConvLSTM-PyTorch)
## Note of Transformer by Hong-yi Lee
### [Transformer (2019-05-31)](https://www.youtube.com/watch?v=ugWDIIOHtPA)
### [【機器學習2021】自注意力機制 (Self-attention) (上)(2021-03-12)](https://www.youtube.com/watch?v=hYdO9CscNes)
### [【機器學習2021】自注意力機制 (Self-attention) (下)(2021-03-21)](https://www.youtube.com/watch?v=gmsMY5kc-zw&t=1s)
### [【機器學習2021】Transformer (上)](https://www.youtube.com/watch?v=n9TlOhRjYoc)
### [【機器學習2021】Transformer (下)](https://www.youtube.com/watch?v=N6aRv06iv2g)

---

---

---

Do weighting sum
---

Self-Transformer可以做到計算平行化,且可以接受long dependencey data
## How to do parallel computing???

* A = Q-K

---

* O = VA

* Summary

* Mult-head Self-Attention (同一個query 分裂!)

Multi-head 好處:每個head可以關注不同的資訊
* 解決位置輸入順序的問題:Position Encoding (+ $e^{i}$)

用一個one-hot vector $p^{i}$來表示$e^{i}$的位置
---
## Replace RNN with self-attention


---



---

---

# Application

# Self-Attention (stacked self-attention)





所以其實要求的參數是 $W^q$, $W^k$, $W^v$?






---
# Transformer

# DETR


---
## Transformer Basics [[Link]](https://arxiv.org/pdf/2012.12556.pdf)

the attention function between different input vectors is calculated as follows

The process can be unified into a single function:

The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: The key matrix K and value matrix V are derived from the encoder module, and the query matrix Q is derived from the previous layer.
## Transformer v.s. CNN v.s. RNN
Compared with CNNs, which focus only on local characteristics, transformer can capture long-distance characteristics, meaning that it can easily derive global information.






self-attention計算量很大!

And in contrast to RNNs, whose hidden state must be computed sequentially, transformer is more efficient because the output of the self-attention layer and the fully connected layers can be computed in parallel and easily accelerated.
------
* [long range arena a benchmark for efficient transformers](https://arxiv.org/abs/2011.04006)
* [efficient transformers a survey](https://arxiv.org/abs/2009.06732)