# YT: Hung Yi Lee ML2021, L10~L13, L18~L21, self_attention, transformer and self_supervised learning
# L10 Self-Attention
- Sophisticated Input
- Input is **a vector** $\Rightarrow$ model $\Rightarrow$ Scalar or Class
- Input is **a set vectors** (may change length) $\Rightarrow$ model $\Rightarrow$ Scalars or Classes
- Vector Set as input
- One-hot-Encoding: 失去字與字之間的關聯性(原本ohe 就是這個目的)
- word embedding: 找出字辭兼相關聯性
- [[to learn more: Unsupervised Learning - Word Embedding]](https://youtu.be/X7PH3NuYW0Q) or Vivian Chen again~~
- voice/sound sequency input, frame(25ms), 400 samples point(16,000Hz), 39-dim MFCC, 80-dim filter bank output...
- even duration of a frame is 25ms signal, but frame shifts 10ms, $\Rightarrow$ (1s $\rightarrow$ 100 frames), data amount is large.
- Graph is also a set of vectors (consider each node as a vector)
- [[blog: Social Network Analytics]](https://medium.com/analytics-vidhya/social-network-analytics-f082f4e21b16), 一個人就是一個節點
- [[分子]](http:/www.twword.com/wiki/%E5%88%86%E5%AD%90)
- What is the output?
- Each vector has a label. (一對一)
Example:
- POS tagging.: I(N) saw(V) a(DET) saw(N)
- speech recognition : phonetic
- the whole sequence has a label(多對一)
Example
- sentiment analysis, "this" "is" "good" $\Rightarrow$ "positive"
- 語者辨認
- GRAPH 分子,有沒有毒,具不具親水性
- 多對多, seq2seq, translation, voice recognition
- Sequence Labeling (一對一)
- FC(Fully Connected)
- "I saw a saw"... 輸入 saw 到底是要輸出動詞還是名詞?
- 當然可以做前後文輸入
- **Self-Attention**:
- [**a$^1$**, **a$^2$**,..., **a$^n$**] $\Rightarrow$ **Self-attention** $\Rightarrow$ [**b$^1$**, **b$^2$**,..., **b$^n$**] $\Rightarrow$ **n 個 FC**(fully connected) $\Rightarrow$ **n 個 output**
- [**V$_1$**, **V$_2$**,..., **V$_n$**] $\Rightarrow$ **Self-attention** $\Rightarrow$ **n 個 FC**(fully connected) $\Rightarrow$ $\Rightarrow$ **Self-attention** $\Rightarrow$ **n 個 FC**$\Rightarrow$ **Self-attention** $\Rightarrow$ **n 個 FC** $\Rightarrow$**n 個 output**
- **Self-attention** 不是只考慮輸入的一個小的範圍或一個 window,而是 **考慮整個 sequence**
- [[Attention Is All You Need]](https://arxiv.org/abs/1706.03762)
- [**a$^1$**, **a$^2$**,..., **a$^n$**] $\Rightarrow$ **Self-attention** $\Rightarrow$ [**b$^1$**, **b$^2$**,..., **b$^n$**]
- **b$^i$** 考慮了所有的 **a** 也就是 [**a$^1$**, **a$^2$**,..., **a$^n$**] (Can be either input or a hidden layer)
- how to generate **b$^1$**
- find the relevant vectors in a sequence. to have relevant $\alpha$
- **dot-product, $\alpha = q\cdot k$**
- additive: $\alpha = \textbf W(tanh(W^q(a^1)+W^k(a^2)))$ 留意,$\textbf W$ 的出現表示,這個定義的 attention scores 需要學習一個網路
- $q^1 = W^qa^1$: **query** :+1:
- $k^2 = W^ka^2$: **key** :+1:
- attention score $\alpha_{1,2}=q^1\cdot k^2$
- if we have 4 inputs, then we have [$\alpha_{1, 1}$, $\alpha_{1, 2}$, $\alpha_{1, 3}$, $\alpha_{1, 4}$] for $q^1(=W^q a^1)$ $\Rightarrow$ **Softmax** $\Rightarrow$ [$\alpha^{'}_{1, 1}$, $\alpha^{'}_{1, 2}$, $\alpha^{'}_{1, 3}$, $\alpha^{'}_{1, 4}$]
- $\alpha^{'}_{1,i} = \frac{\large e^{\small \alpha_{1, i}}}{\sum\limits_j \large e^{\small \alpha_{\small 1, j}}}$
- try relu replace softmax
- $v^i = W^va^i$ :**value** :+1:
- $b^1 = \sum\limits_i \alpha_1^{'}v^i$
- **$b^1, b^2, b^3, b^4$ 可以平行處理**
- 可以再寫一次!
- a 經過三個網路會有 qkv
- qk 做內積有了 attention scores. 可以送進 activation, 比如 softmax, or relu...
- 有了 attention scores 就拿來跟 v 作 weighted sum。
- 所以經過三個網路 $W^q, W^k, W^v$ 後進入 self-attention 都是計算程序!!
# L11 Self-Attention (II)
- Matrix
- $a^1 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^1, k^1, v^1$
- $a^2 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^2, k^2, v^2$
- $a^3 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^3, k^3, v^3$
- $a^4 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^4, k^4, v^4$
- ($q^i = W^qa^i) \Rightarrow [q^1, q^2, q^3, q^4] = W^q[a^1, a^2, a^3,a^4]$
- #### $Q = W^q I$ :+1:
- $k^i = W^ka^i \Rightarrow [k^1, k^2, k^3, k^4] = W^k[a^1, a^2, a^3,a^4]$
- #### $K = W^k I$ :+1:
- $v^i = W^va^i \Rightarrow [v^1, v^2, v^3, v^4] = v^k[a^1, a^2, a^3,a^4]$
- #### $V = W^v I$ :+1:
- $\alpha_{1,1} = k^{1^T} \space q^1$, $\alpha_{1,2} = k^{2^T} \space q^1$, ...
- $\begin{bmatrix} \alpha_{1,1}\\ \alpha_{1,2} \\ \alpha_{1,3}\\ \alpha_{1,4} \end{bmatrix} = \begin{bmatrix} k^{1}\\ k^{2} \\ k^3\\ k^4 \end{bmatrix} q^1$
- $\begin{bmatrix}
\alpha_{1,1} & \alpha_{2,1} & \alpha_{3,1} & \alpha_{4,1} \\
\alpha_{1,2} & \alpha_{2,2} & \alpha_{3,2} & \alpha_{4,2} \\
\alpha_{1,3} & \alpha_{2,3} & \alpha_{3,3} & \alpha_{4,3} \\
\alpha_{1,4} & \alpha_{2,4} & \alpha_{3,4} & \alpha_{4,4}
\end{bmatrix} =
\begin{bmatrix} k^{1}\\ k^{2} \\ k^3\\ k^4 \end{bmatrix} [q^1, q^2, q^3, q^4]$
- #### $A = K^T Q$
- #### $A^{'} = \mathop softmax(K^T Q)$
- $[b^1, b^2, b^3, b^4] = [v^1, v^2, v^3, v^4]A^{'}$
- ### $O = V\space A{'}$
- #### $Q = W^q I$ :+1:
#### $K = W^k I$ :+1:
#### $V = W^v I$ :+1:
### $A^{'} \Leftarrow A = K^T Q$ :+1:
### $O = V A{'}$ :+1:
***果然,老師這裡強調了: $W^q, W^k, W^v$ 需要被訓練,其他地方都是人為設定計算*** :100:
- Multi-head Self-attention (找不同類型的相關$\alpha$)
- how many heads, hyper parameter. :accept:
- $a^i \xrightarrow{W^q} ,q^i \space\Rightarrow \space \space \space \large a^{i,1}\xrightarrow{W^{q, 1}} q^{i,1}, a^{i,2}\xrightarrow{W^{q, 2}} q^{i,2}$
- 也就是 原本三個網路,現在變成六個網路。更多 head 就更多網路。
- 同樣 k, v 運算
- 然後 1 的做自己的,有了 $O^1$; 2 的做自己的,有了 $O^2$
- Positional Encoding
- no position information in self-attention
- each position has a unique positional vector $e^i$
- hand-crafted :accept:
- 讓原來的輸入 $a^i$ 先加入 positional vector $e^i$ 再送入 $W^q, W^k, W^v$ :+1:
- it also can be learnd from network :+1:
- [paper: Learning to Encode Position for Transformer with Continuous Dynamical Model ] (https://arxiv.org/abs/2003.09229)
- further applications
- [paper: Attention Is All You Need](https://arxiv.org/abs/1706.03762) 這一篇提出Transformer :100:
- [Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) :100:
- Self-attention for Speech
- speech is a very long vector sequence.
- if input sequence is length L, the Attention Maxtrix $A^{'}$ is L x L, need a big memory capacity
- Truncated self-attention,就是看的範圍不要全看,人設定只看一部分範圍, and to speed up...... [[paper: Transformer-Transducer: End-to-End Speech Recognition with Self-Attention]](https://arxiv.org/abs/1910.12977)
- Self-attention for Image,
- self-attention GAN, [[paper: Self-Attention Generative Adversarial Networks, Ina Goodfellow]](https://arxiv.org/abs/1805.08318) :+1:
- DEtection Transformer (DETR) [[paper: End-to-End Object Detection with Transformers]](https://arxiv.org/abs/2005.12872) :+1:
- Self-attention v.s. CNN [[paper: On the Relationship between Self-Attention and Convolutional Layers](/S0b4e15ES--o_7RuNde87A)](https://arxiv.org/abs/1911.03584)
- CNN $\subset$ Self-Attention :boom:
- CNN 是受限版的 Self-Attention, :boom:
- Self-Attention 是更 Flexible 的CNN :boom:
- 基本上必較 flexible 的模型需要更多資料才不至於 overfitting。
- 受限的模型,比較有機會在較少的資料中學好。
- [[paper: AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE]](https://arxiv.org/pdf/2010.11929.pdf), it shows, CNN: Good for less data(10M), Self-Attention is good for more data(100M ~300M ~)
- Self-Attention v.s. RNN
- RNN and bi-RNN (**short term memory**) :+1:
- Self-Attention 再遠都可以考慮到
- RNN nonparallel
- Self-Attention **parallel** :+1:
- [[paper: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](/KWZAv7muSLiK9TYFAqG9CQ)](https://arxiv.org/abs/2006.16236)
- [[YT:RNN,ML Lecture 21-1: Recurrent Neural Network (Part I), ML2017]](https://youtu.be/xCGidAeyS4M)
- 印象中,好像是 RNN(seq2seq) 無法 pre_train :boom:
- Self-Attention for Graph
- **nodes** are vectors for input, consider **edge**: only attention to connected nodes.
- 因為沒有相連的 nodes 就是沒有相連,所以不用去計算 attention scores.
- this is one type of **Graph Neural Network(GNN)**
- [[YT: GNN: [TA 補充課] Graph Neural Network (1/2) (由助教姜成翰同學講授), 2020 HLP?]](https://youtu.be/eybCCtNKwzA)
- [[paper: Long Range Arena: A Benchmark for Efficient Transformers]](https://arxiv.org/abs/2011.04006)
- [[paper: Efficient Transformers: A Survey]](https://arxiv.org/abs/2009.06732)
# L12 Transformer(1/2)
- Sequence-to-sequence (Seq2Seq)
- speech: T $\xrightarrow{speech \space recognition}$ sentence: N
- sentence : N $\xrightarrow{machine \space translaion}$ sentence: N$^{'}$
- speech: T $\xrightarrow{speech \space translaion}$ sentence: N
- speech translation $\neq$ speech recognition + machine translation, because, there are many languages in the world without text
- [[Hokkien: FORMOSA SPEECH RECOGNITION CHALLENGE 2020 - TAIWANESE ASR]](https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge-2020)
- we might have data (with label) from youtube...硬勸一發 :100:
- Text-to-Speech (TTS) Synthesis
- 估,台灣婿聲2.0
- not an end-2-end
- Seq2seq for chatbot
- input $\xrightarrow{seq2seq}$ response
- "Hi" $\xrightarrow{seq2seq}$ "Hell! How are you today?"
- Most Natural Language Processing applications:... Question Answering (QA)
- artical summary
- sentiment analysis
- QA can be done by seq2seq
- question, context $\Large \xrightarrow{\space \space seq2seq \space \space }$ answer
- refer: [[The Natural Language Decathlon: Multitask Learning as Question Answering]](https://arxiv.org/abs/1806.08730) and [[LAMOL: LAnguage MOdeling for Lifelong Language Learning]](https://arxiv.org/abs/1909.03329)
- refer HLP2020
- pixel 4 end2end, seq2seq
- Seq2Seq for Syntactice Parsing
- "deep learning is very powerful" $\Rightarrow$ parsing tree
- Is a parsing tree a sequence?
- [[Grammar as a Foreign Language]](https://arxiv.org/abs/1412.7449)
- Seq2Seq for Multi-label Classification
- ref: [[Order-free Learning Alleviating Exposure Bias in Multi-label Classification]](https://arxiv.org/abs/1909.03434)
- [[Order-Free RNN with Visual Attention for Multi-Label Classification]](https://arxiv.org/abs/1707.05495)
- Seq2Seq for Object Detection
- [[End-to-End Object Detection with Transformers]](https://arxiv.org/abs/2005.12872)
- ## Seq2Seq
- input sequence $\Large \xrightarrow{\space \space \small Encoder \space \space}\space \xrightarrow{\space \space \small Decoder \space \space}$ output sequence
- [[Sequence to Sequence Learning with Neural Networks]](https://arxiv.org/abs/1409.3215) :100:
- [[Attention Is All You Need]](https://arxiv.org/abs/1706.03762) :100:
- Encoder
- [$x^1,x^2, x^3, x^4$] $\xrightarrow{\space \space \small Encoder \space \space}$ [$lb^1,lb^2, lb^3, lb^4$] $\xrightarrow{\space \space \small Encoder \space \space}$ [$\dots,\dots, \dots, \dots$] $\xrightarrow{\space \space \small block \space \space}$ [$h^1,h^2, h^3, h^4$]
- ### Each block:
- [$v^1,v^2, v^3,v^4$] $\xrightarrow{\space \space \small self-attention \space \space}$ [$o^1,o^2, o^3, o^4$] $\xrightarrow{\space \space \small FC\times 4 \space \space}$[$lb^1,lb^2, lb^3, lb^4$]
- **residual**
- [$v^1,v^2, v^3,v^4$] $\xrightarrow{\space \space \small self-attention \space \space}$ [$sa^1,sa^2, sa^3, sa^4$] $\xrightarrow[add \space v^i]{\space \space \small {\textbf {Residual}} \space \space}$ [$sa^1+v^1,sa^2+v^2, sa^3+v3, sa^4+v^4$]$\xrightarrow{\space \space \small {\textbf {layer norm}} \space \space}$ 才真的送到 FC,同時 FC 有 residual desing,再做一次 layer norm
- [[Layer Normalization]](https://arxiv.org/abs/1607.06450)
- $\begin{bmatrix} x_1 \\ x_2 \\ x_3 \\x_4\end{bmatrix} \xrightarrow{\space \space \small {\textbf {layer norm}}\space \space}\begin{bmatrix} x_1^{'} \\ x_2^{'} \\ x_3^{'} \\x_4^{'}\end{bmatrix}$, where $x_i^{'} = \frac{x_i-m}{\sigma}$
- [[batch normal]](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/normalization_v4.pdf)
- BERT: Encoder of Transformer
- to learn more
- [[On Layer Normalization in the Transformer Architecture]](https://arxiv.org/abs/2002.04745)
- [[PowerNorm: Rethinking Batch Normalization in Transformers]](https://arxiv.org/abs/2003.07845)
# L12 Transformer (2/2)
- Autoregressive (Speech Rocognition as example)
- voice singal vectors $\xrightarrow{Encoder}$ encoded vectors
- encoded vectors + BOS (a one hot encoding) $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“機” (\a~z\subwords 的 one hot encoding 的 distribution)
- encoded vectors, BOS, "機" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“器” (\a~z\subwords 的 one hot encoding 的 distribution)
- encoded vectors, BOS, "機", "器" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“學” (\a~z\subwords 的 one hot encoding 的 distribution)
- encoded vectors, BOS, "機", "器", "學" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“習” (\a~z\subwords 的 one hot encoding 的 distribution)
- **We do not know the correct output length. We design "END/EOS" code**
- encoded vectors, BOS, "機", "器", "學", "習" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“**EOS**” (\a~z\subwords 的 one hot encoding 的 distribution)
- Decode stucture
- more complicated than Encode
- [Encoder]
- "Input+Position Encoding" $\rightarrow$ "multi-head Attention" $\rightarrow$ "Add & Norm" $\rightarrow$ "FF" $\rightarrow$ "Add Attention"
- **[Decoder]**
- "Outputs (shifted Right) +Position Encoding" $\rightarrow \large Mask\space$ "multi-head Attention" $\rightarrow$ "Add & Norm" $\rightarrow \Large {\textbf [Block]}$ $\rightarrow$ "FF" $\rightarrow$ "Add Attention"
- $\Large Mask$:
- 當 decode 的時候,產生 $b^2$ 時,其 $q^2$ 只跟他自己 $k^2$ 還有左邊的 $k^1$ 作 dot product,產生 attention score... , 不考慮他的右邊的 $k^3, k^4$
- 當 decode 的時候,產生 $b^3$ 時,其 $q^3$ 只跟他自己 $k^3$ 還有左邊的 $k^1, k^2$ 作 dot product,產生 attention score... , 不考慮他的右邊的 $k^4$ 
- $\Large {\textbf [Block]}$
- - why masked?
- NAT (Decode-Non-Autoregressive)
- 就是所以 input 都給 BOS, 然後就跑出來了。所以是平行處理,速度快。
- 如何處理不知道的輸出長度,自己可以先思考。
- 全部都輸入啊,然後看 END 出現在 output 的哪一個位置...
- 加一個 output length predictor
- 好處:parallel, (controllable output length...)
- in general, NAT performance is worse than AT. [YT: (multi-modality, TA)](https://youtu.be/jvyKmU4OM3c)
- Cross attention $\Large {\textbf [Block]}$
- [[paper: Layer-Wise Multi-View Decoding for Natural Language Generation,這一篇是提出與原始 paper 不一樣的 cross attention:很多層,不一定要每一層都跟原始 paper 一樣都是 attent encoder's $k$ 其實老師只是引這一篇中的圖片]](https://arxiv.org/abs/2005.08081)
- cross, $q$ is from decoder, $k$s are from encoder
- $q \cdot k \rightarrow v$ sent to decoder's FF
- Listen, Attend and Spell example@cross-attention, [[paper: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition]](https://ieeexplore.ieee.org/document/7472621), **this is not transfromer. (LSTM)**
- Listen, attend and Spell的步驟, 就是 seq2seq
- [[how much wood would a woodchuck chuck]](https://youtu.be/b8nR9iROHDk)
- Training
- 同樣是輸入一個 voice singal vectors,不過也備好了我們的這個訊號的 label (機器學習)$\xrightarrow{Encoder}$ encoded vectors
- encoded vectors + BOS (a one hot encoding) $\xrightarrow{decoder \rightarrow softmax}$ distribution
- distribution vs. label: minimize cross-entropy
- 訓練的時候,decoder 的輸入需要給予 ground truth (**Teacher Forcing**: using the ground truth as input)
- tips for seq2seq trainging
- Copy Mechanism
- chat-bot
- User: "你好,我是**庫洛洛**"
- Machine: "**庫洛洛**你好,很高興認識你"
- **庫洛洛**需要被複製。
- User: 小傑**不能使用念能力**了
- Machine: 你所謂的**不能使用念能力** 是什麼意思?
- **不能使用念能力** 需要被複製。
- **Summarization 摘要**
- refer [[YT: Pointer Network]](https://youtu.be/VdOyqNQ9aww)
- further study [[Paper: Incorporating Copying Mechanism in Sequence-to-Sequence Learning]](https://arxiv.org/abs/1603.06393)
- Guided Attention:
- TTS as example
- 發財發財發財發財,發財發財發財, 發財發財抑揚頓挫
- 但是...發財 卻只發 “發”
- In some tasks, input and output are monotonically aligned. For example, speech recognition, TTS, etc.
- 如果 machine attent,顛三倒四,就強迫他有一定的 pattern
- **Beam Search**、greedy decoding, please google them.
- but... [[The Curious Case of Neural Text Degeneration]](https://arxiv.org/abs/1904.09751)
- 任務對錯明確,beam searching 就必較有幫助
- TTS decoder 訓練時,要加 noice... (sentence completion 也需要...)
- Optimizing Evaluation Metrics?
- @tranlation 我們是個別自輸出時作 mini. cross-entropy,但是衡量的時候可能是 BLUE score: 比較兩個句子的差距。
- 所以 train: cross-entropy
- validation pick high BLUE score
- or - (BLUE score)? 不可微分啊,很難
- When you don't know how to optimize, just use reinforcement learning (RL)!
- [[Sequence Level Training with Recurrent Neural Networks]](https://arxiv.org/abs/1511.06732)
- exposure bias, Error Progagation
- 因為訓練時 decoder 的 input 都是正確的,
- 但是在測試時, decoder 可能看到錯的,這樣演變成訓練與實用時的 mismatch!
- 我的解法就是,在訓練時偶而給 decode 錯誤的輸入
- original scheuduled sampling [[paper: Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks]](https://arxiv.org/abs/1506.03099)
- [[paper: Scheduled Sampling for Transformers]](https://arxiv.org/abs/1906.07651)
- [[paper: Parallel Scheduled Sampling]](https://arxiv.org/abs/1906.04331)