# YT: Hung Yi Lee ML2021, L10~L13, L18~L21, self_attention, transformer and self_supervised learning # L10 Self-Attention - Sophisticated Input - Input is **a vector** $\Rightarrow$ model $\Rightarrow$ Scalar or Class - Input is **a set vectors** (may change length) $\Rightarrow$ model $\Rightarrow$ Scalars or Classes - Vector Set as input - One-hot-Encoding: 失去字與字之間的關聯性(原本ohe 就是這個目的) - word embedding: 找出字辭兼相關聯性 - [[to learn more: Unsupervised Learning - Word Embedding]](https://youtu.be/X7PH3NuYW0Q) or Vivian Chen again~~ - voice/sound sequency input, frame(25ms), 400 samples point(16,000Hz), 39-dim MFCC, 80-dim filter bank output... - even duration of a frame is 25ms signal, but frame shifts 10ms, $\Rightarrow$ (1s $\rightarrow$ 100 frames), data amount is large. - Graph is also a set of vectors (consider each node as a vector) - [[blog: Social Network Analytics]](https://medium.com/analytics-vidhya/social-network-analytics-f082f4e21b16), 一個人就是一個節點 - [[分子]](http:/www.twword.com/wiki/%E5%88%86%E5%AD%90) - What is the output? - Each vector has a label. (一對一) Example: - POS tagging.: I(N) saw(V) a(DET) saw(N) - speech recognition : phonetic - the whole sequence has a label(多對一) Example - sentiment analysis, "this" "is" "good" $\Rightarrow$ "positive" - 語者辨認 - GRAPH 分子,有沒有毒,具不具親水性 - 多對多, seq2seq, translation, voice recognition - Sequence Labeling (一對一) - FC(Fully Connected) - "I saw a saw"... 輸入 saw 到底是要輸出動詞還是名詞? - 當然可以做前後文輸入 - **Self-Attention**: - [**a$^1$**, **a$^2$**,..., **a$^n$**] $\Rightarrow$ **Self-attention** $\Rightarrow$ [**b$^1$**, **b$^2$**,..., **b$^n$**] $\Rightarrow$ **n 個 FC**(fully connected) $\Rightarrow$ **n 個 output** - [**V$_1$**, **V$_2$**,..., **V$_n$**] $\Rightarrow$ **Self-attention** $\Rightarrow$ **n 個 FC**(fully connected) $\Rightarrow$ $\Rightarrow$ **Self-attention** $\Rightarrow$ **n 個 FC**$\Rightarrow$ **Self-attention** $\Rightarrow$ **n 個 FC** $\Rightarrow$**n 個 output** - **Self-attention** 不是只考慮輸入的一個小的範圍或一個 window,而是 **考慮整個 sequence** - [[Attention Is All You Need]](https://arxiv.org/abs/1706.03762) - [**a$^1$**, **a$^2$**,..., **a$^n$**] $\Rightarrow$ **Self-attention** $\Rightarrow$ [**b$^1$**, **b$^2$**,..., **b$^n$**] - **b$^i$** 考慮了所有的 **a** 也就是 [**a$^1$**, **a$^2$**,..., **a$^n$**] (Can be either input or a hidden layer) - how to generate **b$^1$** - find the relevant vectors in a sequence. to have relevant $\alpha$ - **dot-product, $\alpha = q\cdot k$** - additive: $\alpha = \textbf W(tanh(W^q(a^1)+W^k(a^2)))$ 留意,$\textbf W$ 的出現表示,這個定義的 attention scores 需要學習一個網路 - $q^1 = W^qa^1$: **query** :+1: - $k^2 = W^ka^2$: **key** :+1: - attention score $\alpha_{1,2}=q^1\cdot k^2$ - if we have 4 inputs, then we have [$\alpha_{1, 1}$, $\alpha_{1, 2}$, $\alpha_{1, 3}$, $\alpha_{1, 4}$] for $q^1(=W^q a^1)$ $\Rightarrow$ **Softmax** $\Rightarrow$ [$\alpha^{'}_{1, 1}$, $\alpha^{'}_{1, 2}$, $\alpha^{'}_{1, 3}$, $\alpha^{'}_{1, 4}$] - $\alpha^{'}_{1,i} = \frac{\large e^{\small \alpha_{1, i}}}{\sum\limits_j \large e^{\small \alpha_{\small 1, j}}}$ - try relu replace softmax - $v^i = W^va^i$ :**value** :+1: - $b^1 = \sum\limits_i \alpha_1^{'}v^i$ - **$b^1, b^2, b^3, b^4$ 可以平行處理** - 可以再寫一次! - a 經過三個網路會有 qkv - qk 做內積有了 attention scores. 可以送進 activation, 比如 softmax, or relu... - 有了 attention scores 就拿來跟 v 作 weighted sum。 - 所以經過三個網路 $W^q, W^k, W^v$ 後進入 self-attention 都是計算程序!! # L11 Self-Attention (II) - Matrix - $a^1 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^1, k^1, v^1$ - $a^2 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^2, k^2, v^2$ - $a^3 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^3, k^3, v^3$ - $a^4 \xrightarrow{\space\space W^q, W^k, W^v \space\space} q^4, k^4, v^4$ - ($q^i = W^qa^i) \Rightarrow [q^1, q^2, q^3, q^4] = W^q[a^1, a^2, a^3,a^4]$ - #### $Q = W^q I$ :+1: - $k^i = W^ka^i \Rightarrow [k^1, k^2, k^3, k^4] = W^k[a^1, a^2, a^3,a^4]$ - #### $K = W^k I$ :+1: - $v^i = W^va^i \Rightarrow [v^1, v^2, v^3, v^4] = v^k[a^1, a^2, a^3,a^4]$ - #### $V = W^v I$ :+1: - $\alpha_{1,1} = k^{1^T} \space q^1$, $\alpha_{1,2} = k^{2^T} \space q^1$, ... - $\begin{bmatrix} \alpha_{1,1}\\ \alpha_{1,2} \\ \alpha_{1,3}\\ \alpha_{1,4} \end{bmatrix} = \begin{bmatrix} k^{1}\\ k^{2} \\ k^3\\ k^4 \end{bmatrix} q^1$ - $\begin{bmatrix} \alpha_{1,1} & \alpha_{2,1} & \alpha_{3,1} & \alpha_{4,1} \\ \alpha_{1,2} & \alpha_{2,2} & \alpha_{3,2} & \alpha_{4,2} \\ \alpha_{1,3} & \alpha_{2,3} & \alpha_{3,3} & \alpha_{4,3} \\ \alpha_{1,4} & \alpha_{2,4} & \alpha_{3,4} & \alpha_{4,4} \end{bmatrix} = \begin{bmatrix} k^{1}\\ k^{2} \\ k^3\\ k^4 \end{bmatrix} [q^1, q^2, q^3, q^4]$ - #### $A = K^T Q$ - #### $A^{'} = \mathop softmax(K^T Q)$ - $[b^1, b^2, b^3, b^4] = [v^1, v^2, v^3, v^4]A^{'}$ - ### $O = V\space A{'}$ - #### $Q = W^q I$ :+1: #### $K = W^k I$ :+1: #### $V = W^v I$ :+1: ### $A^{'} \Leftarrow A = K^T Q$ :+1: ### $O = V A{'}$ :+1: ***果然,老師這裡強調了: $W^q, W^k, W^v$ 需要被訓練,其他地方都是人為設定計算*** :100: - Multi-head Self-attention (找不同類型的相關$\alpha$) - how many heads, hyper parameter. :accept: - $a^i \xrightarrow{W^q} ,q^i \space\Rightarrow \space \space \space \large a^{i,1}\xrightarrow{W^{q, 1}} q^{i,1}, a^{i,2}\xrightarrow{W^{q, 2}} q^{i,2}$ - 也就是 原本三個網路,現在變成六個網路。更多 head 就更多網路。 - 同樣 k, v 運算 - 然後 1 的做自己的,有了 $O^1$; 2 的做自己的,有了 $O^2$ - Positional Encoding - no position information in self-attention - each position has a unique positional vector $e^i$ - hand-crafted :accept: - 讓原來的輸入 $a^i$ 先加入 positional vector $e^i$ 再送入 $W^q, W^k, W^v$ :+1: - it also can be learnd from network :+1: - [paper: Learning to Encode Position for Transformer with Continuous Dynamical Model ] (https://arxiv.org/abs/2003.09229) - further applications - [paper: Attention Is All You Need](https://arxiv.org/abs/1706.03762) 這一篇提出Transformer :100: - [Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) :100: - Self-attention for Speech - speech is a very long vector sequence. - if input sequence is length L, the Attention Maxtrix $A^{'}$ is L x L, need a big memory capacity - Truncated self-attention,就是看的範圍不要全看,人設定只看一部分範圍, and to speed up...... [[paper: Transformer-Transducer: End-to-End Speech Recognition with Self-Attention]](https://arxiv.org/abs/1910.12977) - Self-attention for Image, - self-attention GAN, [[paper: Self-Attention Generative Adversarial Networks, Ina Goodfellow]](https://arxiv.org/abs/1805.08318) :+1: - DEtection Transformer (DETR) [[paper: End-to-End Object Detection with Transformers]](https://arxiv.org/abs/2005.12872) :+1: - Self-attention v.s. CNN [[paper: On the Relationship between Self-Attention and Convolutional Layers](/S0b4e15ES--o_7RuNde87A)](https://arxiv.org/abs/1911.03584) - CNN $\subset$ Self-Attention :boom: - CNN 是受限版的 Self-Attention, :boom: - Self-Attention 是更 Flexible 的CNN :boom: - 基本上必較 flexible 的模型需要更多資料才不至於 overfitting。 - 受限的模型,比較有機會在較少的資料中學好。 - [[paper: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE]](https://arxiv.org/pdf/2010.11929.pdf), it shows, CNN: Good for less data(10M), Self-Attention is good for more data(100M ~300M ~) - Self-Attention v.s. RNN - RNN and bi-RNN (**short term memory**) :+1: - Self-Attention 再遠都可以考慮到 - RNN nonparallel - Self-Attention **parallel** :+1: - [[paper: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention](/KWZAv7muSLiK9TYFAqG9CQ)](https://arxiv.org/abs/2006.16236) - [[YT:RNN,ML Lecture 21-1: Recurrent Neural Network (Part I), ML2017]](https://youtu.be/xCGidAeyS4M) - 印象中,好像是 RNN(seq2seq) 無法 pre_train :boom: - Self-Attention for Graph - **nodes** are vectors for input, consider **edge**: only attention to connected nodes. - 因為沒有相連的 nodes 就是沒有相連,所以不用去計算 attention scores. - this is one type of **Graph Neural Network(GNN)** - [[YT: GNN: [TA 補充課] Graph Neural Network (1/2) (由助教姜成翰同學講授), 2020 HLP?]](https://youtu.be/eybCCtNKwzA) - [[paper: Long Range Arena: A Benchmark for Efficient Transformers]](https://arxiv.org/abs/2011.04006) - [[paper: Efficient Transformers: A Survey]](https://arxiv.org/abs/2009.06732) # L12 Transformer(1/2) - Sequence-to-sequence (Seq2Seq) - speech: T $\xrightarrow{speech \space recognition}$ sentence: N - sentence : N $\xrightarrow{machine \space translaion}$ sentence: N$^{'}$ - speech: T $\xrightarrow{speech \space translaion}$ sentence: N - speech translation $\neq$ speech recognition + machine translation, because, there are many languages in the world without text - [[Hokkien: FORMOSA SPEECH RECOGNITION CHALLENGE 2020 - TAIWANESE ASR]](https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge-2020) - we might have data (with label) from youtube...硬勸一發 :100: - Text-to-Speech (TTS) Synthesis - 估,台灣婿聲2.0 - not an end-2-end - Seq2seq for chatbot - input $\xrightarrow{seq2seq}$ response - "Hi" $\xrightarrow{seq2seq}$ "Hell! How are you today?" - Most Natural Language Processing applications:... Question Answering (QA) - artical summary - sentiment analysis - QA can be done by seq2seq - question, context $\Large \xrightarrow{\space \space seq2seq \space \space }$ answer - refer: [[The Natural Language Decathlon: Multitask Learning as Question Answering]](https://arxiv.org/abs/1806.08730) and [[LAMOL: LAnguage MOdeling for Lifelong Language Learning]](https://arxiv.org/abs/1909.03329) - refer HLP2020 - pixel 4 end2end, seq2seq - Seq2Seq for Syntactice Parsing - "deep learning is very powerful" $\Rightarrow$ parsing tree - Is a parsing tree a sequence? - [[Grammar as a Foreign Language]](https://arxiv.org/abs/1412.7449) - Seq2Seq for Multi-label Classification - ref: [[Order-free Learning Alleviating Exposure Bias in Multi-label Classification]](https://arxiv.org/abs/1909.03434) - [[Order-Free RNN with Visual Attention for Multi-Label Classification]](https://arxiv.org/abs/1707.05495) - Seq2Seq for Object Detection - [[End-to-End Object Detection with Transformers]](https://arxiv.org/abs/2005.12872) - ## Seq2Seq - input sequence $\Large \xrightarrow{\space \space \small Encoder \space \space}\space \xrightarrow{\space \space \small Decoder \space \space}$ output sequence - [[Sequence to Sequence Learning with Neural Networks]](https://arxiv.org/abs/1409.3215) :100: - [[Attention Is All You Need]](https://arxiv.org/abs/1706.03762) :100: - Encoder - [$x^1,x^2, x^3, x^4$] $\xrightarrow{\space \space \small Encoder \space \space}$ [$lb^1,lb^2, lb^3, lb^4$] $\xrightarrow{\space \space \small Encoder \space \space}$ [$\dots,\dots, \dots, \dots$] $\xrightarrow{\space \space \small block \space \space}$ [$h^1,h^2, h^3, h^4$] - ### Each block: - [$v^1,v^2, v^3,v^4$] $\xrightarrow{\space \space \small self-attention \space \space}$ [$o^1,o^2, o^3, o^4$] $\xrightarrow{\space \space \small FC\times 4 \space \space}$[$lb^1,lb^2, lb^3, lb^4$] - **residual** - [$v^1,v^2, v^3,v^4$] $\xrightarrow{\space \space \small self-attention \space \space}$ [$sa^1,sa^2, sa^3, sa^4$] $\xrightarrow[add \space v^i]{\space \space \small {\textbf {Residual}} \space \space}$ [$sa^1+v^1,sa^2+v^2, sa^3+v3, sa^4+v^4$]$\xrightarrow{\space \space \small {\textbf {layer norm}} \space \space}$ 才真的送到 FC,同時 FC 有 residual desing,再做一次 layer norm - [[Layer Normalization]](https://arxiv.org/abs/1607.06450) - $\begin{bmatrix} x_1 \\ x_2 \\ x_3 \\x_4\end{bmatrix} \xrightarrow{\space \space \small {\textbf {layer norm}}\space \space}\begin{bmatrix} x_1^{'} \\ x_2^{'} \\ x_3^{'} \\x_4^{'}\end{bmatrix}$, where $x_i^{'} = \frac{x_i-m}{\sigma}$ - [[batch normal]](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/normalization_v4.pdf) - BERT: Encoder of Transformer - to learn more - [[On Layer Normalization in the Transformer Architecture]](https://arxiv.org/abs/2002.04745) - [[PowerNorm: Rethinking Batch Normalization in Transformers]](https://arxiv.org/abs/2003.07845) # L12 Transformer (2/2) - Autoregressive (Speech Rocognition as example) - voice singal vectors $\xrightarrow{Encoder}$ encoded vectors - encoded vectors + BOS (a one hot encoding) $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“機” (\a~z\subwords 的 one hot encoding 的 distribution) - encoded vectors, BOS, "機" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“器” (\a~z\subwords 的 one hot encoding 的 distribution) - encoded vectors, BOS, "機", "器" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“學” (\a~z\subwords 的 one hot encoding 的 distribution) - encoded vectors, BOS, "機", "器", "學" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“習” (\a~z\subwords 的 one hot encoding 的 distribution) - **We do not know the correct output length. We design "END/EOS" code** - encoded vectors, BOS, "機", "器", "學", "習" $\xrightarrow{decoder \rightarrow softmax}$ 比如中文的方塊字:“**EOS**” (\a~z\subwords 的 one hot encoding 的 distribution) - Decode stucture - more complicated than Encode - [Encoder] - "Input+Position Encoding" $\rightarrow$ "multi-head Attention" $\rightarrow$ "Add & Norm" $\rightarrow$ "FF" $\rightarrow$ "Add Attention" - **[Decoder]** - "Outputs (shifted Right) +Position Encoding" $\rightarrow \large Mask\space$ "multi-head Attention" $\rightarrow$ "Add & Norm" $\rightarrow \Large {\textbf [Block]}$ $\rightarrow$ "FF" $\rightarrow$ "Add Attention" - $\Large Mask$: - 當 decode 的時候,產生 $b^2$ 時,其 $q^2$ 只跟他自己 $k^2$ 還有左邊的 $k^1$ 作 dot product,產生 attention score... , 不考慮他的右邊的 $k^3, k^4$ - 當 decode 的時候,產生 $b^3$ 時,其 $q^3$ 只跟他自己 $k^3$ 還有左邊的 $k^1, k^2$ 作 dot product,產生 attention score... , 不考慮他的右邊的 $k^4$ ![](https://i.imgur.com/ctcY6Qf.png) - $\Large {\textbf [Block]}$ - - why masked? - NAT (Decode-Non-Autoregressive) - 就是所以 input 都給 BOS, 然後就跑出來了。所以是平行處理,速度快。 - 如何處理不知道的輸出長度,自己可以先思考。 - 全部都輸入啊,然後看 END 出現在 output 的哪一個位置... - 加一個 output length predictor - 好處:parallel, (controllable output length...) - in general, NAT performance is worse than AT. [YT: (multi-modality, TA)](https://youtu.be/jvyKmU4OM3c) - Cross attention $\Large {\textbf [Block]}$![](https://i.imgur.com/PVbtS6W.png) - [[paper: Layer-Wise Multi-View Decoding for Natural Language Generation,這一篇是提出與原始 paper 不一樣的 cross attention:很多層,不一定要每一層都跟原始 paper 一樣都是 attent encoder's $k$ 其實老師只是引這一篇中的圖片]](https://arxiv.org/abs/2005.08081) - cross, $q$ is from decoder, $k$s are from encoder - $q \cdot k \rightarrow v$ sent to decoder's FF - Listen, Attend and Spell example@cross-attention, [[paper: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition]](https://ieeexplore.ieee.org/document/7472621), **this is not transfromer. (LSTM)** - Listen, attend and Spell的步驟, 就是 seq2seq - [[how much wood would a woodchuck chuck]](https://youtu.be/b8nR9iROHDk) - Training - 同樣是輸入一個 voice singal vectors,不過也備好了我們的這個訊號的 label (機器學習)$\xrightarrow{Encoder}$ encoded vectors - encoded vectors + BOS (a one hot encoding) $\xrightarrow{decoder \rightarrow softmax}$ distribution - distribution vs. label: minimize cross-entropy - 訓練的時候,decoder 的輸入需要給予 ground truth (**Teacher Forcing**: using the ground truth as input) - tips for seq2seq trainging - Copy Mechanism - chat-bot - User: "你好,我是**庫洛洛**" - Machine: "**庫洛洛**你好,很高興認識你" - **庫洛洛**需要被複製。 - User: 小傑**不能使用念能力**了 - Machine: 你所謂的**不能使用念能力** 是什麼意思? - **不能使用念能力** 需要被複製。 - **Summarization 摘要** - refer [[YT: Pointer Network]](https://youtu.be/VdOyqNQ9aww) - further study [[Paper: Incorporating Copying Mechanism in Sequence-to-Sequence Learning]](https://arxiv.org/abs/1603.06393) - Guided Attention: - TTS as example - 發財發財發財發財,發財發財發財, 發財發財抑揚頓挫 - 但是...發財 卻只發 “發” - In some tasks, input and output are monotonically aligned. For example, speech recognition, TTS, etc. - 如果 machine attent,顛三倒四,就強迫他有一定的 pattern - **Beam Search**、greedy decoding, please google them. - but... [[The Curious Case of Neural Text Degeneration]](https://arxiv.org/abs/1904.09751) - 任務對錯明確,beam searching 就必較有幫助 - TTS decoder 訓練時,要加 noice... (sentence completion 也需要...) - Optimizing Evaluation Metrics? - @tranlation 我們是個別自輸出時作 mini. cross-entropy,但是衡量的時候可能是 BLUE score: 比較兩個句子的差距。 - 所以 train: cross-entropy - validation pick high BLUE score - or - (BLUE score)? 不可微分啊,很難 - When you don't know how to optimize, just use reinforcement learning (RL)! - [[Sequence Level Training with Recurrent Neural Networks]](https://arxiv.org/abs/1511.06732) - exposure bias, Error Progagation - 因為訓練時 decoder 的 input 都是正確的, - 但是在測試時, decoder 可能看到錯的,這樣演變成訓練與實用時的 mismatch! - 我的解法就是,在訓練時偶而給 decode 錯誤的輸入 - original scheuduled sampling [[paper: Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks]](https://arxiv.org/abs/1506.03099) - [[paper: Scheduled Sampling for Transformers]](https://arxiv.org/abs/1906.07651) - [[paper: Parallel Scheduled Sampling]](https://arxiv.org/abs/1906.04331)