筆記內容參考於:https://youtu.be/ugWDIIOHtPA
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
RNN:
- 是最經典的處理Sequence的模型,單向RNN或雙向RNN等等。
- RNN的問題:難以平行處理=>就有人提出用CNN取代RNN
CNN取代RNN:
- CNN filters:每一個三角形代表一個filter輸入為seq的一小段,輸出一個數值(做內積得到),不同的filter對應seq中不同的部分。
- 每一個CNN只能考慮有限的內容,RNN能考慮一整個句子
- 考慮很長的句子:疊很多層CNN,上層的filter就可以考慮較多資訊,因為上層的filter會把下層的filter的輸出當作輸入
- 問題:必須疊很多層才有辦法做到考慮較長的句子,因此出現了self-attention機制
Self-Attention Layer
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
目的
- 想取代RNN原本想做的事情,輸入輸出同RNN,可以輸入sequence輸出sequence
- 跟雙向RNN有同樣能力,可以先偷看過整個sequence,但特別的是是同時處理的,不需要一個算完才能算下一個
- 可以完全取代RNN
以前用RNN發表過的paper已經全部被使用self-attention機制洗一輪了…Li
Attention機制(參考自[4])
核心思想:
- 用三元組 代表注意力機制,表示Query和Key的相似性,再根據相似性賦予Value的取值
公式:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Self-attention 思路(本文)
- input: ,是一個sequence
- 每一個輸入先通過一個embedding,乘上一個權重矩陣變成,把丟進Self-attention layer
- 每一個輸入都乘上不同的vector:
- :query(為了去匹配其他人)
- :key(為了被匹配)
- :value 被抽取出來的資訊
- 權重是訓練出來的,一開始隨機初始化
方法
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 拿每個query q去對每個key k做attention(吃兩個向量, 輸出一個分數),其實就是計算q.k的相似度Similarity。
- Scaled Dot-Product : 得到,得到…
- d代表q和k的dimension,只是論文中作者使用的小技巧
- 再做Softmax normalization(歸一化)
- 把得到的 與相乘得到,相當於做weighted sum
- 上圖中得到的就是所求sequence的第一個vector(word或character)
- 每個output vector都用了整個sequence的資訊
===
Self-attention 平行化處理
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 把當作一個矩陣,與權重矩陣相乘得到,當作一個矩陣。相同的矩陣也是同樣方式,由和相乘得來的
…
所以把疊一起變成矩陣,同乘以
也把疊一起變成矩陣,得到集合成的矩陣,就是Attention,做Softmax後變成
- 每一個time step中,兩兩vector之間都有attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 計算V、A hat的weighted sum,就得到b,b集合成的矩陣就是輸出的矩陣
Self-attention Layer做的事情
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
轉換成矩陣乘法,就可以使用GPU來加速運算了
Multi-head Self-attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
以2 heads為例
- 2個head就是把分裂成兩個,而只會跟相乘得到,最後計算出
- 最後把concat起來,乘上一個transform,做降維得到最終的
每個head所關注的資訊不同,有的只關心local資訊(鄰居的資訊),有個只關心global(較長時間)資訊等等。
Positional Encoding
天涯若比鄰
在注意力機制中,輸入句子的詞的順序如何是沒差的。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 沒有位置資訊=>所以有一個唯一的位置向量 ,不是學出來的而是人設置的。
- 其他方法:使用one-hot encoding表示的為表示其位置
Seq2seq with Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
原始seq2seq2 model由兩個RNN分別組成了Encoder、Decoder,可以應用於機器翻譯。
上圖中原本Encoder裡面是雙向RNN,Decoder裡面是一個單向RNN,下圖把兩個都用Self-attention layer取代,可以到達一樣的目的而且可以平行運算。
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Encoder的部分
- Input通過Input Embedding,考慮位置資訊,加上人為設置的Positional Encoding,進入會重複N次的block
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Multi-head:進到Encoder裡面,他是Multi-head Attention的,也就是q,k,v有多個,在裡面做qkv個別乘以a的運算,算出 最後得到
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Add & Norm([3]殘差連接residual connection):把Multi-head attention的input 和output 加起來得到,再做[1]Layer Normalization
- 計算完後丟到前向傳播,再經過一個Add & Norm
Decoder的部分

- Decoder input為前一個time step的output,通過output embedding,考慮位置資訊,加上人為設置的positional encoding,進入會重複n次的block
- Masked Multi-head Attention:做Attention,Masked表示只會attend到已經產生出來的sequenc e,再經過Add & Norm Layer
- 再經過Multi-head Attention Layer,attend到之前Encoder的輸出,然後再進到Add & Norm Layer
- 計算完丟到Feed Forward前向傳播,再做Linear和Softmax,產生最終輸出
Attention Visualization
single-head
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
文字倆倆之間的關係,線條越粗代表關聯越深Multi-head
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
用不同組的q.k配對出來的結果有所不同,代表不同組的q.k擁有不同資訊(下面local or 上面global)應用
Summarizer By Google
input為一堆文檔,output為一篇文章(summarize)
橫向(時間上)是Transformer,縱向是RNN
Self-attention GAN
用在影像生成上
參考資料

[1]. Layer Norm https://arxiv.org/abs/1607.06450 不需要考慮batch,希望各個不同維度的資料平均=0,變異數=1,一般來說Layer Norm會搭配RNN使用
[2] Batch Norm https://www.youtube.com/watch?v=BZh1ltr5Rkg 對同一batch中不同data做normalization,希望平均數=0,變異數=1
[3] Residual Connection 殘差連接:將輸出表示成輸入和輸入的非線性變換的線性疊加