課程連結
Transformer翻譯過來就是變形金剛,目前較為知名的應用為BERT
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Transformer是一種Seq2seq model,特別的地方在於Transformer大量使用"Self-attention"
Sequence
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
談到Sequence就會直覺聯想到RNN架構,輸入vector sequence,輸出另一個vector sequence,但它最大的問題在於不容易被平行化,因為當你想知道的結果時,你必需先算出,這導致無法平行計算的問題。
因此目前很多人提出以CNN取代RNN的想法,以filter去掃過整個sequence得到一個output(不同顏色的圓點代表不同的filter),但缺點就是,CNN每次考慮的資訊是依你的filter大小而定,不像RNN,以來說,它考慮了的內容才做輸出。但這並不是無法克服的,只要疊比較多層,愈上層所考慮的資訊就愈多(感受域的概念)。以上圖右第二層CNN的輸出(藍色三角形),它所考慮的就是第二層的資訊,而就涵蓋了的訊息在裡面了。
採用CNN的好處在於,它是可以平行處理的。因為它並不需要等其它filter計算完之後才計算,而是可以同時計算。但還是有一個缺點,如果你疊的不夠多層,那上層所得資訊依然不足。因此有了另一種想法,那就是Self-Attention。
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Self-Attention想做的就是取代原本RNN能做的事,input sequence, out sequence,而且與bidirectional一樣,每一個output都看過所有的input sequence,最神奇的地方在於它可以平行計算,也就是是可以同時計算的。
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
論文連結_Attention Is All You Need
Self-Attention最早出現於google的一篇論文(如上連結)"Attention Is All You Need"。
實作說明如下:
- 假設輸入為,為sequence
- 通過embedding,得到output-
- 進入Self-Attention layer
- 每一個input都乘上三個transformation(分別乘上三個matrix得到一個vector),分別為
- : query,代表match其它人,
- : key,被query match,
- : value,被抽出來的information,
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 拿query-對每一個key-做attention,得到
- 計算兩個向量之間有多匹配(input兩個vector,output一個scalar)
- 基本上很多種作法可以計算兩個向量之間有多匹配
- 以與計算attention,得到
- 計算的方式稱為"Scaled Dot-Product Attention"
- ,其中是的dimension
- 一般來說,dimension愈大,做inner product之後相加的元素愈多,因此所得的值也會愈大,因此除上來降低數值。(相關可見論文說明)
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 將所得的經過softmax得到
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 將所得的與所有相對應的做相乘做weighted sum,即
- 這就是我們所需要的output sequence,而這個sequence-用了所有input的sequence,因為它是由所有的計算而得,而是由做transformation而得
- 採用這種方式,即使是你想要單純的看local information,只要將本身以外的即可,如果只想考慮頭尾的資訊,也只要讓中間的即可
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 計算的同時也可以直接計算,以相同的方式計算attention,接著計算softmax得到並計算weighted sum得到
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
總結來說,有一個Self-attention layer,input sequence,output sequence,與RNN做一樣的事,不同的是它可以平行計算。
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
下面開始說明平行化的概念。
- 將都以矩陣來表示,即與相關權重做一次的inner product就可以得到結果。
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 計算的過程也可以相同作法,以矩陣一次性計算而得。
- 將做softmax得到
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 最後計算weighted sum,只需要將乘上就可以得到,即,這只是單純的矩陣相乘
Self-Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
整體來看,Self-attention layer的input-(matrix),output-(matrix):
- input-分別乘上三個權重矩陣得到三個輸出矩陣
- ,代表輸入的sequence的兩兩之間的attention
- 對執行softmax得到
這過程就是一連串的矩陣乘法,因此可以有效利用GPU來加速‧
Multi-head Self-attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Self-attention有一種變形,稱為"Multi-head Self-attention",課程以兩個heads做範例說明:
- 相同的作法,先產生
- 接著每一個再分裂出兩個支流,為\(q^{i, 1}, q^{i, 2, k^{i, 1, k^{i, 2}, v^{i, 1}, v^{i, 2}\)
- 一樣的計算attention,但只會針對相同尾數索引的部份各自計算,即只會與計算得到,與計算得到
- 將得到的、堆起來,如果覺得維度過高可以再乘上一個權重矩陣來做降維的動作,得到
Multi-head的好處在於,不同的head的關注點可能是不一樣的。
Positional Encoding
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
仔細回想會發現,對Self-attention而言,輸入的順序似乎不那麼重要,因為最終是全數參考,與每一個input vector都會其它vector做attention計算。
因此原始論文中提到,在得到之後,還要再加上一個(維度同),這並不是一個學習參數,而是手動設置的參數,每一個不同位置都存在一個,除此之外,其餘的操作流程都與上面所說明的一致。
李弘毅老師提出不同於論文說法的看法:
- 想像將input-插入一個one-hot vector-(僅第維為1,其餘為0),這個代表位置資訊
- 將與做堆疊,乘上一個matrix-得到embedding。將想像成有兩個部份(這是線性代數中matrix partition的概念),那這件事就等同於,其中就可以視為,而就視為
Positional Encoding
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
就長上面那樣子。
Seq2seq with Attention
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
稍早已經瞭解Self-attention可以取代RNN,現在要將這個layer應用在seq2seq上面。
在seq2seq中有兩個RNN,一個encoder,一個decoder,中間的RNN就直接以Self-attention取代掉就可以。
Demo

相關過程可參考googleblog
在decoder的時候所參考的不僅是decoder的attention,也會參考encoder的attention。

上面簡報架構所說的是一個seq2seq的model,拆分為encoder(left)與decoder(right),課程以中翻英為範例。

整個流程如下:
- Encoder:
- 首先input經過input embedding layer,output為vector(sequence)
- vector(sequence)加上Positional Encoding
- 進入Multi-Head Self-attention layer,input vector(sequence),output vector(sequence)
- 進入Add & Norm,將Multi-Head的input與output相加,然後再經過layer norm
- 進入Feed Forward layer
- 再進入Add & Norm
- 上述3~6重覆n次
- Decoder:
- input為前一個time step所產生的output
- 經過output embedding,output為vector(sequence)
- vector(sequence)加上Positional Encoding
- 進入Masked Multi-Head Attention,其中Masked代表在計算attention的時候,decoder只會atten到已經產生的sequence
- 進入ADD & Norm
- 進入Multi-Head Attention,這邊atten的是之前encoder已經產生的attention
- 進入ADD & Norm
- 進入Feed Forward
- 進入Add & Norm
- Linear
- Softmax
- Output Probabilities
論文連結_Layer Normalization
影片連結_Batch Normalization
Batch Norm與Layer Norm的差異說明(假設batch_size=4):
- Batch Norm:對同一batch中的所有data的同一個dimention計算normalization,也就是讓同一個dimention的
- Layer Norm:不考慮batch,給定一筆資料,讓該筆資料的,普遍搭配RNN一起使用
Attention Visualization

上圖是論文所附上的attention visualization,這圖述說著每個word兩兩之間的attention,線條愈粗代表attention weight愈大。
Attention Visualization

上面一個句子,The animal didn't cross the street beacause it was too tired.,很神奇的是,it所atten的是animal,因為這個it所指的就是animal。
可是當我們將句子改成The animal didn't cross the street beacause it was too wide.,it所attend的不再是animal而是street。
Multi-head Attention

上圖說明使用Multi-head Attention的時候,每一組所產生的不同結果,下圖(紅線)很明顯的是它想找的是local的information,因此都與附近幾個word有較重的attention,而上圖(綠線)就attention的比較遠。
Example Application

Transformer的應用:
google利用transformer取出文章的摘要,很厲害的是,它讓機器讀了文章之後產出Wikipedia的文章。
過往沒有Transformer的時候這類任務是無法成功的,因為輸入的文章太長(),C輸出的結果也很長,這利用RNN一定掛掉。

這是另外的延申版本。
簡單概念上,原本Transformer的每一層都是不一樣的,但Universal Transformer在深度(縱軸)上用了RNN,每一層都是一樣的Transformer,而Position(橫軸)則以Transformer來替代原本的RNN,也就是同一個Transformer的Block一直被反覆的使用。
文章參考
Self-Attention GAN

論文連結_Self-Attention Generative Adversarial Networks
Transformer也可以應用在影像上。
Self-Attention GAN的作法是讓每一個pixel都去注意其它的pixel,以此考慮更為全域的資訊。