Try   HackMD

Transformer 李宏毅深度學習

tags: Deep Learning, Transformer, , seq2seq, Attention

筆記內容參考於:https://youtu.be/ugWDIIOHtPA

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

RNN:

  • 是最經典的處理Sequence的模型,單向RNN或雙向RNN等等。
  • RNN的問題:難以平行處理=>就有人提出用CNN取代RNN

CNN取代RNN:

  • CNN filters:每一個三角形代表一個filter輸入為seq的一小段,輸出一個數值(做內積得到),不同的filter對應seq中不同的部分。
  • 每一個CNN只能考慮有限的內容,RNN能考慮一整個句子
  • 考慮很長的句子:疊很多層CNN,上層的filter就可以考慮較多資訊,因為上層的filter會把下層的filter的輸出當作輸入
  • 問題:必須疊很多層才有辦法做到考慮較長的句子,因此出現了self-attention機制

Self-Attention Layer

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

目的

  • 想取代RNN原本想做的事情,輸入輸出同RNN,可以輸入sequence輸出sequence
  • 跟雙向RNN有同樣能力,可以先偷看過整個sequence,但特別的是
    b1,b2,b3,b4
    是同時處理的,不需要一個算完才能算下一個
  • 可以完全取代RNN

以前用RNN發表過的paper已經全部被使用self-attention機制洗一輪了Li

Attention機制(參考自[4])

核心思想:

  • 用三元組
    <Q,K,V>
    代表注意力機制,表示Query和Key的相似性,再根據相似性賦予Value的取值

公式:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Self-attention 思路(本文)

  • input:
    x1,...,x4
    ,是一個sequence
  • 每一個輸入先通過一個embedding,乘上一個權重矩陣變成
    a1,...,a4
    ,把
    a1,...,a4
    丟進Self-attention layer
  • 每一個輸入都乘上不同的vector:
    q,k,v
  • q
    :query(為了去匹配其他人)
    • qi=Wqai
  • k
    :key(為了被匹配)
    • ki=Wkai
  • v
    :value 被抽取出來的資訊
    • vi=Wvai
  • 權重
    WqWkWv
    是訓練出來的,一開始隨機初始化

方法

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. 拿每個query q去對每個key k做attention(吃兩個向量, 輸出一個分數),其實就是計算q.k的相似度Similarity。
    • Scaled Dot-Product :
      S(q1,k1)
      得到
      α1,1
      S(q1,k2)
      得到
      α1,2
    • α1,i=q1ki/d
    • d代表q和k的dimension,只是論文中作者使用的小技巧
  2. 再做Softmax normalization(歸一化)
  3. 把得到的
    α^
    v
    相乘得到
    b
    ,相當於做weighted sum
  4. 上圖中得到的
    b1
    就是所求sequence的第一個vector(word或character)
  5. 每個output vector都用了整個sequence的資訊

===

Self-attention 平行化處理

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

qi=Wqai
ki=Wkai

vi=Wvai

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. a1,...,a4
    當作一個矩陣
    I
    ,與權重矩陣
    Wq
    相乘得到
    q1,...,q4
    ,當作一個矩陣
    Q
    。相同的矩陣
    K,V
    也是同樣方式,由
    q,k
    a
    相乘得來的
  2. α1,1=k1Tq1

    α1,2=k2Tq1


    所以把
    k1,...k4
    疊一起變成矩陣
    K
    ,同乘以
    q1

    也把
    q1,...q4
    疊一起變成矩陣
    Q
    ,得到
    α
    集合成的矩陣
    A
    ,就是Attention,做Softmax後變成
    A^
  3. 每一個time step中,兩兩vector之間都有attention

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. 計算V、A hat的weighted sum,就得到b,b集合成的矩陣就是輸出的矩陣
    O

Self-attention Layer做的事情

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

轉換成矩陣乘法,就可以使用GPU來加速運算了

Multi-head Self-attention

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

以2 heads為例

  • 2個head就是把
    q,k,v
    分裂成兩個
    q,k,v
    ,而
    qi,1
    只會跟
    ki,1
    相乘得到
    αi,1
    ,最後計算出
    bi,1
  • 最後把
    bi,1,bi,2
    concat起來,乘上一個transform,做降維得到最終的
    bi

每個head所關注的資訊不同,有的只關心local資訊(鄰居的資訊),有個只關心global(較長時間)資訊等等。

Positional Encoding

天涯若比鄰

在注意力機制中,輸入句子的詞的順序如何是沒差的。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 沒有位置資訊=>所以有一個唯一的位置向量
    ei
    ,不是學出來的而是人設置的。
  • 其他方法:使用one-hot encoding表示的
    pi
    xi
    表示其位置

Seq2seq with Attention

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

原始seq2seq2 model由兩個RNN分別組成了Encoder、Decoder,可以應用於機器翻譯。

上圖中原本Encoder裡面是雙向RNN,Decoder裡面是一個單向RNN,下圖把兩個都用Self-attention layer取代,可以到達一樣的目的而且可以平行運算。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

細看Transformer Model

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Encoder的部分

  1. Input通過Input Embedding,考慮位置資訊,加上人為設置的Positional Encoding,進入會重複N次的block

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. Multi-head:進到Encoder裡面,他是Multi-head Attention的,也就是q,k,v有多個,在裡面做qkv個別乘以a的運算,算出
    α
    最後得到
    b

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. Add & Norm([3]殘差連接residual connection):把Multi-head attention的input
    a
    和output
    b
    加起來得到
    b
    ,再做[1]Layer Normalization
  2. 計算完後丟到前向傳播,再經過一個Add & Norm

Decoder的部分

![](https://i.imgur.com/Jy5uxlE.p ng)

  1. Decoder input為前一個time step的output,通過output embedding,考慮位置資訊,加上人為設置的positional encoding,進入會重複n次的block
  2. Masked Multi-head Attention:做Attention,Masked表示只會attend到已經產生出來的sequenc e,再經過Add & Norm Layer
  3. 再經過Multi-head Attention Layer,attend到之前Encoder的輸出,然後再進到Add & Norm Layer
  4. 計算完丟到Feed Forward前向傳播,再做Linear和Softmax,產生最終輸出

Attention Visualization

single-head

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

文字倆倆之間的關係,線條越粗代表關聯越深

Multi-head

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

用不同組的q.k配對出來的結果有所不同,代表不同組的q.k擁有不同資訊(下面local or 上面global)

應用

Summarizer By Google

input為一堆文檔,output為一篇文章(summarize)

Universal Transformer

橫向(時間上)是Transformer,縱向是RNN

Self-attention GAN

用在影像生成上

參考資料

[1]. Layer Norm https://arxiv.org/abs/1607.06450 不需要考慮batch,希望各個不同維度的資料平均=0,變異數=1,一般來說Layer Norm會搭配RNN使用

[2] Batch Norm https://www.youtube.com/watch?v=BZh1ltr5Rkg 對同一batch中不同data做normalization,希望平均數=0,變異數=1

[3] Residual Connection 殘差連接:將輸出表示成輸入和輸入的非線性變換的線性疊加