changed 2 years ago
Published Linked with GitHub

Attention Is All You Need (Transformer)

tags: paper

Abstract

Before: complex recurrent or convolutional neural network
Proposed architecture: Transformer, based on attention mechanisms

Introduction

Before (sequence model)

computation problem

  • factorization tricks [21]: computational efficiency
  • conditional computation [32]: computational efficiency and model performance

Attention mechanisms

  • without regard to their distance in the input or output sequences
  • conjunction with a recurrent network

補充 (attention)
https://blog.csdn.net/Enjoy_endless/article/details/88679989

  • Human attention: focus something we think it is importance
  • Arbitary formulation have a appropriate C let it more attentive

    \(y_1 = f1(C_1)\)
    \(y_2 = f1(C_2, y_1)\)
    \(y_3 = f1(C_3, y_2)\)

補充 (self-attention)
從影像來看,觀測範圍會比較廣,比起CNN,CNN一次只有看一個kernal
經過encoder可以知道哪個部分重要

Transformer

  • eschewing recurrence
  • attention mechanism to draw global dependencies between input and output

Background

  • Before (Extended Neural GPU [16], ByteNet [18] and ConvS2S [9]): the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions
    • disadvantage: it is difficult to learn dependency distant positions
  • Transformer: constant number of operations
  • Transformer is not use any convolution or RNN

Model Architecture

  • encoder input: \(x = (x_1,...,x_n)\)
  • encoder output: \(z = (z_1,...,z_n)\)
  • decoder output: \(y = (y_1,...,y_m)\)

Encoder and Decoder Stacks

補充 (residual connection)

source: arXiv:1512.03385

補充 (batch normalization)

encoder: N = 6, produce outputs of dimension \(d_{model} = 512\)
Add: residual connection
Norm: layer normalization

別人理解:

source: https://jalammar.github.io/illustrated-transformer/
我的理解:
paper: In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

Decoder: N = 6

Attention

Scaled Dot-Product Attention

  • Two kinds of attention function
    • Additive attention
    • Dot-product attention: faster and more space-efficient
  • Why we need \(\frac{1}{\sqrt{d_k}}\)
    • larger values of \(d_k\), the dot products grow large in magnitude and lead to the bad performance in the softmax function
  • \(Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\)

Multi-Head Attention

\(h = 8, d_k = d_v = d_{model}/h = 64\)

  • beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions
  • Concat: h dimensions are concatenated to 1
  • \(MultiHead(Q, K, V) = Concat(head_1,...,head_h)W^O\)
  • where \(head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\)

Applications of Attention in our Model

  • In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.
  • Each position in the encoder can attend to all positions in the previous layer of the encoder.
  • each position in the decoder to attend to all positions in the decoder up to and including that position.

Position-wise Feed-Forward Networks

Dense(dff, activation='relu')
fully connected: \(FFN(x) = max(0, xW_1+b_1)W_2+b_2 = Relu(xW_1+b_1)W_2+b_2\)

Embeddings and Softmax

Positional Encoding


Because using binary values would be waste of place, we can use float contunus counterparts - Sinusoidal functions

source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Why Self-Attention

  • total computational complexity
  • the amount of computation that can be parallelized
  • long-range dependencies

Training

Results

Question

  • Transformer是如何處理可變長度的?
    length跟weight的dim無關
    以下只計算維度變化,softmax不會造成維度變化

  • decoder跟 encoder怎麼接起來?


    source: https://jalammar.github.io/illustrated-transformer/
    paper: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

    paper: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers.

Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

  • we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation
    為什麼可以share? 跟NLP subword有關

  • 為什麼要有FFN,self-attention後已經有linear transformation
    有Relu增加performance

Select a repo