Attention Is All You Need (Transformer)

tags: `paper`

Abstract

Before: complex recurrent or convolutional neural network
Proposed architecture: Transformer, based on attention mechanisms

Introduction

Before (sequence model)

Recurrent neural networks

source: https://gotensor.com/2019/02/28/recurrent-neural-networks-remembering-whats-important/
Long short-term memory.

source: https://towardsdatascience.com/lstm-recurrent-neural-networks-how-to-teach-a-network-to-remember-the-past-55e54c2ff22e
gated recurrent

source: https://towardsdatascience.com/gru-recurrent-neural-networks-a-smart-way-to-predict-sequences-in-python-80864e4fe9f6

computation problem

factorization tricks [21]: computational efficiency
conditional computation [32]: computational efficiency and model performance

Attention mechanisms

without regard to their distance in the input or output sequences
conjunction with a recurrent network

補充 (attention)
https://blog.csdn.net/Enjoy_endless/article/details/88679989

Human attention: focus something we think it is importance
Arbitary formulation have a appropriate C let it more attentive

\(y_1 = f1(C_1)\)
\(y_2 = f1(C_2, y_1)\)
\(y_3 = f1(C_3, y_2)\)

補充 (self-attention)
從影像來看，觀測範圍會比較廣，比起CNN，CNN一次只有看一個kernal
經過encoder可以知道哪個部分重要

Transformer

eschewing recurrence
attention mechanism to draw global dependencies between input and output

Background

Before (Extended Neural GPU [16], ByteNet [18] and ConvS2S [9]): the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions
- disadvantage: it is difficult to learn dependency distant positions
Transformer: constant number of operations
Transformer is not use any convolution or RNN

Model Architecture

encoder input: \(x = (x_1,...,x_n)\)
encoder output: \(z = (z_1,...,z_n)\)
decoder output: \(y = (y_1,...,y_m)\)

Encoder and Decoder Stacks

補充 (residual connection)

source: arXiv:1512.03385

補充 (batch normalization)

encoder: N = 6, produce outputs of dimension \(d_{model} = 512\)
Add: residual connection
Norm: layer normalization

別人理解:

source: https://jalammar.github.io/illustrated-transformer/
我的理解:
paper: In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

Decoder: N = 6

Attention

Scaled Dot-Product Attention

Two kinds of attention function
- Additive attention
- Dot-product attention: faster and more space-efficient
Why we need \(\frac{1}{\sqrt{d_k}}\)
- larger values of \(d_k\), the dot products grow large in magnitude and lead to the bad performance in the softmax function
\(Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\)

Multi-Head Attention

\(h = 8, d_k = d_v = d_{model}/h = 64\)

beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions
Concat: h dimensions are concatenated to 1
\(MultiHead(Q, K, V) = Concat(head_1,...,head_h)W^O\)
where \(head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\)

Applications of Attention in our Model

In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.
Each position in the encoder can attend to all positions in the previous layer of the encoder.
each position in the decoder to attend to all positions in the decoder up to and including that position.

Position-wise Feed-Forward Networks

Dense(dff, activation='relu')
fully connected: \(FFN(x) = max(0, xW_1+b_1)W_2+b_2 = Relu(xW_1+b_1)W_2+b_2\)

Embeddings and Softmax

Positional Encoding

Because using binary values would be waste of place, we can use float contunus counterparts - Sinusoidal functions

source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Why Self-Attention

total computational complexity
the amount of computation that can be parallelized
long-range dependencies

Training

Results

Question

Transformer是如何處理可變長度的？
length跟weight的dim無關
以下只計算維度變化，softmax不會造成維度變化
decoder跟 encoder怎麼接起來?

source: https://jalammar.github.io/illustrated-transformer/
paper: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

paper: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers.

Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation
為什麼可以share? 跟NLP subword有關
為什麼要有FFN，self-attention後已經有linear transformation
有Relu增加performance

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Attention Is All You Need (Transformer)

tags: paper

Abstract

Introduction

Before (sequence model)

computation problem

Attention mechanisms

Transformer

Background

Model Architecture

Encoder and Decoder Stacks

Attention

Scaled Dot-Product Attention

Multi-Head Attention

Applications of Attention in our Model

Position-wise Feed-Forward Networks

Embeddings and Softmax

Positional Encoding

Why Self-Attention

Training

Results

Question

tags: `paper`