owned this note
owned this note
Published
Linked with GitHub
{%hackmd SybccZ6XD %}
# Attention Is All You Need (Transformer)
###### tags: `paper`
## Abstract
Before: complex recurrent or convolutional neural network
Proposed architecture: Transformer, based on **attention mechanisms**
## Introduction
### Before (sequence model)
- Recurrent neural networks

source: https://gotensor.com/2019/02/28/recurrent-neural-networks-remembering-whats-important/
- Long short-term memory.

source: https://towardsdatascience.com/lstm-recurrent-neural-networks-how-to-teach-a-network-to-remember-the-past-55e54c2ff22e
- gated recurrent

source: https://towardsdatascience.com/gru-recurrent-neural-networks-a-smart-way-to-predict-sequences-in-python-80864e4fe9f6
### computation problem
- factorization tricks [21]: computational efficiency
- conditional computation [32]: computational efficiency and model performance
### Attention mechanisms
- without regard to their distance in the input or output sequences
- conjunction with a recurrent network
:::warning
補充 (attention)
https://blog.csdn.net/Enjoy_endless/article/details/88679989
- Human attention: focus something we think it is importance

- Arbitary formulation have a appropriate C let it more attentive

$y_1 = f1(C_1)$
$y_2 = f1(C_2, y_1)$
$y_3 = f1(C_3, y_2)$
:::
:::warning
補充 (self-attention)
從影像來看,觀測範圍會比較廣,比起CNN,CNN一次只有看一個kernal
經過encoder可以知道哪個部分重要

:::
### Transformer
- eschewing recurrence
- attention mechanism to draw global dependencies between input and output
## Background
- Before (Extended Neural GPU [16], ByteNet [18] and ConvS2S [9]): the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions
- disadvantage: it is difficult to learn dependency distant positions
- Transformer: constant number of operations
- Transformer is not use any convolution or RNN
## Model Architecture
- encoder input: $x = (x_1,...,x_n)$
- encoder output: $z = (z_1,...,z_n)$
- decoder output: $y = (y_1,...,y_m)$
### Encoder and Decoder Stacks
:::warning
補充 (residual connection)

source: arXiv:1512.03385
:::
:::warning
補充 (batch normalization)


:::
==encoder==: N = 6, produce outputs of dimension $d_{model} = 512$
Add: residual connection
Norm: layer normalization

別人理解:

source: https://jalammar.github.io/illustrated-transformer/
我的理解:
paper: In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

==Decoder==: N = 6

### Attention
#### Scaled Dot-Product Attention
- Two kinds of attention function
- Additive attention
- Dot-product attention: faster and more space-efficient
- Why we need $\frac{1}{\sqrt{d_k}}$
- larger values of $d_k$, the dot products grow large in magnitude and lead to the bad performance in the softmax function
- $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

#### Multi-Head Attention
$h = 8, d_k = d_v = d_{model}/h = 64$
- beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions
- Concat: h dimensions are concatenated to 1
- $MultiHead(Q, K, V) = Concat(head_1,...,head_h)W^O$
- where $head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)$

#### Applications of Attention in our Model
- In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

- Each position in the encoder can attend to all positions in the previous layer of the encoder.

- each position in the decoder to attend to all positions in the decoder up to and including that position.

### Position-wise Feed-Forward Networks
Dense(dff, activation='relu')
fully connected: $FFN(x) = max(0, xW_1+b_1)W_2+b_2 = Relu(xW_1+b_1)W_2+b_2$
### Embeddings and Softmax
### Positional Encoding

Because using binary values would be waste of place, we can use float contunus counterparts - Sinusoidal functions

source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
## Why Self-Attention
- total computational complexity
- the amount of computation that can be parallelized
- long-range dependencies
## Training
## Results
## Question
- Transformer是如何處理可變長度的?
length跟weight的dim無關
以下只計算維度變化,softmax不會造成維度變化

- decoder跟 encoder怎麼接起來?


source: https://jalammar.github.io/illustrated-transformer/
paper: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.

paper: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers.
[Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding](https://arxiv.org/abs/2005.08081)


- we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation
為什麼可以share? 跟NLP subword有關

- 為什麼要有FFN,self-attention後已經有linear transformation
有Relu增加performance