Attention Is All You Need

# Attention Is All You Need ###### tags: `論文翻譯` `deeplearning` `nlp` [TOC] ## 說明區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院(現為樂詞網) :::info 原文 ::: :::success 翻譯 ::: :::warning 任何的翻譯不通暢部份都請留言指導 ::: :::danger * [paper hyperlink](https://arxiv.org/pdf/1706.03762.pdf) ::: ## Abstract :::info The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. ::: :::success 主流的序列轉換模型都是基於複雜的recurrent或是convolutional neural networks，不管那一個都包著一個encoder與一個decoder。效能最好的模型通常是透過一個注意力機制來連結encoder與decoder。我們提出一個新的簡單的網路架構，也就是Transformer，完全的基於注意力機制，完全沒在管recurrence與convolutions的。在兩個機器翻譯任務上就說明著，這些模型在質量方面有著更好的效果，同時更為平行化，訓練時間明顯更少。我們的模型在WMT 2014英語-德語翻譯任務上達到28.4 BLEU，超過目前已知的最佳結果(包含ensembles, by over 2 BLEU)。在WMT 2014英語-德語翻譯任務中，經過3.5天在八張GPUs上的訓練之後，我們的模型建立一個新的哩程，單一模型，41.8的BLEU score，這只是目前文獻中已知的最佳模型一小部份的訓練成本。我們透過將Transformer成功地應用到具有大量且有限的訓練資料的English constituency parsing來說明它可以很好地泛化到其它的任務上。 ::: ### 1 Introduction :::info Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. ::: :::success Recurrent neural networks、long short-term memory [13]與gated recurrent [7] neural networks，已經穩穩地在序列模型與[轉導](https://terms.naer.edu.tw/detail/20616afa224a0100be532fb61dbb022a/)問題上目前最好的方法(像是語言模型與機器翻譯[35, 2, 5])。後續眾多努力持續著突破遞迴語言模型與編碼器-解碼器架構的天花板。 ::: :::info Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$, as a function of the previous hidden state $h_{t-1}$ and the input for position $t$. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. ::: :::success 遞迴模型通常通常會沿著輸入與輸出序列的符號位置做因子計算(factor computation)。將位置與計算時間中的步驟(step)切齊，它們生成一系列的隱藏狀態$h_t$，作為先前的隱藏狀態$h_{t-1}$的函數與位置$t$的輸入。這種固有的順序性質在本質上就阻礙了訓練樣本的平行化，這在一些比較長的序列長度情況下變的重要，因為記憶體約束就卡住了樣本間的批處理。近來的研究透過[因子分解](https://terms.naer.edu.tw/detail/1eb27f165a2a57c0f748945614145953/)技巧與條件計算明顯的提高了計算效率，同時也提高後者的模型效能。不過啊，這有順序的計算的先天限制還是在的。 ::: :::info Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. ::: :::success 注意力機制儼然成為各種任務中引人注目的序列建模與轉導模型不可或缺的一部份，它允許對其依賴關係建模，而不需要考慮到它們在輸入或是輸出序列中的距離。然而，除了少數情況外，這類的注意力機制都是跟遞迴網路結合使用的。 ::: :::info In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. ::: :::success 在這個研究中，我們提出Transformer，這是一種避免遞迴的模型架構，並且完全依賴注意力機制來繪製出輸入與輸出之間的全域依賴關係。Transformer可以明顯地更加的平行化，而且可以在訓練品質上來到一個新的境界(只要你有8張P100，簡單的訓練12小時就可以)。 ::: ### 2 Background :::info The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. ::: :::success 減少順序計算(sequential computation)的目標也構成Extended Neural GPU、ByteNet、ConvS2S的基礎，它們都是用卷積神經網路來做為基礎的建構區塊，以平行的方式計算所有輸入與輸出位置的隱藏表示(hidden representations)。在這些模型中，關聯兩個任意輸入或輸出位置的信號所需的運算數量會隨著位置之間的距離而增加，ConvS2S是線性增加，ByteNet則是對數增長。這讓學習位置比較遙遠的輸出入之間的依賴關係變的更加困難。在Transformer中，這問題已經被減化成常數數量的操作，儘管這個代價是由於我們平均注意力權重位置(attention-weighted positions)而降低有效解析度，不過這問題我們會用Multi-Head Attention來抵消(見Section 3.2說明)。 ::: :::info Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. ::: :::success Self-attention(自注意力)，有時候稱為intra-attention，是一種為了計算序列的表示而將單一序列相關聯的注意力機制。Self-attention已經被成功地的應用在各種任務中，包括[閱讀理解](https://terms.naer.edu.tw/detail/4f4ca5c561c98006e991be54584e0d85/)，抽象摘要、文字蘊涵以及跟學習任務無關的句子表示。 ::: :::info End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34]. ::: :::success End-to-end memory networks基於遞迴注意力機制，而非sequencealigned recurrence，而且已經被證明在簡單的語言問答與語言建模任務上表現的不錯。 ::: :::info To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9]. ::: :::success 然而，據我們所知，Transformer是第一個完全依賴self-attention在不使用序列對齊的RNNs或是卷積來計算其輸入與輸出的計算表示的轉導模型。接下來的章節中，我們將會說明Transformer、激發self-attention並討論相對於其它模型的優勢。 ::: ## 3 Model Architecture :::info Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations $(x_1,...,x_n)$ to a sequence of continuous representations $\mathbf{z}=(z_1,...,z_n)$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_1,...,y_n)$ of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. ::: :::success 多數競爭性的神經序列轉導模型都有一個encoder-decoder的結構。encoder將符號表示的輸入序列$(x_1,...,x_n)$映射到一個連續表示的序列$\mathbf{z}=(z_1,...,z_n)$。給定$\mathbf{z}$，然後decoder會每次生成出一個符號輸出序列$(y_1,...,y_n)$的元素。每個步驟中，模型都是auto-regressive，然後在生成下一個的時候將前一個生成的符號當做附加的輸入。 ::: :::info The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively. ::: :::success Transformer依循著下面這個架構，在encoder跟decoder都使用者堆疊的self-attention與point-wise、fully connected layers，分別如Figure 1左右圖所示。 ::: :::info ![image](https://hackmd.io/_uploads/HkHGrYDc6.png) Figure 1: The Transformer - model architecture. ::: ### 3.1 Encoder and Decoder Stacks :::info **Encoder:** The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is $\text{LayerNorm}(x + \text{Sublayer}(x))$, where $\text{Sublayer}(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{\text{model}}=512$. ::: :::success **Encoder:** encoder是由$\text{N}=6$個相同層堆疊所組成。每一層都有兩個子層。第一個是多頭注意力機制(multi-head self-attention mechanism)，第二個比較簡單，position-wise fully connected feed-forward network。兩個子層之間我們採用residual connection，接下來就是layer normalization。也就是說，每個子層的輸出會是$\text{LayerNorm}(x + \text{Sublayer}(x))$，其中$\text{Sublayer}(x)$是由子層本身所實現的函數。為了方便這些子層的連接，模型中的所有子層以及嵌入層(embedding layers)都會產生$d_{\text{model}}=512$的維度輸出。 ::: :::info **Decoder:** The decoder is also composed of a stack of $\text{N}=6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$. ::: :::success **Decoder:** decoder也是由$\text{N}=6$個相同層堆疊所組成。不過它除了兩個子層之外還插入第三個子層，這個子層會對encoder stack的輸出做多頭注意力的處理。類似於encoder，我們會在每個子層之間使用residual connection，接著就是layer normalization。我們也同時調整decoder stack中的自注意力子層(self-attention sub-layer)，以防止位置(position)關注後續的位置。這種遮罩跟輸出嵌入偏移一個位置的真實相結合的作法，確保了位置$i$的預測就只能依賴於小於$i$的已知輸出。 ::: :::warning 這邊說的就是，總之，一開始就只有第一個字看的到，其它不給看，然後給看第一、第二個字，其它不給看，大概有一種dropout的概念，不過是有序的遮掉。 ::: ### 3.2 Attention :::info An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. ::: :::success 注意力函數(attention function)可以將之視為將query與key-value pairs映射到輸出的概念，其中query、key、values與output都是向量。輸出(output)是值(values)的加權和，其中分配給每個值(value)的權重是透過查詢(query)與相對應鍵(key)的相容性函數（compatibility function)計算而得。 ::: #### 3.2.1 Scaled Dot-Product Attentio :::info We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values. ::: :::success 我們把我們的特別注意力層稱為"Scaled Dot-Product Attention"(Figure 2)。input包含了維度為$d_k$的查詢(queries)與鍵(keys)，以及維度為$d_v$的值(values)。我們計算查詢(query)與所有鍵(keys)的點積，除上$\sqrt{d_k}$，然後做softmax的處理來得到值(values)的權重。 ::: :::info ![image](https://hackmd.io/_uploads/rklklz556.png) Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. ::: :::info In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$ . We computethe matrix of outputs as: $$ \text{Attention}(Q,K,V)=\text{softmax}(\dfrac{QK^T}{\sqrt{d_k}})V \tag{1} $$ ::: :::success 實務上，我們會同時計算一組的查詢(queries)，會將之打包成一個矩陣$Q$。鍵(keys)與值(values)也會分別打包成矩陣$K$與$V$。我們計算的輸出矩陣如下： $$ \text{Attention}(Q,K,V)=\text{softmax}(\dfrac{QK^T}{\sqrt{d_k}})V \tag{1} $$ ::: :::info The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\dfrac{1}{\sqrt{d_k}}$ . Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. ::: :::success 兩個最常用的注意力函數為additive attention與dot-product (multiplicative) attention。除了縮放因子$\dfrac{1}{\sqrt{d_k}}$之外，dot-product attention跟我們的演算法是一樣的。additive attention則是用具有單一隱藏層的前饋網路來計算相同性函數。儘管兩種方法的理論複雜度是一樣的，不過實務上dot-product attention還是快多了，因為它可以用高度最佳化的矩陣乘法來實現。 ::: :::info While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3]. We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by $\dfrac{1}{\sqrt{d_k}}$. ::: :::success 雖然對於較小的$d_k$來說，這兩個機制的表現是類似的，不過在沒有較大的縮放值$d_k$的情況下，additive attention是比dot product attention還要好的。我們是這樣懷疑的，較大的$d_k$會造成點積的值變的非常的大，這也將softmax function梯度極小的區域。為了抵消這個影響，我們才會選擇用$\dfrac{1}{\sqrt{d_k}}$來做縮放。 ::: :::info To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1. Then their dot product, $q\cdot k=\sum_{i=1}^{d_k}q_ik_i$, has mean 0 and variance $d_k$. ::: :::success 這邊說明為什麼點積會變大的原因，假設$q$與$k$的成份(components)是均值為0且方差為1的independent random variables。那它們的點積$q\cdot k=\sum_{i=1}^{d_k}q_ik_i$，均值就是0，方差就會是$d_k$ ::: #### 3.2.2 Multi-Head Attention :::info Instead of performing a single attention function with $d_{\text{model}}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k,d_k$ and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$-dimensional output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. ::: :::success 如果說不要用單一個attention function($d_{\text{model}}$維的keys、values、queries)，我們發現到這對於將queries、keys與values分別用不同的學習到的線性投影$h$次到$d_k,d_k,d_v$維上是有幫助的。在這些投影後的queries、keys與values的每個版本上，我們會平行執行注意力函數，然後生成$d_v$維的輸出值。把它們連結起來然後再次的投影，產出最終的值，如Figure 2所示。 ::: :::info ![image](https://hackmd.io/_uploads/rkX1K7qc6.png) Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. ::: :::info Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. $$ \begin{align} & \text{MultiHead}(Q,K,V)=\text{Concat}(\text{head_1},...,\text{head}_h)W^O \\ & \text{where head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V) \end{align} $$ Where the projections are parameter matrices $W_i^Q\in\mathbb{R}^{d_{\text{model}}\times d_k}$, $W_i^K\in\mathbb{R}^{d_{\text{model}}\times d_k}$, $W_i^V\in\mathbb{R}^{d_{\text{model}}\times d_v}$ and $W^O\in\mathbb{R}^{hd_v\times d_{\text{model}}}$. ::: :::success Multi-head attention允許模型可以共同關注來自不同的位置不同的表示子空間中的信息。single attention head情況下，平均會抑制這種情況。 $$ \begin{align} & \text{MultiHead}(Q,K,V)=\text{Concat}(\text{head_1},...,\text{head}_h)W^O \\ & \text{where head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V) \end{align} $$ 其中投影是參數矩陣$W_i^Q\in\mathbb{R}^{d_{\text{model}}\times d_k}$, $W_i^K\in\mathbb{R}^{d_{\text{model}}\times d_k}$, $W_i^V\in\mathbb{R}^{d_{\text{model}}\times d_v}$與$W^O\in\mathbb{R}^{hd_v\times d_{\text{model}}}$ ::: :::info In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. ::: :::success 在這個研究中，我們採用的是$h=8$個平行注意力層(layers)，或頭(heads)。這八個每一個我們都使用$d_k=d_v=d_{\text{model}}/h=64$。由於每一個head的維度降低了，總計算成本跟full dimensionality的single-head attention是差不多的。 ::: #### 3.2.3 Applications of Attention in our Model :::info The Transformer uses multi-head attention in three different ways: * In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9]. * The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. * Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to $\infty$) all values in the input of the softmax which correspond to illegal connections. See Figure 2. ::: :::success Transformer以三種不同的方法來用著multi-head attention： * 在"encoer-decoder attention" layers中，queries來自前一個decoder layer，keys與values則是來自於encoder的輸出。這讓decoder中的每個位置都可關注input sequence中的所有位置。這模仿了sequence-to-sequence models中的encoder-decoder attention mechanisms。 * encoder包含self-attention layers。在self-attention layer中，所有的keys、values與queries都是來自相同的位置(place)，這種情況下就是encoder前一層的輸出。encoder中的每個位置能夠關注encoder前一層中的所有位置。 * 類似地，decoder中的self-attention layers允許decoder中的每一個位置關注decoder中截至該位置(包含)的所有位置。為了保留auto-regressive的性質，我們需要預防decoder中的向左信息流動。我們通過在 scaled dot-product attention中屏蔽(設定為$\infty$)所有對應非法連接的softmax輸入值來實現這一點。見Figure 2。 ::: ### 3.3 Position-wise Feed-Forward Networks :::info In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. $$ \text{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2 \tag{2} $$ While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{\text{model}}=512$, and the inner-layer has dimensionality $d_{ff}=2048$. ::: :::success 除了attention sub-layers之外，我們的encoder與decoder中的每一層都包含一個fully connected feed-forward network，單獨且相同地在每個位置應用。這包含兩個線性轉換，中間則使用ReLU。 $$ \text{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2 \tag{2} $$ 儘管不同位置間的線性轉換是相同的，層到層之間仍然是使用不同的參數。另一種描述這種作法的方式就是想成就兩個1x1的卷積。輸入與輸出的維度$d_{\text{model}}=512$，中間層的維度則是$d_{ff}=2048$。 ::: ### 3.4 Embeddings and Softmax :::info Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$ ::: :::success 類似於其它序列轉導模型，我們使用學習的嵌入(learned embeddings)將輸入與輸出的token轉換為$d_{\text{model}}$維的向量。我們也使用通常的學習線性轉換函數與softmax函數將decoder的輸出轉換成預測next-token的機率。在我們的模型中，兩個embedding layers與pre-softmax linear transformation之間會共享相同的權重矩陣。在嵌入層(embedding layers)中，我們會以$\sqrt{d_{\text{model}}}$乘上權重。 ::: ### 3.5 Positional Encoding :::info Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text{model}}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. ::: :::success 因為我們的模型並不包含遞迴與卷積，為了讓模型能夠用上序列的順序，我們必需要注入這些tokens在序列中的絕對或相對位置的信息。為此，我們把"positional encodings"加到encoder與decoder堆疊底部的input embeddings。positional encodings跟embeddings有相同的維度，$d_{\text{model}}$，也因此兩個矩陣可以相加。positional encodings有很多種選擇，有學習來的，也有固定的。 ::: :::info In this work, we use sine and cosine functions of different frequencies: $$ \begin{align} PE_{(pos, 2i)} &= sin(pos/10000^{2_i/d_{\text{model}}}) \\ PE_{(pos, 2i+1)} &= cos(pos/10000^{2_i/d_{\text{model}}}) \end{align} $$ where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000\cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$. ::: :::success 在這個研究中，我們使用不同頻率的sine與cosine函數： $$ \begin{align} PE_{(pos, 2i)} &= sin(pos/10000^{2_i/d_{\text{model}}}) \\ PE_{(pos, 2i+1)} &= cos(pos/10000^{2_i/d_{\text{model}}}) \end{align} $$ 其中$pos$指的是位置，$i$指的是維度。也就是說，位置編碼(positional encoding)的每個維度都對應於正弦曲線。波長形成一個從$2\pi$到$10000\cdot 2\pi$的幾何級數。選擇這個函數是因為我們假設它可以讓模型簡單地透過相對位置學習到關注，因為對於任意固定偏移量$k$來說，$PE_{pos+k}$可以表示為$PE_{pos}$的線性函數。 ::: :::info We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training. ::: :::success 我們也實驗過使用學習的positional embeddings，有發現到這兩種版本的所產生的結果非常相近(見Table 3 row (E))。我們選擇使用正弦版本(sinusoidal version)是因為它可以推斷出比訓練期間所遇到的序列長度還要再長的序列長度。 ::: :::info Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities. ![image](https://hackmd.io/_uploads/HkmEGM09p.png) ::: ## 4 Why Self-Attention :::info In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1,...x_n)$ to another sequence of equal length $(z_1,...,z_n)$, with $x_i,z_i \in \mathbb{R}^d$, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata. ::: :::success 這一節中，我們要來把self-attention layers跟各種的recurrent與convolutional layers好好的比一比，這些常用層常見用於將符號表示的一個variable-length sequence $(x_1,...x_n)$映射到另一個相同長度的序列$(z_1,...,z_n)$，其中$x_i,z_i \in \mathbb{R}^d$，就像典型的序列轉導編碼器或解碼器中的隱藏層。在使用self-attention的部份我們考慮三個必要條件。 ::: :::info One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. ::: :::success 一個就是每一層的總計算複雜度。另一個就是可以平行化的計算量，利用所需要的最小序列操作數量來衡量。 ::: :::info The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types. ::: :::success 第三個就是網路中長期相依性(long-range dependencies)之間的路徑長度(path length)。學習長期相依性在很多序列轉導任務中是一個關鍵挑戰。影響學習這種相依性能力的關鍵因子就是路徑(path)的前饋(forward)與反饋(backward)信號在網路中必需經過的路徑長度。愈短就愈容易學到長期相依性。所以啊，我們還比較了不同層所組成的網路中任意兩個輸入與輸出位置之間最長的路徑長度。 ::: :::info As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O(n)$ sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to $O(n/r)$. We plan to investigate this approach further in future work. ::: :::success 如Table 1所註記，self-attention layer是以固定數量按序執行操作連接所有位置，而recurrent layer則是需要$O(n)$。在計算複雜度的部份，當序列長度$n$小於表示維度$d$的時候，self-attention layers就會比recurrent layers來的快，這見於機器翻譯中最好的模型所使用的sentence representations(語句表示)，如word-piece與byte-pair的表示。為了提高涉及超長語句的任務的計算效能，self-attention可以限制單純的考慮輸入序列中圍繞於相對應輸出位置中心的大小為$r$的鄰域。這將使最大路徑長度增加為$O(n/r)$。我們計劃在未來研究中進一步研究這個方法。 ::: :::info Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimension, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in restricted self-attention. ![image](https://hackmd.io/_uploads/HJ4ECLJoa.png) ::: :::info A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to $O(k\cdot n \cdot d + n \cdot d^2)$. Even with $k = n$, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. ::: :::success kernel寬度$k<n$的單一卷積層無法連接所有的輸入與輸出成對的位置(兩兩無法全部相接)。在contiguous kernels的情況下要全部接上線的話就必需要堆$O(n/k)$個卷積層，在dilated convolutions的話則是需要$O(log_k(n))$個卷積層，用這樣的方式來增加網路中任意兩個任置之間的最長路徑的長度。卷積層的計算複雜度通常比遞迴層貴個$k$倍吧。不過啊，Separable convolutions通常可以將複雜度降低到$O(k\cdot n \cdot d + n \cdot d^2)$。即使$k=n$，separable convolution的複雜度也就只是等價於self-attention layer與 point-wise feed-forward layer的結合，這也是我們模型中所採用的方法。 ::: :::info As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences. ::: :::success 做為附帶好處，self-attention可以產生更多可解釋模型。我們從我們的模型中檢查注意力分佈，並且在附錄中說明與討論。不僅個別的attention head明顯學會執行不同的任務，許多attention heads似乎還表現出與句子的文法和語義結構相關的行為。 ::: ## 5 Training :::info This section describes the training regime for our models. ::: :::success 這邊來玩玩我們的模型訓練制度。 ::: ### 5.1 Training Data and Batching :::info We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. ::: :::success 我們的訓練資料集是標準WMT 2014 English-German dataset，這包含大約450萬對語句(英文到德語的語句)。語句的部份用byte-pair encoding，這個encoding具有37000個tokens的共享來源-目標的詞彙。對於English-French的部份，我們使用更大的WMT 2014 English-French dataset，包含36M對句子，並且將tokens拆分為32000 word-piece vocabulary。差不多長度的語句就會分批放在一起。每個訓練批次包含一組的sentence pairs，大概有25000個source tokens與25000個target tokens。 ::: ### 5.2 Hardware and Schedule :::info We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). ::: :::success 我們在一台裝有8張NVIDIA P100 GPUs的電腦上訓練模型。對於我們論文中所述的基本模型的超參數，每個訓練step大概需要0.4秒。這個基本模型大概訓練了100,000個steps或12小時。大一點的模型的話(Table 3最後一行說的那個)，每個step要1秒。總共訓練300,000個steps(3.5天)。 ::: ### 5.3 Optimizer :::info We used the Adam optimizer [20] with $\beta_1=0.9, \beta_2=0.98$ and $\epsilon=10^{-9}$. We varied the learning rate over the course of training, according to the formula: $$ lrate = d_{\text{model}}^{-5}\cdot \min(step_{num}^{-0.5}, step_{num}\cdot warmup_steps^{-1.5}) \tag{3} $$ This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000. ::: :::success 我們使用Aadm optimizer，參數的部份$\beta_1=0.9, \beta_2=0.98$且$\epsilon=10^{-9}$。我們根據下面的數學式在訓練過程中改變learning rate： $$ lrate = d_{\text{model}}^{-5}\cdot \min(step_{num}^{-0.5}, step_{num}\cdot warmup_steps^{-1.5}) \tag{3} $$ 這對應於第一個warmup_steps training steps的learning rate線性增長，接著以step number的平方根倒數按比例地減少。warmup_steps=4000。 ::: ### 5.4 Regularization :::info We employ three types of regularization during training: * **Residual Dropout** We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$. * **Label Smoothing** During training, we employed label smoothing of value $\epsilon+{ls}=0.1$ [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. ::: :::success 我們在訓練期間使用了三種正規化類型： * **Residual Dropout** 我們在每個sub-layer的輸出上使用dropout，然後再加到sub-layer的輸入並且正規化。此外，我們還在encoder與decoder的堆疊中的embedding與positional encodings的總和中使用dropout。基本模型的部份，我們使用$P_{drop}=0.1$。 * **Label Smoothing** 訓練期間，我們使用$\epsilon+{ls}=0.1$的值來做label smoothing。這雖然會降低困惑度，因為模型的學習會變的更加不確定，不過這可以增加準確度與BLEU score。 ::: :::warning perplexity，困惑度，這是度量語言模型的一種指標。 ::: ## 6 Results ### 6.1 Machine Translation :::info On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models. ::: :::success 在WMT 2014 English-to-German的翻譯任務上，big transformer model(Transformer (big) in Table 2)優於先前看過最好的模型(包含ensembles)有2.0 BLEU之多，建立了新的最佳28.4的BLEU score。模型的配置在Table 3的最後一行。在8張P100 GPUs上訓練3.5天。儘管是基本模型也是碾壓所有先前所發布的模型與集成模型(ensembles)，而且訓練成本還只是任何一個競爭模型的一小部份而以。 ::: :::info Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. ![image](https://hackmd.io/_uploads/S19vAUyoa.png) ::: :::info On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate $P_{drop}=0.1$, instead of $0.3$. ::: :::success 在WMT 2014 English-to-French任務上，大型模型的部份(big model)得到41.0的BLEU score，也是優於所有先前所發佈的單一模型，而且訓練成本不到他們的1/4。English-to-French的模型使用dropout為$P_{drop}=0.1$，而非$0.3$ ::: :::info For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty $\alpha=0.6$ [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38]. ::: :::success 基本模型的部份，我們使用最後五個檢查點(checkpoints)做平均得到單一模型(signle model)，這五個檢查點是以10分鐘的間隔來寫入。大型模型的部份，我們平均最後20個檢查點。我們使用[定向搜索](https://terms.naer.edu.tw/detail/e80a9c57842994528399bc9186a4b39c/)(beam search)，beam size為4且length penalty $\alpha=0.6$。這些超參數是在開發集上做實驗後所選擇的。我們把推論期間的最大輸出長度設置為輸入長度+50，不過可以的話會盡可能的提早終止。 ::: :::info Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU. ::: :::success Table 2總結我們的研究成果，並跟文獻中的其它模型架構比較翻譯品質與訓練成本。我們透過乘上訓練時間以及所使用的GPU數量跟每塊GPU的sustained single-precision floating-point capacity的估測值來估測用於訓練模型的浮點計算數。 ::: :::info We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively ::: ### 6.2 Model Variations :::info To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3. ::: :::success 為了評估Transformer不同組件的重要性，我們以不同的方式來改變基本模型，看看在newstest2013開發集上的English-to-German翻譯效能上的改變。我們使用上一節所述的beam search，不過沒有做檢查點的平均。相關結果呈現於Table 3。 ::: :::info In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads. ::: :::success Table 3中的rows (A)，我們改變attention heads的數量以及attention key、value的維度，然後維持計算量不變，如Section 3.2.2所述。雖然single-head attention會比最佳化設定還要糟0.9 BELU，不過太多頭(head)也是會造成品質降落。 ::: :::info In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model. ::: :::success Table 3中的rows (B)，我們觀察到降低attention key的大小$d_k$會傷到模型品質。這說明著確定相容性是不容易的，而且比點積更複雜的相容性函數說不定是有好處的。我們進一步的在(C)、(D)觀察到，一如預期那般，模型愈大愈好，而且dropout對於避免過擬合非常有幫助。(E)的部份，我們用學習的位置嵌入(positional embedding)來取代sinusoidal positional encoding，我們觀察到，所產生的結果與基本模型幾無差異。 ::: ### 6.3 English Constituency Parsing :::info To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37]. ::: :::success 為了評估Transformer是否可以泛化至其它任務，我們在英文選區解析上做了實驗。這個任務提出特別的挑戰：輸出受到強烈的結構約束，而且明顯比輸入長。此外，RNN sequence-to-sequence models並沒有辦法在小資料範例中得到最好的結果。 ::: :::info We trained a 4-layer transformer with $d_{model}=1024$ on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting. ::: :::success 我們在Wall Street Journal (WSJ) portion of the Penn Treebank資料集上訓練一個4-layer $d_{model}=1024$的transformer，大概40K的訓練語句。我們還在一個semi-supervised的環境中訓練它(使用一個更大的高置信度與BerkleyParser語料庫，大約17M的語句)。我們WSJ使用16K tokens的詞彙，semi-supervised的環境中則是使用32K的詞彙。 ::: :::info We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and $\alpha=0.3$ for both WSJ only and the semi-supervised setting. ::: :::success 我們只有在Section 22開發集上做少量的實驗來選擇dropout、attention與residual(section 5.4)、learning rates與beam size，其它參數都跟English-to-German base translation model一樣維持不變。推論過程中，我們將最大輸出長度增加為輸入長度+300。對於WSJ only與semi-supervised中使用beam size=21與$\alpha=0.3$的設置。 ::: :::info Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8]. ::: :::success Table 4中的結果說明著，儘管沒有做一些特別的調整，我們的模型效能卻出乎意料之外的好，產生的結果比先前提過的所有模型都要來的好，除了Recurrent Neural Network Grammar。 ::: :::info Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ) ![image](https://hackmd.io/_uploads/rJL2C8ysT.png) ::: :::info In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences. ::: :::success 跟RNN sequence-to-sequence models相比，儘管只有在40K的WSJ訓練集上訓練，Transformer還是優於BerkeleyParser。 ::: ## 7 Conclusion :::info In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. ::: :::success 在這個研究中，我們提出Transformer，第一個完全基於注意力的序列轉導模型，用multi-headed self-attention取代掉encoder-decoder架構中常見的recurrent layers。 ::: :::info For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. ::: :::success 對於翻譯任務而言，Transformer明顯訓練的比基於recurrent或convolutional layers的架構還要來的快。在WMT 2014 English-to-German與WMT 2014 English-to-French翻譯任務上，我們得到一個新的最佳值。前一個任務中，我們最好的那個模型甚至優於所有先前提過的集成模型。 ::: :::info We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. ::: :::success 我們對於attention-based models的未來感到興奮異常，而且計劃將之應用於其它任務上。我們計劃將Transformer擴展至涉及文字以外的輸入、輸出模式的問題，然後研究局部、受限的注意力機制來有效處理大型的輸入與輸出，如影像、音頻與視頻。 ::: :::info The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor. ::: :::success 用來訓練跟評估模型的程式碼在https://github.com/tensorflow/tensor2tensor. ::: :::info **Acknowledgements** We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration. ::: :::success 謝天謝地 :::