論文翻譯
deeplearning
nlp
區塊如下分類,原文區塊為藍底,翻譯區塊為綠底,部份專業用語翻譯參考國家教育研究院(現為樂詞網)
原文
翻譯
任何的翻譯不通暢部份都請留言指導
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
主流的序列轉換模型都是基於複雜的recurrent或是convolutional neural networks,不管那一個都包著一個encoder與一個decoder。效能最好的模型通常是透過一個注意力機制來連結encoder與decoder。我們提出一個新的簡單的網路架構,也就是Transformer,完全的基於注意力機制,完全沒在管recurrence與convolutions的。在兩個機器翻譯任務上就說明著,這些模型在質量方面有著更好的效果,同時更為平行化,訓練時間明顯更少。我們的模型在WMT 2014英語-德語翻譯任務上達到28.4 BLEU,超過目前已知的最佳結果(包含ensembles, by over 2 BLEU)。在WMT 2014英語-德語翻譯任務中,經過3.5天在八張GPUs上的訓練之後,我們的模型建立一個新的哩程,單一模型,41.8的BLEU score,這只是目前文獻中已知的最佳模型一小部份的訓練成本。我們透過將Transformer成功地應用到具有大量且有限的訓練資料的English constituency parsing來說明它可以很好地泛化到其它的任務上。
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
Recurrent neural networks、long short-term memory [13]與gated recurrent [7] neural networks,已經穩穩地在序列模型與轉導問題上目前最好的方法(像是語言模型與機器翻譯[35, 2, 5])。後續眾多努力持續著突破遞迴語言模型與編碼器-解碼器架構的天花板。
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states
遞迴模型通常通常會沿著輸入與輸出序列的符號位置做因子計算(factor computation)。將位置與計算時間中的步驟(step)切齊,它們生成一系列的隱藏狀態
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
注意力機制儼然成為各種任務中引人注目的序列建模與轉導模型不可或缺的一部份,它允許對其依賴關係建模,而不需要考慮到它們在輸入或是輸出序列中的距離。然而,除了少數情況外,這類的注意力機制都是跟遞迴網路結合使用的。
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
在這個研究中,我們提出Transformer,這是一種避免遞迴的模型架構,並且完全依賴注意力機制來繪製出輸入與輸出之間的全域依賴關係。Transformer可以明顯地更加的平行化,而且可以在訓練品質上來到一個新的境界(只要你有8張P100,簡單的訓練12小時就可以)。
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
減少順序計算(sequential computation)的目標也構成Extended Neural GPU、ByteNet、ConvS2S的基礎,它們都是用卷積神經網路來做為基礎的建構區塊,以平行的方式計算所有輸入與輸出位置的隱藏表示(hidden representations)。在這些模型中,關聯兩個任意輸入或輸出位置的信號所需的運算數量會隨著位置之間的距離而增加,ConvS2S是線性增加,ByteNet則是對數增長。這讓學習位置比較遙遠的輸出入之間的依賴關係變的更加困難。在Transformer中,這問題已經被減化成常數數量的操作,儘管這個代價是由於我們平均注意力權重位置(attention-weighted positions)而降低有效解析度,不過這問題我們會用Multi-Head Attention來抵消(見Section 3.2說明)。
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
Self-attention(自注意力),有時候稱為intra-attention,是一種為了計算序列的表示而將單一序列相關聯的注意力機制。Self-attention已經被成功地的應用在各種任務中,包括閱讀理解,抽象摘要、文字蘊涵以及跟學習任務無關的句子表示。
End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
End-to-end memory networks基於遞迴注意力機制,而非sequencealigned recurrence,而且已經被證明在簡單的語言問答與語言建模任務上表現的不錯。
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
然而,據我們所知,Transformer是第一個完全依賴self-attention在不使用序列對齊的RNNs或是卷積來計算其輸入與輸出的計算表示的轉導模型。接下來的章節中,我們將會說明Transformer、激發self-attention並討論相對於其它模型的優勢。
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations
多數競爭性的神經序列轉導模型都有一個encoder-decoder的結構。encoder將符號表示的輸入序列
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer依循著下面這個架構,在encoder跟decoder都使用者堆疊的self-attention與point-wise、fully connected layers,分別如Figure 1左右圖所示。
Figure 1: The Transformer - model architecture.
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
Encoder: encoder是由
Decoder: The decoder is also composed of a stack of
Decoder: decoder也是由
這邊說的就是,總之,一開始就只有第一個字看的到,其它不給看,然後給看第一、第二個字,其它不給看,大概有一種dropout的概念,不過是有序的遮掉。
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
注意力函數(attention function)可以將之視為將query與key-value pairs映射到輸出的概念,其中query、key、values與output都是向量。輸出(output)是值(values)的加權和,其中分配給每個值(value)的權重是透過查詢(query)與相對應鍵(key)的相容性函數(compatibility function)計算而得。
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension
我們把我們的特別注意力層稱為"Scaled Dot-Product Attention"(Figure 2)。input包含了維度為
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix
實務上,我們會同時計算一組的查詢(queries),會將之打包成一個矩陣
The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of
兩個最常用的注意力函數為additive attention與dot-product (multiplicative) attention。除了縮放因子
While for small values of
雖然對於較小的
To illustrate why the dot products get large, assume that the components of
這邊說明為什麼點積會變大的原因,假設
Instead of performing a single attention function with
如果說不要用單一個attention function(
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
Where the projections are parameter matrices
Multi-head attention允許模型可以共同關注來自不同的位置不同的表示子空間中的信息。single attention head情況下,平均會抑制這種情況。
其中投影是參數矩陣
In this work we employ
在這個研究中,我們採用的是
The Transformer uses multi-head attention in three different ways:
Transformer以三種不同的方法來用著multi-head attention:
In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is
除了attention sub-layers之外,我們的encoder與decoder中的每一層都包含一個fully connected feed-forward network,單獨且相同地在每個位置應用。這包含兩個線性轉換,中間則使用ReLU。
儘管不同位置間的線性轉換是相同的,層到層之間仍然是使用不同的參數。另一種描述這種作法的方式就是想成就兩個1x1的卷積。輸入與輸出的維度
Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by
類似於其它序列轉導模型,我們使用學習的嵌入(learned embeddings)將輸入與輸出的token轉換為
Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension
因為我們的模型並不包含遞迴與卷積,為了讓模型能夠用上序列的順序,我們必需要注入這些tokens在序列中的絕對或相對位置的信息。為此,我們把"positional encodings"加到encoder與decoder堆疊底部的input embeddings。positional encodings跟embeddings有相同的維度,
In this work, we use sine and cosine functions of different frequencies:
where
在這個研究中,我們使用不同頻率的sine與cosine函數:
其中
We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
我們也實驗過使用學習的positional embeddings,有發現到這兩種版本的所產生的結果非常相近(見Table 3 row (E))。我們選擇使用正弦版本(sinusoidal version)是因為它可以推斷出比訓練期間所遇到的序列長度還要再長的序列長度。
Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations
這一節中,我們要來把self-attention layers跟各種的recurrent與convolutional layers好好的比一比,這些常用層常見用於將符號表示的一個variable-length sequence
One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
一個就是每一層的總計算複雜度。另一個就是可以平行化的計算量,利用所需要的最小序列操作數量來衡量。
The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
第三個就是網路中長期相依性(long-range dependencies)之間的路徑長度(path length)。學習長期相依性在很多序列轉導任務中是一個關鍵挑戰。影響學習這種相依性能力的關鍵因子就是路徑(path)的前饋(forward)與反饋(backward)信號在網路中必需經過的路徑長度。愈短就愈容易學到長期相依性。所以啊,我們還比較了不同層所組成的網路中任意兩個輸入與輸出位置之間最長的路徑長度。
As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires
如Table 1所註記,self-attention layer是以固定數量按序執行操作連接所有位置,而recurrent layer則是需要
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types.
A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(logk(n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to
kernel寬度
As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
做為附帶好處,self-attention可以產生更多可解釋模型。我們從我們的模型中檢查注意力分佈,並且在附錄中說明與討論。不僅個別的attention head明顯學會執行不同的任務,許多attention heads似乎還表現出與句子的文法和語義結構相關的行為。
This section describes the training regime for our models.
這邊來玩玩我們的模型訓練制度。
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
我們的訓練資料集是標準WMT 2014 English-German dataset,這包含大約450萬對語句(英文到德語的語句)。語句的部份用byte-pair encoding,這個encoding具有37000個tokens的共享來源-目標的詞彙。對於English-French的部份,我們使用更大的WMT 2014 English-French dataset,包含36M對句子,並且將tokens拆分為32000 word-piece vocabulary。差不多長度的語句就會分批放在一起。每個訓練批次包含一組的sentence pairs,大概有25000個source tokens與25000個target tokens。
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
我們在一台裝有8張NVIDIA P100 GPUs的電腦上訓練模型。對於我們論文中所述的基本模型的超參數,每個訓練step大概需要0.4秒。這個基本模型大概訓練了100,000個steps或12小時。大一點的模型的話(Table 3最後一行說的那個),每個step要1秒。總共訓練300,000個steps(3.5天)。
We used the Adam optimizer [20] with
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000.
我們使用Aadm optimizer,參數的部份
這對應於第一個warmup_steps training steps的learning rate線性增長,接著以step number的平方根倒數按比例地減少。warmup_steps=4000。
We employ three types of regularization during training:
我們在訓練期間使用了三種正規化類型:
perplexity,困惑度,這是度量語言模型的一種指標。
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
在WMT 2014 English-to-German的翻譯任務上,big transformer model(Transformer (big) in Table 2)優於先前看過最好的模型(包含ensembles)有2.0 BLEU之多,建立了新的最佳28.4的BLEU score。模型的配置在Table 3的最後一行。在8張P100 GPUs上訓練3.5天。儘管是基本模型也是碾壓所有先前所發布的模型與集成模型(ensembles),而且訓練成本還只是任何一個競爭模型的一小部份而以。
Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate
在WMT 2014 English-to-French任務上,大型模型的部份(big model)得到41.0的BLEU score,也是優於所有先前所發佈的單一模型,而且訓練成本不到他們的1/4。English-to-French的模型使用dropout為
For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty
基本模型的部份,我們使用最後五個檢查點(checkpoints)做平均得到單一模型(signle model),這五個檢查點是以10分鐘的間隔來寫入。大型模型的部份,我們平均最後20個檢查點。我們使用定向搜索(beam search),beam size為4且length penalty
Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.
Table 2總結我們的研究成果,並跟文獻中的其它模型架構比較翻譯品質與訓練成本。我們透過乘上訓練時間以及所使用的GPU數量跟每塊GPU的sustained single-precision floating-point capacity的估測值來估測用於訓練模型的浮點計算數。
We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively
To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
為了評估Transformer不同組件的重要性,我們以不同的方式來改變基本模型,看看在newstest2013開發集上的English-to-German翻譯效能上的改變。我們使用上一節所述的beam search,不過沒有做檢查點的平均。相關結果呈現於Table 3。
In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
Table 3中的rows (A),我們改變attention heads的數量以及attention key、value的維度,然後維持計算量不變,如Section 3.2.2所述。雖然single-head attention會比最佳化設定還要糟0.9 BELU,不過太多頭(head)也是會造成品質降落。
In Table 3 rows (B), we observe that reducing the attention key size
Table 3中的rows (B),我們觀察到降低attention key的大小
To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
為了評估Transformer是否可以泛化至其它任務,我們在英文選區解析上做了實驗。這個任務提出特別的挑戰:輸出受到強烈的結構約束,而且明顯比輸入長。此外,RNN sequence-to-sequence models並沒有辦法在小資料範例中得到最好的結果。
We trained a 4-layer transformer with
我們在Wall Street Journal (WSJ) portion of the Penn Treebank資料集上訓練一個4-layer
We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we increased the maximum output length to input length + 300. We used a beam size of 21 and
我們只有在Section 22開發集上做少量的實驗來選擇dropout、attention與residual(section 5.4)、learning rates與beam size,其它參數都跟English-to-German base translation model一樣維持不變。推論過程中,我們將最大輸出長度增加為輸入長度+300。對於WSJ only與semi-supervised中使用beam size=21與
Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
Table 4中的結果說明著,儘管沒有做一些特別的調整,我們的模型效能卻出乎意料之外的好,產生的結果比先前提過的所有模型都要來的好,除了Recurrent Neural Network Grammar。
Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)
In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.
跟RNN sequence-to-sequence models相比,儘管只有在40K的WSJ訓練集上訓練,Transformer還是優於BerkeleyParser。
In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
在這個研究中,我們提出Transformer,第一個完全基於注意力的序列轉導模型,用multi-headed self-attention取代掉encoder-decoder架構中常見的recurrent layers。
For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
對於翻譯任務而言,Transformer明顯訓練的比基於recurrent或convolutional layers的架構還要來的快。在WMT 2014 English-to-German與WMT 2014 English-to-French翻譯任務上,我們得到一個新的最佳值。前一個任務中,我們最好的那個模型甚至優於所有先前提過的集成模型。
We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.
我們對於attention-based models的未來感到興奮異常,而且計劃將之應用於其它任務上。我們計劃將Transformer擴展至涉及文字以外的輸入、輸出模式的問題,然後研究局部、受限的注意力機制來有效處理大型的輸入與輸出,如影像、音頻與視頻。
The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.
用來訓練跟評估模型的程式碼在https://github.com/tensorflow/tensor2tensor.
Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.
謝天謝地