論文翻譯
deeplearning
排版的順序為先原文,再繁體中文,並且圖片與表格都會出現在第一次出現的段落下面
原文
繁體中文
照片或表格
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.
神經網路的學習能力受到參數數量限制。條件計算,也就是根據每個樣本激活網絡部分,被理論上提出為大幅提升模型容量的方法,而不會造成運算量的成比例增加。然而,在實踐中,存在著顯著的算法和性能挑戰。我們在这项工作中解决了这些挑战,最终实现了条件计算的承诺,在现代GPU集群上,模型容量提高了超過1000倍,同时只損失少量計算效率。
我們引入了稀疏門控混合專家(MoE)層,包含數千個前饋子網路。一個可訓練的門控網絡決定每個樣本應使用的這些專家的稀疏組合。我們將 MoE 应用於語言建模和機器翻譯任務,在這些任務中,模型容量對於吸收海量訓練數據中的知識至關重要。
我們提出了模型架構,其中一個包含最多1370億參數的MoE應用於堆疊LSTM層之間的卷積運算。 在大型語言建模和機器翻譯基准測試中,這些模型在較低的計算成本下取得了比最先進技術更好的結果。
Exploiting scale in both training data and model size has been central to the success of deep learning. When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy. This has been shown in domains such as text (Sutskever et al., 2014; Bahdanau et al., 2014; Jozefowicz et al., 2016; Wu et al., 2016), images (Krizhevsky et al., 2012; Le et al., 2012), and audio (Hinton et al., 2012; Amodei et al., 2015). For typical deep learning models, where the entire model is activated for every example, this leads to a roughly quadratic blow-up in training costs, as both the model size and the number of training examples increase. Unfortunately, the advances in computing power and distributed computation fall short of meeting such demand.
深度學習成功的關鍵在於充分利用訓練數據與模型規模。當數據集足夠龐大時,增加神經網絡的容量(參數數量)可以獲得更好的預測準確度。這已在文本(Sutskever 等人,2014;Bahdanau 等人,2014;Jozefowicz 等人,2016;Wu 等人,2016)、圖像(Krizhevsky 等人,2012;Le 等人,2012)和語音(Hinton 等人,2012;Amodei 等人,2015)等領域得到證實。對於典型的深度學習模型,其中每個示例都會激活整個模型,這導致訓練成本的二次增長,因為模型規模和訓練數據量的增加。不幸的是,計算能力和分布式運算的進步還未能滿足這種需求。
Various forms of conditional computation have been proposed as a way to increase model capacity without a proportional increase in computational costs (Davis & Arel, 2013; Bengio et al., 2013; Eigen et al., 2013; Ludovic Denoyer, 2014; Cho & Bengio, 2014; Bengio et al., 2015; Almahairi et al., 2015). In these schemes, large parts of a network are active or inactive on a per-example basis. The gating decisions may be binary or sparse and continuous, stochastic or deterministic. Various forms of reinforcement learning and back-propagation are proposed for trarining the gating decisions.
各種條件式運算方法被提出來,作為一種在不按比例增加計算成本的情況下(Davis & Arel,2013;Bengio 等人,2013;Eigen 等人,2013;Ludovic Denoyer,2014;Cho & Bengio,2014;Bengio 等人,2015;Almahairi 等人,2015)提高模型容量的方法。在這些方案中,網絡的大部分在每個例子基礎上是活躍或非活躍的。閘控決策可能是二元或稀疏的,連續的,隨機的或確定的。提出了各種形式的強化學習和反向傳播用於訓練閘控決策。
Figure 1: A Mixture of Experts (MoE) layer embedded within a recurrent language model. In this case, the sparse gating function selects two experts to perform computations. Their outputs are modulated by the outputs of the gating network.
圖 1:混合專家 (MoE) 層嵌入於循環語言模型中。在此情況下,稀疏門控函數選擇兩個專家進行計算。它們的輸出會受門控網絡輸出的影響。
While these ideas are promising in theory, no work to date has yet demonstrated massive improvements in model capacity, training time, or model quality. We blame this on a combination of the following challenges:
雖然這些想法在理論上很有前景,但迄今為止沒有工作顯示出了模型容量、訓練時間或模型品質的重大改進。我們認為這是由於以下挑戰因素的共同作用造成的:
現代計算設備,特別是 GPU,在算術運算方面遠超分支操作。大多數上述研究都意識到這一點,並建議每次門控決策時,通過開啟/關閉網路大塊來進行操作。
大批量訓練對性能至關重要,因為它們可以平均分攤參數傳遞和更新的成本。條件運算則減少了網絡中僅在特定條件下活躍部分的批量大小。
網路頻寬可能成為瓶頸。GPU 集群的運算能力,可能遠超設備間總共網路頻寬數千倍。為了計算效率,算法相對應的計算需求必須高於網路需求比率。嵌入層,可視為一種條件式計算方式,正因如此受到限制。由於嵌入通常需透過網路傳遞,因此(範例、參數)交互次數受限於網路頻寬而非計算能力。
依據不同的方案,或許需要損失函數來達到所需的每個片段及/每例的稀疏程度。Bengio 等人(2015)使用了三個這樣的損失函數。這些問題會影響模型品質和負載平衡。
模型容量對處理極大量數據集來說至關重要。現有關於條件計算的文獻主要關注相對較小的圖像識別數據集,這些數據集最多包含 600,000 張圖像。難以想像僅憑這些圖像標籤就能提供足夠信號來充分訓練具有數百萬甚至數十億參數的模型。
In this work, we for the first time address all of the above challenges and finally realize the promise of conditional computation. We obtain greater than 1000x improvements in model capacity with only minor losses in computational efficiency and significantly advance the state-of-the-art results on public language modeling and translation data sets.
在本研究中,我們首次解決上述所有挑戰,最終實現條件計算的承諾。我們的模型容量提升了超過 1000 倍,同時計算效率僅略有下降,在公開語言建模和翻譯數據集上的成果顯著超越目前最先進水平。
Our approach to conditional computation is to introduce a new type of general purpose neural network component: a Sparsely-Gated Mixture-of-Experts Layer (MoE). The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input (see Figure 1). All parts of the network are trained jointly by back-propagation.
我們提出的條件計算方法是引入一種新型通用神經網絡模組:稀疏門控混合專家層(MoE)。MoE 由多個專家組成,每個專家為一個簡單的前饋神經網絡,以及一個可訓練的門控網絡,該網絡選擇稀疏組合的專家處理每個輸入(見圖 1)。整個網絡的所有部分都通過反向傳播共同訓練。
While the introduced technique is generic, in this paper we focus on language modeling and machine translation tasks, which are known to benefit from very large models. In particular, we apply a MoE convolutionally between stacked LSTM layers (Hochreiter & Schmidhuber, 1997), as in Figure 1. The MoE is called once for each position in the text, selecting a potentially different combination of experts at each position. The different experts tend to become highly specialized based on syntax and semantics (see Appendix E Table 9). On both language modeling and machine translation benchmarks, we improve on best published results at a fraction of the computational cost.
儘管所提出的技術具有普適性,但在本文中我們專注於語言建模和機器翻譯任務,這些任務已知對極大型模型有益。特別是,我們在堆疊 LSTM 層之間(Hochreiter & Schmidhuber, 1997)應用 MoE 卷積,如圖 1 所示。對於每一個文本位置,我們一次性調用 MoE,並選擇潛在的專家組合。不同的專家傾向於根據語法和語義高度專業化(見附錄 E 表 9)。 在語言建模和機器翻譯基準上,我們以較低的計算成本改進了最佳公開結果。
Since its introduction more than two decades ago (Jacobs et al., 1991; Jordan & Jacobs, 1994), the mixture-of-experts approach has been the subject of much research. Different types of expert architectures hae been proposed such as SVMs (Collobert et al., 2002), Gaussian Processes (Tresp, 2001; Theis & Bethge, 2015; Deisenroth & Ng, 2015), Dirichlet Processes (Shahbaba & Neal, 2009), and deep networks. Other work has focused on different expert configurations such as a hierarchical structure (Yao et al., 2009), infinite numbers of experts (Rasmussen & Ghahramani, 2002), and adding experts sequentially (Aljundi et al., 2016). Garmash & Monz (2016) suggest an ensemble model in the format of mixture of experts for machine translation. The gating network is trained on a pre-trained ensemble NMT model.
自二十多年前(Jacobs 等人,1991;Jordan & Jacobs,1994)提出以來,混合專家方法一直是深度學習領域的研究熱點。不同類型的專家架構被提出,例如支持向量機 (SVM) (Collobert 等人,2002)、高斯過程 (Gaussian Process) (Tresp,2001;Theis & Bethge,2015;Deisenroth & Ng,2015)、狄利克雷過程 (Dirichlet Process) (Shahbaba & Neal,2009)和深度網絡。 其他研究關注於不同的專家配置,例如分層結構(Yao 等人,2009)、無限個專家(Rasmussen & Ghahramani,2002)以及順序添加專家(Aljundi 等人,2016)。 Garmash & Monz (2016) 提出了在機器翻譯中使用混合專家模型組合。閘網絡在預先訓練好的集成 NMT 模型上進行訓練。
The works above concern top-level mixtures of experts. The mixture of experts is the whole model. Eigen et al. (2013) introduce the idea of using multiple MoEs with their own gating networks as parts of a deep model. It is intuitive that the latter approach is more powerful, since complex problems may contain many sub-problems each requiring different experts. They also allude in their conclusion to the potential to introduce sparsity, turning MoEs into a vehicle for computational computation.
以上研究重點關注頂層混合專家模型。混合專家模型是整個模型。Eigen 等人 (2013) 引入了使用多個具有自身門控網路的混合專家模型作為深度模型一部分的想法。顯然,後者的方法更加有力,因為複雜問題可能包含許多子問題,每個子問題都需要不同的專家來處理。他們在結論中也暗示了引入稀疏性的潛力,將混合專家模型轉化為一種計算效率更高的工具。
Our work builds on this use of MoEs as a general purpose neural network component. While Eigen et al. (2013) uses two stacked MoEs allowing for two sets of gating decisions, our convolutional application of the MoE allows for different gating decisions at each position in the text. We also realize sparse gating and demonstrate its use as a practical way to massively increase model capacity.
我們的工作基於將 MoE 用作通用神經網路組件。雖然 Eigen 等人(2013)使用兩個堆疊的 MoE,允許進行兩組門控決策,但我們的卷積式 MoE 應用方式允許在文本的每個位置進行不同的門控決策。我們還實現了稀疏門控,並展示了它是增加模型容量的一種實用方法。
The Mixture-of-Experts (MoE) layer consists of a set of n 'expert networks" E , 1 · · · , E n , and a 'gating network" G whose output is a sparse n -dimensional vector. Figure 1 shows an overview of the MoE module. The experts are themselves neural networks, each with their own parameters. Although in principle we only require that the experts accept the same sized inputs and produce the same-sized outputs, in our initial investigations in this paper, we restrict ourselves to the case where the models are feed-forward networks with identical architectures, but with separate parameters.
混合專家 (MoE) 層由一組 n 個“專家網絡” E1、···、En 和一個“門控網絡” G 組成,其輸出為一個稀疏的 n 維向量。圖 1 展示了 MoE 模塊的概述。每個專家都是一個神經網絡,擁有各自的參數。雖然原則上我們只需確保所有專家接受相同大小的輸入並產生相同大小的輸出,但在本研究中,我們將模型限制在具有相同架構但參數獨立的前饋網絡上。
Let us denote by G x ( ) and E x i ( ) the output of the gating network and the output of the i -th expert network for a given input x . The output y of the MoE module can be written as follows:
令 G_x ( ) 和 E_x i ( ) 分别表示门控网络和第i个专家网络对输入 x 的输出。MoE 模块的输出 y 可以表示如下:
Wesave computation based on the sparsity of the output of G x ( ) . Wherever G x ( ) i = 0 , we need not compute E x i ( ) . In our experiments, we have up to thousands of experts, but only need to evaluate a handful of them for every example. If the number of experts is very large, we can reduce the branching factor by using a two-level hierarchical MoE. In a hierarchical MoE, a primary gating network chooses a sparse weighted combination of 'experts", each of which is itself a secondary mixture-of-experts with its own gating network. In the following we focus on ordinary MoEs. We provide more details on hierarchical MoEs in Appendix B.
基於 Gx( ) 的稀疏輸出,我們可以節省計算量。只要 Gx( ) i = 0,就不需要計算 Exi( )。在我們的實驗中,最多使用了數千個專家,但對於每個樣本只需要評估 handful 中的幾個專家。如果專家的數量非常大,我們可以使用兩級分層式 MoE 來減少分支因子。 在分層式 MoE 中,一個主要門控網路選擇稀疏加權組合(“專家”),每個“專家”本身都是具有自身門控網絡的次要混合專家。以下將重點關注普通的 MoE。關於分層式 MoE 的更多詳細信息請見附錄 B。
Our implementation is related to other models of conditional computation. A MoE whose experts are simple weight matrices is similar to the parameterized weight matrix proposed in (Cho & Bengio, 2014). A MoE whose experts have one hidden layer is similar to the block-wise dropout described in (Bengio et al., 2015), where the dropped-out layer is sandwiched between fully-activated layers.
我們實作與其他條件運算模型相關。一個以簡單權重矩陣為專家組的 MoE 與(Cho & Bengio, 2014) 中提出的參數化權重矩陣相似。而具有單個隱藏層專家的 MoE 與(Bengio et al., 2015) 中描述的塊狀 dropout 類似,其中丟棄的那一層被夾在完全連接的層之間。
Softmax Gating: A simple choice of non-sparse gating function (Jordan & Jacobs, 1994) is to multiply the input by a trainable weight matrix W g and then apply the Softmax function.
Softmax Gating:一種簡單的非稀疏門控函數選擇(Jordan & Jacobs,1994)是將輸入乘以可訓練權重矩陣 Wg,然後應用 softmax 函數。
Noisy Top-K Gating: We add two components to the Softmax gating network: sparsity and noise. Before taking the softmax function, we add tunable Gaussian noise, then keep only the top k values, setting the rest to -∞ (which causes the corresponding gate values to equal 0 ). The sparsity serves to save computation, as described above. While this form of sparsity creates some theoretically scary discontinuities in the output of gating function, we have not yet observed this to be a problem in practice. The noise term helps with load balancing, as will be discussed in Appendix A. The amount of noise per component is controlled by a second trainable weight matrix W noise .
噪聲頂K門控:我們在Softmax門控網路中加入兩個組成部分:稀疏性和噪聲。 在計算Softmax函數之前,我們添加可調Gaussian噪聲,然後只保留前k個值,其他值設為-∞(此舉導致相應的門控值等於0)。這種稀疏性有助於節省計算量,如上所述。儘管這種形式的稀疏性在門控函數的輸出中可能引發理論上的不連續性,但我們實踐中尚未觀察到此問題。噪聲項有助於負載平衡,詳見附錄A。每個組成部分的噪聲量由第二個可訓練權重矩陣Wnoise 控制。
Training the Gating Network We train the gating network by simple back-propagation, along with the rest of the model. If we choose k > 1 , the gate values for the top k experts have nonzero derivatives with respect to the weights of the gating network. This type of occasionally-sensitive behavior is described in (Bengio et al., 2013) with respect to noisy rectifiers. Gradients also backpropagate through the gating network to its inputs. Our method differs here from (Bengio et al., 2015) who use boolean gates and a REINFORCE-style approach to train the gating network.
我們通過簡單的反向傳播訓練門控網路,與模型的其他部分一起訓練。如果我們選擇 k > 1,那麼前 k 個專家的門值對於門控網路權重的梯度不為零。這種偶爾會對噪聲敏感的行为如 (Bengio et al., 2013) 所述,適用於非線性整流器。梯度也回傳到門控網路及其輸入。與我們的做法不同的是,(Bengio et al., 2015) 使用布林閘和 REINFORCE 風格方法訓練門控網路。
On modern CPUs and GPUs, large batch sizes are necessary for computational efficiency, so as to amortize the overhead of parameter loads and updates. If the gating network chooses k out of n experts for each example, then for a batch of b examples, each expert receives a much smaller batch of approximately kb n glyph[lessmuch] b examples. This causes a naive MoE implementation to become very inefficient as the number of experts increases. The solution to this shrinking batch problem is to make the original batch size as large as possible. However, batch size tends to be limited by the memory necessary to store activations between the forwards and backwards passes. We propose the following techniques for increasing the batch size:
在現代 CPU 和 GPU 上,為了提高計算效率,需要使用較大的批次大小,以便攤銷參數載入和更新的開銷。如果門控網路選擇每個樣本中的 k 個專家,那麼對於一個包含 b 個樣本的批次來說,每個專家接收到的樣本數量約為 kb/n 個。這使得 naive 的 MoE 實現隨著專家數量增加變得非常低效。解決此縮小批次問題的方法是使原始批次大小儘可能大。然而,批次大小往往受到儲存正向和反向傳播之間激活值的內存需求限制。我們提出以下技術來增加批次大小:
Mixing Data Parallelism and Model Parallelism: In a conventional distributed training setting, multiple copies of the model on different devices asynchronously process distinct batches of data, and parameters are synchronized through a set of parameter servers. In our technique, these different batches run synchronously so that they can be combined for the MoE layer. We distribute the standard layers of the model and the gating network according to conventional data-parallel schemes, but keep only one shared copy of each expert. Each expert in the MoE layer receives a combined batch consisting of the relevant examples from all of the data-parallel input batches. The same set of devices function as data-parallel replicas (for the standard layers and the gating networks) and as model-parallel shards (each hosting a subset of the experts). If the model is distributed over d devices, and each device processes a batch of size b , each expert receives a batch of approximately kbd n examples. Thus, we achieve a factor of d improvement in expert batch size.
混合數據並行與模型並行:在傳統的分布式訓練環境中,不同的設備上有多個模型副本異步處理不同的數據批次,參數是透過一組參數伺服器同步。我們的技術讓這些不同批次同步運行,以便能夠組合成 MoE 層。我們按照傳統的數據並行方案分佈模型的標準層和門控網絡,但僅保留每個專家的共享副本。MoE 層中的每個專家接收由所有數據並行輸入批次的相關示例組成的組合批次。相同的設備集既充當數據並行複本(對標準層和門控網絡),也作為模型並行切片(每片都託管一部分專家)。如果模型分佈在 d 個設備上,並且每個設備處理大小為 b 的批次,那麼每個專家接收大約 kbdn 個示例的批次。因此,我們實現了 d 倍的专家批次大小提高。
In the case of a hierarchical MoE (Section B), the primary gating network employs data parallelism, and the secondary MoEs employ model parallelism. Each secondary MoE resides on one device.
在階層式混合模型 (B part) 中,主要閘網絡採用數據並行,每個次級 MoE 採用模型並行。每個次級 MoE 駐置在一台設備上。
This technique allows us to increase the number of experts (and hence the number of parameters) by proportionally increasing the number of devices in the training cluster. The total batch size increases, keeping the batch size per expert constant. The memory and bandwidth requirements per device also remain constant, as do the step times, as does the amount of time necessary to process a number of training examples equal to the number of parameters in the model. It is our goal to train a trillionparameter model on a trillion-word corpus. We have not scaled our systems this far as of the writing of this paper, but it should be possible by adding more hardware.
這種技術讓我們能夠通過增加訓練群組中的設備數量來提升專家數(也就是參數數量)。總批次大小隨之下也會增加,但每個專家的批次大小保持不變。 然而,每台設備的記憶體和頻寬需求、步驟時間以及處理等同於模型參數數量的訓練樣本所需的時間都維持不變。 目標是訓練一個擁有兆級參數的模型,並使用包含兆級詞彙的語料庫進行訓練。儘管我們尚未在撰寫本文時達到這個規模,但通過添加更多硬件應該可以實現目標。
Taking Advantage of Convolutionality: In our language models, we apply the same MoE to each time step of the previous layer. If we wait for the previous layer to finish, we can apply the MoE to all the time steps together as one big batch. Doing so increases the size of the input batch to the MoE layer by a factor of the number of unrolled time steps.
充分利用卷積特性:我們在語言模型中,對前一層的每個時間步都應用相同的 MoE。如果我們等待前一層完成,我們可以將所有時間步一起作為一個大批量,應用於 MoE。這樣可以將輸入到 MoE 層的大小放大為未展開時間步數的倍數。
Increasing Batch Size for a Recurrent MoE: We suspect that even more powerful models may involve applying a MoE recurrently. For example, the weight matrices of a LSTM or other RNN could be replaced by a MoE. Sadly, such models break the convolutional trick from the last paragraph, since the input to the MoE at one timestep depends on the output of the MoE at the previous timestep. Gruslys et al. (2016) describe a technique for drastically reducing the number of stored activations in an unrolled RNN, at the cost of recomputing forward activations. This would allow for a large increase in batch size.
增大遞歸式 MoE 的批次大小:我們懷疑,更強大的模型可能涉及重複應用 MoE。例如,可以將 LSTM 或其他递归神经网络(RNN) 的權重矩陣替換為 MoE。遺憾的是,此類模型無法使用上一段中提到的卷積技巧,因為一個時間步的 MoE 輸入取決於前一個時間步的 MoE 輸出。Gruslys 等人 (2016) 描述了一種技術,可以顯著減少展開 RNN 中儲存活躍項目的數量,但需要重新計算正向活躍值。這允許大幅增加批次大小。
Another major performance concern in distributed computing is network bandwidth. Since the experts are stationary (see above) and the number of gating parameters is small, most of the communication involves sending the inputs and outputs of the experts across the network. To maintain computational efficiency, the ratio of an expert's computation to the size of its input and output must exceed the ratio of computational to network capacity of the computing device. For GPUs, this may be thousands to one. In our experiments, we use experts with one hidden layer containing thousands of RELU-activated units. Since the weight matrices in the expert have sizes input _ size × hidden size _ and hidden size _ × output _ size , the ratio of computation to input and output is equal to the size of the hidden layer. Conveniently, we can increase computational efficiency simply by using a larger hidden layer, or more hidden layers.
在分布式計算中,另一個重要的性能問題是網路帶寬。由於專家是靜態的(見上述),並且門控參數數量很少,大部分通信涉及傳送專家的輸入和輸出跨網絡。為了維持運算效率,專家運算與其輸入和輸出的比例必須超過計算設備的計算與網路容量之比。對於 GPU 而言,此比率可能高達數千對一。在我們的實驗中,我們使用具有包含數千個 RELU 激活單元的隱藏層的專家。由於專家中的權重矩陣大小為輸入尺寸 × 隱藏尺寸 和 隱藏尺寸 × 輸出尺寸,因此計算與輸入和輸出的比率等於隱藏層的大小。方便的是,我們可藉由增加隱藏層的大小或添加更多隱藏層來簡單地提高運算效率。
We have observed that the gating network tends to converge to a state where it always produces large weights for the same few experts. This imbalance is self-reinforcing, as the favored experts are trained more rapidly and thus are selected even more by the gating network. Eigen et al. (2013) describe the same phenomenon, and use a hard constraint at the beginning of training to avoid this local minimum. Bengio et al. (2015) include a soft constraint on the batch-wise average of each gate. 1
我們觀察到,閘網絡通常收斂至一種狀態,它總是為少數特定專家賦予較大權重。這種失衡是自我加強的:受青睞的專家因為訓練速度更快而被閘網絡更頻繁地選擇,進而獲得更多的訓練資源。Eigen 等人 (2013) 描述了同樣現象,並在訓練初期使用硬約束來避免此局部最優解。Bengio 等人 (2015) 則採用軟約束來限制每個閘值在每批資料上的平均權重。
We take a soft constraint approach. We define the importance of an expert relative to a batch of training examples to be the batchwise sum of the gate values for that expert. We define an additional loss L importance , which is added to the overall loss function for the model. This loss is equal to the square of the coefficient of variation of the set of importance values, multiplied by a hand-tuned scaling factor w importance . This additional loss encourages all experts to have equal importance.
我們採取一種柔性約束的方法。我們定義一個專家相對於一批訓練示例的重要性為該批處理中的該專家的門值總和。我們定義了一個額外的損失函數 L_importance,它被添加到模型的整體損失函數中。這個損失等於重要性值集合的變異係數平方,乘以一個手工調整的比例因子 w_importance 。這個額外的損失鼓勵所有專家具有相等的的重要性。
While this loss function can ensure equal importance, experts may still receive very different numbers of examples. For example, one expert may receive a few examples with large weights, and another may receive many examples with small weights. This can cause memory and performance problems on distributed hardware. To solve this problem, we introduce a second loss function, L load , which ensures balanced loads. Appendix A contains the definition of this function, along with experimental results.
儘管這種損失函數(shǔn shī hán shù)可以確保每個專家(zhuān jiā)獲得相同的權重(quán zhòng),但專家的實際例子數量可能仍然差異很大。例如,一位專家可能會收到一些具有較大權重的例子,而另一位專家則可能會收到許多具有較小權重的例子。這可能會在分布式硬體(fēn bù shí yìng duǎ)上導致記憶體和效能問題。為解決此問題,我們引入了第二個損失函數 L_load,它確保了負荷均衡。附錄 A 包含此函數的定義以及實驗結果。
Dataset: This dataset, introduced by (Chelba et al., 2013) consists of shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words.
資料集:此資料集由 (Chelba et al., 2013) 提出,包含來自新聞文章的隨機排列獨特句子,總計約 829,000 萬字,詞彙量為 793,471 個詞。
Previous State-of-the-Art: The best previously published results (Jozefowicz et al., 2016) use models consisting of one or more stacked Long Short-Term Memory (LSTM) layers (Hochreiter & Schmidhuber, 1997; Gers et al., 2000). The number of parameters in the LSTM layers of these models vary from 2 million to 151 million. Quality increases greatly with parameter count, as do computational costs. Results for these models form the top line of Figure 2-right.
先前最先進的研究成果(Jozefowicz 等,2016)採用一個或多個堆疊式長短期記憶網絡 (LSTM) 層(Hochreiter & Schmidhuber,1997;Gers 等,2000)組成的模型。這些模型中 LSTM 層的參數數量介於 200 萬到 1.51 億之間。隨著參數量的增加,模型性能顯著提升,但計算成本也會隨之上升。這些模型的結果見圖 2-右邊最高線。
MoE Models: Our models consist of two stacked LSTM layers with a MoE layer between them (see Figure 1). We vary the sizes of the layers and the number of experts. For full details on model architecture, training regimen, additional baselines and results, see Appendix C.
我們的模型由兩層 stacked LSTM layers 組成,中間夾着一層 MoE 層(請見圖 1)。我們調整了各個層的規模和專家數量。關於模型架構、訓練方案、其他基線以及結果的詳細信息,請參閱附錄 C。
Low Computation, Varied Capacity: To investigate the effects of adding capacity, we trained a series of MoE models all with roughly equal computational costs: about 8 million multiply-andadds per training example per timestep in the forwards pass, excluding the softmax layer. We call this metric (ops/timestep). We trained models with flat MoEs containing 4, 32, and 256 experts, and models with hierarchical MoEs containing 256, 1024, and 4096 experts. Each expert had about 1 million parameters. For all the MoE layers, 4 experts were active per input.
低運算量,多種容量:為了探究添加容量的影響,我們訓練了一系列 MoE 模型,這些模型在計算成本上大致相同:每個訓練樣本每步前向傳播約消耗 800 萬次乘法和加法(不包括 softmax 層)。我們稱這個指標為 ops/timestep。我們訓練了包含 4、32 和 256 個專家的扁平 MoE 模型,以及包含 256、1024 和 4096 個專家的分層 MoE 模型。每個專家約含有 100 萬個參數。對於所有 MoE 層,每次輸入都啟用 4 個專家。
The results of these models are shown in Figure 2-left. The model with 4 always-active experts performed (unsurprisingly) similarly to the computationally-matched baseline models, while the largest of the models (4096 experts) achieved an impressive 24% lower perplexity on the test set.
這些模型的結果如圖 2 左側所示。具有 4 個始終活躍專家的模型(不出所料)與計算成本相似的基線模型表現相似,而擁有 4096 個專家的最大模型在測試集上實現了令人印象深刻的 perplexity 值降低 24%。
Figure 2: Model comparison on 1-Billion-Word Language-Modeling Benchmark. On the left, we plot test perplexity as a function of model capacity for models with similar computational budgets of approximately 8-million-ops-per-timestep. On the right, we plot test perplexity as a function of computational budget. The top line represents the LSTM models from (Jozefowicz et al., 2016). The bottom line represents 4-billion parameter MoE models with different computational budgets.
圖 2:基於10億詞語語言建模基準測試的模型比較。左圖顯示測試困惑度隨模型容量變化趨勢,其中各模型計算預算約為 800 萬運算量每時間步相似;右圖則顯示測試困惑度隨計算預算變化趨勢。最高線代表 Joesfowicz 等人 (2016) 提出的 LSTM 模型;最低線代表具有不同計算預算的 40 億參數 MoE 模型。
Varied Computation, High Capacity: In addition to the largest model from the previous section, we trained two more MoE models with similarly high capacity (4 billion parameters), but higher computation budgets. These models had larger LSTMs, and fewer but larger and experts. Details can be found in Appendix C.2. Results of these three models form the bottom line of Figure 2-right. Table 1 compares the results of these models to the best previously-published result on this dataset . Even the fastest of these models beats the best published result (when controlling for the number of training epochs), despite requiring only 6% of the computation.
多樣化運算、高容量:除了上一節中的最大模型外,我們還訓練了兩個具有相似高度容量(40億參數)但計算資源更高的 MoE 模型。這些模型擁有較大的 LSTM,以及更少但更大、更加專業的專家模組。詳細信息請見附錄 C.2。這三個模型的結果構成圖 2 右方的底線。表 1 將這些模型的結果與該數據集上先前最佳公開結果進行比較。即使是最快的模型在(控制訓練週期數的情況下)也超越了最佳公開結果,而其計算資源僅為前者的 6%。
Table 1: Summary of high-capacity MoE-augmented models with varying computational budgets, vs. best previously published results (Jozefowicz et al., 2016). Details in Appendix C.
表 1:總結不同計算預算下的高容量 MoE 加強模型與先前最佳公開結果(Jozefowicz 等,2016 年)。詳情見附錄 C。
Test Perplexity 10 epochs | Test Perplexity 100 epochs | #Parameters excluding embedding and softmax layers | ops/timestep | Training Time 10 epochs | TFLOPS /GPU | |
---|---|---|---|---|---|---|
Best Published Results | 34.7 | 30.6 | 151 million | 151 million | 59 hours, 32 k40s | 1.09 |
Low-Budget MoE Model | 34.1 | 4303 million | 8.9 million | 15 hours, 16 k40s | 0.74 | |
Medium-Budget MoE Model | 31.3 | 4313 million | 33.8 million | 17 hours, 32 k40s | 1.22 | |
High-Budget MoE Model | 28 | 4371 million | 142.7 million | 47 hours, 32 k40s | 1.56 |
Computational Efficiency: Wetrained our models using TensorFlow (Abadi et al., 2016) on clusters containing 16-32 Tesla K40 GPUs. For each of our models, we determine computational efficiency in TFLOPS/GPU by dividing the number of floating point operations required to process one training batch by the observed step time and the number of GPUs in the cluster. The operation counts used here are higher than the ones we report in our ops/timestep numbers in that we include the backwards pass, we include the importance-sampling-based training of the softmax layer, and we count a multiply-and-add as two separate operations. For all of our MoE models, the floating point operations involved in the experts represent between 37% and 46% of the total.
計算效率:我們使用 TensorFlow(Abadi 等人,2016)在包含 16-32 個 Tesla K40 GPU 的集群上訓練模型。對於每個模型,我們通過將一個訓練批次的浮點運算量除以觀察到的步長時間和集群中的 GPU 數量來確定每 GPU 的計算效率(FLOPS)。這裡使用的運算計數高於我們在 ops/timestep 數字中報告的數字,因為我們包含反向傳播、基於重要性采样的 softmax 層訓練,並且將乘加計為兩個獨立操作。對於所有我們的 MoE 模型而言,專家參與的浮點運算是總量的 37% 到 46%。
For our baseline models wtih no MoE, observed computational efficiency ranged from 1.07-1.29 TFLOPS/GPU. For our low-computation MoE models, computation efficiency ranged from 0.740.90 TFLOPS/GPU, except for the 4-expert model which did not make full use of the available parallelism. Our highest-computation MoE model was more efficient at 1.56 TFLOPS/GPU, likely due to the larger matrices. These numbers represent a significant fraction of the theoretical maximum of 4.29 TFLOPS/GPU claimed by NVIDIA. Detailed results are in Appendix C, Table 7.
對於我們基線模型(不使用 MoE),觀察到的計算效率介於 1.07 到 1.29 TFLOPS/GPU 之間。我們的低運算量 MoE 模型,計算效率範圍為 0.74 至 0.90 TFLOPS/GPU,除了 4 個專家模型未充分利用可用的並行性外。我們的最高運算量 MoE 模型更有效率,達到 1.56 TFLOPS/GPU,這很可能是由於矩陣更大。這些數字代表了 NVIDIA 聲稱的理論最大值 4.29 TFLOPS/GPU 的很大一部分。詳細結果見附錄 C,表 7。
Figure 3: Language modeling on a 100 billion word corpus. Models have similar computational budgets (8 million ops/timestep).
圖 3:基於 1000 億個詞語的語料庫進行語言建模。模型具備相似的計算資源(每時間步 800 萬次運算)。
On the 1-billion-word corpus, adding additional capacity seems to produce diminishing returns as the number of parameters in the MoE layer exceeds 1 billion, as can be seen in Figure 2-left. We hypothesized that for a larger training set, even higher capacities would produce significant quality improvements.
在10億字詞語料上,根據圖表2左所示,隨著MoE層參數數量超過10億,增加容量似乎會產生遞減回報。我們假設對於更大的訓練集來說,更高容量能帶來顯著的品質提升。
We constructed a similar training set consisting of shuffled unique sentences from Google's internal news corpus, totalling roughly 100 billion words. Similarly to the previous section, we tested a series of models with similar computational costs of about 8 million ops/timestep. In addition to a baseline LSTM model, we trained models augmented with MoE layers containing 32, 256, 1024, 4096, 16384, 65536, and 131072 experts. This corresponds to up to 137 billion parameters in the MoE layer. Details on architecture, training, and results are given in Appendix D.
我們構建了一個類似的訓練集,由 Google 內部新聞語料庫中經過打亂的獨特句子組成,總共約 100,000 百萬個詞。與前一節類似,我們測試了一系列計算成本約為每步 800 萬次運算(ops/step)的模型。除了基線 LSTM 模型外,我們還訓練了增強了包含 32、256、1,024、4,096、16,384、65,536 和 131,072 個專家的 MoE 層的模型。這對應於 MoE 層中的最高 1370 億個參數。關於架構、訓練和結果的詳細信息,請見附錄 D。
Results: Figure 3 shows test perplexity as a function of capacity after training on 10 billion words (top line) and 100 billion words (bottom line). When training over the full 100 billion words, test perplexity improves significantly up to 65536 experts (68 billion parameters), dropping 39% lower than the computationally matched baseline, but degrades at 131072 experts, possibly a result of too much sparsity. The widening gap between the two lines demonstrates (unsurprisingly) that increased model capacity helps more on larger training sets.
結果:圖 3 展示了訓練後基于 10 億字數據集(上線)和 100 億字數據集(下線)上的測試困惑度隨容量函數變化情況。當在完整 100 億字的訓練資料上進行訓練時,測試困惑度在達到 65536 個專家(680 億個參數)時顯著提高,比計算匹配基準線低 39%,但在 131072 個專家處下降,這可能因為過高的稀疏性。兩條線之間間隙擴大證明(毫不令人意外的是),更大的模型容量對更大的訓練集有更明顯的幫助。
Even at 65536 experts (99.994% layer sparsity), computational efficiency for the model stays at a respectable 0.72 TFLOPS/GPU.
即使在 65,536 個專家模型(層稀疏度達 99.994%)的情況下,該模型的運算效率仍可達到 0.72 TFLOPS/GPU。
Model Architecture: Our model was a modified version of the GNMT model described in (Wu et al., 2016). To reduce computation, we decreased the number of LSTM layers in the encoder and decoder from 9 and 8 to 3 and 2 respectively. We inserted MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). Each MoE layer contained up to 2048 experts each with about two million parameters, adding a total of about 8 billion parameters to the models. Further details on model architecture, testing procedure and results can be found in Appendix E.
模型架構:我們模型改進自(Wu et al.,2016)所述的 GNMT 模型。為減小計算負擔,我們將編碼器和解码器的 LSTM 層數分別從 9 和 8 降低至 3 和 2。在編碼器 (第 2 層與第 3 層之間) 及解码器 (第 1 層與第 2 層之間) 中,我們插入了 MoE 層。每個 MoE 層最多包含 2048 個專家,每個專家約有 200 萬個參數,共增加模型約 80 億個參數。更多關於模型架構、測試方法及結果的細節,請見附錄 E。
Datasets: We benchmarked our method on the WMT'14 En → Fr and En → De corpora, whose training sets have 36M sentence pairs and 5M sentence pairs, respectively. The experimental protocols were also similar to those in (Wu et al., 2016): newstest2014 was used as the test set to compare against previous work (Luong et al., 2015a; Zhou et al., 2016; Wu et al., 2016), while the combination of newstest2012 and newstest2013 was used as the development set. We also tested the same model on a Google's Production English to French data.
語料庫:我們在 WMT'14 英文至法文和英文至德文資料集上測試我們的演算法,這些資料集的訓練集分別包含 3600 萬個句子對和 500 萬個句子對。實驗方案與 (Wu 等人,2016) 相似:使用 newstest2014 作為測試集,與之前的工作進行比較 (Luong 等人,2015a;周等人,2016;吳等人,2016),同時將 newstest2012 和 newstest2013 合併作為開發集。我們還將相同的模型測試於 Google 公司生產型英法語翻譯資料上。
Table 2: Results on WMT'14 En → Fr newstest2014 (bold values represent best results).Table 3: Results on WMT'14 En → De newstest2014 (bold values represent best results).
表 2:WMT'14 英文 → 法文 newstest2014 結果(以粗體顯示最佳結果)。
表 3:WMT'14 英文 → 德文 newstest2014 結果(以粗體顯示最佳結果)。
Model | Test | Test | ops/timenstep | Total | Training |
---|---|---|---|---|---|
Perplexity | BLEU | #Parameters | Time | ||
MoE with 2048 Experts | 2.69 | 40.35 | 85M | 8.7B | 3 days/64 k40s |
MoE with 2048 Experts (longer training) | 2.63 | 40.56 | 85M | 8.7B | 6 days/64 k40s |
GNMT (Wu et al., 2016) | 2.79 | 39.22 | 214M | 278M | 6 days/96 k80s |
GNMT+RL (Wu et al., 2016) | 2.96 | 39.92 | 214M | 278M | 6 days/96 k80s |
PBMT (Durrani et al., 2014) | 37.0 | ||||
LSTM (6-layer) (Luong et al., 2015b) | 31.5 | ||||
LSTM (6-layer+PosUnk) (Luong et al., 2015b) | 33.1 | ||||
DeepAtt (Zhou et al., 2016) | 37.7 | ||||
DeepAtt+PosUnk (Zhou et al., 2016) | 39.2 |
Table 4: Results on the Google Production En → Fr dataset (bold values represent best results).
表 4:Google Production En → Fr 資料集結果(粗體值表示最佳成果)。
Model | Test Perplexity | Test BLEU | ops/timestep | Total #Parameters | Training Time |
---|---|---|---|---|---|
MoE with 2048 Experts | 4.64 | 26.03 | 85M | 8.7B | 1 day/64 k40s |
GNMT (Wu et al., 2016) | 5.25 | 24.91 | 214M | 278M | 1 day/96 k80s |
GNMT +RL (Wu et al., 2016) | 8.08 | 24.66 | 214M | 278M | 1 day/96 k80s |
PBMT (Durrani et al., 2014) DeepAtt (Zhou et al., 2016) | 8.08 | 20.7 20.6 | 214M | 278M | 1 day/96 k80s |
Model | Eval Perplexity | Eval BLEU | Test Perplexity | Test BLEU | ops/timestep | Total #Parameters | Training Time |
---|---|---|---|---|---|---|---|
MoE with 2048 Experts | 2.6 | 37.27 | 2.69 | 36.57 | 85M | 8.7B | 1 day/64 k40s |
GNMT (Wu et al., 2016) | 2.78 | 35.8 | 2.87 | 35.56 | 214M | 278M | 6 days/96 k80s |
Results: Tables 2, 3, and 4 show the results of our largest models, compared with published results. Our approach achieved BLEU scores of 40.56 and 26.03 on the WMT'14 En → Fr and En → De benchmarks. As our models did not use RL refinement, these results constitute significant gains of 1.34 and 1.12 BLEU score on top of the strong baselines in (Wu et al., 2016). The perplexity scores are also better. 2 Onthe Google Production dataset, our model achieved 1.01 higher test BLEU score even after training for only one sixth of the time.
結果如表 2、3 和 4 所示,我們最大的模型表現與已發表成果進行比較。我們的演算法在 WMT'14 En → Fr 和 En → De 基準測試上取得了 40.56 和 26.03 的 BLEU 分數。由於我們的模型沒有使用 RL 改進,這些結果比 Wu et al. (2016) 中的強勁基線提升了 1.34 和 1.12 BLEU 分數。 perplexity 得分也更佳。在 Google Production 資料集上,即使只訓練六分之一的時間,我們的模型在測試 BLEU 分數上仍提高了 1.01。
Dataset: (Johnson et al., 2016) train a single GNMT (Wu et al., 2016) model on a very large combined dataset of twelve language pairs. Results are somewhat worse than those for 12 separately trained single-pair GNMT models. This is not surprising, given that the twelve models have 12 times the capacity and twelve times the aggregate training of the one model. We repeat this experiment with a single MoE-augmented model. See Appendix E for details on model architecture. We train our model on the same dataset as (Johnson et al., 2016) and process the same number of training examples (about 3 billion sentence pairs). Our training time was shorter due to the lower computational budget of our model.
資料集:(Johnson 等人,2016) 使用單一 GNMT (Wu 等人,2016) 模型訓練一個包含十二個語言對的大型聯合資料集。結果表現略遜於分別訓練的十二個雙向 GNMT 模型。考慮到十二個模型具備十二倍容量和十二倍訓練量,這並非意外。我們使用單一 MoE 增強模型重複此實驗。有關模型架構詳細資訊,請參閱附錄 E。我們的模型訓練在同一個資料集上進行 (Johnson 等人,2016),並處理相同數量約 30 億句子對的訓練數據。由於模型計算資源較低,因此訓練時間縮短。
Results: Results for the single-pair GNMT models, the multilingual GNMT model and the multilingual MoE model are given in Table 5. The MoE model achieves 19% lower perplexity on the dev set than the multilingual GNMT model. On BLEU score, the MoE model significantly beats the multilingual GNMT model on 11 of the 12 language pairs (by as much as 5.84 points), and even beats the monolingual GNMT models on 8 of 12 language pairs. The poor performance on English → Korean seems to be a result of severe overtraining, as for the rarer language pairs a small number of real examples were highly oversampled in the training corpus.
結果如表 5 所示,包括單對式 GNMT 模型、多語 GNMT 模型和多層級模組(MoE)模型。MoE 模型在開發集上的 perplexity 低於多語 GNMT 模型 19%。在 BLEU 分數方面,MoE 模型在 12 個語言對中 11 個表現優於多語 GNMT 模型(最高提升 5.84 分),甚至在 12 個語言對中的 8 個上超越單語 GNMT 模型。英語到韓文的表現欠佳可能是過度訓練導致,因為對於較罕見的語言對,訓練數據中少數真實範例被嚴重過采樣。
Table 5: Multilingual Machine Translation (bold values represent best results).
表 5:多語機器翻譯(粗體值代表最佳結果)。
GNMT-Mono | GNMT-Multi | MoE-Multi | MoE-Multi vs. GNMT-Multi | |
---|---|---|---|---|
ops/timestep training time, hardware | Parameters 278M / model 212M | 278M 212M 21 days, 96 k20s | 8.7B 102M 12 days, 64 k40s | |
Perplexity (dev) | various | 4.14 | 3.35 | -19% |
French → English Test BLEU | 36.47 | 34.40 | 37.46 | +3.06 |
German → English Test BLEU | 31.77 | 31.17 | 34.80 | +3.63 |
Japanese → English Test BLEU | 23.41 | 21.62 | 25.91 | +4.29 |
Korean → English Test BLEU | 25.42 | 22.87 | 28.71 | +5.84 |
Portuguese → English Test BLEU | 44.40 | 42.53 | 46.13 | +3.60 |
Spanish → English Test BLEU | 38.00 | 36.04 | 39.39 | +3.35 |
English → French Test BLEU | 35.37 | 34.00 | 36.59 | +2.59 |
English → German Test BLEU | 26.43 | 23.15 | 24.53 | +1.38 |
English → Japanese Test BLEU | 23.66 | 21.10 | 22.78 | +1.68 |
English → Korean Test BLEU | 19.75 | 18.41 | 16.62 | -1.79 |
English → Portuguese Test BLEU | 38.40 | 37.35 | 37.90 | +0.55 |
English → Spanish Test BLEU | 34.50 | 34.25 | 36.21 | +1.96 |
This work is the first to demonstrate major wins from conditional computation in deep networks. We carefully identified the design considerations and challenges of conditional computing and addressed them with a combination of algorithmic and engineering solutions. While we focused on text, conditional computation may help in other domains as well, provided sufficiently large training sets. We look forward to seeing many novel implementations and applications of conditional computation in the years to come.
這項研究首次展示了在深度網絡中條件計算的顯著優勢。我們仔細探討了條件計算的设计考量和挑戰,並以算法和工程方案相結合的方式加以解決。雖然我們的專注領域是文本處理,但只要訓練集足夠龐大,條件計算也可能應用於其他領域。我們期待未來看到更多關於條件計算的新颖實施和應用,特別是在深度學習的领域中。
We would like to thank all of the members of the Google Brain and Google Translate teams who helped us with this project, in particular Zhifeng Chen, Yonghui Wu, and Melvin Johnson. Thanks also to our anonymous ICLR reviewers for the helpful suggestions on making this paper better.
我們謹此向 Google Brain 和 Google 翻譯團隊的所有成員致謝,特別是陳志峰、吳永輝和梅爾文·約翰遜,感謝他們對此專案所作的貢獻。也感謝匿名 ICLR 審查員給予的寶貴建議,協助我們改進這篇論文。
As discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples. Unfortunately, the number of examples received by an expert is a discrete quantity, so it can not be used in backpropagation. Instead, we define a smooth estimator Load X ( ) of the number of examples assigned to each expert for a batch X of inputs. The smoothness allows us to back-propagate gradients through the estimator. This is the purpose of the noise term in the gating function. We define P x, i ( ) as the probability that G x ( ) i is nonzero, given a new random choice of noise on element i , but keeping the already-sampled choices of noise on the other elements. To compute P x, i ( ) , we note that the G x ( ) i is nonzero if and only if H x ( ) i is greater than the k th -greatest element of H x ( ) excluding itself. The probability works out to be:
如第4節所述,為了實現負載均衡,我們需要定義一個額外的損失函數,鼓勵專家接收大致相同的訓練樣本數量。然而,每個專家接收的樣本數量是一個離散值,無法直接用於反向傳播。因此,我們定義了一個平滑估計器 Load X ( ),來估計每位專家在輸入批次 X 中分配到的樣本數量。這種平滑性允許我們透過估計器反向傳播梯度。這就是門控函數中噪聲項的作用。
我們定義 Px,i( ) 為 Gx( )i 非零的概率,假設對元素 i 選擇新的隨機噪聲,但保留其他元素已經採樣的噪聲選擇。為了計算 Px,i( ),我們注意到 Gx( )i 非零當且僅當 Hx( )i 大於 Hx( ) 排除自身的第 k-th 個最大值時。這個概率可以如下計算:
Where kth _ excluding v, k, i ( ) means the kth highest component of v , excluding component i . Simplifying, we get:
其中,kth _ 排除 v、k、i 的意思是指 v 的第 k 個最大值,但不包含 i 元素。簡化後得到:
Where Φ is the CDF of the standard normal distribution.
其中Φ為標準常態分佈的累積分佈函數。
We can now define the load loss to be the square of the coefficient of variation of the load vector, multiplied by a hand-tuned scaling factor w load .
我們現在可以定義載荷損失為載荷向量的變異係數平方,乘以一個手工調整的比例因子 w_load。
Initial Load Imbalance: To avoid out-of-memory errors, we need to initialize the network in a state of approximately equal expert load (since the soft constraints need some time to work). To accomplish this, we initialize the matrices W g and W noise to all zeros, which yields no signal and some noise.
初始載荷失衡:為了避免超出記憶體錯誤,我們需要將網絡初始化為每個專家負擔接近均等的狀態(因為軟約束需要一些時間才能生效)。为此,我们将矩阵 Wg 和 W noise 初始化为全零值,这样可以生成少量信号和噪声。
Experiments: We trained a set of models with identical architecture (the MoE-256 model described in Appendix C), using different values of w importance and w load . We trained each model for 10 epochs, then measured perplexity on the test set. We also measured the coefficients of variation in Importance and Load , as well as ratio of the load on the most overloaded expert to the average load. This last value is significant for load balancing purposes on distributed hardware. All of these metrics were averaged over several training batches.
實驗:我們使用相同的架構(參閱附錄 C 中描述的 MoE-256 模型)訓練了一組模型,並分別設定不同的重要性權重 (w_importance) 和負載權重 (w_load) 值。每部模型訓練 10 個 epochs 後,在測試集上測量困惑度。此外,我們還計算了重要性和負載的變異係數,以及最負載最高的專家負載與平均負載之比。這個比率對分布式硬件上的負載平衡至關重要。所有這些指標均根據多個訓練批次進行平均。
Table 6: Experiments with different combinations of losses.
表 6:不同損失函數組合實驗結果。
w importance | w load | Test Perplexity | CV Importance X ( ( )) ( | CV Load X ( )) | max Load X ( ( )) mean Load X ( ( |
---|---|---|---|---|---|
0 | 0 | 39.8 | 3.04 | 3.01 | 17.8 |
0.2 | 0 | 35.6 | 0.06 | 0.17 | 1.47 |
0 | 0.2 | 35.7 | 0.22 | 0.04 | 1.15 |
0.1 | 0.1 | 35.6 | 0.06 | 0.05 | 1.14 |
0.01 | 0.01 | 35.7 | 0.48 | 0.11 | 1.37 |
1 | 1 | 35.7 | 0.03 | 0.02 | 1.07 |
Results: Results are reported in Table 6. All the combinations containing at least one the two losses led to very similar model quality, where having no loss was much worse. Models with higher values of w load had lower loads on the most overloaded expert.
結果如表 6 所示。所有包含至少一種損失組合的模型表現相似,而沒有損失的模型表現明顯較差。w_load 值較高的模型在最超載專家上負擔較輕。
If the number of experts is very large, we can reduce the branching factor by using a two-level hierarchical MoE. In a hierarchical MoE, a primary gating network chooses a sparse weighted combination of 'experts", each of which is itself a secondary mixture-of-experts with its own gating network. 3 If the hierarchical MoE consists of a groups of b experts each, we denote the primary gating network by G primary , the secondary gating networks by ( G , G ..G 1 2 a ) , and the expert networks by ( E 0 0 , , E 0 1 , ..E a,b ) . The output of the MoE is given by:
當專家人數眾多時,我們可以使用兩級層次式 MoE 減少分支因子。在層次式 MoE 中,主要門控網路選擇稀疏加權組合的「專家」,每個專家本身都是一個具有自身門控網路的二次混合專家模型。若此層次式 MoE 由 b 個專家組成的小組,我們將主門控網路表示為 G_primary,次級門控網路表示為 (G_1, G_2,…, G_a),並將專家網路表示為 (E_{0,0}, E_{0,1}, …, E_{a,b})。MoE 的輸出由下式給出:
Our metrics of expert utilization change to the following:
我們調整了專家利用率的衡量標準,如下:
Load primary and Load i deonte the Load functions for the primary gating network and i th secondary gating network respectively. X ( ) i denotes the subset of X for which G primary ( x ) i > 0 .
載入主要門檻網路和次要門檻網路的函數分別為「Load primary」及「Load denote」。X(i) 表示 G_primary(x)_i > 0 的 X 子集。
It would seem simpler to let Load H ( X ) i,j = Load i ( X i ) j , but this would not have a gradient with respect to the primary gating network, so we use the formulation above.
看起來使用 Load H ( X ) i,j = Load i ( X i ) j 更簡單,但這樣做無法對主要控制单元网络進行梯度計算,因此我們採用上述公式。
Model Architecture: Our model consists of five layers: a word embedding layer, a recurrent Long Short-Term Memory (LSTM) layer (Hochreiter & Schmidhuber, 1997; Gers et al., 2000), a MoE layer, a second LSTM layer, and a softmax layer. The dimensionality of the embedding layer, the number of units in each LSTM layer, and the input and output dimensionality of the MoE layer are all equal to 512. For every layer other than the softmax, we apply drouput (Zaremba et al., 2014) to the layer output, dropping each activation with probability DropProb , otherwise dividing by (1 -DropProb ) . After dropout, the output of the previous layer is added to the layer output. This residual connection encourages gradient flow (He et al., 2015).
模型架構:我們的模型包含五個層次:一個詞嵌入層、一個循環型長短期記憶 (LSTM) 層(Hochreiter & Schmidhuber, 1997; Gers 等人,2000)、一個 MoE 層、一個第二個 LSTM 層以及一個 Softmax 層。嵌入層的維度、每個 LSTM 層中的單元數以及 MoE 層的輸入和輸出維度均等於 512。對於除 Softmax 層以外的所有層次,我們將層輸出應用 dropout(Zaremba 等人,2014),以 DropoutProb 的機率捨棄每個激活值,否則將其除以 (1 - DropoutProb)。 dropout 後,前一層的輸出加到當前層的輸出。這種殘差連接鼓勵梯度流動(He 等人,2015)。
MoE Layer Architecture: Each expert in the MoE layer is a feed forward network with one ReLU-activated hidden layer of size 1024 and an output layer of size 512. Thus, each expert contains [512 ∗ 1024] + [1024 ∗ 512] = 1 M parameters. The output of the MoE layer is passed through a sigmoid function before dropout. We varied the number of experts between models, using ordinary MoElayers with 4, 32 and 256 experts and hierarchical MoE layers with 256, 1024 and 4096 experts. We call the resulting models MoE-4, MoE-32, MoE-256, MoE-256-h, MoE-1024-h and MoE-4096h. For the hierarchical MoE layers, the first level branching factor was 16, corresponding to the number of GPUs in our cluster. We use Noisy-Top-K Gating (see Section 2.1) with k = 4 for the ordinary MoE layers and k = 2 at each level of the hierarchical MoE layers. Thus, each example is processed by exactly 4 experts for a total of 4M ops/timestep. The two LSTM layers contribute 2M ops/timestep each for the desired total of 8M.
「MoE 層架構:每個 MoE 層的專家都是一個具有一個 1024 大小的 ReLU 啟活隱層和一個 512 大小的輸出層的前饋網路。因此,每個專家包含 [512 * 1024] + [1024 * 512] = 1M 個參數。MoE 層的輸出經過 sigmoid函數後再進行 dropout處理。我們在模型之間改變了專家的數量,使用了具有 4、32 和 256 個專家的普通 MoE 層,以及具有 256、1024 和 4096 個專家的分層 MoE 層。我們將生成的模型命名為 MoE-4、MoE-32、MoE-256、MoE-256-h、MoE-1024-h 和 MoE-4096h。對於分層 MoE 層,第一級分支因子為 16,與我們集群中的 GPU 數量相符。我們使用 Noisy-Top-K gating(見第 2.1 節),其中普通 MoE 層的 k=4,而每層的分層 MoE 層則使用 k=2。因此,每個樣本由正好 4 個專家處理,總計為 4M 運算量/時間步長。兩個 LSTM 層分別貢獻 2M 運算量/時間步長,以達到所需的總共 8M 的運算量。」
Computationally-Matched Baselines: The MoE-4 model does not employ sparsity, since all 4 experts are always used. In addition, we trained four more computationally-matched baseline models with no sparsity:
計算上匹配的基準模型:MoE-4 模型不使用稀疏性,因為所有 4 個專家始終被使用。此外,我們還訓練了四個計算上匹配的基準模型,這些模型不使用稀疏性:
MoE-1-Wide:MoE 層由一個單一「專家」組成,其包含一個 ReLU 啟用的隱藏層,大小為 4096。
MoE-1-Deep:MoE 層由一個單一“專家”組成,包含四個 ReLU 啟活隱藏層,每個隱藏層大小為 1024。
我們將新增兩個包含 512 個單元的 LSTM 層來取代 MoE 層。
Training: The models were trained on a cluster of 16 K40 GPUs using the synchronous method described in Section 3. Each batch consisted of a set of sentences totaling roughly 300,000 words. In the interest of time, we limited training to 10 epochs, (27,000 steps). Training took 12-16 hours for all models, except for MoE-4, which took 18 hours (since all the expert computation was performed on only 4 of 16 GPUs). We used the Adam optimizer (Kingma & Ba, 2015). The base learning rate was increased linearly for the first 1000 training steps, and decreased after that so as to be proportional to the inverse square root of the step number. The Softmax output layer was trained efficiently using importance sampling similarly to the models in (Jozefowicz et al., 2016). For each model, we performed a hyper-parmeter search to find the best dropout probability, in increments of 0.1.
訓練:這些模型使用第 3 節中描述的同步方法,在由 16 個 K40 GPU 組成的集群上進行訓練。每個批次包含約 30 萬字的句子集。為節省時間,我們將訓練限制於 10 個 epochs(27,000 步)。除了 MoE-4,所有模型的訓練歷時 12 到 16 小時(由於專家計算僅在 16 個 GPU 中的 4 個上執行,因此 MoE-4 需要 18 小時)。我們採用 Adam 優化器(Kingma & Ba,2015)。基本學習率在最初 1,000 步訓練過程中線性增加,之後則按照步驟數的倒平方根成比例遞減。我們類似於(Jozefowicz et al.,2016)模型中使用重要採樣的方式有效地訓練了 Softmax 輸出層。對於每個模型,我們進行超參數搜索以找到最佳的丟棄機率,每次增量為 0.1。
To ensure balanced expert utilization we set w importance = 0 1 . and w load = 0 1 . , as described in Section 4 and Appendix A.
為確保專家資源均衡運用,我們將權重 w_importance 設定為 0.1,w_load 也設定為 0.1(詳見第 4 節和附錄 A)。
Results: We evaluate our model using perplexity on the holdout dataset, used by (Chelba et al., 2013; Jozefowicz et al., 2016). We follow the standard procedure and sum over all the words including the end of sentence symbol. Results are reported in Table 7. For each model, we report the test perplexity, the computational budget, the parameter counts, the value of DropProb , and the computational efficiency.
結果:我們利用保留集數據集(參考Chelba et al.,2013;Jozefowicz et al.,2016)上的困惑度評估模型表現。我們遵循標準流程,對所有詞彙進行累計計算,包括句結符號。詳細結果如表 7 所示。對於每個模型,我們分別報告測試困惑度、計算預算、參數數量、DropProb 值以及計算效率。
Table 7: Model comparison on 1 Billion Word Language Modeling Benchmark. Models marked with * are from (Jozefowicz et al., 2016).
表格 7:十億元語言建模基準測試模型比較。標記有 * 的模型來自 (Jozefowicz等人,2016)。
Model | Test Perplexity 10 epochs | Test Perplexity (final) | ops/timestep (millions) | #Params excluding embed. & softmax (millions) | Total #Params (billions) | Drop - Prob | TFLOPS per GPU (observed) |
---|---|---|---|---|---|---|---|
Kneser-Ney 5-gram* | 67.6 | 1e-05 | 1.8 | ||||
LSTM-512-512* | 54.1 | 2.4 | 2.4 | 0.8 | 0.1 | ||
LSTM-1024-512* | 48.2 | 4.7 | 4.7 | 0.8 | 0.1 | ||
LSTM-2048-512* | 45.0 | 43.7 | 9.4 | 9.4 | 0.8 | 0.1 | 0.61 |
LSTM-2048-512 | 44.7 | 9.4 | 9.4 | 0.8 | 0.1 | 1.21 | |
4xLSTM-512 | 46.0 | 8.4 | 8.4 | 0.8 | 0.1 | 1.07 | |
MoE-1-Wide | 46.1 | 8.4 | 8.4 | 0.8 | 0.1 | 1.29 | |
MoE-1-Deep | 45.7 | 8.4 | 8.4 | 0.8 | 0.1 | 1.29 | |
MoE-4 | 45.0 | 8.4 | 8.4 | 0.8 | 0.1 | 0.52 | |
MoE-32 | 39.7 | 8.4 | 37.8 | 0.9 | 0.1 | 0.87 | |
MoE-256 | 35.7 | 8.6 | 272.9 | 1.1 | 0.1 | 0.81 | |
MoE-256-h | 36.0 | 8.4 | 272.9 | 1.1 | 0.1 | 0.89 | |
MoE-1024-h | 34.6 | 8.5 | 1079.0 | 1.9 | 0.2 | 0.90 | |
MoE-4096-h | 34.1 | 8.9 | 4303.4 | 5.1 | 0.2 | 0.74 | |
2xLSTM-8192-1024* | 34.7 | 30.6 | 151 | 151.0 | 1.8 | 0.25 | 1.09 |
MoE-34M | 31.3 | 33.8 | 4313.9 | 6 | 0.3 | 1.22 | |
MoE-143M | 28.0 | 142.7 | 4371.1 | 6 | 0.4 | 1.56 |
We ran two additional models (MoE-34M and MoE-143M) to investigate the effects of adding more computation in the presence of a large MoE layer. These models have computation budgets of 34M and 143M ops/timestep. Similar to the models above, these models use a MoE layer between two LSTM layers. The dimensionality of the embedding layer, and the input and output dimensionality of the MoE layer are set to 1024 instead of 512. For MoE-34M, the LSTM layers have 1024 units. For MoE-143M, the LSTM layers have 4096 units and an output projection of size 1024 (Sak et al., 2014). MoE-34M uses a hierarchical MoE layer with 1024 experts, each with a hidden layer of size 2048. MoE-143M uses a hierarchical MoE layer with 256 experts, each with a hidden layer of size 8192. Both models have 4B parameters in the MoE layers. We searched for the best DropProb for each model, and trained each model for 10 epochs.
我們額外訓練了兩個模型(MoE-34M 和 MoE-143M)來探討在大型 Mixture-of Experts (MoE) 層存在的情況下增加運算量對結果的影響。這些模型擁有 3400 萬和 1.43 億運算量/時間步長的計算預算。與上述模型類似,這些模型在兩個 LSTM layers 之間使用一個 MoE 層。嵌入層的維度以及 MoE 層的輸入和輸出維度設置為 1024(而不是 512)。對於 MoE-34M,LSTM layers 包含 1024 個單元。對於 MoE-143M,LSTM layers 包含 4096 個單元和一個大小為 1024 的輸出投影 (Sak 等人,2014)。MoE-34M 使用一個具有 1024 個專家的分層式 MoE 層,每個專家擁有大小為 2048 的隱藏層。MoE-143M 使用一個具有 256 個專家的分層式 MoE 層,每個專家擁有大小為 8192 的隱藏層。兩個模型在 MoE 層中都拥有 40 億參數。我們為每個模型搜索最佳 DropProb,並訓練每個模型 10 個時代。
The two models achieved test perplexity of 31 3 . and 28 0 . respectively, showing that even in the presence of a large MoE, more computation is still useful. Results are reported at the bottom of Table 7. The larger of the two models has a similar computational budget to the best published model from the literature, and training times are similar. Comparing after 10 epochs, our model has a lower test perplexity by 18% .
這兩個模型在測試時分別達到31.3和28.0的困惑度,表明即使存在一個大型的多模組模型(MoE),更多的計算資源仍然是有用的。結果見表7底部。較大的那兩個模型具有與文獻中最佳公開模型相似的計算預算,訓練時間也相似。比較10個epochs之後,我們的模型的測試困惑度降低了18%。
Model Architecture: The models are similar in structure to the 8-million-operations-per-timestep models described in the previous section. We vary the number of experts between models, using an ordinary MoE layer with 32 experts and hierarchical MoE layers with 256, 1024, 4096, 16384, 65536 and 131072 experts. For the hierarchical MoE layers, the first level branching factors are 32, 32, 64, 128, 256 and 256, respectively.
模型架構:這些模型的結構與先前描述的每步操作 800 萬次模型類似。我們在不同模型中調整專家數量,採用包含 32 個專家的標準 MoE 層,以及包含 256、1024、4096、16384、65536 和 131072 個專家的分層 MoE 層。對於分層 MoE 層而言,第一層分支因子分別為 32、32、64、128、256 和 256。
Training: Models are trained on a cluster of 32 Tesla K40 GPUs, except for the last two models, which are trained on clusters of 64 and 128 GPUs so as to have enough memory for all the parameters. For all models, training batch sizes are approximately 2.5 million words. Models are trained once-through over about 100 billion words.
訓練:模型主要在由 32 個 Tesla K40 GPU 組成的集群上訓練,最後兩個模型則分別使用包含 64 和 128 個 GPU 的較大型集群進行訓練,以確保具備足夠記憶體儲存所有參數。所有模型的訓練批次大小約為 250 萬個詞彙,一次完整地訓練約 1000 億個詞彙。
We implement several memory optimizations in order to fit up to 1 billion parameters per GPU. First, we do not store the activations of the hidden layers of the experts, but instead recompute them on the backwards pass. Secondly, we modify the optimizer on the expert parameters to require less auxiliary storage:
為使每個 GPU 容納多達 10 億個參數,我們實施了幾項記憶體優化。首先,我們在反向傳播過程中重新計算專家隱藏層的激活值,而非儲存其值。其次,我們修改專家參數的優化器,使其需要的輔助存儲空間更少。
The Adam optimizer (Kingma & Ba, 2015) keeps first and second moment estimates of the perparameter gradients. This triples the required memory. To avoid keeping a first-moment estimator, we set β 1 = 0 . To reduce the size of the second moment estimator, we replace it with a factored approximation. For a matrix of parameters, instead of maintaining a full matrix of second-moment estimators, we maintain vectors of row-wise and column-wise averages of that matrix. At each step, the matrix of estimators is taken to be the outer product of those two vectors divided by the mean of either one. This technique could similarly be applied to Adagrad (Duchi et al., 2010).
Adam 優化器(Kingma & Ba,2015)會儲存每個參數梯度的第一和第二 moment 的估計值。這使所需的記憶體增加三倍。為了避免儲存第一 moment 估計器,我們將 β 1 設定為 0。為了減少第二 moment 估計器的尺寸,我們用一個分解的近似值替換它。對於一個參數矩陣,我們不保留完整的 second moment 估計器矩陣,而是保留該矩陣的行和列平均值的向量。在每次迭代中,估計器矩陣被視為這兩個向量的 outer product ,再除以其中一個的平均值。這種技術也可以用於 Adagrad (Duchi等人, 2010)。
Table 8: Model comparison on 100 Billion Word Google News Dataset
表 8:基於 1000億個字詞的 Google 新聞資料集進行的模型分析
Model | Test Perplexity .1 epochs | Test Perplexity 1 epoch | ops/timestep (millions) | #Params excluding embed. & softmax (millions) | Total #Params (billions) | TFLOPS per GPU (observed) |
---|---|---|---|---|---|---|
Kneser-Ney 5-gram | 67.1 | 45.3 | 1e-05 | 76 | ||
4xLSTM-512 | 54.5 | 47 | 8.4 | 8.4 | 0.1 | 1.23 |
MoE-32 | 48.5 | 40.4 | 8.4 | 37.8 | 0.1 | 0.83 |
MoE-256-h | 42.8 | 35.3 | 8.4 | 272.9 | 0.4 | 1.11 |
MoE-1024-h | 40.3 | 32.7 | 8.5 | 1079.0 | 1.2 | 1.14 |
MoE-4096-h | 38.9 | 30.9 | 8.6 | 4303.4 | 4.4 | 1.07 |
MoE-16384-h | 38.2 | 29.7 | 8.8 | 17201.0 | 17.3 | 0.96 |
MoE-65536-h | 38.2 | 28.9 | 9.2 | 68791.0 | 68.9 | 0.72 |
MoE-131072-h | 39.8 | 29.2 | 9.7 | 137577.6 | 137.7 | 0.30 |
Results: We evaluate our model using perplexity on a holdout dataset. Results are reported in Table 8. Perplexity after 100 billion training words is 39% lower for the 68-billion-parameter MoE model than for the baseline model. It is notable that the measured computational efficiency of the largest model (0.30 TFLOPS/GPU) is very low compared to the other models. This is likely a result of the fact that, for purposes of comparison to the other models, we did not increase the training batch size proportionally to the number of GPUs. For comparison, we include results for a computationally matched baseline model consisting of 4 LSTMs, and for an unpruned 5-gram model with Kneser-Ney smoothing (Kneser & Ney, 1995). 4
結果:我們利用保留集上的困惑度來評估我們的模型。結果如表 8 所示。訓練後 100 億個詞,680 億參數的 MoE 模型的困惑度比基準模型低 39%。值得注意的是,最大的模型(0.30 TFLOPS/GPU)的計算效率指標非常低,相比其他模型。這很可能是因為,為了與其他模型進行比較,我們沒有按比例增加訓練批次大小與 GPU 數量相符。為了進行比較,我們還包括了計算效能相當於基準模型的結果,該基準模型由 4 個 LSTM 組成,以及包含 Kneser-Ney 平滑 (Kneser & Ney, 1995) 的未剪枝 5 元模型。
Model Architecture for Single Language Pair MoE Models: Our model is a modified version of the GNMT model described in (Wu et al., 2016). To reduce computation, we decrease the number of LSTM layers in the encoder and decoder from 9 and 8 to 3 and 2 respectively. We insert MoE layers in both the encoder (between layers 2 and 3) and the decoder (between layers 1 and 2). We use an attention mechanism between the encoder and decoder, with the first decoder LSTM receiving output from and providing input for the attention 5 . All of the layers in our model have input and output dimensionality of 512. Our LSTM layers have 2048 hidden units, with a 512-dimensional output projection. We add residual connections around all LSTM and MoE layers to encourage gradient flow (He et al., 2015). Similar to GNMT, to effectively deal with rare words, we used subword units (also known as 'wordpieces") (Schuster & Nakajima, 2012) for inputs and outputs in our system.
模型架構為單語言對 MoE 模型:我們的模型是基於 Wu 等人 (2016) 所述的 GNMT 模型進行修改的。為了減少計算量,我們將編碼器和解碼器的 LSTM 層數分別從 9 和 8 降至 3 和 2。我们在编码器(层 2 和层 3 之间)和解码器(层 1 和层 2 之间)中插入 MoE 层。我们使用编解码器之间的注意力机制,第一个解码器 LSTM 从注意力輸出中得到輸入5。 我们模型的所有层都具有 512 维的输入和输出维度。我们的 LSTM 层具有 2048 个隐藏单元,以及一个 512 维的输出投影。我们在所有 LSTM 和 MoE 层周围添加残差连接,以鼓励梯度流動 (He 等人,2015)。與 GNMT 相似,为了有效处理罕见词,我们使用了子詞单元(也称为“wordpieces”)(Schuster & Nakajima,2012),用于输入和输出系统。
We use a shared source and target vocabulary of 32K wordpieces. We also used the same beam search technique as proposed in (Wu et al., 2016).
我們使用一個共用的源語和目標語詞彙庫,包含 32,000 個 BPE 分割詞。我們還使用了與 (Wu 等人,2016) 提出的相同束搜索技術。
We train models with different numbers of experts in the MoE layers. In addition to a baseline model with no MoE layers, we train models with flat MoE layers containing 32 experts, and models with hierarchical MoE layers containing 512 and 2048 experts. The flat MoE layers use k = 4 and the hierarchical MoE models use k = 2 at each level of the gating network. Thus, each input is processed by exactly 4 experts in each MoE layer. Each expert in the MoE layer is a feed forward network with one hidden layer of size 2048 and ReLU activation. Thus, each expert contains [512 ∗ 2048] + [2048 ∗ 512] = 2 M parameters. The output of the MoE layer is passed through a sigmoid function. We use the strictly-balanced gating function described in Appendix F.
我們訓練具有不同數量專家的 MoE 層模型。除了沒有 MoE 層的基線模型外,我們還訓練包含 32 個專家的扁平 MoE 層模型,以及包含 512 和 2048 個專家的分層 MoE 層模型。扁平 MoE 層使用 k = 4,而階層式 MoE 模型在門控網路的每個層級都使用 k = 2。因此,每個輸入在每個 MoE 層中都被 exactly 4 個專家處理。每個 MoE 層中的專家都是具有一個隱藏層大小為 2048 的前饋網絡,並使用 ReLU 激活函數。因此,每個專家包含 [512 * 2048] + [2048 * 512] = 2M 個參數。MoE 層的輸出經過 sigmoid 函數處理。我們使用附錄 F 中描述的嚴格平衡門控機制。
Model Architecture for Multilingual MoE Model: We used the same model architecture as for the single-language-pair models, with the following exceptions: We used noisy-top-k gating as described in Section 2.1, not the scheme from Appendix F. The MoE layers in the encoder and decoder are non-hierarchical MoEs with n = 512 experts, and k = 2 . Each expert has a larger hidden layer of size 8192 . This doubles the amount of computation in the MoE layers, raising the computational budget of the entire model from 85M to 102M ops/timestep.
多語言 MoE 模型的模型架構:我們使用與單語言對模型相同的模型架構,但以下例外情況:我們使用噪聲上限 k 組織方法(如第 2.1 節所述),而不是附錄 F 中的方案。編碼器和解碼器的 MoE 層是非層次化的 MoE,具備 n = 512 個專家和 k = 2。每個專家都擁有更大尺寸為 8192 的隱藏層。這使得 MoE 層中的計算複雜度增加一倍,從 85M 到 102M 次操作/時間步提高了整個模型的計算成本。
Training: We trained our networks using the Adam optimizer (Kingma & Ba, 2015). The base learning rate was increased linearly for the first 2000 training steps, held constant for an additional 8000 steps, and decreased after that so as to be proportional to the inverse square root of the step number. For the single-language-pair models, similarly to (Wu et al., 2016), we applied dropout (Zaremba et al., 2014) to the output of all embedding, LSTM and MoE layers, using DropProb = 0 4 . . Training was done synchronously on a cluster of up to 64 GPUs as described in section 3. Each training batch consisted of a set of sentence pairs containing roughly 16000 words per GPU.
訓練:我們使用 Adam 優化器(Kingma & Ba,2015)訓練模型。基本學習率在最初 2000 個訓練步驟中線性遞增,接下來維持固定值 8000 步,之後根據步數的倒數平方根進行衰減。針對單語種對模型,類似於(Wu 等人,2016),我們在所有嵌入層、LSTM 層和 MoE 層的輸出中應用 dropout(Zaremba 等人,2014),設定 DropProb 為 0.4。訓練以GPU叢集同步進行,每個 GPU 上最多可使用 64 個,具體細節請參考第 3 節。每個訓練批次包含一組句子對,每批約含 16000 個詞語。
To ensure balanced expert utilization we set w importance = 0 01 . and w load = 0 01 . , as described in Section 4 and Appendix A.
為確保專家資源平衡,我們將 w_importance 設定為 0.01,w_load 設定為 0.01,詳情請見第 4 節和附錄 A。
Metrics: We evaluated our models using the perplexity and the standard BLEU score metric. We reported tokenized BLEU score as computed by the multi-bleu.pl script, downloaded from the public implementation of Moses (on Github), which was also used in (Luong et al., 2015a).
我們使用 perplexity 和標準 BLEU 指標來評估模型。我們報告的 BLEU 分數基於 token 化計算,由從 Moses 的公眾版本(GitHub 上)下載的多個 bleu.pl 腳本計算,該腳本與 (Luong et al., 2015a) 一致。
Results: Tables 2, 3 and 4 in Section 5.3 show comparisons of our results to other published methods. Figure 4 shows test perplexity as a function of number of words in the (training data's) source sentences processed for models with different numbers of experts. As can be seen from the Figure, as we increased the number of experts to approach 2048, the test perplexity of our model continued to improve.
結果如第 5.3 節中的表格 2、3 和 4 所示,將我們的結果與其他已發表方法進行比較。圖 4 展示了測試困惑度隨著訓練資料中源句子詞數變化曲線,並分析了具有不同專家數量模型的表現。如圖所示,當我們增加專家數量接近 2048 時,我們的模型測試困惑度持續降低。
Figure 4: Perplexity on WMT'14 En → Fr (left) and Google Production En → Fr (right) datasets as a function of number of words processed. The large differences between models at the beginning of training are due to different batch sizes. All models incur the same computational budget (85M ops/timestep) except the one with no experts.
圖 4:顯示 WMT'14 En → Fr(左)和 Google 生產環境 En → Fr(右)資料集上困惑度的變化,計算單位為處理詞數。訓練初期模型之間的差異主要來自於不同的批次大小。所有模型使用相同的運算預算 (85M 運算/時間步),除了不包含專家模型的情況。
We found that the experts indeed become highly specialized by syntax and/or semantics, as can be seen in Table 9. For example, one expert is used when the indefinite article 'a" introduces the direct object in a verb phrase indicating importance or leadership.
我們發現,正如表 9 所示,專家確實會根據語法和/或語義高度專業化。例如,當不定冠詞「a」引導動詞短語中的直接賓語時,表示重要性或領導力,就會使用特定專家。
Table 9: Contexts corresponding to a few of the 2048 experts in the MoE layer in the encoder portion of the WMT'14 En → Fr translation model. For each expert i , we sort the inputs in a training batch in decreasing order of G x ( ) i , and show the words surrounding the corresponding positions in the input sentences.
表 9:展示 WMT'14 英文至法文翻譯模型編碼器部分的 MoE 層中 2048 個專家的相關語境。對於每個专家 i,我們會將訓練批次中的輸入按 G x ( ) i 的遞減順序排序,並顯示相應位置附近的輸入句子詞彙。
Expert 381 | Expert 2004 | |
---|---|---|
… with researchers , … … to innovation . … tics researchers . … the generation of … … technology innovations is … … technological innovations , … … support innovation throughout … … role innovation will … … research scienti st … … promoting innovation where … | Expert 752 … plays a core … … plays a critical … … provides a legislative … … play a leading … … assume a leadership … … plays a central … … taken a leading … … established a reconciliation … … played a vital … … have a central … | … with rapidly growing … … under static conditions … … to swift ly … … to dras tically … … the rapid and … … the fast est … … the Quick Method … … rec urrent ) … … provides quick access … … of volatile organic … |
Due to some peculiarities in our infrastructure which have since been fixed, at the time we ran some of the machine translation experiments, our models ran faster if every expert received exactly the same batch size. To accommodate this, we used a different gating function which we describe below.
鑑於我們基礎設施中存在一些已修復的特殊狀況,在進行部分機器翻譯實驗時,我們的模型執行速度若每個專家接收完全相同的批次大小會更快。為此,我們採用了不同的門控函數,如下所述。
Recall that we define the softmax gating function to be:
回想一下,我們將 softmax 門控函數定義為:
Sparse Gating (alternate formulation): To obtain a sparse gating vector, we multiply G σ ( x ) component-wise with a sparse mask M G ( σ ( x )) and normalize the output. The mask itself is a function of G σ ( x ) and specifies which experts are assigned to each input example:
稀疏門控(替代公式):為了獲得一個稀疏的門控向量,我們將 Gσ(x) 的每個分量與稀疏掩模 M_G(σ(x)) 相乘,然後歸一化輸出。該掩模是一個關於 Gσ(x) 的函數,指定每個輸入樣本分配給哪些專家。
Top-K Mask: To implement top-k gating in this formulation, we would let M v ( ) = TopK v,k ( ) , where:
頂 k 遮罩:為了在這個公式中實現 top-k attention gate,我們會讓 M v ( ) = TopK v,k ( ) ,其中:
Batchwise Mask: To force each expert to receive the exact same number of examples, we introduce an alternative mask function, M batchwise ( X,m ) , which operates over batches of input vectors. Instead of keeping the top k values per example, we keep the top m values per expert across the training batch, where m = k X | | n , so that each example is sent to an average of k experts.
為確保每個專家接收相同數量的樣本,我們引入了替代掩蔽函數Mbatchwise(X,m),它作用於輸入向量的小批量。相較於保留每個樣本的前k 個值,我們會保留每個專家在訓練批次中(m=k * n)的前面m個值,這樣每一個樣本平均會被傳送給k個專家。
As our experiments suggest and also observed in (Ioffe & Szegedy, 2015), using a batchwise function during training (such as M batchwise ) requires modifications to the inference when we may not have a large batch of examples. Our solution to this is to train a vector T of per-expert threshold values to approximate the effects of the batchwise mask. We use the following mask at inference time:
根據我們的實驗結果,以及 Ioffe & Szegedy (2015) 的觀察,在訓練過程中使用批次函數(例如 M 批次函數)需要在數據量不足時修改推理過程。我們解決這個問題的方法是訓練一個包含每個專家閾值的向量 T,以近似批次掩碼的效果。我們在推理時使用以下掩碼:
To learn the threshold values, we apply an additional loss at training time which is minimized when the batchwise mask and the threshold mask are identical.
為了學習閾值,我們在訓練過程中添加一個額外損失函數。此函數在每批樣本的掩碼與閾值掩碼一致時達到最小化。
The attention mechanism described in GNMT (Wu et al., 2016) involves a learned 'Attention Function" A x , y ( i j ) which takes a 'source vector" x i and a 'target vector" y j , and must be computed for every source time step i and target time step j . In GNMT, the attention function is implemented as a feed forward neural network with a hidden layer of size n . It can be expressed as:
在 GNMT (Wu et al., 2016) 中描述的注意力機制涉及一個學習到的「注意力函數」Ax,y(i, j),它接收一個「源向量」xi 和一個「目標向量」yj,需針對每一個源時間步 i 和目標時間步 j 计算。在 GNMT 中,注意力函數被實作為具有 n 大小的隱藏層的 feed-forward 神經網絡。可以表示為:
Where U and W are trainable weight matrices and V is a trainable weight vector.
其中 U 和 W 是可訓練的權重矩陣,V 是可訓練的權重向量。
For performance reasons, in our models, we used a slightly different attention function:
為提升效能,我們模型中採用了一個略微不同的注意力機制:
With our attention function, we can simultaneously compute the attention function on multiple source time steps and multiple target time steps using optimized matrix multiplications. We found little difference in quality between the two functions.–––––––––––––––––
藉由我們的注意力機制,我們能夠利用優化後的矩陣乘法同時計算在多個來源時間步和多個目標時間步上的關注機制。我們發現這兩種機制品質差異很小。