OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER(機翻未調校)

tags:`論文翻譯` `deeplearning`

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER(機翻未調校)

說明

排版的順序為先原文，再繁體中文，並且圖片與表格都會出現在第一次出現的段落下面

原文

繁體中文

照片或表格

個人註解，任何的翻譯不通暢部份都請留言指導
為了加速閱讀，直接採用自建的反思翻譯(gemma2 -> deepseek -> gemma2)的結果呈現，然後快速看過，語意對了頭過身就過，多多包函
這篇論文測試利用docling將論文從pdf取出轉成markdown格式，再利用正則式取片段至dify建置的反思翻譯api取得譯文再做優化調整
解釋的部份也是直接用deepseek-r1:7b模型來針對片段做理解說明
機器說明的部份我會把一些不必要的冗文刪除，這可能可以從提示詞中再來做回應的優化

paper hyperlink

ABSTRACT

原文

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

翻譯後的結果

神經網路的學習能力受到參數數量限制。條件計算，也就是根據每個樣本激活網絡部分，被理論上提出為大幅提升模型容量的方法，而不會造成運算量的成比例增加。然而，在實踐中，存在著顯著的算法和性能挑戰。我們在这项工作中解决了这些挑战，最终实现了条件计算的承诺，在现代GPU集群上，模型容量提高了超過1000倍，同时只損失少量計算效率。

我們引入了稀疏門控混合專家（MoE）層，包含數千個前饋子網路。一個可訓練的門控網絡決定每個樣本應使用的這些專家的稀疏組合。我們將 MoE 应用於語言建模和機器翻譯任務，在這些任務中，模型容量對於吸收海量訓練數據中的知識至關重要。

我們提出了模型架構，其中一個包含最多1370億參數的MoE應用於堆疊LSTM層之間的卷積運算。在大型語言建模和機器翻譯基准測試中，這些模型在較低的計算成本下取得了比最先進技術更好的結果。

1.1 CONDITIONAL COMPUTATION

1.2 OUR APPROACH: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

2 THE STRUCTURE OF THE MIXTURE-OF-EXPERTS LAYER

2.1 GATING NETWORK

Another major performance concern in distributed computing is network bandwidth. Since the experts are stationary (see above) and the number of gating parameters is small, most of the communication involves sending the inputs and outputs of the experts across the network. To maintain computational efficiency, the ratio of an expert's computation to the size of its input and output must exceed the ratio of computational to network capacity of the computing device. For GPUs, this may be thousands to one. In our experiments, we use experts with one hidden layer containing thousands of RELU-activated units. Since the weight matrices in the expert have sizes input _ size × hidden size _ and hidden size _ × output _ size , the ratio of computation to input and output is equal to the size of the hidden layer. Conveniently, we can increase computational efficiency simply by using a larger hidden layer, or more hidden layers.

翻譯後的結果

在分布式計算中，另一個重要的性能問題是網路帶寬。由於專家是靜態的（見上述），並且門控參數數量很少，大部分通信涉及傳送專家的輸入和輸出跨網絡。為了維持運算效率，專家運算與其輸入和輸出的比例必須超過計算設備的計算與網路容量之比。對於 GPU 而言，此比率可能高達數千對一。在我們的實驗中，我們使用具有包含數千個 RELU 激活單元的隱藏層的專家。由於專家中的權重矩陣大小為輸入尺寸 × 隱藏尺寸和隱藏尺寸 × 輸出尺寸，因此計算與輸入和輸出的比率等於隱藏層的大小。方便的是，我們可藉由增加隱藏層的大小或添加更多隱藏層來簡單地提高運算效率。

4 BALANCING EXPERT UTILIZATION

翻譯後的結果

表 1：總結不同計算預算下的高容量 MoE 加強模型與先前最佳公開結果（Jozefowicz 等，2016 年）。詳情見附錄 C。

	Test Perplexity 10 epochs	Test Perplexity 100 epochs	#Parameters excluding embedding and softmax layers	ops/timestep	Training Time 10 epochs	TFLOPS /GPU
Best Published Results	34.7	30.6	151 million	151 million	59 hours, 32 k40s	1.09
Low-Budget MoE Model	34.1		4303 million	8.9 million	15 hours, 16 k40s	0.74
Medium-Budget MoE Model	31.3		4313 million	33.8 million	17 hours, 32 k40s	1.22
High-Budget MoE Model	28		4371 million	142.7 million	47 hours, 32 k40s	1.56

原文

Computational Efficiency: Wetrained our models using TensorFlow (Abadi et al., 2016) on clusters containing 16-32 Tesla K40 GPUs. For each of our models, we determine computational efficiency in TFLOPS/GPU by dividing the number of floating point operations required to process one training batch by the observed step time and the number of GPUs in the cluster. The operation counts used here are higher than the ones we report in our ops/timestep numbers in that we include the backwards pass, we include the importance-sampling-based training of the softmax layer, and we count a multiply-and-add as two separate operations. For all of our MoE models, the floating point operations involved in the experts represent between 37% and 46% of the total.

翻譯後的結果

計算效率：我們使用 TensorFlow（Abadi 等人，2016）在包含 16-32 個 Tesla K40 GPU 的集群上訓練模型。對於每個模型，我們通過將一個訓練批次的浮點運算量除以觀察到的步長時間和集群中的 GPU 數量來確定每 GPU 的計算效率（FLOPS）。這裡使用的運算計數高於我們在 ops/timestep 數字中報告的數字，因為我們包含反向傳播、基於重要性采样的 softmax 層訓練，並且將乘加計為兩個獨立操作。對於所有我們的 MoE 模型而言，專家參與的浮點運算是總量的 37% 到 46%。

原文

For our baseline models wtih no MoE, observed computational efficiency ranged from 1.07-1.29 TFLOPS/GPU. For our low-computation MoE models, computation efficiency ranged from 0.740.90 TFLOPS/GPU, except for the 4-expert model which did not make full use of the available parallelism. Our highest-computation MoE model was more efficient at 1.56 TFLOPS/GPU, likely due to the larger matrices. These numbers represent a significant fraction of the theoretical maximum of 4.29 TFLOPS/GPU claimed by NVIDIA. Detailed results are in Appendix C, Table 7.

翻譯後的結果

對於我們基線模型（不使用 MoE），觀察到的計算效率介於 1.07 到 1.29 TFLOPS/GPU 之間。我們的低運算量 MoE 模型，計算效率範圍為 0.74 至 0.90 TFLOPS/GPU，除了 4 個專家模型未充分利用可用的並行性外。我們的最高運算量 MoE 模型更有效率，達到 1.56 TFLOPS/GPU，這很可能是由於矩陣更大。這些數字代表了 NVIDIA 聲稱的理論最大值 4.29 TFLOPS/GPU 的很大一部分。詳細結果見附錄 C，表 7。

5.2 100 BILLION WORD GOOGLE NEWS CORPUS

5.3 MACHINE TRANSLATION (SINGLE LANGUAGE PAIR)

Model	Test	Test	ops/timenstep	Total	Training
	Perplexity	BLEU		#Parameters	Time
MoE with 2048 Experts	2.69	40.35	85M	8.7B	3 days/64 k40s
MoE with 2048 Experts (longer training)	2.63	40.56	85M	8.7B	6 days/64 k40s
GNMT (Wu et al., 2016)	2.79	39.22	214M	278M	6 days/96 k80s
GNMT+RL (Wu et al., 2016)	2.96	39.92	214M	278M	6 days/96 k80s
PBMT (Durrani et al., 2014)		37.0
LSTM (6-layer) (Luong et al., 2015b)		31.5
LSTM (6-layer+PosUnk) (Luong et al., 2015b)		33.1
DeepAtt (Zhou et al., 2016)		37.7
DeepAtt+PosUnk (Zhou et al., 2016)		39.2

原文

Table 4: Results on the Google Production En → Fr dataset (bold values represent best results).

翻譯後的結果

表 4：Google Production En → Fr 資料集結果（粗體值表示最佳成果）。

Model	Test Perplexity	Test BLEU	ops/timestep	Total #Parameters	Training Time
MoE with 2048 Experts	4.64	26.03	85M	8.7B	1 day/64 k40s
GNMT (Wu et al., 2016)	5.25	24.91	214M	278M	1 day/96 k80s
GNMT +RL (Wu et al., 2016)	8.08	24.66	214M	278M	1 day/96 k80s
PBMT (Durrani et al., 2014) DeepAtt (Zhou et al., 2016)	8.08	20.7 20.6	214M	278M	1 day/96 k80s

Model	Eval Perplexity	Eval BLEU	Test Perplexity	Test BLEU	ops/timestep	Total #Parameters	Training Time
MoE with 2048 Experts	2.6	37.27	2.69	36.57	85M	8.7B	1 day/64 k40s
GNMT (Wu et al., 2016)	2.78	35.8	2.87	35.56	214M	278M	6 days/96 k80s

原文

Results: Tables 2, 3, and 4 show the results of our largest models, compared with published results. Our approach achieved BLEU scores of 40.56 and 26.03 on the WMT'14 En → Fr and En → De benchmarks. As our models did not use RL refinement, these results constitute significant gains of 1.34 and 1.12 BLEU score on top of the strong baselines in (Wu et al., 2016). The perplexity scores are also better. 2 Onthe Google Production dataset, our model achieved 1.01 higher test BLEU score even after training for only one sixth of the time.

翻譯後的結果

結果如表 2、3 和 4 所示，我們最大的模型表現與已發表成果進行比較。我們的演算法在 WMT'14 En → Fr 和 En → De 基準測試上取得了 40.56 和 26.03 的 BLEU 分數。由於我們的模型沒有使用 RL 改進，這些結果比 Wu et al. (2016) 中的強勁基線提升了 1.34 和 1.12 BLEU 分數。 perplexity 得分也更佳。在 Google Production 資料集上，即使只訓練六分之一的時間，我們的模型在測試 BLEU 分數上仍提高了 1.01。

5.4 MULTILINGUAL MACHINE TRANSLATION

	GNMT-Mono	GNMT-Multi	MoE-Multi	MoE-Multi vs. GNMT-Multi
ops/timestep training time, hardware	Parameters 278M / model 212M	278M 212M 21 days, 96 k20s	8.7B 102M 12 days, 64 k40s
Perplexity (dev)	various	4.14	3.35	-19%
French → English Test BLEU	36.47	34.40	37.46	+3.06
German → English Test BLEU	31.77	31.17	34.80	+3.63
Japanese → English Test BLEU	23.41	21.62	25.91	+4.29
Korean → English Test BLEU	25.42	22.87	28.71	+5.84
Portuguese → English Test BLEU	44.40	42.53	46.13	+3.60
Spanish → English Test BLEU	38.00	36.04	39.39	+3.35
English → French Test BLEU	35.37	34.00	36.59	+2.59
English → German Test BLEU	26.43	23.15	24.53	+1.38
English → Japanese Test BLEU	23.66	21.10	22.78	+1.68
English → Korean Test BLEU	19.75	18.41	16.62	-1.79
English → Portuguese Test BLEU	38.40	37.35	37.90	+0.55
English → Spanish Test BLEU	34.50	34.25	36.21	+1.96

6 CONCLUSION

原文

This work is the first to demonstrate major wins from conditional computation in deep networks. We carefully identified the design considerations and challenges of conditional computing and addressed them with a combination of algorithmic and engineering solutions. While we focused on text, conditional computation may help in other domains as well, provided sufficiently large training sets. We look forward to seeing many novel implementations and applications of conditional computation in the years to come.

翻譯後的結果

這項研究首次展示了在深度網絡中條件計算的顯著優勢。我們仔細探討了條件計算的设计考量和挑戰，並以算法和工程方案相結合的方式加以解決。雖然我們的專注領域是文本處理，但只要訓練集足夠龐大，條件計算也可能應用於其他領域。我們期待未來看到更多關於條件計算的新颖實施和應用，特別是在深度學習的领域中。

ACKNOWLEDGMENTS

原文

We would like to thank all of the members of the Google Brain and Google Translate teams who helped us with this project, in particular Zhifeng Chen, Yonghui Wu, and Melvin Johnson. Thanks also to our anonymous ICLR reviewers for the helpful suggestions on making this paper better.

翻譯後的結果

我們謹此向 Google Brain 和 Google 翻譯團隊的所有成員致謝，特別是陳志峰、吳永輝和梅爾文·約翰遜，感謝他們對此專案所作的貢獻。也感謝匿名 ICLR 審查員給予的寶貴建議，協助我們改進這篇論文。

APPENDICES

A LOAD-BALANCING LOSS

原文

As discussed in section 4, for load-balancing purposes, we want to define an additional loss function to encourage experts to receive roughly equal numbers of training examples. Unfortunately, the number of examples received by an expert is a discrete quantity, so it can not be used in backpropagation. Instead, we define a smooth estimator Load X ( ) of the number of examples assigned to each expert for a batch X of inputs. The smoothness allows us to back-propagate gradients through the estimator. This is the purpose of the noise term in the gating function. We define P x, i ( ) as the probability that G x ( ) i is nonzero, given a new random choice of noise on element i , but keeping the already-sampled choices of noise on the other elements. To compute P x, i ( ) , we note that the G x ( ) i is nonzero if and only if H x ( ) i is greater than the k th -greatest element of H x ( ) excluding itself. The probability works out to be:

翻譯後的結果

如第4節所述，為了實現負載均衡，我們需要定義一個額外的損失函數，鼓勵專家接收大致相同的訓練樣本數量。然而，每個專家接收的樣本數量是一個離散值，無法直接用於反向傳播。因此，我們定義了一個平滑估計器 Load X ( )，來估計每位專家在輸入批次 X 中分配到的樣本數量。這種平滑性允許我們透過估計器反向傳播梯度。這就是門控函數中噪聲項的作用。

表 6：不同損失函數組合實驗結果。

w importance	w load	Test Perplexity	CV Importance X ( ( )) (	CV Load X ( ))	max Load X ( ( )) mean Load X ( (
0	0	39.8	3.04	3.01	17.8
0.2	0	35.6	0.06	0.17	1.47
0	0.2	35.7	0.22	0.04	1.15
0.1	0.1	35.6	0.06	0.05	1.14
0.01	0.01	35.7	0.48	0.11	1.37
1	1	35.7	0.03	0.02	1.07

原文

Results: Results are reported in Table 6. All the combinations containing at least one the two losses led to very similar model quality, where having no loss was much worse. Models with higher values of w load had lower loads on the most overloaded expert.

翻譯後的結果

結果如表 6 所示。所有包含至少一種損失組合的模型表現相似，而沒有損失的模型表現明顯較差。w_load 值較高的模型在最超載專家上負擔較輕。

翻譯後的結果

表格 7：十億元語言建模基準測試模型比較。標記有 * 的模型來自 (Jozefowicz等人，2016)。

Model	Test Perplexity 10 epochs	Test Perplexity (final)	ops/timestep (millions)	#Params excluding embed. & softmax (millions)	Total #Params (billions)	Drop - Prob	TFLOPS per GPU (observed)
Kneser-Ney 5-gram*		67.6	1e-05		1.8
LSTM-512-512*		54.1	2.4	2.4	0.8	0.1
LSTM-1024-512*		48.2	4.7	4.7	0.8	0.1
LSTM-2048-512*	45.0	43.7	9.4	9.4	0.8	0.1	0.61
LSTM-2048-512	44.7		9.4	9.4	0.8	0.1	1.21
4xLSTM-512	46.0		8.4	8.4	0.8	0.1	1.07
MoE-1-Wide	46.1		8.4	8.4	0.8	0.1	1.29
MoE-1-Deep	45.7		8.4	8.4	0.8	0.1	1.29
MoE-4	45.0		8.4	8.4	0.8	0.1	0.52
MoE-32	39.7		8.4	37.8	0.9	0.1	0.87
MoE-256	35.7		8.6	272.9	1.1	0.1	0.81
MoE-256-h	36.0		8.4	272.9	1.1	0.1	0.89
MoE-1024-h	34.6		8.5	1079.0	1.9	0.2	0.90
MoE-4096-h	34.1		8.9	4303.4	5.1	0.2	0.74
2xLSTM-8192-1024*	34.7	30.6	151	151.0	1.8	0.25	1.09
MoE-34M	31.3		33.8	4313.9	6	0.3	1.22
MoE-143M	28.0		142.7	4371.1	6	0.4	1.56

C.2 MORE EXPENSIVE MODELS

原文

We ran two additional models (MoE-34M and MoE-143M) to investigate the effects of adding more computation in the presence of a large MoE layer. These models have computation budgets of 34M and 143M ops/timestep. Similar to the models above, these models use a MoE layer between two LSTM layers. The dimensionality of the embedding layer, and the input and output dimensionality of the MoE layer are set to 1024 instead of 512. For MoE-34M, the LSTM layers have 1024 units. For MoE-143M, the LSTM layers have 4096 units and an output projection of size 1024 (Sak et al., 2014). MoE-34M uses a hierarchical MoE layer with 1024 experts, each with a hidden layer of size 2048. MoE-143M uses a hierarchical MoE layer with 256 experts, each with a hidden layer of size 8192. Both models have 4B parameters in the MoE layers. We searched for the best DropProb for each model, and trained each model for 10 epochs.

翻譯後的結果

我們額外訓練了兩個模型（MoE-34M 和 MoE-143M）來探討在大型 Mixture-of Experts (MoE) 層存在的情況下增加運算量對結果的影響。這些模型擁有 3400 萬和 1.43 億運算量/時間步長的計算預算。與上述模型類似，這些模型在兩個 LSTM layers 之間使用一個 MoE 層。嵌入層的維度以及 MoE 層的輸入和輸出維度設置為 1024（而不是 512）。對於 MoE-34M，LSTM layers 包含 1024 個單元。對於 MoE-143M，LSTM layers 包含 4096 個單元和一個大小為 1024 的輸出投影 (Sak 等人，2014)。MoE-34M 使用一個具有 1024 個專家的分層式 MoE 層，每個專家擁有大小為 2048 的隱藏層。MoE-143M 使用一個具有 256 個專家的分層式 MoE 層，每個專家擁有大小為 8192 的隱藏層。兩個模型在 MoE 層中都拥有 40 億參數。我們為每個模型搜索最佳 DropProb，並訓練每個模型 10 個時代。

原文

The two models achieved test perplexity of 31 3 . and 28 0 . respectively, showing that even in the presence of a large MoE, more computation is still useful. Results are reported at the bottom of Table 7. The larger of the two models has a similar computational budget to the best published model from the literature, and training times are similar. Comparing after 10 epochs, our model has a lower test perplexity by 18% .

翻譯後的結果

這兩個模型在測試時分別達到31.3和28.0的困惑度，表明即使存在一個大型的多模組模型（MoE），更多的計算資源仍然是有用的。結果見表7底部。較大的那兩個模型具有與文獻中最佳公開模型相似的計算預算，訓練時間也相似。比較10個epochs之後，我們的模型的測試困惑度降低了18%。

D 100 BILLION WORD GOOGLE NEWS CORPUS - EXPERIMENTAL DETAILS

Model	Test Perplexity .1 epochs	Test Perplexity 1 epoch	ops/timestep (millions)	#Params excluding embed. & softmax (millions)	Total #Params (billions)	TFLOPS per GPU (observed)
Kneser-Ney 5-gram	67.1	45.3	1e-05		76
4xLSTM-512	54.5	47	8.4	8.4	0.1	1.23
MoE-32	48.5	40.4	8.4	37.8	0.1	0.83
MoE-256-h	42.8	35.3	8.4	272.9	0.4	1.11
MoE-1024-h	40.3	32.7	8.5	1079.0	1.2	1.14
MoE-4096-h	38.9	30.9	8.6	4303.4	4.4	1.07
MoE-16384-h	38.2	29.7	8.8	17201.0	17.3	0.96
MoE-65536-h	38.2	28.9	9.2	68791.0	68.9	0.72
MoE-131072-h	39.8	29.2	9.7	137577.6	137.7	0.30

原文

Results: We evaluate our model using perplexity on a holdout dataset. Results are reported in Table 8. Perplexity after 100 billion training words is 39% lower for the 68-billion-parameter MoE model than for the baseline model. It is notable that the measured computational efficiency of the largest model (0.30 TFLOPS/GPU) is very low compared to the other models. This is likely a result of the fact that, for purposes of comparison to the other models, we did not increase the training batch size proportionally to the number of GPUs. For comparison, we include results for a computationally matched baseline model consisting of 4 LSTMs, and for an unpruned 5-gram model with Kneser-Ney smoothing (Kneser & Ney, 1995). 4

翻譯後的結果

結果：我們利用保留集上的困惑度來評估我們的模型。結果如表 8 所示。訓練後 100 億個詞，680 億參數的 MoE 模型的困惑度比基準模型低 39%。值得注意的是，最大的模型（0.30 TFLOPS/GPU）的計算效率指標非常低，相比其他模型。這很可能是因為，為了與其他模型進行比較，我們沒有按比例增加訓練批次大小與 GPU 數量相符。為了進行比較，我們還包括了計算效能相當於基準模型的結果，該基準模型由 4 個 LSTM 組成，以及包含 Kneser-Ney 平滑 (Kneser & Ney, 1995) 的未剪枝 5 元模型。

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER(機翻未調校)

tags:論文翻譯 deeplearning

說明

ABSTRACT

原文

翻譯後的結果

1 INTRODUCTION AND RELATED WORK

1.1 CONDITIONAL COMPUTATION

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

1.2 OUR APPROACH: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

原文

翻譯後的結果

原文

翻譯後的結果

1.3 RELATED WORK ON MIXTURES OF EXPERTS

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

2 THE STRUCTURE OF THE MIXTURE-OF-EXPERTS LAYER

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

2.1 GATING NETWORK

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

3 ADDRESSING PERFORMANCE CHALLENGES

3.1 THE SHRINKING BATCH PROBLEM

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

3.2 NETWORK BANDWIDTH

原文

翻譯後的結果

4 BALANCING EXPERT UTILIZATION

原文

翻譯後的結果

原文

翻譯後的結果

原文

翻譯後的結果

tags:`論文翻譯` `deeplearning`