Distilling the knowledge in a neural network

tags:`論文翻譯` `deeplearning` `Distilling Knowledge`

Distilling the knowledge in a neural network

說明

區塊如下分類，原文區塊為藍底，翻譯區塊為綠底，部份專業用語翻譯參考國家教育研究院

原文

翻譯

個人註解，任何的翻譯不通暢部份都請留言指導

paper hyperlink

Abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions [3]. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators [1] have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

一個可以提升幾乎所有機器學習演算法效能的方法就是在相同資料集上訓練不同模型，然後平均它們的預測。不幸的是，使用這種集成模型預測非常的麻煩，而且計算成本可能太高以致於無法佈署到大量用戶情境，尤其是如果其中有個別模型是大型神經網路的話。Caruana跟他的好碰友已經證明，是可以將集成模型中的知識壓縮到單一模型的，這種單一模型更容易佈署，我們使用不同的壓縮技術進一步的開發這種方法。我們在MNIST上得到讓人驚訝的成果，這說明著我們可以透過將集成模型中的知識濃縮到單一模型的方式來顯著提升頻繁使用的商業系統的聲學模型(acoustic model)。我們還引入一種由一個或多個完整模型與許多專家模型(specialist model)所組成的新集成類型，其中專家模型學習完整模型會混淆的細粒度類別(fine-grained classes)。不同於混合專家模型，這些專家模型可以很快速地以平行化的方式來訓練。

1 Introduction

Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of traveling and reproduction. In large-scale machine learning, we typically use very similar models for the training stage and the deployment stage despite their very different requirements: For tasks like speech and object recognition, training must extract structure from very large, highly redundant datasets but it does not need to operate in real time and it can use a huge amount of computation. Deployment to a large number of users, however, has much more stringent requirements on latency and computational resources. The analogy with insects suggests that we should be willing to train very cumbersome models if that makes it easier to extract structure from the data. The cumbersome model could be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout [9]. Once the cumbersome model has been trained, we can then use a different kind of training, which we call “distillation” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment. A version of this strategy has already been pioneered by Rich Caruana and his collaborators [1]. In their important paper they demonstrate convincingly that the knowledge acquired by a large ensemble of models can be transferred to a single small model.

很多蟲屬都有幼蟲期，從環境中提取能量與養分來為自己的成長最佳化，以及截然不同的成蟲期，是利用旅遊跟繁殖來最佳化。在大型機器學習中，我們通常對於訓練階段與佈署階段使用非常相似的模型，儘管它們的需求非常的不一樣：對於像是語音與物體偵測的任務，訓練必需從非常大型而且高度冗餘的資料集中提取結構，但它並不需要即時處理，並且它會用到大量的計算。然而，佈署到大量使用者的環境上對於延遲與計算資源就有更嚴格的要求。這跟蟲蟲的類比說明著，如果能夠更輕鬆的從資料中提取結構，那我們應該會比較願意訓練非常笨重的模型。這種笨重的模型可以是各別訓練模型的集成，或是一個用著非常強力的regularizer訓練的模型(如dropout，老爺子幫自己打廣告？)。一旦笨蛋的模型訓練好了，我們就可以使用不同類型的訓練，稱之為"蒸餾(distillation)"，將知識從笨重的模型中轉移到一個更適合用來佈署的小型模型。Rich Caruana跟他的夥伴已經率先提出這種策略的一個版本。在他們的重要論文中提出了有力的證明，從大型集成模型中已獲取的知識可以轉移到單一的小型模型中。

A conceptual block that may have prevented more investigation of this very promising approach is that we tend to identify the knowledge in a trained model with the learned parameter values and this makes it hard to see how we can change the form of the model but keep the same knowledge. A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors. For cumbersome models that learn to discriminate between a large number of classes, the normal training objective is to maximize the average log probability of the correct answer, but a side-effect of the learning is that the trained model assigns probabilities to all of the incorrect answers and even when these probabilities are very small, some of them are much larger than others. The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize. An image of a BMW, for example, may only have a very small chance of being mistaken for a garbage truck, but that mistake is still many times more probable than mistaking it for a carrot.

可能會阻礙進一步研究這種非常有前景的方法的概念障礙就是我們傾向於將訓練好的模型中知識跟學習到的參數值視為是一致的，這讓我們很難理解如何改變模型的形式而保留相同的知識。關於知識的一個更抽象的觀點，它是從任意特定實例中解放出來的，它是從輸入向量到輸出向量所學習到的映射。對於學習區分大量類別的笨重模型來說，正常的目標是最大化正確答案的平均對數機率，不過學習的副作用在於，訓練好的模型會讓每個非正確答案都有一定的機率，即使它們的機率非常非常小(有些可能比其它的大一點)。非正確答案的相關機率給了我們很多關於笨重模型如何泛化的信息。舉例來說，BMW，也許它有很小的機率被誤認為垃圾車，不過儘管如此，這種機率還是比被誤認為是胡蘿蔔還要來的高很多倍。

It is generally accepted that the objective function used for training should reflect the true objective of the user as closely as possible. Despite this, models are usually trained to optimize performance on the training data when the real objective is to generalize well to new data. It would clearly be better to train models to generalize well, but this requires information about the correct way to generalize and this information is not normally available. When we are distilling the knowledge from a large model into a small one, however, we can train the small model to generalize in the same way as the large model. If the cumbersome model generalizes well because, for example, it is the average of a large ensemble of different models, a small model trained to generalize in the same way will typically do much better on test data than a small model that is trained in the normal way on the same training set as was used to train the ensemble.

普遍來說，用於訓練的目標函數應該盡可能地反映使用者的實際目標。儘管如此，當實際目標可以很好的泛化到新的資料時，模型通常在訓練資料上也可能有很好的效能。很明顯的，好好的訓練模型可以有比較好的泛化效果，不過這需要有關於正確泛化方式的信息，而這些信息通常無法使用。然而，當我們從笨重模型蒸餾知識到小型模型的時候，我們可以訓練小型模型用著跟笨重模型相同的方式來進行泛化。如果笨重的模型能夠很好也泛化，舉例來說，因為它是不同模型的大型集成所得的平均，那麼，以相同方式訓練來泛化的小模型通常會比以正常方式訓練(與集成訓練相同資料集的情況下)而得的小模型在測試資料上做的更好。

An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. For this transfer stage, we could use the same training set or a separate “transfer” set. When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or geometric mean of their individual predictive distributions as the soft targets. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.

一個將笨重模型的泛化能力轉移到小型模型的方式就是將大型模型所產生的類別機率做為訓練小型模型的"soft targets"。對於這個轉移階段來說，我們可以使用相同的訓練集或是單獨的"transfer" set。當笨蛋模型是較為簡單模型的大型集成的時候，我們可以使用它們各自預測分佈的算術或是幾何平均值來做為soft target。當soft targets有高的熵(entropy)的時候，它們在每個訓練案例中所提供的信息會比hard targets來的多，而且訓練案例之間的梯度中的變異數也會低很多，所以啊，小型模型通常可以在比訓練原始笨重模型還要少的資料上訓練，而且還可以使用較高的學習效率(learning rate)。

For tasks like MNIST in which the cumbersome model almost always produces the correct answer with very high confidence, much of the information about the learned function resides in the ratios of very small probabilities in the soft targets. For example, one version of a 2 may be given a probability of

10^{- 6}

of being a 3 and

10^{- 9}

of being a 7 whereas for another version it may be the other way around. This is valuable information that defines a rich similarity structure over the data (i. e. it says which 2’s look like 3’s and which look like 7’s) but it has very little influence on the cross-entropy cost function during the transfer stage because the probabilities are so close to zero. Caruana and his collaborators circumvent this problem by using the logits (the inputs to the final softmax) rather than the probabilities produced by the softmax as the targets for learning the small model and they minimize the squared difference between the logits produced by the cumbersome model and the logits produced by the small model. Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model is actually a special case of distillation.

對於MNIST這類的任務，笨重模型通常可以以非常高的置信度產生正確答案，不過關於學習函數的大部份信息都是存在於soft targets中非常小機率的比率中。舉例來說，2個一個版本也許給出有

10^{- 6}

的機率是3，有

10^{- 9}

是7，另一個版本則可能是相反的。這是一個有價值的信息，其定義了資料上豐富的類似結構(即，它說那個2看起來像3，那個看起來像7之類的)，不過它在轉移階段的時候對於cross-entropy cost function的影響非常小，因為它的機率小到非常接近0。Caruana跟他的合作人員利用logits(final softmax的輸入)而不是softmax所產生的機率做為學習小型模型的目標(target)來繞過這個問題，他們最小化笨重模型所產生的logits與小型模型所產生的logits之間的平方差。我們更為通用的解決方案，稱為"distillation"，是一種提高final softmax的溫度一直到笨重模型產生出一組適合的soft targets。然後，我們會在訓練小型模型的時候使用相同的高溫來匹配這些soft targets。後面會說明匹配笨重模型的logits實際上是distillation的一種特殊情況。

The transfer set that is used to train the small model could consist entirely of unlabeled data [1] or we could use the original training set. We have found that using the original training set works well, especially if we add a small term to the objective function that encourages the small model to predict the true targets as well as matching the soft targets provided by the cumbersome model. Typically, the small model cannot exactly match the soft targets and erring in the direction of the correct answer turns out to be helpful.

用來訓練小型模型的轉移集(transfer set)可以包含所有未標記的資料，或者可以使用原始的訓練集(training set)。我們發現到，使用原始訓練集的效果不錯，特別是如果我們在目標函數中加入一個小項目來鼓勵小型模型預測實際目標以及匹配笨重模型所提到的soft targets。通常小型模型無法完全地匹配soft targets，不過在正確答案的方向上犯錯是有幫助的。

2 Distillation

Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit,

z_{i}

, computed for each class into a probability,

q_{i}

, by comparing

z_{i}

with the other logits.

\begin{matrix} (1) & q_{i} = \frac{e x p (z_{i} / T)}{\sum_{j} e x p (z_{i} / T)} \end{matrix}

where

T

is a temperature that is normally set to 1. Using a higher value for

T

produces a softer probability distribution over classes.

神經網路通用利用"softmax"輸出層來產生類別機率，這個輸出層會轉換logit，也就是為每個類別計算的

z_{i}

透過跟其它logits比較來轉換成機率

q_{i}

。

\begin{matrix} (1) & q_{i} = \frac{e x p (z_{i} / T)}{\sum_{j} e x p (z_{i} / T)} \end{matrix}

其中

T

是溫度，通常設置為1。

T

使用較高的值的話，就可以產生類別上softer probability distribution。

In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.

在最簡單的distillation(蒸餾)形式中，知識會透過在transfer set上的訓練而轉移到蒸餾過的模型中，並且會在transfer set中的每一個案例都使用soft target distribution，這個soft target distribution是由笨重模型在其softmax中使用較高的溫度所生成的。在訓練distilled model的時候會使用相同的高溫，不過訓練好之後就會改用1了。

When the correct labels are known for all or some of the transfer set, this method can be significantly improved by also training the distilled model to produce the correct labels. One way to do this is to use the correct labels to modify the soft targets, but we found that a better way is to simply use a weighted average of two different objective functions. The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels. This is computed using exactly the same logits in softmax of the distilled model but at a temperature of 1. We found that the best results were generally obtained by using a condiderably lower weight on the second objective function. Since the magnitudes of the gradients produced by the soft targets scale as

1 / T^{2}

it is important to multiply them by

T^{2}

when using both hard and soft targets. This ensures that the relative contributions of the hard and soft targets remain roughly unchanged if the temperature used for distillation is changed while experimenting with meta-parameters.

當所有或是部份的transfer set的正確標記是已知的時候，我們也可以利用訓練distilled model產生正確標記來改進這個方法。一個方法就是使用正確標記來修改soft targets，不過我們發現一個更簡單的方法就是使用兩個不同目標函數的加權平均。第一個目標函數是跟soft targets的cross entropy，這個cross entropy的計算跟從笨重模型中用來生成soft targets的distilled model的softmax中使用的高溫是一樣的。第二個目標函數是跟正確標記的cross entropy。這是使用跟distilled model的softmax中完全相同的logits所計算的，就是差在這邊的溫度為1。我們發現到，最好的結果通常會是在第二個目標函數使用相對低權重的情況下得到。因為由soft targets scale所產生的梯度的幅度為

1 / T^{2}

，所以在使用hard與soft targets的時候將之乘上

T^{2}

是很重要的。這確保了如果在meta-parameters的實驗中，用於蒸餾的溫度變化了，其hard與soft target的相對貢獻大致維持不變。

溫度

T

愈大，經過softmax得到的機率分佈就愈平滑，原本

T = 1

的情況下，其機率很小的類別會因為

T

的放大而放大它的類別機率，這某種程度的保住該類別的信息，理解上這對反向傳播應該很有幫助，下面給出Claude給我的範例。

一個例子是一個5分類問題，logits為[5.0, 2.0, 1.0, 0.5, 0.1]:

當
$T = 1$ 時，soft target為[0.84, 0.11, 0.03, 0.01, 0.00]
當
$T = 2$ 時，soft target為[0.65, 0.20, 0.09, 0.04, 0.02]
當
$T = 4$ 時，soft target為[0.45, 0.24, 0.16, 0.10, 0.05]

2.1 Matching logits is a special case of distillation

Each case in the transfer set contributes a cross-entropy gradient,

d C / d z_{i}

, with respect to each logit,

z_{i}

of the distilled model. If the cumbersome model has logits

v_{i}

which produce soft target probabilities

p_{i}

and the transfer training is done at a temperature of

T

, this gradient is given by:

\begin{matrix} (2) & \frac{\partial C}{\partial z_{i}} = \frac{1}{T} (q_{i} - p_{i}) = \frac{1}{T} (\frac{e^{z_{i} / T}}{\sum_{j} e^{z_{j}} / T} - \frac{e^{v_{i} / T}}{\sum_{j} e^{v_{j} / T}}) \end{matrix}

transfer set中的每一個案例都貢獻了一個交叉熵，

d C / d z_{i}

，這對應於每一個logit，也就是蒸餾模型的

z_{i}

。如果笨重模型有著logits

v_{i}

，其產生soft target probability

p_{i}

，並且其transfer training以溫度

T

完成，其梯度為：

\begin{matrix} (2) & \frac{\partial C}{\partial z_{i}} = \frac{1}{T} (q_{i} - p_{i}) = \frac{1}{T} (\frac{e^{z_{i} / T}}{\sum_{j} e^{z_{j}} / T} - \frac{e^{v_{i} / T}}{\sum_{j} e^{v_{j} / T}}) \end{matrix}

If the temperature is high compared with the magnitude of the logits, we can approximate:
如果溫度相比於logits的幅度較高的話，那我們可近似：

\begin{matrix} (3) & \frac{\partial C}{\partial z_{i}} \approx \frac{1}{T} (\frac{1 + z_{i} / T}{N + \sum_{j} z_{j} / T} - \frac{1 + v_{i} / T}{N + \sum_{j} v_{j} / T}) \end{matrix}

如果溫度相比於logits的幅度較高的話，那我們可近似：

\begin{matrix} (3) & \frac{\partial C}{\partial z_{i}} \approx \frac{1}{T} (\frac{1 + z_{i} / T}{N + \sum_{j} z_{j} / T} - \frac{1 + v_{i} / T}{N + \sum_{j} v_{j} / T}) \end{matrix}

If we now assume that the logits have been zero-meaned separately for each transfer case so that

\sum_{j} z_{j} = \sum_{j} v_{j} = 0

Eq. 3 simplifies to:

\begin{matrix} (4) & \frac{\partial C}{\partial z_{i}} \approx \frac{1}{N T^{2}} (z_{i} - v_{i}) \end{matrix}

如果我們現在假設對每一個轉移案例的logits都進行了均值為0的處理，那就變成

\sum_{j} z_{j} = \sum_{j} v_{j} = 0

，那Eq.3就可以簡化成：

\begin{matrix} (4) & \frac{\partial C}{\partial z_{i}} \approx \frac{1}{N T^{2}} (z_{i} - v_{i}) \end{matrix}

So in the high temperature limit, distillation is equivalent to minimizing

1 / 2 (z_{i} - v_{i})^{2}

, provided the logits are zero-meaned separately for each transfer case. At lower temperatures, distillation pays much less attention to matching logits that are much more negative than the average. This is potentially advantageous because these logits are almost completely unconstrained by the cost function used for training the cumbersome model so they could be very noisy. On the other hand, the very negative logits may convey useful information about the knowledge acquired by the cumbersome model. Which of these effects dominates is an empirical question. We show that when the distilled model is much too small to capture all of the knowledege in the cumbersome model, intermediate temperatures work best which strongly suggests that ignoring the large negative logits can be helpful.

所以啊，在高溫限制下，蒸餾等同於最小化

1 / 2 (z_{i} - v_{i})^{2}

，前提是對每一個轉移案例的logits做各別的均值為0的處理。在較低的溫度情況下，蒸餾對匹配那些比均值還要負的多的logits所付出的注意力少很多。這是一種潛在的優勢，因為這些logits幾乎是完全不受限於用來訓練笨重模型的成本函數的約束，所以它們可能是非常雜亂。另一方面，非常負值的logits也許傳達關於笨重模型所獲得的知識的有用信息。這些影響中的那一個是佔主導地位的就是一個經驗問題了。我們說明了，當蒸餾模型太小以致於無法取得笨重模型中的所有知識的時候，中間的溫度效果是最好的，這強烈的說明了，忽略掉大的negative logits也許是有幫助的。

3 Preliminary experiments on MNIST

To see how well distillation works, we trained a single large neural net with two hidden layers of 1200 rectified linear hidden units on all 60,000 training cases. The net was strongly regularized using dropout and weight-constraints as described in [5]. Dropout can be viewed as a way of training an exponentially large ensemble of models that share weights. In addition, the input images were jittered by up to two pixels in any direction. This net achieved 67 test errors whereas a smaller net with two hidden layers of 800 rectified linear hidden units and no regularization achieved 146 errors. But if the smaller net was regularized solely by adding the additional task of matching the soft targets produced by the large net at a temperature of 20, it achieved 74 test errors. This shows that soft targets can transfer a great deal of knowledge to the distilled model, including the knowledge about how to generalize that is learned from translated training data even though the transfer set does not contain any translations.

為了看看知識蒸餾做的如何，我們訓練了一個大型神經網路，2個隱藏層，1200個rectified linear hidden units，在60000訓練案例上訓練。這個網路我們使用dropout與weight-constraints做了強烈的正規化。dropout可以視為是一種訓練共享權重的指數級大型模型的集成的方式。此外，輸入影像會在任意方向上抖動(jittered)兩個pixels。這個網路實現了67個測試誤差，而另一個有兩個隱藏層的800個rectified linear hidden units並且沒有正規化的網路則是有146個誤差。不過如果這個較小的網路單純的利用大型網路在溫度20的情況下所產生的soft target的額外任務來正規化的話，那就可以有74的測試誤差。這說明著，soft targets可以將大量的知識轉移到蒸餾模型中，其中包括從翻譯的訓練資料中學到的泛化能力，儘管transfer set並不包含任何的翻譯。

When the distilled net had 300 or more units in each of its two hidden layers, all temperatures above 8 gave fairly similar results. But when this was radically reduced to 30 units per layer, temperatures in the range 2.5 to 4 worked significantly better than higher or lower temperatures.

當蒸餾模型的兩個隱藏層中各有300或是更多個神經元的時候，所有高於8的溫度都會有類似的結果。不過當每一層的神經元徹底的降到30的時候，溫度設置在2.5~4這個區間明顯比區間外的的效能來的好。

We then tried omitting all examples of the digit 3 from the transfer set. So from the perspective of the distilled model, 3 is a mythical digit that it has never seen. Despite this, the distilled model only makes 206 test errors of which 133 are on the 1010 threes in the test set. Most of the errors are caused by the fact that the learned bias for the 3 class is much too low. If this bias is increased by 3.5 (which optimizes overall performance on the test set), the distilled model makes 109 errors of which 14 are on 3s. So with the right bias, the distilled model gets 98.6% of the test 3s correct despite never having seen a 3 during training. If the transfer set contains only the 7s and 8s from the training set, the distilled model makes 47.3% test errors, but when the biases for 7 and 8 are reduced by 7.6 to optimize test performance, this falls to 13.2% test errors.

然後，我們試著忽略掉transfer set中數字3的所有樣本。從蒸餾模型的角度來看，3是一個從未見過不存在的數字。儘管如此，蒸餾模型還是只有206個測試誤差，其中133個誤差是測試集中1010個3裡面的錯誤。多數的誤差是由3這個類別太少了的學習偏差所造成的。如果這個偏差增加3.5(這優化了測試集的整體效能)，那蒸餾模型會產生109個誤差，其中14個是在3這個類別上。因此，在正確的偏差情況下，即使訓練過程中未曾見過3，蒸餾模型仍然在3這個類別上有98.6%的正確性。如果transfer set僅包含來自訓練集中的7與8兩個別類，那蒸餾模型就會有47.3%的測試誤差，不過，當對於7、8的偏差降低7.6來最佳化測試效能的時候，測試誤差就會降低13.2%。

4 Experiments on speech recognition

In this section, we investigate the effects of ensembling Deep Neural Network (DNN) acoustic models that are used in Automatic Speech Recognition (ASR). We show that the distillation strategy that we propose in this paper achieves the desired effect of distilling an ensemble of models into a single model that works significantly better than a model of the same size that is learned directly from the same training data.

這一本我們就來研究一下用於Automatic Speech Recognition(自動化語音辨識，ASR)的集成式神經網路(DNN)聲學模型的效果。我們主要說明，我們於論文中所提出的蒸餾策略實現了將集成模型蒸餾成單一模型的預期效果，而且明顯優於以相同訓練資料直接學習的相同大小的模型。

State-of-the-art ASR systems currently use DNNs to map a (short) temporal context of features derived from the waveform to a probability distribution over the discrete states of a Hidden Markov Model (HMM) [4]. More specifically, the DNN produces a probability distribution over clusters of tri-phone states at each time and a decoder then finds a path through the HMM states that is the best compromise between using high probability states and producing a transcription that is probable under the language model.

當下最好的ASR系統使用DNNs將從波形導出的(短)時間上下文特徵映射到Hidden Markov Model (HMM)的離散狀態上的機率分佈。更具體而言，DNN在每個時間會生成一個三分量受波器簇上的機率分佈，然後解碼器找出一個通過HMM states的路徑，這是介於使用高機率狀態與生成在語言模型上可能的轉錄之間的折衷。

Although it is possible (and desirable) to train the DNN in such a way that the decoder (and, thus, the language model) is taken into account by marginalizing over all possible paths, it is common to train the DNN to perform frame-by-frame classification by (locally) minimizing the cross entropy between the predictions made by the net and the labels given by a forced alignment with the ground truth sequence of states for each observation:

θ = \arg max_{θ^{'}} P (h_{t} | s_{t}; θ^{'})

where

θ

are the parameters of our acoustic model

P

which maps acoustic observations at time

t

s_{t}

, to a probability,

P (h_{t} | s_{t}; θ^{'})

, of the “correct” HMM state

h_{t}

, which is determined by a forced alignment with the correct sequence of words. The model is trained with a distributed stochastic gradient descent approach.

儘管這是有機會以這樣的方式(也希望可以)訓練DNN，也就是解碼器(以及語言模型)透過邊緣化所有可能的路徑來考慮，通常會透過(局部)最小化網路所給出的預測與透過強制對齊每個觀測的實際狀態序列所得到的標記之間的交叉熵來訓練DNN執行frame-by-frame的分類：

θ = \arg max_{θ^{'}} P (h_{t} | s_{t}; θ^{'})

其中

θ

是聲學模型

P

的參數，模型會將在時間

t

的聲學觀測，

s_{t}

，映射到機率，

P (h_{t} | s_{t}; θ^{'})

，也就是正確的HMM state

h_{t}

的機率，這個HMM state是由正確詞序做強制對齊所確定的。這模型以分布式隨機梯度下降法訓練而得。

We use an architecture with 8 hidden layers each containing 2560 rectified linear units and a final softmax layer with 14,000 labels (HMM targets

h_{t}

). The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame. The total number of parameters is about 85M. This is a slightly outdated version of the acoustic model used by Android voice search, and should be considered as a very strong baseline. To train the DNN acoustic model we use about 2000 hours of spoken English data, which yields about 700M training examples. This system achieves a frame accuracy of 58.9%, and a Word Error Rate (WER) of 10.9% on our development set.

我的架構是8個隱藏層，每一層都有2500個rectified linear units，最後是14000個標記的softmax layer(HMM target

h_{t}

)。The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame。總的參數約為85M。這是Android voice search用的聲學模型中一個比較過時的版本，可以被視為是一個基線。為了訓練DNN聲學模型，我們使用大約2000小時的英語資料，這生成大概700M的訓練樣本。這系統在我們的開發集上得到58.9%的frame accuracy與10.9%的Word Error Rate (WER)。

這邊的用語多為音學部份，因為在下沒有學習過，實在無法翻譯。下面給出百度的翻譯參考：
輸入是26幀的40 Mel縮放的濾波器組係數，每幀具有10ms的超前，並且我們預測第21幀的HMM狀態。

4.1 Results

We trained 10 separate models to predict

P (h_{t} | s_{t}; θ)

, using exactly the same architecture and training procedure as the baseline. The models are randomly initialized with different initial parameter values and we find that this creates sufficient diversity in the trained models to allow the averaged predictions of the ensemble to significantly outperform the individual models. We have explored adding diversity to the models by varying the sets of data that each model sees, but we found this to not significantly change our results, so we opted for the simpler approach. For the distillation we tried temperatures of [1, 2, 5, 10] and used a relative weight of 0.5 on the cross-entropy for the hard targets, where bold font indicates the best value that was used for table 1 .

我們訓練10個不同模型來預測

P (h_{t} | s_{t}; θ)

，使用跟baseline完全一樣的架構與訓練程序。模型以不同的初始參數值來隨機初始化，我們發現到這樣的方式在訓練模型中創造出明顯的多樣性，這讓集合的平均預測明顯優於個別模型。我們透過改變每個模型看過的的資料來增加模型的多樣性，這並不會明顯改變我們的結果，因此，我們選擇更簡單的方法。蒸餾的部份我們嚐試幾個溫度，[1, 2, 5, 10]，然後對於hard targets的cross-entropy使用0.5的相對權重，table 1粗體字就表示最佳值。

Table 1 shows that, indeed, our distillation approach is able to extract more useful information from the training set than simply using the hard labels to train a single model. More than 80% of the improvement in frame classification accuracy achieved by using an ensemble of 10 models is transferred to the distilled model which is similar to the improvement we observed in our preliminary experiments on MNIST. The ensemble gives a smaller improvement on the ultimate objective of WER (on a 23K-word test set) due to the mismatch in the objective function, but again, the improvement in WER achieved by the ensemble is transferred to the distilled model.

Table 1說明著，事實上，我們的蒸餾方法能夠從訓練集中提取出更有效的信息(比那些單純使用hard labels訓練單一模型的方法)。透過使用從10模型的集成轉移到蒸餾模型的方式，實現frame classification accuracy超過80%的提升，這與我們在MNIST的初步實驗中觀察到的提升相近。由於目標函數中的不匹配所導致，造成集成模型在最終的目標WER(on a 23K-word test set)上給出較小的提升，不過，透過集成所實現的WER的提升已經被轉移至蒸餾模型中了。

Table 1: Frame classification accuracy and WER showing that the distilled single model performs about as well as the averaged predictions of 10 models that were used to create the soft targets.

We have recently become aware of related work on learning a small acoustic model by matching the class probabilities of an already trained larger model [8]. However, they do the distillation at a temperature of 1 using a large unlabeled dataset and their best distilled model only reduces the error rate of the small model by 28% of the gap between the error rates of the large and small models when they are both trained with hard labels.

我們最近已經意識到透過匹配已訓練好的較大型模型的類別機率來學習一個小型聲學模型的相關研究。然而，它都是使用大型未標記資料以溫度1來蒸餾而得，而且當它們都用hard labels訓練的時候，它們的最佳蒸餾模型就只是將小型模型的誤差率降低28%(大、小模型誤差率之間的差距)。

5 Training ensembles of specialists on very big datasets

Training an ensemble of models is a very simple way to take advantage of parallel computation and the usual objection that an ensemble requires too much computation at test time can be dealt with by using distillation. There is, however, another important objection to ensembles: If the individual models are large neural networks and the dataset is very large, the amount of computation required at training time is excessive, even though it is easy to parallelize.

訓練模型的集成是利用平行計算的一種非常簡單的方法，可以透過使用蒸餾處理掉反對方通常會說的，集成在測試的時候需要太多計算資源，的問題。對於集成還有一個很重要的反對的意見：如果各別的模型是大型神經網路，且資料集非常大，那訓練時候所需要的計算量就會很大，儘管它是可以輕易平行化的。

In this section we give an example of such a dataset and we show how learning specialist models that each focus on a different confusable subset of the classes can reduce the total amount of computation required to learn an ensemble. The main problem with specialists that focus on making fine-grained distinctions is that they overfit very easily and we describe how this overfitting may be prevented by using soft targets.

這一節我們就來給出一個這類資料集的範例，然後說明如何學習專家模型，其中每個模型都專注在不同的容易混淆的類別的子集，以此降低學習集成所需的總計算量。專注在處理細粒度區分的專家模型的主要問題在於它們很容易過擬合，我們會說明如何透過使用soft targets來預防這種過擬合的情形發生。

5.1 The JFT dataset

JFT is an internal Google dataset that has 100 million labeled images with 15,000 labels. When we did this work, Google’s baseline model for JFT was a deep convolutional neural network [7] that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores. This training used two types of parallelism [2]. First, there were many replicas of the neural net running on different sets of cores and processing different mini-batches from the training set. Each replica computes the average gradient on its current mini-batch and sends this gradient to a sharded parameter server which sends back new values for the parameters. These new values reflect all of the gradients received by the parameter server since the last time it sent parameters to the replica. Second, each replica is spread over multiple cores by putting different subsets of the neurons on each core. Ensemble training is yet a third type of parallelism that can be wrapped around the other two types, but only if a lot more cores are available. Waiting for several years to train an ensemble of models was not an option, so we needed a much faster way to improve the baseline model.

JFT是一個GOOGLE內部資料集，有著1億張標記好的照片，共計15000個類別。當我們做這個研究的時候，GOOGLE對於JFT這資料集的baseline mode是一個深度卷積模型，這模型使用非同步隨機梯度下降在大量核心上訓練大約六個月。這個訓練使用了兩個平行化類型。第一個就是，複製很多的神經網路在不同的核心集上(很多副本的概念)，然後從訓練集上執行不同的mini-batches。每個副本計算在它自己當下這個mini-batch的平均梯度，然後把梯度送到共享參數的伺服器上，這個伺服器會回送參數的新值。這些新值反映了透過參數伺服器從上次向副本發送參數到現在所接收到的所有梯度。第二，透過將不同的神經元子集放在每個核心上讓每個副本散在多個核心上。集成訓練是平行訓練的第三種類型，可以包裝在另外兩種類型上，不過必需要有更多核心可以用。等個幾年來訓練一個集成模型是不可能的事情，所以我們需要一種更快的方法來改進基線模型。

5.2 Specialist Models

When the number of classes is very large, it makes sense for the cumbersome model to be an ensemble that contains one generalist model trained on all the data and many “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom). The softmax of this type of specialist can be made much smaller by combining all of the classes it does not care about into a single dustbin class.

當類型數量非常大的時候，使用集成模型來做為笨重模型是說的過去的，集成模型包含用所有資料訓練的通用模型，以及多個專家模型，每個專家模型都用來自類別中非常容易混淆的子集中(像是不同的磨菇)高度豐富的資料上訓練。這種專家模型的softmax會明顯較小(透過將不在意的其它所有類別合併成一個垃圾類別)。

To reduce overfitting and share the work of learning lower level feature detectors, each specialist model is initialized with the weights of the generalist model. These weights are then slightly modified by training the specialist with half its examples coming from its special subset and half sampled at random from the remainder of the training set. After training, we can correct for the biased training set by incrementing the logit of the dustbin class by the log of the proportion by which the specialist class is oversampled.

為了降低過擬合(overfitting)，並分擔學習較低階特徵偵測器的工作，每個專家模型都要用通用模型的權重來初始化。然後用特定子集的一半資料來微調權重，另一半的資料就用訓練集的剩餘部份來隨機採樣。訓練之後，我們就可以來校正訓練集偏差帶來的影響，這可以透過增加垃圾類別的logit跟專家類別被過別採樣的比例的對數來校正。

好多by，所以很很確翻譯的是不是正確。

5.3 Assigning classes to specialists

In order to derive groupings of object categories for the specialists, we decided to focus on categories that our full network often confuses. Even though we could have computed the confusion matrix and used it as a way to find such clusters, we opted for a simpler approach that does not require the true labels to construct the clusters.

我們推導物件類別的群組給專家系統使用，我們決定專注在我們完整網路經常混淆的類別上。儘管我們能夠計算出混淆矩陣並將之用來做為尋找這類群集的方法，但我們選擇一種更簡單的方法，這方法不需要實際標記來建構群集。

In particular, we apply a clustering algorithm to the covariance matrix of the predictions of our generalist model, so that a set of classes

S^{m}

that are often predicted together will be used as targets for one of our specialist models,

m

. We applied an on-line version of the K-means algorithm to the columns of the covariance matrix, and obtained reasonable clusters (shown in Table 2). We tried several clustering algorithms which produced similar results.

特別是，我們將clustering algorithm用在通用模型預測的共變異方陣，這樣啊，經常被一起預測的類別集合

S^{m}

就會被用來做為其中一個專家模型

m

的目標類別。我們用K-means的on-line版本covariance matrix的columns上，然後獲得合理的clusters(如Table 2所示)。我們試了幾種clustering algorithms，得到的結果都差不多。

Table 2: Example classes from clusters computed by our covariance matrix clustering algorithm

5.4 Performing inference with ensembles of specialists

Before investigating what happens when specialist models are distilled, we wanted to see how well ensembles containing specialists performed. In addition to the specialist models, we always have a generalist model so that we can deal with classes for which we have no specialists and so that we can decide which specialists to use. Given an input image

x

, we do top-one classification in two steps:

Step 1: For each test case, we find the

n

most probable classes according to the generalist model. Call this set of classes

k

. In our experiments, we used

n = 1

Step 2: We then take all the specialist models,

m

, whose special subset of confusable classes,

S^{m}

, has a non-empty intersection with

k

and call this the active set of specialists

A_{k}

(note that this set may be empty). We then find the full probability distribution

q

over all the classes that minimizes:

\begin{matrix} (5) & K L (p^{g}, q) + \sum_{m \in A_{k}} K L (p^{m}, q) \end{matrix}

where

K L

denotes the KL divergence, and

p^{m} p^{g}

denote the probability distribution of a specialist model or the generalist full model. The distribution

p^{m}

is a distribution over all the specialist classes of

m

plus a single dustbin class, so when computing its KL divergence from the full

q

distribution we sum all of the probabilities that the full

q

distribution assigns to all the classes in

m

’s dustbin.

在研究蒸餾專家模型會發生什麼事情之前，我們想要看看包含專家模型的集成模型表現如何。除了專家模型之外，我們總是還有一個通用模型，這可以處理專家模型沒有的類別，也可以決定要用那些專家模型。給定一個輸入影像

x

，我們用兩個步驟來做top-one分類：

步驟1：對於每個測試案例，我們會根據通用模型找出

n

最有可能的類別。這個類別的集合稱為

k

。在我們的實驗中，我們使用

n = 1

。

步驟2：接著我們會取出所有的專家模型

m

，其容易混淆類別的特殊子集

S^{m}

要跟

k

(步驟1)有非空的交集，然後將之稱為專家的active set

A_{k}

(注意，這個集合可能是空的)。然後我們就尋找所有類別上的完整機率分佈

q

，使其最小化：

\begin{matrix} (5) & K L (p^{g}, q) + \sum_{m \in A_{k}} K L (p^{m}, q) \end{matrix}

其中

K L

表示KL divergence，而

p^{m} p^{g}

表示一個專家模型(前)或通用完整模型(後)的機率分佈。分佈

p^{m}

是一個涵蓋

m

的所有專家類別加上一個垃圾類別的分佈，所以當我們從完整的

q

分佈計算KL divergence的時候，我們會加總完整的

q

分佈分配在

m

中的垃圾類別的機率。

Eq. 5 does not have a general closed form solution, though when all the models produce a single probability for each class the solution is either the arithmetic or geometric mean, depending on whether we use

K L (p, q)

K L (q, p)

. We parameterize

q = s o f t m a x (z) (with T = 1)

and we use gradient descent to optimize the logits

z

w.r.t. eq. 5. Note that this optimization must be carried out for each image.

方程式5. 並沒有一般的閉合形式解，儘管當所有的模型都為行一個類別生成一個單一機率，其解就是算術平均或是幾何平均，這取決於我們使用

K L (p, q)

或是

K L (q, p)

。我們把

q = s o f t m a x (z) (with T = 1)

參數化，並且使用梯度下降來最佳化

z

w.r.t eq.5。注意到，要對每一張照片都做最佳化的處理。

5.5 Results

Starting from the trained baseline full network, the specialists train extremely fast (a few days instead of many weeks for JFT). Also, all the specialists are trained completely independently. Table 3 shows the absolute test accuracy for the baseline system and the baseline system combined with the specialist models. With 61 specialist models, there is a 4.4% relative improvement in test accuracy overall. We also report conditional test accuracy, which is the accuracy by only considering examples belonging to the specialist classes, and restricting our predictions to that subset of classes.

我們從訓練好的baseline full network開始，專家模型的訓練速度非常的快(快到沒人性)。另外，所有的專家模型都是完全獨立地訓練。Table 3呈現出baseline system與baseline system結合專家模型的絕對測試準度度。使用61個專家模型，總的測試精度提升了4.4%。我們還給出條件測試精度，也就是僅考慮樣本僅屬於專家類別情況下的準確度，並將我們的預測限制在該子集類別上。

Table 3: Classification accuracy (top 1) on the JFT development set.

For our JFT specialist experiments, we trained 61 specialist models, each with 300 classes (plus the dustbin class). Because the sets of classes for the specialists are not disjoint, we often had multiple specialists covering a particular image class. Table 4 shows the number of test set examples, the change in the number of examples correct at position 1 when using the specialist(s), and the relative percentage improvement in top1 accuracy for the JFT dataset broken down by the number of specialists covering the class. We are encouraged by the general trend that accuracy improvements are larger when we have more specialists covering a particular class, since training independent specialist models is very easy to parallelize.

針對我們的JFT專家實驗的部份，我們訓練61個專家模型，每一個模型都有300個類別(加上垃圾類別)。因為專家的類別子集們並非不相交的，所以我們通常會有多個專家模型重覆一個特定的影像類別的情形發生。Table 4給出測試集樣本的數量，使用專家系統時，於位置1(position 1)正確的樣本數量的變化，以及對於JFT在top 1準確度的相對百分比的提升(依著交集該類別的專家數量畫劃分)。我們感到鼓舞的地方在於，當我們有更多的專家覆蓋一個特定類別的時候，它準確度的提升趨勢是更大的，這是因為訓練獨立的專家模型是很容易可以平行化的。

Table 4: Top 1 accuracy improvement by # of specialist models covering correct class on the JFT test set.

6 Soft Targets as Regularizers

One of our main claims about using soft targets instead of hard targets is that a lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target. In this section we demonstrate that this is a very large effect by using far less data to fit the 85M parameters of the baseline speech model described earlier. Table 5 shows that with only 3% of the data (about 20M examples), training the baseline model with hard targets leads to severe overfitting (we did early stopping, as the accuracy drops sharply after reaching 44.5%), whereas the same model trained with soft targets is able to recover almost all the information in the full training set (about 2% shy). It is even more remarkable to note that we did not have to do early stopping: the system with soft targets simply “converged” to 57%. This shows that soft targets are a very effective way of communicating the regularities discovered by a model trained on all of the data to another model.

我們主張使用soft targets而不是hard targets的一個原因就是，soft targets會帶著一些有效的信息，這些信息是single hard target無法編碼的。這一節我們就要來證明，利用少量的資料來擬合之前說過的基線語音模型85M的參數。Table 5就說明了，我們只用3%的資料(大約20M筆資料)，用hard target的方式訓練基線模型，這導致了嚴重的過擬合(我們有做early stopping，因為準確度在來到44.5%之後急速下降)，而相同的模型採用soft targets能夠還原整個訓練集中幾乎所有的信息(大約少了2%)。更值得注意的地方在於，我們不需要做early stopping：使用soft targets的系統僅僅收簽到57%。這說明著，soft targets是一種非常有效的方式，能夠將訓練模型的所有資料所發現的規則性找出來給另一個模型。

Table 5: Soft targets allow a new model to generalize well from only 3% of the training set. The soft targets are obtained by training on the full training set.

6.1 Using soft targets to prevent specialists from overfitting

The specialists that we used in our experiments on the JFT dataset collapsed all of their non-specialist classes into a single dustbin class. If we allow specialists to have a full softmax over all classes, there may be a much better way to prevent them overfitting than using early stopping. A specialist is trained on data that is highly enriched in its special classes. This means that the effective size of its training set is much smaller and it has a strong tendency to overfit on its special classes. This problem cannot be solved by making the specialist a lot smaller because then we lose the very helpful transfer effects we get from modeling all of the non-specialist classes.

我們實驗中使用的JFT資料集的專家模型將所有非專家類別的類別歸到單一的垃圾類別。如果我們讓專家模型有著所有類別完整的softmax的話，那也許有比使用early stopping更好的方法來避免overfitting。專家模型的訓練是建立在有大量其專家類別的資料上。這意味著，它的訓練集的有效數量會小很多，而且會有很強的趨勢在其專家類別上overfitting。這問題沒有辦法單靠縮小其規模來處理，因為這會讓我們損失掉來自非專家類別(non-specialist classes)建模中所得到的非常有幫助的轉移效果(transfer effects)。

Our experiment using 3% of the speech data strongly suggests that if a specialist is initialized with the weights of the generalist, we can make it retain nearly all of its knowledge about the non-special classes by training it with soft targets for the non-special classes in addition to training it with hard targets. The soft targets can be provided by the generalist. We are currently exploring this approach.

我們的實驗中所使用的3%的語音資料就說明著，如果專家模型是用通用模型的權重來初始化，除了用hard targets來訓練之外，我們可以對non-special classes使用soft targets的方式來訓練炮，以此保留幾乎所有關於non-special classes的所有知識。soft targets可以用通用模型來提供。我們目前也正在探索這個方法。

7 Relationship to Mixtures of Experts

The use of specialists that are trained on subsets of the data has some resemblance to mixtures of experts [6] which use a gating network to compute the probability of assigning each example to each expert. At the same time as the experts are learning to deal with the examples assigned to them, the gating network is learning to choose which experts to assign each example to based on the relative discriminative performance of the experts for that example. Using the discriminative performance of the experts to determine the learned assignments is much better than simply clustering the input vectors and assigning an expert to each cluster, but it makes the training hard to parallelize: First, the weighted training set for each expert keeps changing in a way that depends on all the other experts and second, the gating network needs to compare the performance of different experts on the same example to know how to revise its assignment probabilities. These difficulties have meant that mixtures of experts are rarely used in the regime where they might be most beneficial: tasks with huge datasets that contain distinctly different subsets.

在資料的子集上訓練的專家模型跟使用gating network計算將每個樣本分配給每個專家的機率的混合專家模型有相似之處。在專家模型學習處理樣本分配的同時，gating network正學著那個專家可以分配那個樣本(基於專家對於該樣本的相對判別效能)。使用專家的判別效能來決定學習到的分配會比單純的對輸入向量做聚類然後把專家分配到每個聚類的方式來的好，不過這造成了難以平行化訓練過程：首先，每個專家的加權訓練集會不斷地變化，而這個變化是取決於所有其它專家，接著就是，gating networks需要在相同的樣本上比較不同專家之間的效能，這樣才能知道如何修訂它的分配機率。這些難題就意味著，混合的專家模型很少用於它們最有利的領域：有著包含明顯不同子集的大量資料的任務。

It is much easier to parallelize the training of multiple specialists. We first train a generalist model and then use the confusion matrix to define the subsets that the specialists are trained on. Once these subsets have been defined the specialists can be trained entirely independently. At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.

平行化訓練多個專家模型就簡單多了。我們首先訓練一個通用模型，然後用混淆矩陣來定義專家模型訓練的子集。一旦定義好了，那專家模型就可以完全獨立地訓練。測試的時候，我們可以使用通用模型的預測來決定跟那個專家模型是有關的，然後就執行這些專家模型就好了。

8 Discussion

We have shown that distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model. On MNIST distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more of the classes. For a deep acoustic model that is version of the one used by Android voice search, we have shown that nearly all of the improvement that is achieved by training an ensemble of deep neural nets can be distilled into a single neural net of the same size which is far easier to deploy.

我們已經證明，從集成模型或是大型高度正規化模型中轉移知識到小型模型，也就是蒸餾模型的蒸餾效果是很好的。盡管用來訓練蒸餾模型的transfer set有少一個或多個類別，但在MNIST上的蒸餾效果異常的好。對於深度聲學模型(deep acoustic model)，也就是android voice search所用的一個版本，我們也說明了，透過訓練一個集成的深度神經網路所能達成的所有提升都可以蒸餾成相同大小的的單一神經網路，這更容易佈署。

For really big neural networks, it can be infeasible even to train a full ensemble, but we have shown that the performance of a single really big net that has been trained for a very long time can be significantly improved by learning a large number of specialist nets, each of which learns to discriminate between the classes in a highly confusable cluster. We have not yet shown that we can distill the knowledge in the specialists back into the single large net.

對於真正大型的神經網路來說，即使訓練一個完整的集成都可能是不可行的，不過我們說明，透過學習大量的專家網路(specialist nets)可以明顯提升單一個經遛常時間訓練的超大型網路的效能，其中每個專家網路都學習判別高度混淆集群中的類別。不過我們還沒有證明專家模型中蒸餾出來的知識可以再丟回去單一個大型模型就是。

Distilling the knowledge in a neural network

tags:論文翻譯 deeplearning Distilling Knowledge

說明

Abstract

1 Introduction

2 Distillation

2.1 Matching logits is a special case of distillation

3 Preliminary experiments on MNIST

4 Experiments on speech recognition

4.1 Results

5 Training ensembles of specialists on very big datasets

5.1 The JFT dataset

5.2 Specialist Models

5.3 Assigning classes to specialists

5.4 Performing inference with ensembles of specialists

5.5 Results

6 Soft Targets as Regularizers

6.1 Using soft targets to prevent specialists from overfitting

7 Relationship to Mixtures of Experts

8 Discussion

Read more

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Book_論文翻譯

Dify + Whisper Asr Webservice

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

tags:`論文翻譯` `deeplearning` `Distilling Knowledge`