G. Hinton et al. [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531), 2015

# G. Hinton et al. [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531), 2015  ### Abstract 要改善 machine learning 演算法的 performance 有個很簡單的方式，ensembling 很多不同的 model，然而這樣的 ensemble 會過於笨重且計算成本太高，因此本篇論文的一個目的是使用簡單的 model 來學習 ensemble model 的知識。 ### 1. Introduction - 訓練小 model 時，不使用 true label，而使用 ensemble model 給的 softened label 作為標準答案訓練，可以達到與 ensemble model 相近的 performance - 當 soft target 有很高的 entropy (機率不集中)，每個 training case 會提供比 hard target 更多的資訊量，並且使得 training case 間的 variance of gradient 更小，此時 small model 常常可以用更少的 data 以及更高的 learning rate 來訓練 - transfer set 用來 train 小的 model，可以全部都是 unlabeled data，或者使用原本的 training set (this works well)，只要我們在 objective function 上鼓勵小的 model 去預測跟 cumbersome model 一樣的 output - The relative probabilities of incorrect answers tell us a lot about how the cumbersome model tends to generalize. - a small model trained to generalize in the same way will typically do much better on test data than a small model that is trained in the normal way on the same training set as was used to train the ensemble. ### 2. Distillation 又稱為 **dark knowledge** - teacher model 預測時的 output softmax 改使用 softened softmax $q_i=\dfrac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}$ - $T$ 會使用一個大於 1 的數字來讓 output 的機率更分散 ($T$ 設 1 即原本的 softmax) - 在訓練 student model 的時候，也使用與 teacher model 相同的 temperature $T$ - train 完之後的 student model 將 $T$ 設回 1 當 transfer set 有 label 的時候，可以使用 2 個 objective functions 的 weighted average 1. cross entropy with the soft targets 2. cross entropy with the correct labels - 最好的 result 通常給這個 objective function 非常低的 weight - 因為 soft target 的 gradient 縮小了 $T^2$ 倍，因此要將它 (**應該是 soft loss?**)乘上 $T^2$ #### 2.1 Matching logits is a special case of distillation 論文前面有提到，有學者想讓 student model 學 teacher model 的 logits (即z)，而經由推導可以得知，那個方法是 distillation 近似的一種特例 ### 3. Preliminary experiments on MNIST 在 MNIST 上的調參經驗 ### 4. Experiments on speech recognition 略 ### 5. Training ensembles of specialists on very big datasets 將很多簡單模型做 ensemble 在 test time 消耗太多計算量了，這可以被 distillation 解決。然而，ensemble 另一個重要的問題是，當 ensemble 的每個 model 都是 large NN 而且 dataset 也很大的時候，連 training 的時間都會太久 (即使有平行運算)。本段顯示了在這樣的 dataset 中，訓練 specialist models 專注在 different confusable subset of the classes 可以減少訓練 ensemble 的總計算量。 - 而 specialists 分辨 fine-grained 的主要問題是：太容易 overfit，這裡會描述要怎麼運用 soft targets 來避免 overfitting #### 5.2 Specialist Models 當 classes 非常多的時候，用所有資料 train 一個 generalist model，並且用 confusable subset of the classes 的資料 (例如不同種類的蘑菇) 訓練 specialist models。為了減少 overfitting 並共享 lwer level feature detectors，每個 specialist model 會使用 generalist model 的 weights 來初始化。 - 而 specialist model 會將它不在乎的 class 都視為同一個 dustbin class #### 第五章還包含了一些 specialist 的 training 細節 ### 6. Soft Targets as Regularizers ![](https://i.imgur.com/gdnkUro.png) 若只用所有 data 的 3% 下去訓練，則使用 hard label 會導致嚴重的 overfitting，然而使用 soft target 可以保留將近 100% training data 的訊息量以及 generalize 能力，如上圖。 #### 6.1 Using soft targets to prevent specialists from overfitting specialist model 會用很多 special classes 來訓練，也就是 training set size 會很小，這樣非常容易 overfit 在該 special classes。而這個問題不能用 smaller model 解決，因為這樣會損失從 non-specialist classes model 出非常有幫助的 transfer effects。 - 猜測應該是說如果用小的 network 當 specialist，就不能從 generalist model transfer 過來? ### 7. Relationship to Mixtures of Experts ###### tags: `model compression` `knowledge distillation`