Knowledge Distillation Notes

--- tags: Model Compression --- # Knowledge Distillation Notes 參考資料 * https://towardsdatascience.com/distilling-knowledge-in-neural-network-d8991faa2cdc * https://zhuanlan.zhihu.com/p/292797265 > Using the class probabilities as a target class provides much more information than simply using just the raw target. - student net 甚至可以學習 teacher net 的 ensemble output - temperature 加 temperature 是為了要讓 softmax 的 output 不要太接近於 onehot (就跟直接給 label 的效果差不多)，讓不同類別的分數被拉近一點，希望讓模型學到不同 class 之間的相關性，所以做了這樣的調整 (ex. 1 跟 7 的手寫樣子其實蠻像的)