Net2Net: Accelerating Learning via Knowledge Transfer, ICLR 2016

# Net2Net: Accelerating Learning via Knowledge Transfer, ICLR 2016 ## [GitXiv](http://www.gitxiv.com/posts/GgBNprwm8xZ3Tzfwh/net2net-accelerating-learning-via-knowledge-transfer) ## [GitHub](https://github.com/keras-team/keras/blob/master/examples/mnist_net2net.py) (Keras implementation) ## 1. Introduction ML 從業者通常都會在同個 dataset 上 train 很多個 model，然而每個 model 都從頭 train 太沒效率，因此將更寬 or 更深的 net 初始化得到和之前的 net 相同的 function - Net2WiderNet - Net2DeeperNet ## 2. Methodology ### 2.1. Feature Prediction - 模仿 FitNet 但表現普普，估計是因為 batch normalization 作為 baseline 太強了 ### 2.2. Function-Preserving Initializations 2 種 - teacher network: $y=f(x;\theta)$ - student network: $g(x;\theta')$ such that $\forall x, f(x;\theta)=g(x;\theta')$ - ***(???)*** Any change made to the network after initialization is guaranteed to be an improvement, so long as each local step is an improvement. Previous methods could fail to improve over the baseline even if each local step was guaranteed to be an improvement, because the initial change to the larger model size could worsen performance. - It is always "safe" to optimize all parameters in the network ### 2.3. Net2WiderNet ![](https://i.imgur.com/rIlpcu7.png) 以 fully connected layer 為例要使第 i 層更寬，我們替換 $W^{(i)}$ 及 $W^{(i+1)}$ - 前面的 w 用複製的 - 後面的 w 用平均的 #### Algorithm ![](https://i.imgur.com/UQelqx1.png) _**???**_ $g(j)$: 將 1~q 的數字 mapping 成 1~n ![](https://i.imgur.com/RyP9QsO.png) 於是 the new weights $U$ are given by ![](https://i.imgur.com/B0AvAcF.png) - **_(???)To make Net2WiderNet a fully general algorithm, we would need a remapping inference algorithm_** ### 2.4. Net2DeeperNet 把本來的一層 $h^{(i)} = \phi (h^{(i-1)T}W^{(i)})$ (**_這沒寫錯嗎@@?_**) 換成兩層 $h^{(i)}=\phi(U^{(i)T}\phi(W^{(i)T}h^{(i-1)})$ - $U$ 被初始化成單位矩陣 (**_Q: 不是方陣也有單位矩陣嗎?_**) - Conv layer 可以用 identity filters (正中央為1，其他為0) - 只有在 activation function $\phi$ 成立 $\phi(I\phi(v))=\phi(v)$ 時適用此方法 - ex: ReLU, Maxout ###### tags: `model expansion`