Notes on "[Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531.pdf)"

# Notes on "[Distilling the Knowledge in a Neural Network](https://arxiv.org/pdf/1503.02531.pdf)" ###### tags: `knowledge-distillation` `research paper notes` `distillation` #### Author [Raj Ghugare](https://github.com/RajGhugare19) ## A brief introduction: 1) Often predictions problems are tackled using the average outputs of an ensemble of large neural networks. This provides high accuracy but also has high computational requirements for being deployment friendly. 2) This paper further develops an approach to store the "knowledge" of such larger models in a small model which can be deployed for a large number of users. ## About the simplest approach: The first obvious thing that comes up when thinking about the knowledge of a Neural network are its parameters, which makes it difficult to think of ways to transfer knowledge to a smaller model because the smaller model can have only so many parameters.But the real objective of a neural network is to generalize well on previously unseen data. If a smaller neural network is taught how to generalize in the same way as these large learned ensembles of models, it would perform better than being directly trained on the same data set as the larger model. The probabilities assigned by the larger trained network to all classes for a data set can be used as soft-targets for the smaller model. At this point they talk about high entropy soft-targets being better as they carry more information. And have much less variance on the gradients between training examples. They provide a technique in which the temperature in the final layer softmax of the cumbersome model is increased so that it gives a soft-target. These Soft-targets are used to train the smaller network using the same temperature. ## Distillation: Temperature is a parameter of the softmax function used to convert the output logits into a probability distribution. $$q_{i} = \frac{exp(z_{i}/T)}{\Sigma_{i}exp(z_{j}/T)}$$ $T \rightarrow$ Temperature factor If T would be relatively higher, it would be as if all the logits are same... i.e. their probabilities would come closer.Using higher values of temperature provides the smaller network with the knowledge about which classes the larger network finds similar. ### Simplest form of distillation: 1) Form a transfer set. 2) Find a high value of temperature that produces suitable soft targets 3) Train the smaller network on these soft-targets using the same temperature in its output softmax layer. 4) After training use temperature = 1, prediction. ### Improvements: This method as proposed can be significantly improved if the labels for transfer data set are known.Then the small network can also be simultaneously trained to predict the correct label.Empirically it is found out that it is better to use a weighted average of the two objective functions. The first objective is the cross entropy with the soft-targets (using high value of temperature). The second objective function is the cross entropy with the correct labels(with temperature = 1). they found that the best results were generally obtained by using a considerably lower weight on the second objective function I will denote the smaller network as the student network and the larger one as the teacher network from now on. Let $z_{i}$ be the logits of the student network and $q_{i}$ be the respective probabilities generated from softmax. Similarly for the teacher network let $v_{i}$ be the logits and $p_{i}$ be the soft-targets generated after softmax. C is the cross-entropy loss between $q$ and $p$. #### Derivation of gradient: $$\frac {\partial C}{\partial z_{i}} = \frac{q_{i}-p_{i}}{T}$$ we know that, $$C = -\Sigma_j p_{j} log(q_{j})$$ We also know that $$q_{j} = \frac{exp(z_{j}/T)}{\Sigma_{k}exp(z_{k}/T)}$$ [Just differentiating the Softmax function] if $i = j$ $$\frac {\partial q_{j}}{\partial z_{i}} = \frac{q_{j}(1-q_{j})}{T}$$ if $i\neq j$ $$\frac {\partial q_{j}}{\partial z_{i}} = - \frac{q_{i}q_{j}}{T}$$ $$\frac {\partial C}{\partial z_{i}} = -\Sigma_{j} p_{j} \frac {\partial log(q_{j})}{\partial z_{i}}$$ $$ = -\Sigma_{j} \frac {p_{j}}{q_{j}}\frac {\partial q_{j}}{\partial z_{i}}$$ $$ = -\frac{p_{i}(1-q_{i})}{T} + \Sigma_{j \neq i}\frac{p_j q_{i}}{T}$$ $$ = (-1/T)* (p_{i} -p_{i} q_{i} - \Sigma_{j \neq i} p_{j}q_{i})$$ $$ = (-1/T) * (p_{i} - q_{i})$$ $$\frac {\partial C}{\partial z_{i}} = \frac{q_{i}-p_{i}}{T}$$ The assumption in the next step is: $$exp[e] = 1 + e$$ if $e$ is very small. ### Key implementation details: #### On MNIST: * While predicting soft-targets temperature used was 20.(When the student network had 300 or more nodes in its two hidden layers). * When the number of nodes was drastically reduced to 30 temperatures between 2 and 4.5 gave the best results.