# Notes on "[FSP Knowledge Distillation](https://ieeexplore.ieee.org/document/8100237)"
###### tags: `notes` `knowledge-distillation` `attention`
Author: [Akshay Kulkarni](https://akshayk07.weebly.com/)
Note: Actual title of paper is "**A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning**" and it was published as a CVPR '17 paper ([pdf link](http://openaccess.thecvf.com/content_cvpr_2017/papers/Yim_A_Gift_From_CVPR_2017_paper.pdf)).
## Brief Outline
The distilled knowledge to be transferred is defined in terms of flow between layers, and is calculated as the inner product between feature maps from 2 layers. This training is followed by training the student for the original task.
## Introduction
* This is effectively the third paper on Knowledge Distillation after [Hinton et. al. 2015](https://arxiv.org/abs/1503.02531) and [Romero et. al. 2015](https://arxiv.org/abs/1412.6550).
* Knowledge transfer performance is very sensitive to how distilled knowledge is defined and this can be defined by extracting various features from the pretrained network.
* Considering that a teacher teaches the flow of solving a problem, they defined distilled knowledge as the flow for solving a problem.
* Since a DNN has sequential layers mapping from input to output, the flow of solving a problem can be the relationship between feature maps from 2 layers.
* [Gatys et. al. 2015](https://arxiv.org/abs/1508.06576) used the Gramian matrix (inner product of feature vectors from the same layer) to represent texture information of the input image.
* Similarly, this paper also uses Gramian matrix (but between feature maps of different layers) to represent the flow of solving a problem.
* The extracted feature maps are used to generate the flow of solution procedure (FSP) matrix. The student DNN is trained to make its FSP matrix similar to that of the teacher DNN.
* They show the usefulness of the technique on 3 tasks:
* Fast Optimization - A student DNN trained using the proposed technique learns the main task faster than an untrained DNN. The flow of solution is considered as a good way of initializing the weights of a DNN.
* Knowledge Distillation - Training a smaller network using a teacher to perform better than smaller network trained alone.
* Transfer Learning - Because the proposed method can transfer the distilled knowledge to a small DNN, it can perform similar to a large DNN that uses a normal transfer learning method.
![FSP Training Method](https://i.imgur.com/iXduALy.png)
## Methodology
* Other techniques mostly mimic the intermediate feature maps for KD. However, the student does not necessarily have to learn the intermediate output but can learn the solution method when encountered by a specific question.
* Thus, they believe that demonstrating the flow of solution procedure (FSP) is better for generalization than teaching the intermediate results.
### Mathematical Expression for FSP
* For a DNN, the FSP relationship can be considered as the direction between features of two layers. Let $G \in \mathbb{R}^{m \times n}$ be the FSP matrix, one of the feature maps be $F^1 \in \mathbb{R}^{h \times w \times m}$ with $h, w, m$ representing the height, width and channels respectively.
* Let the other feature map be $F^2 \in \mathbb{R}^{h \times w \times n}$. Then, the FSP matrix $G$ is calculated as
$$
G_{i, j}(x; W) = \sum_{s=1}^h \sum_{t=1}^w \frac{F^1_{s, t, i}(x; W) \times F^2_{s, t, j}(x; W)}{h \times w}
\tag{1}
$$
* Here, $x$ and $W$ represent the input image and weights of the DNN respectively.
* This is a matrix multiplication and in practice it is done by reshaping the 2 feature maps into $hw \times m$ and $hw \times n$, transposing one of them and taking a matrix multiplication to eliminate the $hw$ dimension. This is not mentioned clearly in the paper.
* Further, this requires both feature maps to have same spatial dimension $(h, w)$. So, if both are not of same size, the larger feature map is downsampled using Max Pooling (some open source implementations use Adaptive Average Pooling instead and there is no official implementation available).
### Loss for FSP Matrix
* It can be assumed that there are $k$ FSP matrices $G_i^T$ and $G_i^S$ for $i=1, 2, \dots, k$ generated from the teacher and student respectively.
* They only consider a pair of matrices with the same spatial size and take the squared L2 norm (MSE) as the cost function for each pair. The cost of transferring the distilled knowledge task is defined as
$$
L_{FSP}(W_t, W_s)=\frac{1}{N}\sum_x \sum_{i=1}^k \lambda_i || G_i^T(x; W_t) - G_i^S(x; W_s) ||_2^2
\tag{2}
$$
* Here, $\lambda_i$ and $N$ represent the weight for each loss term and the number of samples respectively. In the experiments, they used the same $\lambda_i$ (effectively all $\lambda_i = 1$).
### Training Procedure
![FSP Training Procedure](https://i.imgur.com/6Tuh7ru.png)
* The learning procedure has two stages (also shown in above Algorithm 1):
* First, $L_{FSP}$ is minimized to make the FSP matrices of the student similar to those of the teacher.
* The student network is then trained by the main task loss at the second stage. For the classification task, they use the softmax cross entropy loss $L_{ori}$ as the main task loss.
## Conclusion
* They present a novel method for distilling the knowledge using the flow of solution procedure calculated by the proposed FSP matrix.
* They perform experiments to show the effectiveness in 3 aspects (mentioned in Introduction). See the paper for results of the experiments.