# ICIP rebutal
## Review Sum-up
| | <span style="color:blue">050E</span> | <span style="color:red">0C97</span> | <span style="color:green">02A3</span> | <span style="color:yellow">1358</span> |
| ------------------------ | ------------------------------------ | ----------------------------------- | ------------------------------------- | -------------------------------------- |
| Importance/Relevance | sufficient interest | sufficient interest | sufficient interest | sufficient interest |
| Novelty | Moderately Original | **Very Original** | Moderately original | Moderately original |
| Technical Correctness | Probably correct | Probably correct | Probably correct | Probably correct |
| Experimental Validation | Limited but convincing | **Lacking in some respect** | Limited but convincing | Limited but convincing |
| Clarity of Presentation | Clear Enough | Clear enough | Clear enough | Clear enough |
| References to Prior Work | References adequate | References adequate | References adequate | References adequate |
## Per review details
### Review <span style="color:blue">050E</span>
_Justification of Novelty_ : Le reviewer a correctement compris le but de la contribution
_Justification of Experimental Validation Score_ : Le reviewer parle de **VGG16** alors que l'on a testé sur **VGG19**.
_Justification of Clarity of Presentation Score_ : Le reviewer reproche un manque de clarté sur la manière dont les résultats avec MP sont calculés.
_Justification of Reference to Prior Work_ : Le reviewer reproche qu'il n'y ait pas de comparaisons avec d'autres méthodes de l'état de l'art et propose une comparaison avec le papier "A Deep Neural Network Pruning Method Based on Gradient L1-norm", IEEE ICCC 2020
### Review <span style="color:red">0C97</span>
_Justification of Novelty/Originality/Contribution Score_ : Le reviewer a compris l'objectif de la contribution. Trouve que la formulation de la reparamétrisation "vaut vraiment le coup".
_Justification of Experimental Validation Score_ : Le reviewer aimerait avoir des résultats sur un **dataset plus grand** avec des **images plus grandes**
_Additional comments to author(s)_ : Le reviewer suggère de fournir des **cas d'usages** où il est intéressant de ne pas finetuner le réseau.
### Review <span style="color:green">02A3</span>
_Justification of Novelty/Originality/Contribution Score_ : Le reviewer à compris le but de la contribution
### Review <span style="color:yellow">1358</span>
_Justification of Novelty/Originality/Contribution Score_ : Le reviewer à compris le but de la contribution. Le reviewer pointe du doigt le fait que **les performances de notre méthode sont en retrait par rapport au mag+FT**. Selon lui cela porte préjudice à la justification de notre méthode puisqu'il y a un **tradeoff**
## Points à développer dans le rebutal
1. <span style="color:blue">@050E</span> Préciser la manière dont sont calculés les scores pour le Mag Pruning
2. <span style="color:blue">@050E</span> Se comparer ou expliquer les différences méthodologiques avec le papier dont il parle
3. <span style="color:red">@0C97</span> Proposer quelques chiffres sur des datasets plus grands
4. <span style="color:red">@0C97</span> Fournir des cas d'usages où il est intéressant de pruner mais de ne pas finetuner.
5. <span style="color:yellow">@1358</span> Parler des performances vs mag+FT qui sont moins bonnes ?
6. <span style="color:yellow">@1358</span> répondre au tradeoff ?
## Elements de réponse
1. Les scores du pruning par magnitude fintuné sont obtenus en évaluant un réseau sur le test set de cifar 10 qui a subit 3 étapes : training initial, pruning au coût cible et finetuning. Comme précisé dans le papier, l'étape de finetuning est faite de la même manière que l'étape de tuning, mais le réseau initial est le réseau entrainé et pruné. La performance indiquée dans le tableau est la meilleure accuracy de test sur 5 runs.
2. **Il faut obtenir le papier**
3. Résultats CONV4 sur tinyimagenet : tinyimagenet répond aux critiques : dataset plus gros et images plus grandes.
| method \ pruning rate | 90% | 93% | 95% | 97% | 99% |
| --------------------- | ------- | ------- | ------- | ------- | ------- |
| MP | 26% | 18% | 10% | 3% | 0.5% |
| MP + FT | **45%** | **45%** | **45%** | **43%** | 21% |
| Ours | 39% | 39% | 39% | 39% | **35%** |
Pour un réseau de type conv4 entrainé avec le dataset tinyimagenet dans les mêmes conditions que celles décrites dans la section experiment du papier, on obtient pour un taux de pruning de 99% un gain de 14 points de l'accuracy de test par rapport au matgnitude pruning **avec** finetuning.
4. Deux avantages à notre méthode par rapport au finetuning : Le premier est le fait que le pruning par magnitude est réalisé à posteriori et ne tient pas compte de la topologie du réseau. Les distributions des poids sont différentes à chaque couches, certaines couches ayant des distributions bien plus centrées autour de 0 que d'autres. On peut se retrouver dans une situation où un taux de pruning trop élevé élimine tous les poids d'une seule couche (cela se produit généralement vers les dernières couches du réseaux, celles proches des logits, plus particulièrement si ces couches sont des layers fully connected). Il se produit alors une déconnexion dans le réseau et, d'une part, pendant la phase de training, les poids en amont de cette déconnexion ne sont plus mis à jour par la backpropagation. D'autre part, la déconnexion empêche même la propagation de l'information des premières couches jusqu'aux logits et donc le réseau donne toujours le même résultat. A contrario, notre méthode étant end-to-end, elle permet d'optimiser conjointement les poids et la topologie de manière progressive, ce qui permet d'éviter les déconnexions potentiellement créées par un pruning à posteriori. Le deuxième avantage de notre méthode, comme détaillée dans le papier, est le fait que l'on peut s'abstenir de finetuner le réseau pruné. Ceci est particulièrement utile dans des cas appliqués, notamment dans le cas de déploiement de modèles de reconnaissance faciale déployé dans des caméras intelligentes connectées. L'entrainement est un processus extrêmement long qui prend plusieurs mois, et du fait du caractère embarqué, il est pertinent d'obtenir le modèle le plus léger possible, notamment via le pruning. Le gain de temps peut donc être substentiel.
## Rebutal
After carefully reading the reviewers comments, your rebutal provides additional informations on 4 points: The methodology used to obtained finetuned magnitude pruning results, a comparison with [1], insight of our method behaviour on larger datasets and finally some precisions on the main advantages of our method compared to magnitude pruning or methods that requires finetuning.
The finetuned magnitude pruning scores are obtained by evaluating the accuracy on the CIFAR 10 test dataset of a network that underwent the following steps: initial training, pruning at the targeted pruning rate (the percentage indicated in columns header of tables 1,2 and 3 are the percentage of removed weights) and then finetuning. As stated in the the section 3.1 of the paper, the finetuning procedure is realised with the same setup as the training procedure (same batch hyperparameters and learning rate schedule) but the initial network is the trained and pruned network. The indicated performance is the best of 5 runs.
Review 050E asks what is our contribution compared to [1]. First we kindly remind 050E that we did not test ou method on VGG16 but VGG19. Then, [1] is a structured pruning method, which prune filters based on the L1-norm of their associated gradients. As mentioned in section 1 of out paper, structured puring introduce a strong prior on the network topology to prevent sparse network. In contrast, we chose untructured pruning to ensure our method is not constraint in the obtention of a new topology. Moreover we target higher pruning rates (90% and more) that Liu et al. which repports pruning rates up to 71%. Very high pruning rates (90%+) ensure that we take full advantage of sparse computation library.
Review 0C97 pointed out that tests on larger datasets with larger images would be more informative. Therefore we chosed the Tiny Imagenet dataset [2], which consists in 100k 64x64 color images split in 200 classes. At pruning rates of 90%, 95%, 97% our method performs much better than the magnitude pruning method, with gain on the test accuracy from 13 point to 36 points, while staying only 6 points lower than the finetuned magnitude pruning method. At the very high puning rate of 99%, our method overperforms both magnitude pruning and finetuned magnitude pruning by providing a gain of 34.5% and 14% respectively, at a test accuracy of 35%.
Our methods exhibit two main advantages. The first advantage of our method is related to disconnection in the network. The magnitude pruning method pruned the weight a prosteriori, after training, while only considering the absolute values of the weights as a saliency criterion. However, the weight distributions are different from one layer to another. Some layers have weight distribution that are way more centered arround zero than other layers. Hence, a too important pruning rate might prune all the weights of a layer (which generally happens in tha last layers of a network, particularly if these layers are fully connected ones). As a consequence, we observe a disonnection in the network, leading to 1) impossibility to finetune upper layer (the backpropagation information is lost after the disonnection) and 2) random outputs from the networks (the inputs does not change the outputs). In contrast, our method being end-to-end, it jointly optimize both the weights and the topology in a progressive manner, which prevents the network from potential disconnection introduced by a aposteriori pruning. The second advantage of our method is the fact that is does not need for a finetuning. It is particularly useful in applied cases such as embedded neural networks. One the one hand, the pruning yield a lightweight model that can be shipped on low power devices and on the otherhand it prevents a finetuning that can last for month in some cases, allowing for faster developpement cycles, and model shipping to the final customer.
[1] Liu et al., "A Deep Neural Network Pruning Method Based on Gradient L1-norm", IEEE ICCC 2020
[2] https://tiny-imagenet.herokuapp.com/
-> 653 mots
## Shrinked version
050E : The fine-tuned magnitude pruning (FMP) scores are obtained by evaluating the accuracy on the CIFAR 10 test dataset of a network that underwent full training, pruning to the targeted rate and then fine-tuning (with the same hyperparameters as training). The indicated performance is the best of 5 runs.
050E asks what is our contribution compared to [1]. First we kindly remind 050E that we did not test ou method on VGG16 but VGG19. Then, [1] is a structured pruning method, which prune filters. As mentioned in section 1 of our paper, structured pruning introduce a strong prior on the network topology to prevent sparse network. In contrast, we chose unstructured pruning to ensure our method is not constraint in the topology search. Moreover, we target higher pruning rates (90%+) that Liu et al, who repport pruning rates up to 71%. Very high pruning rates (90%+) ensure that we take full advantage of sparse computation library on embedded devices.
0C97 asked for tests on larger datasets with larger images. We chose the Tiny Imagenet dataset [2], which consists in 100k 64x64 color images split in 200 classes. At pruning rates of 90%, 95%, 97% our method performs much better than the magnitude pruning (MP) method, with gain on the test accuracy from 13 to 36 points, and only 6 points lower than FMP method. At 99%, our method over performs both MP and FMP with a gain of 34.5 and 14 points respectively, at a test accuracy of 35%.
0C97 & 1358 : The first advantage of our method is related to disconnection in the network. The MP method pruned the weight after training, only considering the magnitude of the weights as a saliency criterion. However, the weight distributions are different from one layer to another. Some layers have weight distribution that are way more centered around zero than others. Hence, a too important pruning rate might prune all the weights of a layer (which generally happens in the last layers of a network, particularly if these layers are fully connected ones). As a consequence, we observe a disconnection in the network, leading to 1) impossibility to fine-tune upper layer and 2) random outputs from the networks. In contrast, our method being end-to-end, it jointly optimize both the weights and the topology in a progressive manner, which prevents potential disconnection introduced by an a posteriori pruning. The second advantage of our method is the fact that is does not need fine-tuning. It is particularly useful in applied cases such as embedded neural networks. One the one hand, the pruning yield a lightweight model that can be shipped on low powered devices. On the other hand it prevents a fine-tuning that can last for months in some cases, allowing for faster development cycles and model shipping to the final customer.
[1] Liu et al., "A Deep Neural Network Pruning Method Based on Gradient L1-norm", IEEE ICCC 2020
[2] https://tiny-imagenet.herokuapp.com/
-> 486 mots
## Revised version
050E: The fine-tuned magnitude pruning (FMP) scores are obtained by evaluating the accuracy on the CIFAR10 test dataset of a network that underwent full training, pruning to the targeted rate and then fine-tuning (with the same hyperparameters as training). The indicated performance is the best of 5 runs.
050E asks what is our contribution compared [1](2020) (which btw is extremely similar to [3], same saliency criterion applied on gradient, not weights). First we kindly remind 050E that we did not test our method on VGG16 but VGG19. Then, [1] is a structured pruning method, which prunes filters. As mentioned in section 1 of our paper, structured pruning introduces a strong prior on the network topology to prevent sparse network. In contrast, we chose unstructured pruning to ensure our method is not constrained in the topology search. Moreover, we target higher pruning rates (90%+) that Liu et al., who report pruning rates (PR) up to 71%. Very high PR (90%+) ensures that we take full advantage of sparse computation library on embedded devices.
0C97 asked for tests on larger datasets with larger images. We chose the TinyImagenet dataset[2], which consists in 100k 64x64 color images split in 200 classes. At PR of 90%, 95%, 97% our method performs much better than the magnitude pruning (MP) method, with gain on the test accuracy from 13 to 36 points, and only 6 points lower than FMP method. At 99%, our method over performs both MP and FMP with a gain of 34.5 and 14 points respectively, at a test accuracy of 35%. We would be able to include these results in an updated paper version.
0C97 & 1358: The first advantage of our method is related to disconnection in the network. The MP method pruned the weight after training, only considering the magnitude of the weights as a saliency criterion. However, the weight distributions are different from one layer to another. Some layers have weight distribution that are way more centered around zero than others. Hence, a too important PR might prune all the weights of a layer (which generally happens in the last layers of a network, particularly if these layers are fully connected ones). As a consequence, we observe a disconnection in the network, leading to 1) impossibility to fine-tune upper layer and 2) random outputs from the networks. In contrast, our method being end-to-end, it jointly optimizes both the weights and the topology in a progressive manner, which prevents potential disconnection introduced by an a posteriori pruning. The second advantage of our method is the fact that is does not need fine-tuning. It is particularly useful in applied cases such as embedded neural networks (including face recognition). On the one hand, the pruning yields a lightweight model that can be shipped on low-powered devices. On the other hand it prevents a fine-tuning that can last for months in some cases, allowing for faster development cycles and model shipping to the final customer.
[1]-Reference of 050E.
[2]-https://github.com/ksachdeva/tiny-imagenet-tfds
[3]-[16] in our paper
## Revised Version 2
Review 050E: The fine-tuned magnitude pruning scores correspond to the accuracy (on CIFAR) of lightweight networks obtained by (i) training primary networks (VGG19, etc.), (ii) pruning their weights to the targeted rate, and (iii) fine-tuning these weights using the same hyper-parameters. We kindly remind that this training process is achieved on top of VGG19 and not VGG16.
Regarding the suggested reference [Liu-et-al-2020] (closely related to a group of methods, such as [3], already cited in our paper): this reference uses the same saliency criterion as [3] but applied to gradients instead of weights. Moreover, [Liu-et-al-2020] is based on structured pruning: as mentioned in section 1 of our paper, structured pruning relies on strong priors on the topology of the trained networks, which make balancing high pruning rates and accuracy very hard to achieve. In contrast, our proposed unstructured pruning avoids these rigid priors by training the weights of the networks while also adapting their topology. In practice, our method achieves very high pruning rates (90%+) compared to those provided in [Liu-et-al-2020] which do not exceed 71%, and this allows taking full advantage of sparse computation libraries on embedded devices.
Review 0C97: suggested extra experiments on larger datasets and images. We recently applied our method on Tiny Imagenet (https://github.com/ksachdeva/tiny-imagenet-tfds), which consists in 100,000 images of 64x64 pixels belonging to 200 classes. At high pruning rates, for instance 99%, our method overtakes not only magnitude pruning (MP), but also fine-tuned MP by a significant margin (34.5 and 14 points respectively). These extra recent results can be added, in our tables, for different pruning rates.
Reviews 0C97 & 1358: in contrast to MP, one of the major advantages of our method is its ability to prevent disconnections. Indeed, as MP decouples weight training from pruning, one may end up with pruned networks which are topologically inconsistent. In other words, networks with heterogeneous weight distributions may result into completely disconnected layers (especially fully connected ones) when their incoming or outgoing connections are completely pruned. This behavior, observed with high pruning rates, makes fine-tuning powerless to restore a high accuracy, and results into random classification performances. In contrast, our method jointly optimizes network weights and topology, and thereby prevents disconnections. As a second advantage, our method bypasses the fine-tuning step which may last for months in some applications. This allows faster development cycles and yields lightweight embedded networks that can easily be shipped and deployed on low powered devices.