Response to Reviewer *u43X*

# Response to Reviewer *u43X* We thank reviewer *u43X* for the valuable feedback about our work. > However, it feels like a mere extension of the evaluation provided by Sehwag et al. (NeurIPS 2020), detailed in Section 4.2 of their paper. "Credit" replaces the pre-training method (ensembles) and uses recent work to adjust HYDRA's loss $L_{pruning}$ that has been designed to be generalizable all along. Thus, the technical novelty of the approach is limited. Thanks for the comment. However, we respectfully disagree with the assertion that the technical novelty of our proposed approach is limited. We kindly note that this is the first work using *sparse ensemble* to achieve high certified robustness, and we empirically found that prediction consistency over Gaussian augmentation, model gradient diversity, and large confidence margin are the key components for pruning a diverse yet sparse ensemble. > Please **clearly state what variant of HYDRA's loss the authors compare to**. Thanks for the suggestion. In this work, our focus is to provide tighter and faster certification and inference under randomized smoothing. We are comparing with `randomized smoothing (vra-s)` in HYDRA. We will revise our paper accordingly for better clarity. > Please **better explain what the "Consistency" baseline is all about.** Thanks for the clarification question. In all experiments involving consistency regularization, we follow the standard *pre-training + pruning + finetuning* pipeline for model pruning while using consistency regularizer as the certifiably robust training / pruning objective. On ImageNet, we mainly compare with consistency regularization since it is the state-of-the-art robust training regularizer for the certified robustness of *dense* models under randomized smoothing on ImageNet. Moreover, HYDRA did not provide any evaluation on ImageNet and even failed to provide non-trivial certified accuracy. > I think the authors should **elaborate on the runtime improvements over dense ensembles more strongly.** Thanks for the suggestion. We performed additional runtime evaluations following the same procedures in Section IV-C with different *model architectures, datasets, and sparsity level*, shown in the table below, to elaborate on the runtime improvements over dense ensembles more strongly. Overall, both HYDRA and Credit achieves significant speed up in inference & certification compared with their dense counterparts. We will revise our paper accordingly to put stronger emphasis on the runtime improvements over dense ensembles. | Architecture | Training Dataset | Sparsity Level | Mean Latency in ms/batch ± std | Throughput (items/sec) | | --------------- | ---------------- | -------------- | ------------------------------ | ---------------------- | | WideResNet-28-4 | CIFAR-10 | Dense | 6374.3137 ± 2692.4499 | 2040.1573 | | | | 90% | 2045.7099 ± 863.2472 | 6310.4437 | | | | 95% | 1550.6562 ± 832.4383 | 8104.02 | | | | 99% | 1273.4309 ± 676.0076 | 10153.7556 | | WideResNet-28-4 | CIFAR-10 | Dense | 6240.4341 ± 2474.6815 | 2015.0741 | | | | 90% | 732.3455 ± 387.9789 | 19891.4304 | | | | 95% | 709.9382 ± 374.7545 | 20219.5839 | | | | 99% | 762.9356 ± 421.3151 | 18867.2881 | | WideResNet-28-4 | SVHN | Dense | 6329.9919 ± 2307.9137 | 2038.5784 | | | | 90% | 2109.6255 ± 812.8043 | 6227.043 | | | | 95% | 2012.7913 ± 525.8822 | 7360.1363 | | | | 99% | 2084.7073 ± 490.8666 | 7127.1265 | | ResNet-50 | ImageNet | Dense | 31811.1827 ± 15909.3763 | 304.5999 | | | | 90% | 13097.1140 ± 5899.1730 | 858.7413 | | | | 95% | 12094.0817 ± 5460.7789 | 926.4405 | | | | 99% | 10433.9441 ± 4780.2449 | 1155.4011 | > I want to encourage the authors to **show the runtime difference for HYDRA** to get a better feeling for the yield improvement of sparse ensembles. Thanks for the comment. We also benchmarked HYDRA's inference runtime on Wide-ResNet and CIFAR-10 dataset to compare the performance with Credit. Indeed, because HYDRA and Credit have the same target sparsity, the relative improvements over dense ensembles are comparable. However, we note that Credit can provides much better certified accuracy than HYDRA, as demonstrated in the main results in our paper. | Pruning Method | Sparsity Level | Mean Latency in ms/batch ± std | Throughput (items/sec) | | -------------- | -------------- | ------------------------------ | ---------------------- | | Credit | Dense | 6374.3137 ± 2692.4499 | 2040.1573 | | | 90% | 2045.7099 ± 863.2472 | 6310.4437 | | | 95% | 1550.6562 ± 832.4383 | 8104.02 | | | 99% | 1273.4309 ± 676.0076 | 10153.7556 | | HYDRA | Dense | 6240.4341 ± 2474.6815 | 2015.0741 | | | 90% | 2204.4622 ± 924.2661 | 6241.8039 | | | 95% | 1737.7222 ± 996.6428 | 7835.195 | | | 99% | 1406.2721 ± 740.8576 | 10374.8085 | > Moreover, I think the paper would also benefit from experiments on structural pruning, which can further strengthen the focus on runtime improvements. Thanks for the suggestion. We performed additional experiments with structured pruning. Concretely, we use the importance scores of the weights of the WideResNet-28-4 architecture learned during Credit's pruning step on CIFAR-10 dataset and perform a one-shot $\ell_1$ structured pruning to remove the convolution filters with low absolute importance scores. We use the sparsity mask obtained in this process to further finetune our model following Credit's finetuning procedure. At the end, we use the same benchmark procedures as in Section IV-C of our paper to measure the mean latency and throughput of the model pruned with Credit and structured sparsity. The results are shown in the table below. At roughly the same sparsity level, structured sparsity can improve inference and certification speed by more than two-fold over unstructured sparsity. | Sparsity Type | Sparsity Level | Mean Latency in ms/batch ± std | Throughput (items/sec) | | ------------- | -------------- | ------------------------------ | ---------------------- | | Unstructured | Dense | 6374.3137 ± 2692.4499 | 2040.1573 | | | 90% | 2045.7099 ± 863.2472 | 6310.4437 | | | 95% | 1550.6562 ± 832.4383 | 8104.02 | | | 99% | 1273.4309 ± 676.0076 | 10153.7556 | | Structured | Dense | 6374.3137 ± 2692.4499 | 2040.1573 | | | 90% | 732.3455 ± 387.9789 | 19891.4304 | | | 95% | 709.9382 ± 374.7545 | 20219.5839 | | | 99% | 762.9356 ± 421.3151 | 18867.2881 | However, we note that structured pruning could lead to significant loss of model utility and certified robustness. We compare the certified robustness of Credit's structured pruning version against unstructured version in the table below. At 99% sparsity, structured pruning could incur over 20% loss in certified robustness. Therefore, we believe that the current unstructured pruning version of Credit achieves the better trade-off between certified robustness and inference speed over structured pruning. | Sparsity Level | Pruning Method | $r = 0$ | $r = 0.25$ | $r = 110 / 255$ | $r = 0.5$ | $r = 0.75$ | | :------------: | :-------------------: | :-----: | :--------: | :-------------: | :-------: | :--------: | | 0.9 | Credit (Unstructured) | 0.786 | 0.701 | 0.629 | 0.602 | 0.496 | | | Credit (Structured) | 0.649 | 0.577 | 0.519 | 0.499 | 0.402 | | 0.95 | Credit (Unstructured) | 0.775 | 0.687 | 0.623 | 0.598 | 0.497 | | | Credit (Structured) | 0.594 | 0.52 | 0.463 | 0.446 | 0.365 | | 0.99 | Credit (Unstructured) | 0.701 | 0.625 | 0.569 | 0.545 | 0.451 | | | Credit (Structured) | 0.442 | 0.393 | 0.35 | 0.333 | 0.283 | # Response to Review *uXpP* We thank reviewer *uXpP* for the valuable feedback about our work. > Is it fair to compare hydra to this new method (or any other one for the matter) since it basically uses 3 times the number of parameters? Or is sparsity computed over the whole ensemble? Thanks for the clarification question. We note that the sparsity is applied over the whole ensemble and we control Credit to have the same sparsity as HYDRA. To further clarify this question, we perform additional evaluations in which the number of parameters are also controlled. As shown in the table below, our method still outperforms HYDRA by a large margin when the number of parameter is controlled. | Pruning Ratio | Method | $r = 0$ | $r = 0.25$ | $r = 110 / 255$ | $r = 0.5$ | $r = 0.75$ | | :-----------: | :----: | :-------: | :--------: | :-------------: | :-------: | :--------: | | 0.9 | HYDRA | 0.78 | 0.687 | 0.599 | 0.562 | 0.466 | | | Credit | **0.786** | **0.701** | **0.629** | **0.602** | **0.496** | | 0.95 | HYDRA | 0.759 | 0.678 | 0.598 | 0.572 | 0.45 | | | Credit | **0.775** | **0.687** | **0.623** | **0.598** | **0.497** | | 0.99 | HYDRA | 0.697 | 0.616 | 0.548 | 0.527 | 0.409 | | | Credit | **0.701** | **0.625** | **0.569** | **0.545** | **0.451** | > Since publishing, HYDRA is no longer state of the art in performance. What do auto-attack results look like for it? Thanks for the suggestion. We note that auto-attack only evaluates the *empirical robustness* of models, while our work focuses on pruning a *diversified sparse ensemble* to achieve *high certified robustness*, which can guarantee that an input example's prediction remains unchanged as long as the perturbation is contained in a $\ell_2$ ball of certified radius around it. Additionally, because auto-attack is designed for non-smooth models, it could not generate adversarial examples against smoothed classifier because of its Monte Carlo estimation procedure. However, we also performed additional evaluations of empirical robustness of HYDRA and Credit with the following procedure: (1) we take the mean of the output confidence from the base ensemble evaluated on 30 Gaussian-augmented views of the same image; (2) we use projected gradient descent (PGD) with maximum perturbation norm $\varepsilon = 0.5$ to minimize this mean confidence and obtain the corresponding adversarial image; (3) we evaluate our smoothed model on the generated adversarial images. We present the results below. | Method \ Sparsity | 0.9 | 0.95 | 0.99 | | ----------------- | ----- | ----- | ----- | | HYDRA | 0.721 | 0.65 | 0.647 | | Credit | 0.729 | 0.712 | 0.641 | We would like to emphasize again that our focus is about *certified robustness* instead of *empirical robustness*. > Have you made sure that gradient diversity is enforced across the whole ensemble? How does one pick when to stop? Thanks for the question. We note that we enforce gradient diversity in Credit *between every base model pair* and *throughout the entire training process*. For each base model pair $(F_i, F_j)$ where $i, j \in [N]$ and $i \neq j$, our gradient diversity loss promotes the diversity of gradients, defined as the gradient difference between different labels, between the base model $F_i$ and $F_j$ below $$ \mathcal{L}_{GD}(x_0)_{ij} = \left\|\nabla_x f_i^{y_0/ y_i^{(2)}} + \nabla_x f_j^{y_0 / y_i^{(2)} }\right\|. $$ It directly reflects the relationship between the magnitude and the directional diversity of the gradients of the base models. We use a fixed stopping criteria for our training process when the number of epochs is met. # Response to Review **sMk7** We thank the reviewer’s appreciation for our work and address the questions below. > Can this method be applied to other types of robustness thread models? Thanks for the question. We note that our proposed approach can be easily extended to other $\ell_p$ bounded perturbations by using existing works. For example, [1] and [2] provide us with tools for certifying $\ell_1$ and $\ell_\infty$ robustness. [3] can be applied to provide certified robustness for problems with structured output like image segmentation, object detection, generative models, etc. [1] Yang, Greg, Tony Duan, J. Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. "Randomized smoothing of all shapes and sizes." In *International Conference on Machine Learning*, pp. 10693-10705. PMLR, 2020. [2] Levine, Alexander J., and Soheil Feizi. "Improved, Deterministic Smoothing for L_1 Certified Robustness." In *International Conference on Machine Learning*, pp. 6254-6264. PMLR, 2021. [3] Kumar, Aounon, and Tom Goldstein. "Center Smoothing: Certified Robustness for Networks with Structured Outputs." *Advances in Neural Information Processing Systems* 34 (2021): 5560-5575. > Instead of the weight-subspace ensembling, what about the standard ensembling method of ensembling a few checkpoints during model training? Thanks for the suggestion. We performed three additional experiments by ensembling three checkpoints during the training trajectory of a single model with and prune and finetune the ensemble with Credit. As shown in the table below, simply ensembling checkpoints on the same training trajectory as the starting point for pruning yields non-ideal certified robustness since it does not explicitly enforce weight diversity or gradient diversity. | Pruning Ratio | Method | $r = 0$ | $r = 0.25$ | $r = 110 / 255$ | $r = 0.5$ | $r = 0.75$ | | :-----------: | :---------------------------: | :-------: | :--------: | :-------------: | :-------: | :--------: | | 0.9 | Credit | **0.786** | **0.701** | **0.629** | **0.602** | **0.496** | | | Credit (Ensemble Checkpoints) | 0.714 | 0.641 | 0.569 | 0.55 | 0.465 | | 0.95 | Credit | **0.775** | **0.687** | **0.623** | **0.598** | **0.497** | | | Credit (Ensemble Checkpoints) | 0.706 | 0.629 | 0.567 | 0.547 | 0.462 | | 0.99 | Credit | **0.701** | **0.625** | **0.569** | **0.545** | **0.451** | | | Credit (Ensemble Checkpoints) | 0.671 | 0.591 | 0.532 | 0.508 | 0.432 | > Also, how does the number of models in the ensemble impact results? Thanks for the suggestion. we performed additional evaluations with increasing number of ensemble members in Credit, shown in the table below. However, we only noticed marginal or no increase in the final performance while the inference time is significantly longer. | Pruning Ratio | Method | $r = 0$ | $r = 0.25$ | $r = 110 / 255$ | $r = 0.5$ | $r = 0.75$ | | :-----------: | :------------: | :-------: | :--------: | :-------------: | :-------: | :--------: | | 0.9 | Credit (N = 3) | **0.786** | **0.701** | 0.629 | 0.602 | 0.496 | | | Credit (N = 4) | 0.771 | 0.692 | 0.629 | **0.605** | 0.488 | | | Credit (N = 5) | 0.777 | 0.697 | **0.631** | 0.601 | **0.504** | | 0.95 | Credit (N = 3) | **0.775** | 0.687 | **0.623** | **0.598** | 0.497 | | | Credit (N = 4) | 0.752 | 0.678 | 0.612 | 0.588 | **0.5** | | | Credit (N = 5) | 0.766 | **0.684** | 0.622 | 0.596 | **0.5** | | 0.99 | Credit (N = 3) | 0.701 | 0.625 | 0.569 | 0.545 | 0.451 | | | Credit (N = 4) | 0.699 | 0.625 | 0.571 | **0.557** | 0.461 | | | Credit (N = 5) | **0.715** | **0.637** | **0.572** | 0.549 | **0.468** | > does that mean each model is independently at 99% sparsity level, but since there are 3 models, the sparsity level is effectively around 97% Thanks for the clarification question. We note that the sparsity is applied over the whole ensemble and we control Credit to have the same sparsity as HYDRA. To further clarify this question, we perform additional evaluations in which the number of parameters are also controlled. As shown in the table below, our method still outperforms HYDRA by a large margin when the number of parameter is controlled. | Pruning Ratio | Method | $r = 0$ | $r = 0.25$ | $r = 110 / 255$ | $r = 0.5$ | $r = 0.75$ | | :-----------: | :----: | :-------: | :--------: | :-------------: | :-------: | :--------: | | 0.9 | HYDRA | 0.78 | 0.687 | 0.599 | 0.562 | 0.466 | | | Credit | **0.786** | **0.701** | **0.629** | **0.602** | **0.496** | | 0.95 | HYDRA | 0.759 | 0.678 | 0.598 | 0.572 | 0.45 | | | Credit | **0.775** | **0.687** | **0.623** | **0.598** | **0.497** | | 0.99 | HYDRA | 0.697 | 0.616 | 0.548 | 0.527 | 0.409 | | | Credit | **0.701** | **0.625** | **0.569** | **0.545** | **0.451** |