NeurIPS 2023 - HackMD

## General Response We want to thank all reviewers for taking the time to read our manuscript carefully and for providing constructive and insightful feedback. We are very encouraged by the positive comments of the reviewers on **tackling a well-motivated and important problem** (Reviewers E5Wk, PQzZ) the **simplicity, generality, and novelty of SLaM** (Reviewers E5Wk, PQzZ, Un32, 7BQA), **its extensive experimental evaluation and performance gains** (Reviewers E5Wk, PQzZ, Un32), **its strong theoretical guarantees** (Reviewers PQzZ, Un32, 7BQA), **the writing quality and the clarity of the presentation of the ideas** (Reviewers PWzZ, 7BQA). Here we give an update about our theoretical contribution and provide an overview of the additional experiments included in the rebuttal. We provide detailed responses to each reviewer separately. We look forward to engaging in further discussion with the reviewers, answering questions, and discussing improvements. ### Additional Experiments 1. An ablation investigating the two different ways to estimate $k(x)$, see Response to Reviewer E5Wk. CIFAR100: Fixed-Value vs Data-Dependent k(x) Labeled Examples | 10% | 15% | 20% | | ---------------- | ------------ | ------------| ------------| | Vanilla | $37.94 \pm 0.10$ | $46.42 \pm 0.24$ | $52.17 \pm 0.21$ | | k=2 |$41.25 \pm 0.36$|$49.18\pm 0.19$|$54.48 \pm 0.25$ | | k=5 |$40.71 \pm 0.29$|$49.41 \pm 0.23$| $54.41 \pm 0.2$ | | k=10 |$41.2 \pm 0.8$|$49.31 \pm 0.12$| $54.42 \pm 0.19$ | |Data-Dependent k(x) ($t=0.9$)| $\mathbf{42.7 \pm 0.30}$| $\mathbf{49.89 \pm 0.23}$ | $\mathbf{54.73 \pm 0.27}$| 2. An ablation investigating the effect of the validation size, see Response to Reviewer E5Wk. | CIFAR100 | 5000 | 7500 | 10000 |12500 | 15000 | 17500 | |-------------------------------|--------|--------|--------|--------|--------|--------| | Validation 256 | $40.97 \pm 0.12$ | $49.21 \pm 0.17$ | $54.18 \pm 0.23$ | $58.54 \pm 0.13$ | $61.18 \pm 0.25$ | $63.19 \pm 0.07$ | | Validation 512 | $41.06 \pm 0.30$ | $49.27 \pm 0.19$ | $54.36 \pm 0.12$ | $58.57 \pm 0.26$ | $61.25 \pm 0.31$ | $63.38 \pm 0.11$ | | Validation 1024 | $41.83 \pm 0.32$ | $49.35 \pm 0.25$ | $54.71 \pm 0.18$ | $58.95 \pm 0.32$ | $61.28 \pm 0.46$ | $63.62 \pm 0.28$ | 3. We investigated the robustness of SLaM to inaccurate predictions for $\alpha(x), k(x)$ as proposed by Reviewer PQzZ. | CIFAR100 10000 Labeled Examples | $\ell$ = 0 | $\ell$ = 5 | $\ell$ = 10 | $\ell$ = 50 | $\ell$ = 90 | |--------------------------------------|-------------|-------------|--------------|--------------|--------------| | $\sigma$ = 0 | $61.69 \pm 0.3$ | $59.71 \pm 0.49$ | $59.7 \pm 0.55$ | $59.63 \pm 0.44$ | $59.4 \pm 0.65$ | | $\sigma$ = 0.1 | $60.04 \pm 0.29$ | $59.55 \pm 0.33$ | $59.66 \pm 0.35$ | $59.79 \pm 0.2$ | $60.03 \pm 0.41$ | | $\sigma$ = 0.2 | $59.39 \pm 0.1$ | $59.23 \pm 0.25$ | $59.5 \pm 0.39$ | $59.16 \pm 0.29$ | $59.31 \pm 0.31$ | | $\sigma$ = 0.5 | $57.71 \pm 0.15$ | $57.6 \pm 0.23$ | $57.56 \pm 0.28$ | $57.46 \pm 0.25$ | $57.29 \pm 0.21$ | 4. We investigated using different regression methods (knn, Logistic regression) for estimating $\alpha(x), k(x)$ with SLaM as proposed by Reviewer PQzZ. 5. We performed additional experiments using fewer labeled examples, see response to Reviewer Un32 for more details. | CIFAR10, other dataset-sizes | 1% | 5% | |------------------------------|----------------|----------------| | Teacher | $10.07$ | $51.67$ | | Vanilla | $11.75 \pm 0.1$| $54.2 \pm 0.16$| | Taylor-CE | $10.00 \pm 0.01$| $55.14 \pm 0.28$| | UPS | $12.74 \pm 0.94$| $56.21 \pm 0.23$| | VID | $13.25 \pm 0.26$| $54.32 \pm 0.05$| | Weighted | $12.67 \pm 0.01$| $54.58 \pm 0.1$ | | SLaM | $\mathbf{26.73 \pm 0.02}$| $\mathbf{57.40 \pm 0.05}$| 6. We included a table reporting the final epoch accuracy of all methods on CIFAR100 as proposed by Reviewr 7BQA. | Final-epoch experiments CIFAR-100 | 5000 | 7500 | 10000 | 12500 | 15000 | 17500 | |-----------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| | Vanilla | $37.97 \pm 0.1$ | $46.37 \pm 0.12$ | $51.9 \pm 0.13$ | $57.85 \pm 0.17$ | $60.86 \pm 0.18$ | $63.56 \pm 0.35$ | | Taylor-CE | $40.12 \pm 0.03$ | $47.99 \pm 0.21$ | $54.18 \pm 0.21$ | $57.76 \pm 0.06$ | $61.22 \pm 0.11$ | $63.50 \pm 0.20$ | | UPS | $39.47 \pm 0.18$ | $48.29 \pm 0.20$ | $53.27 \pm 0.26$ | $57.83 \pm 0.29$ | $61.16 \pm 0.13$ | $62.58 \pm 0.28$ | | VID | $37.76 \pm 0.13$ | $46.3 \pm 0.18$ | $52.07 \pm 0.12$ | $57.91 \pm 0.31$ | $60.57 \pm 0.21$ | $63.43 \pm 0.18$ | | Weighted | $38.39 \pm 0.07$ | $46.91 \pm 0.08$ | $52.48 \pm 0.13$ | $57.7 \pm 0.15$ | $\mathbf{61.32 \pm 0.31}$ | $63.66 \pm 0.46$ | | SLaM | $\mathbf{41.16 \pm 0.16}$ | $\mathbf{49.11 \pm 0.17}$ | $\mathbf{54.31 \pm 0.19}$ | $\mathbf{58.59 \pm 0.24}$ | $\mathbf{61.30 \pm 0.26}$ | $\mathbf{63.72 \pm 0.43}$ | ### Theoretical Contribution We would like to share with the reviewers and ACs an exciting recent result (appeared in COLT2023) presenting strong evidence about the optimality of our theoretical result for SLaM. Our results include a novel theoretical result for learning halfspaces with random classification noise (RCN) (see Related Work). Our contribution **improves the theoretical SOTA** and shows that SLaM can learn halfspaces with $O(1/(\gamma^2 \epsilon^2))$ examples, while the previous best-known result was from Diakonikolas et al. [DGT19]. **A Recent Update** In the paper [DKK+23] that appeared in COLT2023 a Statistical Query (SQ) lower bound of $\Omega(1/(\gamma^{1/2} \epsilon^2))$ was given. Therefore, there is now strong evidence (SQ algorithms is very broad class that contains gradient-based optimization methods) that **our theoretical result for learning noisy halfspaces is, in fact, optimal in its dependence on the accuracy parameter $\epsilon$.** Using different techniques the authors of [DKK+23] also provide an $O(1/(\gamma^2\epsilon^2))$ upper bound for the problem. We will update our conclusion section to include this discussion. [DKK+23]: Diakonikolas I., Diakonikolas J., Daniel K., Wang P., Zarifis N. Information-Computation Tradeoffs for Learning Margin Halfspaces with Random Classification Noise. COLT 2023 [DGT19]: Diakonikolas I, Gouleakis T., Tzamos C. Distribution Independent PAC Learning of halfspaces with Massart Noise. **(Outstanding Paper Award, NeurIPS 2019)**. ## Response to Reviewer E5Wk > The paper lacks the explanation on the intuition of the mixing solution on formula (1), w ... *What is the intuition of the second term on formula (1)? **Short Answer** * The high-level intuition of equation (1) appears in Lines 212-226 of the manuscript. * In Appendix C of the supplementary material we have provided a **formal proof** that the minimizer of the SLaM objective using equation (1) is the ground-truth labelling (given noisy teacher predictions). * The second term is needed to ensure that the minimizer of the SLaM objective is the ground-truth labelling, see Appendix C. **More Details** Perhaps it is easier to understand the mixing-operation in the binary setting: the expected noisy teacher label of some example x is "mixed", i.e., $\alpha(x) g(x) + (1-\alpha(x)) (1-g(x))$. This is because the teacher is correct (equal to the ground-truth) with probability $\alpha(x)$ and incorrect (equal to $1-g(x)$) with probability $1-\alpha(x)$. Performing the same mixing operation to the student means that for a point $x$ we "mix" the student's prediction $s(x)$ and output $\alpha(x) s(x) + (1-\alpha(x)) (1-s(x))$. Minimizing the cross-entropy loss with respect to the noisy teacher label yields a model $s(x)$ that will satisfy $a(x) g(x) + (1-\alpha(x)) (1-g(x)) = a(x) s(x) + (1-\alpha(x)) (1-g(x))$ (by Gibb's inequality). This means that it must be the case that $s(x) = g(x)$, i.e., the student prediction $s(x)$ is equal to the ground-truth label $g(x)$. We refer to Appendix C for more details and the multi-class setting. > Even though ... not robust enough to learn $\alpha(x)$ and $k(x)$. A... it is easy to make $\alpha$ overfit to those data. -- How important is the design $\alpha(x)$ and $k(x)$ is? * As we mention in our manuscript, the parameter $\alpha(x)$ that we estimate is only a function of the teacher's margin so it can be learned from very few data via a **simple one dimensional** regression task. * SLaM is robust to noise in the estimates of $\alpha(x)$ and $k(x)$; only rough estimates are needed for SLaM to provide improvements. In fact, in many cases, **even using a single value for $k(x)$ for all examples (say k(x)=5) is enough for SLaM to achieve good performance** (see our Imagenet experiments in Table 4 and also our response below). We also test the robustness of SLaM to noise in the predictions of $\alpha(x)$ and $k(x)$ in the ablation provided to Reviewer PQzZ. >Also it has another strong assumption that \alpha is isotonic regression. **Using isotonic regression is not an assumption about $\alpha(x)$**: our regression algorithm works regardless of whether $\alpha(x)$ is monotone as a function of the margin. Using isotonic regression is just a way to enforce the monotonicity on the learned $\hat{a}(x)$ that as we empirically observe, enhances the robustness of SLaM and leads to more stable estimates of $\alpha(x)$ (*harder to overfit*) -- see also our response to Reviewer PQzZ. >It lacks the result details about \alpha and k setup for each experiment. Also it lacks the related ablation study to demonstrate how \alpha and k affect the model performance. **Short Answer** Due to space limitations, the details for $\alpha(x)$ and $k(x)$ were included in section D of the supplementary material. See Lines 677-680 for Celeb-A, Cifar10/100, Lines 689-693 for Imagenet, Lines 726-727 for Large Movies Reviews Dataset. **More Details** For the convenience of the reviewer we provide here an explanation/intuition of the hyper-parameters used. Binary Classification. Our method only requires tuning a single hyperparameter: the accuracy lower bound for a(x) in isotonic regression, see the value lb in the first paragraph in Section B.1. In the Celeb-A experiment, we always used lb=0.5 (see first paragraph of Section D.1.). In the Large Movies Reviews Dataset (see Section D.3) we performed a hyper-parameter search over the values {0.5,0.6,0.7,0.8,0.9} for lb. The effect of the parameter lb is clear from the mixing operation (see the mixing operation above Equation 2 in Section 3). As the accuracy lower bound (lb) converges to 1, SLaM converges to “vanilla” distillation as we do not mix the student’s predictions. When the accuracy lower bound is closer to 0, SLaM more aggressively mixes student’s predictions. Multiclass Classification. Similarly to the binary setting, our method requires knowing the accuracy of the teacher a(x). For CIFAR-10 and CIFAR100, we used lb=0.5. We found out that SLaM is robust in the setting of lb in those datasets and the values {0.5,0.6,0.7} usually yield good performance. In multiclass settings, we also need an estimate of k(x). We give two ways of obtaining k(x). The first one uses the same value k(x) = k for all examples. Even though this method is simple, we found that it suffices to achieve good performance gains. We used this method in our ImageNet experiments and used the value k=5 as the top-5 accuracy of the teacher model was satisfactory (much higher than its top-1 accuracy) on the validation dataset. The second way uses a (data-dependent) value of k(x) for each example (see Sections B.1. and B.2.). This requires tuning a hyperparameter t (see the top-k accuracy threshold in Algorithm 2). We used this method in CIFAR10/100 and used t=0.9. This means that for each example x, we find (an estimate of) the value of k(x) so that the teacher’s top-k(x) accuracy is at least 0.9. We found that larger values of k(x) (e.g., 0.9 or 0.95) performed better and this is consistent with item (i) of the definition of our noise model of Definition 3.2, i.e., that the ground-truth label should belong in the teacher’s top-k(x) predictions (when the top-1 teacher prediction is incorrect). In the following, we present our results for CIFAR100 by using a fixed value for k(x). We observe that SLaM is rather robust to the value of k used since it outperforms vanilla distillation for some reasonable values of k. Overall, we found that using fixed values of $k(x)$ (after some hyper-parameter search for $k$) and using the data-dependent method results in comparable results. The advantage of the fixed-value method is that it is easier to implement (and slightly more efficient) and the advantage of the data-dependent method is that its hyperparameter (threshold $t$ in Algorithm 2) is easier to tune (in all our experiments using $t=0.9$ was enough to achieve good performance gains). CIFAR100: Fixed-Value vs Data-Dependent k(x) | Labeled Examples | 10% | 15% | 20% | | ---------------- | ------------ | ------------| ------------| | Vanilla | $37.94 \pm 0.10$ | $46.42 \pm 0.24$ | $52.17 \pm 0.21$ | | k=2 |$41.25 \pm 0.36$|$49.18\pm 0.19$|$54.48 \pm 0.25$ | | k=5 |$40.71 \pm 0.29$|$49.41 \pm 0.23$| $54.41 \pm 0.2$ | | k=10 |$41.2 \pm 0.8$|$49.31 \pm 0.12$| $54.42 \pm 0.19$ | |Data-Dependent k(x) ($t=0.9$)| $\mathbf{42.7 \pm 0.30}$| $\mathbf{49.89 \pm 0.23}$ | $\mathbf{54.73 \pm 0.27}$| >no comparison with a baseline where the validation dataset is used for vanilla distillation, ... a fair baseline would be to incorporate the validation set for vanilla distillation ... or other state of the art unlabeled distillation? Αs we mention in Lines 274-278 of the main body of the paper, to be fair to methods not using validation data, we have included the validation data in the training dataset of all methods compared. >How does the size of the validation set affect the overall performance? As we already discussed SLaM requires only rough estimates of $\alpha(x)$ and $k(x)$ and thus even very small validation datasets suffice. In the following table we use different validation sizes for the CIFAR-100 experiment described in our manuscript and and show that the performance of SLaM improves when the validation dataset is larger but the gaps are not very significant especially for larger sizes of the labeled dataset. | CIFAR100 | 5000 | 7500 | 10000 |12500 | 15000 | 17500 | |-------------------------------|--------|--------|--------|--------|--------|--------| | Validation 256 | $40.97 \pm 0.12$ | $49.21 \pm 0.17$ | $54.18 \pm 0.23$ | $58.54 \pm 0.13$ | $61.18 \pm 0.25$ | $63.19 \pm 0.07$ | | Validation 512 | $41.06 \pm 0.30$ | $49.27 \pm 0.19$ | $54.36 \pm 0.12$ | $58.57 \pm 0.26$ | $61.25 \pm 0.31$ | $63.38 \pm 0.11$ | | Validation 1024 | $41.83 \pm 0.32$ | $49.35 \pm 0.25$ | $54.71 \pm 0.18$ | $58.95 \pm 0.32$ | $61.28 \pm 0.46$ | $63.62 \pm 0.28$ | ## Response to Reviewer PQzZ >.. a more rigorous ablation study of the influence of the uncertainty quantification would be insightful. ... obtaining $\alpha(x)$ and $k(x)$ from an oracle with “perfect” estimations, and gradually contaminating these estimates. We agree with the reviewer that a "controlled" experiment by adding noise to test the robustness of SLaM to inaccurate predictions of $\alpha(x)$ and $k(x)$ is a valuable addition to our experimental evaluation. To do this, as the reviewer suggested, we start from the oracle values for $\alpha(x)$ and $k(x)$, i.e., $\alpha^*(x) = 1$ if the teacher prediction is correct on $x$ and $0$ if it is incorrect and $k^*(x)$ is equal to the smallest integer value $\ell$ so that the ground-truth label is contained in the top $\ell$ predictions of the teacher. We then introduce random noise to the oracle predictions: for $\alpha(x)$ we perform a random perturbation $$\alpha(x) = \alpha^*(x) + (1-2\alpha^*(x)) ~ \xi, $$ where $\xi$ is uniformly distributed in $[0,\sigma]$. Hence, when $\alpha^*(x) = 0$ we increase it by adding a random variable $\xi \in [0,\sigma]$ and when $\alpha^*(x) = 1$ we decrease it by subtracting the same random variable $\xi \in [0,\sigma]$. We clip the resulting value in the interval $[0,1]$. To create noisy values for $k(x)$ we simply add a random integer to the optimal value $k^*(x)$, i.e., $k(x) = k^*(x) + Z$, where $Z$ is a random integer in $\{-\ell,\ldots,\ell\}$. We clip the resulting value of $k(x)$ in the range $\{0,\ldots, numClasses\}$. We present our ablation for the noisy values of $\alpha(x)$ for binary distillation in Celeba-A in the following table. We observe that in the binary setting SLaM is rather robust in the quality of the provided predictions of $\alpha(x)$ and obtains comparable perfomance for $\sigma$ up to $0.5$. For the extreme setting of $\sigma=1$ the provided predictions are essentially arbitrary and SLaM diverges. | Celeb_A Cheating Noise | 10000 | 12500 | 15000 | 17500 | 25000 | 30000 | |------------------------|----------|----------|----------|----------|----------|----------| | $\sigma$ = 0 | $96.85 \pm 0.04$ | $96.83 \pm 0.04$ | $96.97 \pm 0.03$ | $96.97 \pm 0.08$ | $96.94 \pm 0.02$ | $97.02 \pm 0.11$ | | $\sigma$ = 0.1 | $96.91 \pm 0.03$ | $96.86 \pm 0.09$ | $96.94 \pm 0.03$ | $96.94 \pm 0.1$ | $96.98 \pm 0.06$ | $97.00 \pm 0.01$ | | $\sigma$ = 0.3 | $96.84 \pm 0.07$ | $96.77 \pm 0.04$ | $96.94 \pm 0.08$ | $96.92 \pm 0.1$ | $96.87 \pm 0.05$ | $96.93 \pm 0.07$ | | $\sigma$ = 0.5 | $96.66 \pm 0.1$ | $96.77 \pm 0.11$ | $96.78 \pm 0.06$ | $96.75 \pm 0.06$ | $96.77 \pm 0.04$ | $96.83 \pm 0.1$ | | $\sigma$ = 1 | Diverges | Diverges | Diverges | Diverges | Diverges | Diverges | We also tested the robustness SLaM in CIFR100 with noise added both to $\alpha(x)$ and $k(x)$. We observe that the accuracy of SLaM again decays rather gracefully as the predictions for $\alpha(x)$ and $k(x)$ become noisier (bottom right corner is the noisiest $\sigma = 0.5, \ell=90$ and top left is the noiseless $\sigma=0, \ell =0$). We will include these ablation in our manuscript. | CIFAR100 10000 Labeled Examples | $\ell$ = 0 | $\ell$ = 5 | $\ell$ = 10 | $\ell$ = 50 | $\ell$ = 90 | |--------------------------------------|-------------|-------------|--------------|--------------|--------------| | $\sigma$ = 0 | $61.69 \pm 0.3$ | $59.71 \pm 0.49$ | $59.7 \pm 0.55$ | $59.63 \pm 0.44$ | $59.4 \pm 0.65$ | | $\sigma$ = 0.1 | $60.04 \pm 0.29$ | $59.55 \pm 0.33$ | $59.66 \pm 0.35$ | $59.79 \pm 0.2$ | $60.03 \pm 0.41$ | | $\sigma$ = 0.2 | $59.39 \pm 0.1$ | $59.23 \pm 0.25$ | $59.5 \pm 0.39$ | $59.16 \pm 0.29$ | $59.31 \pm 0.31$ | | $\sigma$ = 0.5 | $57.71 \pm 0.15$ | $57.6 \pm 0.23$ | $57.56 \pm 0.28$ | $57.46 \pm 0.25$ | $57.29 \pm 0.21$ | >... why exactly isotonic regression is chosen ... An ablation study as described before could be used to support this decision. We chose isotonic regression to enforce the monotonicity in the learned accuracy estimates for $\alpha(x)$ based on the empirical observation that $\alpha(x)$ is often approximately monotone as a function of the margin of the teacher. Moreover, the lower threshold (denoted by lb in Section B.1. in the Supplementary Material) in isotonic regression gives us a way to control how "agressive" the mixing operation is going to be. We agree with the reviewer that SLaM does not hinge on some particular regression method and other methods can be used. We used SLaM with the k-nearest neighbors regression and with logistic regression and found that isotonic regression typically outperfoms logistic regression and its hyper-parameter is easier to tune than the number of neighbors in kNN. That said SLaM is agnostic to the methods for estimating $\alpha(x), k(x)$ and more sophisticated methods (such as conformal prediction) could also be used. We will include this discussion and ablations in our updated manuscript. | CIFAR-10 | 10000 | 15000 | 20000 | 25000 | 30000 | 35000 | |---------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| |kNN k=10 | $67.1 \pm 0.15$ | $70.56 \pm 0.21$ | $74.6 \pm 0.11$ | $76.68 \pm 0.17$ | $78 \pm 0.16$ | $79.26 \pm 0.12$ | | kNN k=20 | $67.5 \pm 0.15$ | $71.09 \pm 0.19$ | $74.47 \pm 0.15$ | $77.03 \pm 0.1$ | $78.03 \pm 0.17$ | $79.21 \pm 0.13$ | |kNN k=30 | $67.51 \pm 0.11$| $71.27 \pm 0.13$| $74.66 \pm 0.12$| $77.03 \pm 0.13$| $78.07 \pm 0.2$ | $79.2 \pm 0.18$ | |kNN k=40 | $67.64 \pm 0.21$| $71.08 \pm 0.22$| $74.5 \pm 0.09$ | $76.64 \pm 0.12$| $77.92 \pm 0.11$| $79.41 \pm 0.22$ | |Logistic | $65.26 \pm 0.05$ | $68.85 \pm 0.08$ | $73.35 \pm 0.12$ | $76.17 \pm 0.15$ | $76.87 \pm 0.25$ | $78.76 \pm 0.35$| |Isotonic lb=0.5| $66.82 \pm 0.61$ | $72.61 \pm 0.30$ | $75.01 \pm 0.25$ | $75.72 \pm 0.17$ | $78.04 \pm 0.16$ | $79.22 \pm 0.11$| | CIFAR-100 | 10000 | 15000 | 20000 | 25000 | 30000 | 35000 | |---------------------------|-----------|-----------|-----------|-----------|-----------|-----------| | kNN k=10 | $40.75 \pm 0.02$ | $49.07 \pm 0.15$ | $54.86 \pm 0.11$ | $57.87 \pm 0.17$ | $61.9 \pm 0.2$ | $63.06 \pm 0.22$ | | kNN k=20 | $41.03 \pm 0.05$ | $49.19 \pm 0.12$ | $54.9 \pm 0.1$ | $57.85 \pm 0.18$ | $61.76 \pm 0.2$ | $63.45 \pm 0.16$ | | kNN k=30 | $41.04 \pm 0.07$ | $49.55 \pm 0.13$ | $55.14 \pm 0.15$ | $57.96 \pm 0.21$ | $61.9 \pm 0.19$ | $63.2 \pm 0.23$ | | kNN k=40 | $41.23 \pm 0.03$ | $49.76 \pm 0.15$ | $54.78 \pm 0.1$ | $58.15 \pm 0.17$ | $61.98 \pm 0.21$ | $63.47 \pm 0.2$ | | Logistic | $39.68 \pm 0.03$ | $48.17 \pm 0.1$ | $53.56 \pm 0.11$ | $57.45 \pm 0.09$ | $61.77 \pm 0.19$ | $63.24 \pm 0.18$ | |Isotonic lb=0.5 | $42.72 \pm 0.30$ | $49.89 \pm 0.23$ | $54.73 \pm 0.27$ | $58.78 \pm 0.15$ | $61.30 \pm 0.09$ | $63.98 \pm 0.19$| ## Response to Reviewer Un32 >... but it also seems hard to guarantee the proposed method can get the pseudo-labels without noisy. Can the proposed method ensure that the pseudo-labels generated by SLaM are noise-free, thereby addressing the limitation discussed in the paper? How and why?* **Response** SLaM effectively improves the quality (reduces the noise) of the pseudolabels produced by the student. We remark that we "add" noise in the output of the student **only during training** -- during testing no noise is added to the pseudo-labels of the student. While we cannot guarantee that the pseudo-labels generated by a student model trained with SLaM will be perfect (and no other method can guarantee that) we argue that student models trained with SLaM produce better predictions than the available baselines. To show that SLaM effectively reduces the noise of the produced pseudolabels we perform a self-distillation experiment. We first train a ResNet56 (teacher) on the small labeled Dataset A (its size is shown in the first row of the table below) and then use SLaM to train another ResNet56. Observe that SLaM produces a student ResNet56 that is more accurate than the teacher ResNet56 for all dataset sizes. Therefore, in this self-distillation experiment the pseudolabels produced by the student trained with SLaM are **less noisy** than the labels provided initially by the teacher. | CIFAR100 (Resnet 56->56) | 1% | 5% | 10% | 15% | 20% | 25% | 30% | 35% | 50% | 60% | |--------------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------| | Teacher | 8.78 | 21.06 | 34.15 | 42.72 | 47.9 | 53.47 | 56.41 | 58.73 | 64.4 | 66.35 | 69.18 | | Vanilla | $9.36 \pm 0.03$ | $22.42 \pm 0.11$ | $36.34 \pm 0.30$ | $44.93 \pm 0.34$ | $50.74 \pm 0.36$ | $55.8 \pm 0.11$ | $58.72 \pm 0.33$ | $61.24 \pm 0.15$ | $66.43 \pm 0.04$ | $68.4 \pm 0.02$ | | Taylor-CE | $9.36 \pm 0.05$ | $23.58 \pm 0.15$ | $38.41 \pm 0.19$ | $46.45 \pm 0.09$ | $52.53 \pm 0.11$ | $56.67 \pm 0.31$ | $59.79 \pm 0.23$ | $61.98 \pm 0.31$ | $66.53 \pm 0.35$ | 68.43+/-0.30 | | SLaM | $\mathbf{10.52 \pm 0.4}$ | $\mathbf{25.66 \pm 0.2}$ | $\mathbf{39.72 \pm 0.3}$ | $\mathbf{48.46 \pm 0.4}$ | $\mathbf{53.53 \pm 0.15}$ | $\mathbf{57.38 \pm 0.31}$ | $\mathbf{60.64 \pm 0.13}$ | $\mathbf{62.27 \pm 0.32}$ | $\mathbf{66.60 \pm 0.15}$ | $\mathbf{68.47 \pm 0.24}$ | **Reviewer** *Although not specifically designed for knowledge distillation, the baseline method Taylor-CE shows competitive or even superior performance on celebA even with only 2% labeled data. Can you elaborate on the reasoning behind this, especially highlighting the advantages of the proposed method in dealing with sparsely labeled data, which Taylor-CE lacks?* * Taylor-CE outperforms SLaM only for the $2\%$ case of the Celeb-A dataset. In contrast, SLaM outperforms Taylor-CE in all other experiments and in some cases by a large margin (see, e.g., the 5\% case of **Imagenet at Table 4 where SLaM outperforms Taylor-CE by ~6.5\%** and the CIFAR100 ablation of the previous answer).  * SLaM almost always outperforms Taylor-CE because the noise in the teacher's labels is structured (not random nor adversarial) and SLaM exploits this structure while Taylor-CE does not (was designed for generic noisy labels). * SLaM can be combined with different loss functions (we used the cross-entropy in our experiments because it is the most commonly used one). In fact, in Section D.6 and Figure 7 of the supplementary material we show that SLaM can be used with the Taylor-CE loss function and improve its performance. > It is better to plot a figure to demonstrate the performance of the comparison method in different percents of the labeled training data with a large span. Eg: 5%, 10%, 30%, 50%, 70%. * In the following experiments we included smaller labelled training data (1%, 5%) for CIFAR10/100. We observe that SLaM consistently outperfoms the baselines. For CIFAR10 with 1% labelled datas SLaM achieved an improvement of over 10% over other baselines. * For larger dataset sizes, as seen in the plots of our manuscript, all methods converge to roughly the same performance (the setting is more similar to standard knowledge distillation where the teacher model has access to the full labelled dataset). See also our ablation in the previous question. * Experiments in which the the number of available labeled examples is small are of greater importance since this is the typical scenario where "distillation with unlabeled examples" applies. (See also the "performance gains of SlaM as a Function of the Number of Labeled Examples" section -- bottom of page 8 of our manuscript -- for more details, and also the additional experiments we performed to answer the remark of Reviewer 7BQA.) | CIFAR10, other dataset-sizes | 1% | 5% | |------------------------------|----------------|----------------| | Teacher | $10.07$ | $51.67$ | | Vanilla | $11.75 \pm 0.1$| $54.2 \pm 0.16$| | Taylor-CE | $10.00 \pm 0.01$| $55.14 \pm 0.28$| | UPS | $12.74 \pm 0.94$| $56.21 \pm 0.23$| | VID | $13.25 \pm 0.26$| $54.32 \pm 0.05$| | Weighted | $12.67 \pm 0.01$| $54.58 \pm 0.1$ | | SLaM | $\mathbf{26.73 \pm 0.02}$| $\mathbf{57.40 \pm 0.05}$| | CIFAR100, other dataset-sizes | 1% | 5% | |-------------------------------|----------------|----------------| | Teacher | $9.45$ | $23.61$ | | Vanilla | $10.08 \pm 0.06$| $25.15 \pm 0.11$| | Taylor-CE | $9.79 \pm 0.13$| $26.14 \pm 0.39$| | UPS | $10.40 \pm 0.05$| $26.41 \pm 0.13$| | VID | $10.02 \pm 0.13$| $24.93 \pm 0.19$| | Weighted | $10.07 \pm 0.06$| $25.36 \pm 0.14$| | SLaM | $\mathbf{10.87 \pm 0.07}$| $\mathbf{27.76 \pm 0.19}$| >I wonder if the proposed method can work in noise label learning that also need reliable pseudo-labels? For example, on clothing1m dataset. SLaM is related to forward methods for dealing with label noise (see our Related Work section for more details). However, the fundamental difference between our setting and the general learning under label noise setting is that the noise introduced by the teacher is structured, and this is a crucial observation we utilize in our design. Specifically, our approach is specifically tailored to the structure of the distillation with unabeled examples settin by exploiting that (i) we have access to confidence metrics of the teacher's predictions; and (ii) that often times, when the teacher model's top-$1$ prediction is inaccurate the true label is within its top-$k$ predictions for some appropriate $k$, to design and estimate a much more refined model for the teacher's noise that we use to inform the design of the student's loss function. Given the above SLaM cannot be applied to general learning from noisy-labels tasks but it is interesting to investigate whether adaptations of it could be helpful in those settings. ## Response to Reviewer 7BQA >All empirical results presented are based on "[...] the best test accuracy over all epochs... I would strongly suggest reporting something like the test performance at the last epoch to avoid overfitting the test set. ... completely independent test sets would be included. We agree with the reviewer that reporting the last-epoch accuracy is a better metric and are willing to change our tables to report that. We used the best test accuracy over all epochs to be consistent with the way the baselines of the prior works were evaluated. In the following table we report the final epoch accuracy of all methods for CIFAR100. We observe that SLaM again consistently outperforms the baselines. | Final-epoch experiments CIFAR-100 | 5000 | 7500 | 10000 | 12500 | 15000 | 17500 | |-----------------------------------|-----------|-----------|-----------|-----------|-----------|-----------| | Vanilla | $37.97 \pm 0.1$ | $46.37 \pm 0.12$ | $51.9 \pm 0.13$ | $57.85 \pm 0.17$ | $60.86 \pm 0.18$ | $63.56 \pm 0.35$ | | Taylor-CE | $40.12 \pm 0.03$ | $47.99 \pm 0.21$ | $54.18 \pm 0.21$ | $57.76 \pm 0.06$ | $61.22 \pm 0.11$ | $63.50 \pm 0.20$ | | UPS | $39.47 \pm 0.18$ | $48.29 \pm 0.20$ | $53.27 \pm 0.26$ | $57.83 \pm 0.29$ | $61.16 \pm 0.13$ | $62.58 \pm 0.28$ | | VID | $37.76 \pm 0.13$ | $46.3 \pm 0.18$ | $52.07 \pm 0.12$ | $57.91 \pm 0.31$ | $60.57 \pm 0.21$ | $63.43 \pm 0.18$ | | Weighted | $38.39 \pm 0.07$ | $46.91 \pm 0.08$ | $52.48 \pm 0.13$ | $57.7 \pm 0.15$ | $\mathbf{61.32 \pm 0.31}$ | $63.66 \pm 0.46$ | | SLaM | $\mathbf{41.16 \pm 0.16}$ | $\mathbf{49.11 \pm 0.17}$ | $\mathbf{54.31 \pm 0.19}$ | $\mathbf{58.59 \pm 0.24}$ | $\mathbf{61.30 \pm 0.26}$ | $\mathbf{63.72 \pm 0.43}$ | > In e.g. L230 SLaM is reported to "[...] consistently outperform the baselines, often by a large margin". I believe this is too bold a claim. In particular, only 7 of 24 experiments yield improvements over 1 compared to the second-best baseline, while 8 other of 24 have improvements < 0.2 of which 3 have improvements < 0.02. Furthermore, there are 2 of 24 experiments where the method does not improve over the baselines. All of these observations are without consideration of the variance, which likely makes some additional improvements statistically indistinguishable. I would suggest reducing the claims slightly, especially given the evaluation protocol. We are certainly open to changing the exact phrasing. However, it is worth mentioning that not all experiments are of the same "importance": * Experiments in which the the number of available labeled examples is small are of greater importance since this is the typical scenario where "distillation with unlabeled examples" applies. In these cases, our method indeed often outperforms other methods by a large magin. (See also the "performance gains of SlaM as a Function of the Number of Labeled Examples" section for more details -- bottom of page 8 of our manuscript -- and also the additional experiments we performed to answer the remark of Reviewer Un32.) * Our method significantly outperforms the other baselines on Imagenet which is arguably the most "difficult" dataset. As prominent examples: when given labels for only 5% of the Imagenet examples our method is better by **more than 6%** from every other baseline; when given labels for only 10% of Imagenet our method is better by **more than 3%** from every other baseline. Given the above we are willing to change the phrasing to "[...] often by a large margin in the important cases where (i) only few labeled examples are available; (ii) one deals with large-scale problems with many classes like the Imagenet dataset". We are also happy to adopt any specific suggestion by the reviewers. >It is unclear why this method shouldn't be considered a semi-supervised learning method, as both labeled and unlabeled samples are provided and used for the distillation procedure. In fact, the availability of a labeled validations set, V appears to be a necessity for this method, and I believe it should be presented as a semi-supervised setting more explicitly. * We definitely agree that distillation with unlabeled examples and semi-supervised learning have a lot in common — see the Semi-Supervised Learning paragraph in the "Related Work section" where we explain the similaraties and differences. It is also true that our method could potentially be applied in a purely semi-supervised learning setting and, moreover, potentially be combined with other semi-supervised learning techniques — especially given the performance of our methond on Imagenet (a typically difficult dataset for SOTA semi-supervised learning techniques). * However, we believe that distillation with unlabeled examples is an important problem of its own (given the popularity of the approach in practical applications) and our goal in this paper is to focus on it and, in particular, to study ways to deal with "teacher's noise" both practically and theoretically. * We are open to any suggestions by the reviewers for highlighting the connection to semi-supervised learning more explicitly. > A lot of results and details are deferred to the appendix. While I fully understand the limitations a 9-page restriction can put on a paper, I believe some of the space in the paper could be better utilized to present at least some key results from the appendix. This is certainly true — and we indeed had a hard time deciding what results we should defer to the appendix. We promise to make an effort to include more results from the appendix to the main body of the paper, and we are open to any suggestions by the reviewers for what they think we should include.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.