NeurIPS Rebuttal

# NeurIPS Rebuttal --- ## Reviewer1 WA(6), Confidence(4) We deeply appreciate the reviewers’ valuable comments and some concerns. We hope that the concerns can be resolved through our clarifications in this rebuttal. `Q1.Is it necessary to train the MQ-Net in each round? What if we train it in the first round (or first several rounds) and fix it in the remaining round?` The goal of MQ-Net is to adaptively find the best trade-off between purity and informativeness throughout the entire AL period, since the optimal balance varies with respect to the learning stage of the target classifier. Thus, if we fix the MQ-Net after the first round (or first several rounds), it no longer adjusts this trade-off, leading to a suboptimal result. For example, if the MQ-Net emphasizes purity over informativeness at the first round and keeps sticking to the policy, many informative examples would be ignored in query selection at later AL rounds. `Q2. The performance comparison between MQ-Net and other simple alternatives is not discussed. How do the MQ-Net's performances differ from those simple alternatives?` This is a very good point. As shown in Figure 4, the best trade-off between purity and informativeness differs by the learning stage and the open-set noise ratio. However, if we use heuristic rules as simple alternatives, we should search for the best rules every time we learn a new classifier on new datasets. However, as shown in Figure 3, MQ-Net successfully finds the best trade-off throughout the learning stage with varying OOD ratios by leveraging our meta-learning framework. This flexibility of MQ-Net is a clear advantage over the simple alternatives. `Q3. The architecture of MQ-Net is not clearly stated. The activation functions are not reported, and the layer number is only reported in the appendix.` We appreciate the reviewer for pointing out this important issue. We used a shallow MLP architecture with a layer number of 2, a hidden dimension size of 64, and the ReLU activation function. We clarified these details of the architecture in Section 5.1 with the R1Q3 mark. `Q4. The multi-score version is intuitive by adding more input dimensions to MQ-Net. Furthermore, when more scores are used, there would be a score selection problem in MQ-Net. The solution to that problem will increase the quality of the paper.` Thank you very much for helping us improve our paper. Though using multiple (more than two) scores is a very interesting topic, it seems to be beyond the scope of this paper. We leave this topic for potential future work. --- ## Reviewer2 R(3), Confidence(4) ## TODO: Q2-2(done?), Q3-1(done?), Q3-2(2)(done?), Q4(done?), Q8(done?) We deeply appreciate the reviewers’ valuable comments and reasonable concerns. We hope that they can be resolved through our clarifications in this rebuttal. #### Major Concerns (Q1~3). `Q1. About problem definition: There is no need to create a new concept and call it an open-set noise problem. This concept is wider. For instance, some instances belong to ID but contain noise in x. It is not OOD data but it still contains noise.` Yes, we deal with the IN/OOD problem, the same setting as in CCAL and SIMILAR. In fact, “open-set noise” is frequently used as a **synonym** of “out-of-distribution (OOD)” data in machine learning literature on open-set recognition [Salehi et al., 2021], OOD detection [Yang et al., 2021], open-set noisy label handling [Wei et al., 2021, Wang et al., 2018], and open-set semi-supervised learning [Saito et al., 2021, Yu et al., 2020, Huang et al., 2021]. In particular, looking at [Saito et al., 2021, Yu et al., 2020, Huang et al., 2021] in which OOD examples are mixed with **clean** IN examples, they use the term “**open-set** semi-supervised learning” just like our paper. When the noise in IN examples is addressed as well, it is common to specify closed-set noise and open-set noise together (e.g., see [Sachdeva et al., 2021]), which is beyond the scope of this paper. Overall, we haven’t created a new wider concept or setting. Following your advice, we will clarify that closed-set noise is not involved in the method. [Salehi et al., 2021] "A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges.", arXiv preprint arXiv:2110.14051, 2021. [Yang et al., 2021] "Generalized out-of-distribution detection: A survey," arXiv preprint arXiv:2110.11334, 2021. [Wei et al., 2021] "Open-set label noise can improve robustness against inherent label noise," In NeurIPS, 2021. [Wang et al., 2018] "Iterative learning with open-set noisy labels," In CVPR, 2018. [Saito et al., 2021] "Openmatch: Open-set semi-supervised learning with open-set consistency regularization," In NeurIPS, 2021. [Yu et al., 2020] "Multi-task curriculum framework for open-set semi-supervised learning," In ECCV, 2020. [Huang et al., 2021] "Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning," In ICCV, 2021. [Sachdeva et al., 2021] "EvidentialMix: Learning with Combined Open-set and Closed-set Noisy Labels," In WACV, 2021. `Q2-1. About motivation of the purity-informativeness dilemma: Figure1 is not convincing enough since LL4AL and CCAL are both non-typical methods and they are not comparable. Also, the example only shows the first 10 rounds and shows low-noise (10% and 20 % OOD rate) cases.` LL is indeed a representative HI-focused method in that it only uses the predicted loss as the informativeness score and there is no purity score combined. CCAL is also a representative HP-focused method since it carefully incorporates the purity score by using CSI. Thus, they are comparable from the perspective of showing the dominance between the HI-focused method and the HP-focused method throughout the learning stages (i.e., AL rounds). Our 10-round experiment is a quite typical setting in AL literature [Yoo et al., Moon et al.]. For a higher OOD rate (e.g., 30%), a similar trend was observed, where the cross point appeared at a later round. Following your suggestion, we replaced Figure `1(c)` with the plot for a 30% OOD rate (see the revised paper). [Yoo et al.] "Learning loss for active learning.", In CVPR, 2019. [Moon et al.] "Confidence-aware learning for deep neural networks.", In ICML, 2020. `Q2-2. About motivation of the purity-informativeness dilemma: Why can’t we maintain purity all the time and at the same time acquire high informativeness? If there is an ideal method, the proportion of OOD samples in the unlabeled data pool will naturally be higher and higher, and more attention should be attached to purity.` Usually, the purity score favors examples for which the model exhibits high confidence (i.e., certain in the model's prediction), while the informativeness score favors examples for which the model exhibits low confidence (i.e., uncertain in the model's prediction). That is, an opposite trend between the two scores is natural, e.g., if an example shows a high purity score, then its informativeness score is likely to be low. Therefore, it is very difficult to achieve high purity and high informativeness all the time. Favoring high purity over the AL rounds is not challenging because we can select a query set with high purity by including only easy examples in an unlabeled set. However, at the later round, this strategy will not make a significant gain in the model performance due to the low informativeness in the selected set. We empirically observed that, as the model performance increases, ‘fewer but highly-informative’ examples are more impactful than ‘more but less-informative’ examples in terms of improving the model performance. Therefore, it is necessary to emphasize informativeness in later AL rounds even at the risk of choosing OOD examples. For the second question, the size of the unlabeled set is assumed to be considerably larger than those of the query set and the labeled set, e.g., vast amounts of unlabeled images collected by web-crawling. Then, even though there is an ideal method, the proportion of OOD examples in the unlabeled pool would slightly increase throughout the AL rounds. `Q3-1. About experiment results: There is no error bar. The author didn't provide the code.` We added the error bar and standard deviation in Figure 3 and Table 8 in the revised version. We are sorry to miss our source code and now provide it at [the link](https://anonymous.4open.science/r/MQNet-43E6/) (updated in the revised paper with the R2Q3 mark). For implementing SIMILAR and CCAL, we used the same source code available at their official Github links (SIMILAR: https://github.com/decile-team/distil and CCAL: https://github.com/RUC-DWBI-ML/CCAL). `Q3-2. About experiment results: The experimental results on CCAL and SIMILAR are strange on low noise situations (10% and 20% OOD rate), which are even worse than typical uncertainty-based sampling strategy (e.g., CONF). SIMILAR authors said "If there is low-noise then it should only be less challenging and the performance should at least be consistent and better than MARGIN." CCAL author showed me their new experiments on low-noise data scenarios, also better than typical uncertainty-based measures. Is it a fair comparison?` We appreciate the reviewer for these careful comments and answer your questions in two perspectives. (1) We would like to clarify that a low performance of CCAL is also reported in their original paper [Du et al., 2021]. See the left-most plot (20% OOD rate) of Figure 4 for CIFAR-10. The accuracy of CCAL is lower than those of several baselines. For your convenience, here is [the link](https://bit.ly/3cXQ3Cm) to Figure 4. For SIMILAR, a low OOD rate was not considered in their original paper [Kothawade et al., 2021]. We are not aware of the new experiment results which the reviewer received from the CCAL authors and are not sure whether such unpublished, private communication can be considered for the review. [Kothawade et al. 2021] "Similar: Submodular information measures based active learning in realistic scenarios." In NeurIPS, 2021 [Du et al. 2021] "Contrastive coding for active learning under class distribution mismatch.", In ICCV, 2021. (2) Moreover, at the initial phase of our work, we had thought that a low OOD rate was less challenging, just like the SIMILAR authors thought. However, it turned out that our initial thought was wrong for the following reason. In the low-noise case, the standard AL methods such as CONF and MARGIN can query many IN examples **even without careful consideration of purity**. As shown in Table R1, with 10% noise, the ratio of IN examples in the query set reaches 75.2% at the last AL round in CONF. This number is farily similar to 88.4% and 90.2% in CCAL and SIMILAR respectively. In contrast, the difference between CONF and CCAL or SIMILAR becomes much larger (i.e., from 16.2% to 41.8% or 67.8%) with the high-noise case (20% noise in Table R2). Therefore, especially in the low-noise case, the two purity-based methods, SIMILAR and CCAL, have the potential risk of **overly selecting less-informative IN examples** that the model already shows high confidence, leading to lower generalization performance than the standard AL methods. Overall, putting these two facts together, we believe that the low performance of CCAL and SIMILAR on a low OOD rate is not strange and hope that our analysis is persuasive to you. Table R1: Accuracy and ratio of IN examples in the query set for our split-dataset experiment on CIFAR10 with open-set noise of **10%**, where $\% Q_{in}$ means % of IN examples in a query set. | Method | Round |1|2|3|4|5|6|7|8|9|10| |:-----:|:----------:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | CONF | ACC |62.3|74.8|80.8|84.5|86.8|89.0|90.6|91.5|92.4|92.8| | | $\% Q_{in}$|87.6|82.2|80.8|79.0|75.2|76.2|74.0|74.6|74.0|**75.2**| | CCAL | ACC |61.2|71.8|78.2|82.3|85.0|87.0|88.2|89.2|89.8|90.6| | | $\% Q_{in}$|89.0|88.4|89.2|88.6|89.6|88.8|90.4|88.0|88.6|**88.4**| |SIMILAR| ACC |63.5|73.5|77.9|81.5|84.0|86.3|87.6|88.5|89.2|89.9| | | $\% Q_{in}$|91.4|91.0|91.6|92.6|92.6|91.4|92.2|90.6|90.8|**90.2**| Table R2: Accuracy and ratio of IN examples in the query set for our split-dataset experiment on CIFAR10 with open-set noise of **60%**, where $\% Q_{in}$ means % of IN examples in a query set. | Method | Round |1|2|3|4|5|6|7|8|9|10| |:-----:|:----------:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | CONF | ACC |56.1|65.2|69.6|73.6|76.3|80.3|81.6|83.7|84.9|85.4| | | $\% Q_{in}$|37.4|32.2|28.2|25.4|25.6|20.0|20.8|17.0|18.0|**16.2**| | CCAL | ACC |56.5|67.0|72.2|76.3|80.2|82.9|84.6|85.7|86.6|87.5| | | $\% Q_{in}$|42.0|38.6|39.8|41.2|38.6|42.2|42.2|40.4|42.2|**41.8**| |SIMILAR| ACC |57.6|67.6|72.0|75.7|79.7|82.2|84.2|85.9|86.8|87.4| | | $\% Q_{in}$|56.0|61.0|67.2|66.6|67.4|67.2|68.0|67.1|68.2|**67.8**| #### Minor Questions (Q4~11). `Q4. In lines 63-64, learning is for updating the META model to better output Φ(<P(x),I(x)>; w), instead of using it to get better P(x) and I(x). It feels like it is just to train a classifier, but in deep learning tasks, the feature representation and classifier are jointly trained.` Because $P(x)$ and $I(x)$ are obtained from the classifier, $P(x)$ and $I(x)$ get better as the training progresses (with more labeled examples). Of course, the meta-model $\Phi(\langle P(x),I(x) \rangle; w)$ is improved to produce better prioritization. Thus, the score functions and the meta-model are improved together, as you precisely expect. `Q5. In line 95, why choose a classifier-dependent approach to get a meta-model, what is the motivation?` In fact, we didn’t choose a classifier-dependent approach. Line 95 is just the introduction of OSR methods in the related work section. `Q6. In line 277-288, why did you use CSI and LL for purity and informativeness scores, respectively?` As shown in Section 5.4, we used CONF and LL as the informativeness scores, and CSI and ReAct as the purity scores. We choose the combination of LL and CSI as the default setting of MQ-Net, since it shows good overall accuracy, as reported in Table 2. The other combinations also showed better accuracy than the baselines. `Q7. Is MQ-Net jointly trained with the backbone classifier? or not like LL?` MQ-Net is disjointly trained with the backbone classifier. That is, the training procedure alternates between the classifier and MQ-Net, and the details can be found in the algorithm pseudocode in Appendix B. `Q8. Equation 3 looks similar to find pareto fronts. Could the author provide some discussions about the situation that the size of a pareto front set is less than the batch size in the active learning process?` This is a good point. Equation 3 is similar to find pareto fronts, but it is slightly different. The pareto front is a set of examples having at least one dominance in purity or informativeness over all other examples (see [Liu et al., 2015] for details). However, the skyline constraint in Equation 3 is just to ensure the output score of MQ-Net to satisfy $\Phi (z_{x_i})$ $>$ $\Phi(z_{x_j})$ if $P(x_i)>P(x_j)$ **AND** $I(x_i)>I(x_j)$ for all $i$, $j$. That is, the Pareto front does not necessarily have to be the set with the highest MQ-Net score. Thus, regardless of the size of a pareto front set, MQ-Net just queries examples in the order of their meta-scores within the budget (i.e., batch size). [Liu et al., 2015] "Finding pareto optimal groups: Group-based skyline." In VLDB, 2015. `Q9. Is ResNet18 pre-trained?` No, we did not use any pre-trained networks. `Q10. Did the author conduct repeat trials per experiment?` Of course. We clarified it in Line 294-295. We also added the error bar in the revised paper. `Q11. In line 284-285, the author already defined the cost of querying OOD data samples. Why is it not an evaluation metric in later experimental result analysis?` Thank you very much for helping us improve our paper. According to the reviewer’s suggestion, we conducted additional experiments and analyzed the effect of different costs for querying OOD examples in Appendix F in the revised version. Table 7 summarizes the performance change with four different labeling costs (i.e., 0.5, 1, 2, and 4) for the split-dataset setup on CIFAR10 with an open-set ratio of 40%. Overall, MQ-Net consistently outperformed the four baselines regardless of the labeling costs. Meanwhile, CCAL and SIMILAR were more robust to the higher labeling cost than CONF and CORESET. This is because CCAL and SIMILAR, which favor high purity examples, query more in-distribution examples than CONF and CORESET, so they are less affected by labeling costs, especially when cost $\tilde{c}$ is high. --- ## Reviewer3 A(7), Confidence(2) ## TODO: Q4 We deeply appreciate the reviewers’ constructive comments and positive feedback on our manuscript. `Q1. Computational efficiency of MQ-Net?` This is a good point. MQ-Net needs one more meta-training phase at every AL round. However, it is not very expensive, because MQ-Net uses a very light MLP architecture and the amount of a meta-training set is small. For example, the size of the meta-training set is only 1% of the labeled+unlabeled set for our split CIFAR10/100 experiments. `Q2. A running example of what the OOD data and informative versus non-informative data looks like would be helpful.` Thank you very much for your careful comment. As you mentioned, Figure 1(a) is intended to explain the purity-informativeness dilemma, but we couldn’t include the details due to lack of the space. Let us detail our intention to present Figure 1(a). The task is to classify dogs and cats in a given image dataset. (1) The HP-LI subset includes trivial (easy) cases of dogs and cats. (2) The HP-HI subset includes moderate and hard cases of dogs and cats, e.g., properly labeled dog-like cats and cat-like dogs. (3) The LP-HI subset includes other similar animals (e.g., wolves and jaguars) which may share some features with dogs and cats. (4) The LP-LI subset includes other dissimilar animals. Overall, it is clear that HP-HI is the most preferable; however, it is NOT clear which of HP-LI and LP-HI is more preferable. This issue is defined as the purity-informativeness dilemma. We will add this explanation in the supplementary material or an external web page (e.g., GitHub repository). `Q3. Equation 1 is defined as the optimal query set approach, but is not mentioned otherwise in the paper. Also, the cost constraint in Equation 1 is used in MQ-Net but is not mentioned in that section.` Equation 1 formalizes the open-set AL problem. Here, MQ-Net is used to derive a query set $S_Q$ in each round. More specifically, the examples in the order of the meta-score $\Phi(x; w)$ within the budget $b$ form the query set. We expect that this query set is very close to $S_Q^*$ in Equation 1. In Section 4.1, we focused on the training of the MQ-Net itself, and the overall procedure involving the budget was not contained. The overall procedure is clearly described in Appendix B (see the AL procedure pseudocode). We will improve the presentation so that Equation 1 and Section 4.1 can be better connected. `Q4. The intuitive interpretation of L(S_Q) was not clear. The main idea from previous sections was that we want a loss that emphasizes informativeness more in later rounds. How does this loss function do that?` This is a very good question. $L(S_Q)$ is designed to favor high-loss (i.e., uncertain) examples, which tend to be highly informative, in the meta-model $\Phi(\cdot; w)$. Also, the HP-HI subset is the most preferred by the skyline constraint. Because the backbone classifier is not mature at early rounds, the loss value may not precisely represent the informativeness. As its side effect, simply in-distribution (IN) examples can be selected more often at early rounds than at later rounds. As the classifier becomes mature, the loss value is able to precisely capture the informativeness. Consequently, the informativeness can be properly emphasized at later rounds using Equation (3) **with the varying capability of the classifier**. Figure 4(a) empirically confirms that informativeness is more emphasized as the training progresses. Also, in Figure 3, the gap between MQ-Net and purity-based methods (CCAL and SIMILAR) becomes larger at later rounds. Another new evidence is provided below. We measure the proportion of IN examples in a query set at each round of MQ-Net. As shown in Table R3, this proportion decreases as the training progresses, because informative rather than pure examples are favored at later rounds. Table R3: Accuracy and ratio of IN examples in the query set for our split-dataset experiment on CIFAR10 with open-set noise of 40%, where $\% Q_{in}$ means % of IN examples in a query set. | Method | Round |1|2|3|4|5|6|7|8|9|10| |:-----:|:----------:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| |MQ-Net | ACC |59.6|73.1|79.5|82.9|85.7|88.2|89.3|90.1|90.9|91.5| | | $\% Q_{in}$|88.4|81.1|76.6|72.8|66.2|63.4|57.6|61.4|56.7|57.6| `Q5. Typos.` Thank you very much for helping us improve our paper. The second appearance of HP should be changed to HI, and we fixed the typo. --- ## Reviewer4 R(3), Confidence(3) We are very glad that you have acknowledged the main contribution of this paper. At the same time, we deeply appreciate your valuable comments and reasonable concerns. During the rebuttal process, we have already addressed all of your concerns in the revised paper. Therefore, we look forward to hearing your positive feedback. #### Major Concerns (Q1~3). `Q1-1. No code is provided and no report about the resource usage.` Thank you very much for helping us improve our paper. We provide our code at [the link](https://anonymous.4open.science/r/MQNet-43E6/). All methods are implemented with PyTorch 1.8.0 and executed using a single NVIDIA Tesla V100 GPU. The experiments for ImageNet could be run smoothly using this resource. We updated this information about the code and resource usage in the revised paper with the R4Q1 mark. `Q1-2. It is also unclear how hyperparameters of the MQ-Net were chosen.` Since MQ-Net is trained on low-dimensional meta-input, we decided to use a shallow MLP architecture with a layer number of 2, a hidden dimension size of 64, and the ReLU activation function. The hyperparameters of MQ-Net including its architecture and the optimization configurations are specified in Appendix C. `Q2. No standard deviation and random baseline.` Thanks again for pointing out these issues. We added the error bar and the result of the random baseline in Figure 3 and Table 8 in the revised paper. Also, we added the result of the random baseline in Table 1. Evidently, the standard deviations are very small, and the significance of the empirical results is sufficiently high. `Q3. How do you do the z-score normalization, i.e., over which parts do you compute mean and standard deviation?` We conduct z-score normalization for each scalar score $O(x)$ and $Q(x)$. That is, we iteratively compute the mean and standard deviation over the unlabeled examples for every AL round. The mean and standard deviation are computed before the meta-training, and they are used for the z-score normalization at that round. We further clarified this procedure in the revised paper in Lines 237-240 with the R4Q3 mark. #### Minor Questions (Q4~8). `Q4. Figure 4: it appears that all red OOD samples received a high purity score - shouldn't it be vice-versa?` In Figure 4, most red OOD examples received around 0.7--0.8 purity scores which are regarded as being low in CSI-based purity scores. Some OOD examples may receive a high purity score over 0.9, since the open-set recognition performance of CSI is not very accurate in AL due to the lack of a sufficient amount of clean labeled examples. `Q5. In Theorem 4.1, the constraint that the activation function needs to be monotonically non-decreasing is not mentioned (it is however in the appendix).` Thank you for pointing out this important issue. According to your suggestion, we fixed Theorem 4.1 by adding the constraint of an activation function. See the updated Theorem 4.1 in the revised paper. `Q6. In the problem statement, different annotation costs for annotation of IN and OOD examples are introduced. But in the experiments, the costs for OOD and IN labeling are set as the same.` We agree and appreciate this comment. According to the reviewer’s suggestion, we conducted additional experiments and analyzed the effect of different costs for querying OOD examples in Appendix F in the revised version. Table 7 summarizes the performance change with four different labeling costs (i.e., 0.5, 1, 2, and 4) for the split-dataset setup on CIFAR10 with an open-set ratio of 40%. Overall, MQ-Net consistently outperformed the four baselines regardless of the labeling costs. Meanwhile, CCAL and SIMILAR were more robust to the higher labeling cost than CONF and CORESET. This is because CCAL and SIMILAR, which favor high purity examples, query more in-distribution examples than CONF and CORESET, so they are less affected by labeling costs, especially when cost $\tilde{c}$ is high. `Q7. Some parts of the paper could be improved in readability.` Per your suggestion, we removed an unimportant notation, $T_{in}$, in the problem statement. `Q8. Typos.` Thank you very much for helping us improve our paper. We fixed all the typos in the revised paper.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.