NeurIPS 2022 Rebuttal

# NeurIPS 2022 Rebuttal ## To all We thank all reviewers for their positive feedback (the proposed GCD method is effective, solid and well-motivated [GP8j, dTQi, JCAw], class number estimation is effective [dTQi, JCAw], experimental results are extensive, impressive and convincing [GP8j, dTQi, JCAw], and the paper is well-written and easy to follow [GP8j, dTQi], etc) and constructive comments.  We have addressed individual concerns carefully in the response to each reviewer. In the updated version, we revised the paper following the suggestions and highlighted the revisions in blue. Additional results, details and discussions will be included in the final version. ## Reviewer GP8j **Novelty** We study the practical and challenging problem of generalized category discovery which is a relatively new task recently introduced in [32]. * First, to tackle this challenge, we propose a joint representation learning and category discovery framework that effectively explores the cross-instance positive relations from both unlabelled data, which is substantially different from [32] that simply uses two augmented versions of each instance to form the positive pairs. * Second, it is non-trivial to obtain reliable positive relations from the unlabelled data that contains instances from both seen and unseen classes. To this end, we propose S-FINCH, as part of our framework to dynamically explore positive relations during training for the GCD task. Meanwhile, S-FINCH is also a transfer clustering algorithm for GCD to produce the class discovery results. * Third, we propose a much more efficient class number estimation approach with a one-by-one merging strategy. (See Table 6 in Appendix D) Hence, we believe our novelty is not simply a modified version of FINCH. **Details on GCD and test stage** We strictly follow the process of Vaze et al. [32] for fair comparison. GCD considers the situation where we have a collection of images where part of the images are labelled and the rest are not. The objective is to automatically group the unlabelled images based on their semantics. Hence, the model takes all images as input and predicts a label assignment for each unlabelled instance. The same unlabelled data are used during training and testing, except that the random augmentation is applied during training. For evaluation, we also follow [32] to adopt Hungarian algorithm to compare the predicted label assignment with the ground-truth labels for the unlabelled data. More in-depth details can be found in Appendix E of [32]. We have clarified the process in the updated paper (lines 277-279). **Questions** > Q1: What is the difference between GCD and the open-set data clustering? A1: We are not sure what open-set data clustering means here. If the review refers to open-set recognition (OSR), OSR aims at detecting the unlabelled instances from new categories without distinguishing between unseen classes, i.e., OSR can be considered as a K+1 classification problem and there is no discovery process for the new class, while differently, GCD not only finds the new classes but also groups unlabelled instances based on their semantics. If the reviewer refers to clustering, unsupervised clustering is inherently ambiguous (i.e., objects can be partitioned into different groups based on different but equally valid criteria), while GCD considers a partially supervised setting that has prior knowledge of some known classes, which aims at removing the ambiguity problem in clustering. Hence, GCD is a more practical and realistic setting. > Q2: As the method is designed for the GCD task that could discover new classes from the unlabeled data whatever the classes they are from, I want to know the results of the scenarios where the unlabeled data are only from seen classes and the unlabeled data are only from the unseen classes. A2: Thanks for this helpful suggestion. We have experimented by directly testing on the suggested cases and reported results. The results are as follows. We can see that our method performs well on these two cases. We outperforms Vaze et al. [32] on all unseen classes and some seen classes. We also included the results in Appendix G of the updated supplementary material. | All unseen| CIFAR10 | CIFAR100 | ImgNet100| CUB| SCars| Herbarium | | ----| :----: | :----: | :----: | :----:| :----: | :----: | | Vaze et al [32]| 87.5 | 57.7 | 69.0 | 53.2| 32.6 | 16.1 | | Ours | **97.6** | **82.7** | **74.3** | **56.5**| **39.3** | **37.9** | | All seen| CIFAR10 | CIFAR100 | ImgNet100| CUB| SCars| Herbarium | | ----| :----: | :----: | :----: | :----:| :----: | :----: | | Vaze et al [32]| **99.1** | **93.1** | 77.6 | **92.5** | **85.3** | 30.7 | | Ours | 98.5 | 84.4 | **83.3** | 79.1 | 72.0 | **55.4** | > Q3: The data splitting in the experiments are too ideal where the numbers of data from both seen and novel classes are balance. I wonder the experimental results when the unlabeled data from the novel classes are small. A3: We agree that real-world data could contain complicated data distributions and we often meet the long-tailed distribution. Actually, Herbarium19 is a long-tailed dataset in which different classes contain an unbalanced number of instances, varying from 10 to 500. So the experimental results on Herbarium19 can demonstrate the feasibility of this challenging setting. In addition, to mimic the case where there are only few instances from novel classes, we experiment on CIFAR-100 and CUB-200 by only including 10% instances in each unseen class. The results are shown below. We can see that under this challenging scenario, the performance of both methods drops, while the results are still reasonably good for our method. When the number of unlabelled instances from unseen classes greatly decreases, our method maintains reasonably good performance on 'unseen' classes, while Vaze et al. [32] gives much worse performance, though it achives better performance on 'seen' classes on some datasets. Overall, our method consistently achives the best trade off. | CIFAR-100| all | seen | unseen | | ----| :----: | :----: | :----: | | Vaze et al [32]| **87.7** | **91.6** | 8.9 | | Ours | 79.9 | 81.0 | **58.1** | | ImgNet-100| all | seen | unseen | | ----| :----: | :----: | :----: | | Vaze et al [32]| 64.3 | 69.8 | 37.1 | | Ours | **78.4** | **81.3** | **63.7** | | CUB-200 | all | seen | unseen | | ----| :----: | :----: | :----: | | Vaze et al [32]| **73.9** | **86.5** | 10.7 | | Ours | 52.6 | 58.2 | **24.7** | | Herbarium19 | all | seen | unseen | | ----| :----: | :----: | :----: | | Vaze et al [32]| 25.0 | 28.3 | 10.0 | | Ours | **42.9** | **47.5** | **22.2** | > Q4: As the results are only evaluated on the small or medium scale datasets, I wonder if the proposed approach still be superior on the large-scale datasets or with a more powerful pretrained model. A4: We agree with the hypothesis and observation of the reviewer. Indeed, as the reviewer might be aware, we have included experiments on a wide spectrum of datasets (see Table 8 in Appendix F), from small scale to large-scale (though medium large compared with other larger ones), from coarse- to fine-grained, and our method consistently performs well on all of them. We believe the same conclusion will hold true for other larger datasets. However, we are short of computing power to carry out more such experiments. On the other hand, to validate the effects of a more powerful pretrained model, we experiment on CIFAR-100 by replacing the ViT-B-16 with the more powerful ViT-B-8 pretrained by DINO. The results are as follows. We can see, with a more powerful pretrained backbone, the performance can be further improved on all cases, indicating our framework can generalize to different backbones. |Model | all | seen | unseen| | :---- |:----: |:----:|:----: | |ViT-B-16 | 81.5 | 82.4 | 79.7 | |ViT-B-8 | **83.1** | **82.6** | **84.2** | **Typos** Thanks. We have fixed them in the revised paper. ## Reviewer dTQi We thank the reviewer for constructive suggestions and questions. For questions: > Q1: In Sec. 3.3, it seems the chain length is an important hyper-parameter. It would be great to investigate how different chain lengths impact the clustering. A1: We agree the chain length is an important factor. Intuitively, the chain length $\lambda$ should be positively correlated with the number of labelled instances in each class $n_l$, and meanwhile, $\lambda$ should be smaller than $n_l$ and not too small when $n_l$ is small (e.g., the extreme case such as $\lambda$ will lead to slow convergence and less useful chains). The square root is the simplest option that satisfies this intuition. Thus, we simply apply the square root. This might not be the best option but we didn’t thoroughly investigate this due to extraordinary performance of the simple square root. We compare our dynamic length with the fixed chain length in Table 2, Appendix A. As can be seen, our dynamic length overall works better than the best possible fixed chain length. In addition, we also experiment with other possible dynamic chain length formulations like $\lambda=\lceil n_\ell/2 \rceil$ and $\lambda=\lceil\sqrt[3]{n_\ell}\ \rceil$. Results are shown below. We can observe that neither the larger $\lambda=\lceil n_\ell/2 \rceil$ nor the smaller $\lambda=\lceil\sqrt[3]{n_\ell}\ \rceil$ works well generally. We hypothesize the reason is that different formulations lead to different cluster numbers at each level, thus producing more wrong or fewer right positive relations. There might be other options, but since the square root works considerably well, we decide to use it as our default choice. | CIFAR-100 | all | seen | unseen | | ----| :----: | :----: | :----: | | $\lceil n_\ell/2 \rceil$ | 81.4 | **84.5** | 75.2 | | $\lceil\sqrt[3]{n_\ell}\ \rceil$ | 72.5 | 77.1 | 63.2| | $\lceil\sqrt{n_\ell}\ \rceil$ (Ours) | **81.5** | 82.4 | **79.7** | | CUB-200 | all | seen | unseen | | ----| :----: | :----: | :----: | | $\lceil n_\ell/2 \rceil$ | 45.5 | 45.5 | 45.5 | | $\lceil\sqrt[3]{n_\ell}\ \rceil$ | 42.4 |45.0 | 41.1 | | $\lceil\sqrt{n_\ell}\ \rceil$ (Ours) | **57.1** | **58.7** | **55.6** | > Q2: In Sec. 3.3, although the proposed SFN is interesting and effective, It is difficult to understand why these constraints are adopted. It will be helpful to provide some insights into why the simple way fails and why SFN uses these constraints. A2: We found that simple FN or a long chain often leads to a single large cluster containing all instances, which is fatal to hierarchical clustering. To avoid such a situation, we thus apply the following constraints for SFN: * we use short chains with the dynamically determined length, to reduce the chance of forming a single cluster with all instances. * a labelled instance is not allowed to be the SFN of another labelled instance more than once, to further avoid multiple chains being connected together. This also inherently ensures that chains of different labelled classes will not be merged. * neither of the above two constraints are adopted to the unlabelled instances, such that the unlabelled instances can join a labelled chain or an unlabelled cluster based on their semantic similarities. > Q3: In Sec. 3.4, the idea of estimating the clustering quality by joint reference score is great. But I have several questions: a) What partition of D^{l} and D^{v} is used in experiments? How does this partition impact the cluster number estimation? b) It is helpful to give some insights into why such a joint reference score is adopted. (a) We simply set $|D^{l}|$:$|D^{v}|$ to 8:2, which is a common ratio widely used in validation, without any tuning. We have followed the suggestion to add this detail in the updated paper (line 289). We further experimented other ratios, namely 9:1 and 7:3. The results change slightly (as follows), but overall our method is not sensitive to different ratios. | Ratio | CIFAR10 | CIFAR100 | ImgNet100| CUB| SCars| Herbarium | | ---- | :----: | :----: | :----: | :----: | :----: | :----: | |GT|10|100|100|200|196|683| |9:1|13|97|107|152|195|446| |8:2|12|103|100|155|182|490| |7:3|12|102|107|151|183|423| (b) The intuition is that we want the overall measurement on the labelled and unlabelled subsets to be the best, such that we have a good overall score. Thus, we combine the labelled clustering accuracy and silhouette score into the joint reference score. (1) silhouette score is an intrinsic clustering quality index. We want to maximize inner-cluster compactness and inter-cluster discrepancy on unlabelled data (without access to GT labels). (2) labelled clustering accuracy is an extrinsic index. We hope labelled data can be accurately clustered to the greatest extent (with access to GT labels). > Q4: In L220, the author simply picks the third level. Please give a detailed analysis of this parameter. We expect the cluster number is neither too large nor too small at the picked level. If it is too large (at a lower level), we will have fewer pairs of positive relations in each mini-batch. If it is too small (at a higher level), excessive wrong positive relations will be generated. So a good choice is the level that over-clusters the labelled instances from known classes to some extent. We empirically find the third level appears to a good level that slightly over-clusters the labelled instances from the known classes. We further experiment on other levels and find that the overclustering level 3 and 4 are similarly good, while level 2 worse because of less positive relations being explored in each mini-batch. Even under level 2, our method still performs on par with Vaze et al [32]. Details are added in Appendix A (lines 30-37). | CIFAR-100 | all | seen | unseen | | ----| :----: | :----: | :----: | | Vaze et al [32]| 70.8 | 77.6 | 57.0 | | Ours w/ level 2 | 72.4 | 79.6 | 58.0 | | Ours w/ level 3 | 81.5 | **82.4** | 79.7 | | Ours w/ level 4 | **81.6** | 81.9 | **80.8** | | CUB-200 | all | seen | unseen | | ----| :----: | :----: | :----: | | Vaze et al [32]| 51.3 | 56.6 | 48.7| | Ours w/ level 2 | 50.9 | 55.8 | 48.5 | | Ours w/ level 3 | **57.1** | **58.7** | **55.6** | | Ours w/ level 4 | 52.9 | 53.1 | 52.8 | > Q5: Can you list the latency of each method? We have reported the latency for clustering between our method and Vaze et al [32] in Appendix D (Table 6) of our submission. The latency mainly consists of two parts: feature extraction and clustering. For all methods, the feature extraction takes less than 0.02s per image as shown below, while our clustering method is 6-30 times faster than that of Vaze et al. [32]. We measured the latency more comprehensively and reported the more detailed latency in Appendix D. | Methods | Time cost | | -------- | -------- | | RankStats+ [7] | 0.015s±0.001 | UNO+ [6] | 0.017s±0.001 | ORCA [2] |0.015s±0.001 | Vaze et al. [32] | 0.014s±0.001 | Ours | 0.014s±0.001 > Q6: How many times do you run for each dataset? Can you provide the mean and variance of the main tables? In our original paper, we only report once experimental results. We have followed the suggestion to repeat our experiments 5 times (3 times on Herbarium19 due to time limit) and added mean and std in our main table (Table 1 & 2). The std is generally quite small indicating the stability of our method. **Limitations** Indeed, we have broadly discussed the limitations of our approach in Appendix H. For the additional questions: > (1) From Table 3, it seems the proposed method prefers to infer a small cluster number. Will it bring some potential negative impact? (1) Class number estimation remains an open challenge, especially in the GCD. So far, only Vaze et al. [32] provides a solution for class number estimation in GCD. We notice that the small class number estimation mainly happens on the challenging fine-grained datasets, where the difference between classes is subtle thus multiple similar classes might be considered as one. Hence, in the scenarios where an accurate cluster number estimation is required, this can bring negative impact. Though our method achieves comparablelly good performance with Vaze et al. [32] much more efficiently, more efforts are still needed to achieve accurate novel class number estimation. > (2) It seems that the method can’t be SOTA on seen classes. Will it be limited to data with all known classes?  (2) Please refer to response Q2 & A2 to reviewer GP8j. We agree that our method is not SOTA on seen classes of some datasets, though the gap between our method with other methods is not big. For other methods outperforming our method on seen classes, we observe a sharp drop on their unseen classe performance, indicating a notably bias towards the seen classes. Our method shows superior performance on all unseen classes and some seen classes, thus achieving overall SOTA performance across the board. We further evaluated the case where the unlabelled instances are only from seen classes, our method can still achieve reasonably good performance (more details in Appendix G in the updated supplementary material). **Typos** Thank you. We have fixed them in the updated paper. ## Reviewer JCAw We thank the reviewer for the detailed investigation and valuable suggestions. We carefully answer each question as follows: > Q1: In Sec. 3.3, the reason why the chain length λ (Line209) is set to the square root of the number of labelled instances in each class (i.e., λ=⌈nℓ) is not clear. How about the sensitivity of the results w.r.t. the different values of λ? It seems that there were no discussions or quantitative results about this claim. A1: Please refer to the response to reviewer dTQi (Q1 & A1). We provided the additional discussion and results on other choices in Appendix A. The choice of chain length should be positively correlated to (but smaller than) the labelled instance number, while the number should not be too small. The concrete formulation to this constraint is not unique, and square root is the simplest formulation we think of. We experiment with other possible choices and found our simple square root appears to be an effective option, though there might be other better (possibly more complex) alternatives. > Q2: In Sec. 3.3, in order to obtain a high purity of each cluster, the authors chose the third level from the bottom of hierarchy to generate the pseudo labels. Why did the authors use this level? What will the results change if we choose the higher or lower level? Or how do the different levels affect the quality of pseudo labels? A2: Please refer to the response to reviewer dTQi (Q4 & A4). The consideration for such a choice is that the level should slightly over-cluster the labelled instances such that chains can be generated to produce useful positive relations. We generally found level three appears to satisfy this and adopt it for all datasets. We further show comparison with other levels in Appendix A (lines 30-37, Table 3). > Q3: In [3], a similar strategy, named Neighborhood Contrastive Learning (NCL), was also used to construct the pseudo-positive pairs for contrastive learning. Could the author compare the difference and connection between NCL and the proposed method in this paper? How about the performance gain of the pseudo-positive pair generation in this method compared to NCL [3] if we fix the other parts? A3: **Difference**: (1) NCL uses a memory bank to store samples, but ours uses batch-wise samples which is more memory-efficient. (2) NCL directly uses the k nearest neighbors in the memory bank to generate positive samples, while ours uses connected components in graphs of each mini-batch to generate global positive relations, subject to the constraints of SFN, bringing more thoughtful and reliable positive relations than kNN. (3) NCL also relies on a hard negative generation process for training, which leverages the assumption that unlabelled data are all from unseen classes, which holds true for NCD but not for GCD setting. **Connection**: the idea of pseudo positive generation is similar and shown to be effective. For the experimental comparison, according to the formulation of the number of positive samples (i.e., $|M|/C^u/2$) in NCL, this number is 1 in our setting of CIFAR-100 and CUB-200. Thus, it is exactly the experiment by replacing S-FINCH with the nearest neighbor for positive relation generation, which can be seen in the first row of Table 4 in the main paper. Under this same setting, S-FINCH achieves significant improvement over the nearest neighbor (i.e., NCL positive relation extraction strategy). > Q4: The biggest concern: I noticed that some results in Tab. 1 in this paper are highly different from the results reported in Tab. 2 of [1], even for the same method on CIFAR10, CIFAR100,ImageNet-100. For example, [1] claimed that their method achieved 76.9%, 84.5%, and 61.7% respectively on "All", "Old", and "New" classes on CIFAR100 dataset; however, in the Tab. 1 of this paper, the performances of the method in [1] on CIFAR-100 dataset were just reported as 70.8%, 77.6%,57.0%, which were significantly lower than the paper [1]. This is a common observation if we compare Tab. 1 of this paper with Tab. 2 of [1], or Tab. 2 of this paper with Tab. 3 of [1]. We can conclude that the performances of different baselines methods (RankStats+, UNO+, method of [1]) reported in this paper are mostly lower than the results in [1]. If we adopt the results in [1], the proposed method in this paper did not outperform other baselines. Could the author explain more about these results? It is so weird because some results are still consistent across these two papers. For example, the method in [1] always achieved the same performance (91.5%/97.9%/88.2% for "All"/"Seen"/"Unseen" classes) on CIFAR-100 in Tab.1 of this paper and Tab.2 of [1]. A4: We double-checked this and can confirm that we compare with the latest published and arXiv version of [1] ([paper link](https://openaccess.thecvf.com/content/CVPR2022/papers/Vaze_Generalized_Category_Discovery_CVPR_2022_paper.pdf)), in which they have used more rigorous and challenging data splits. They also improved the evaluation protocol to better reflect the challenge in the GCD task (see Appendix E of [1]). Our numbers of [1] are from the up-to-date [1], and we strictly follow [1] to adopt the new data splits and evaluation protocol for fair comparison. The numbers that the reviewer referred to are from an earlier version of [1]. > Q5: In Tab. 5, if we compare the Rows (0)-(2), we can conclude that of we adopt "k-means & u-u" or "FINCH & u-u", the performances on "Seen" classes will significantly decrease, compared to Row (0). Furthermore, if we compare Rows (0)(4), we can observe that the combination of "S-FINCH & u-ℓ can drop the performances on "Unseen classes" (on CIFAR-100) or on "Seen classes" (on CUB-200). Could the author provide more detailed discussion about these observations? A5: (1) The "kmeans & u-u" in Row (1) and “FINCH & u-u” in Row (2) adopt kmeans and FINCH on all the instances, no matter seen or unseen, to group instances together. Labelled instances from different classes are not prohibited to be put in the same cluster, thus wrong positive relations connecting labelled instances from different seen classes can be used for training, which is harmful to the performance of seen classes, causing slight performance drop on seen classes in Row (0) vs Row (1, 2). However, overall, more useful positive relations are introduced to improve the training, thus boosting the performance significantly on the unseen classes, leading to notable overall performance boost. (2) For "S-FINCH & u-ℓ", positive pair relations are only introduced for the seen classes, biasing the training to the seen classes, thus bringing negative effects on the unseen classes as well as the seen classes on the more challenging fine-grained CUB-200 dataset. While after introducing the “u-u” pairs in Row (5) to remove the bias, we can see that both the seen and unseen performance are consistently boosted, further demonstrating the effectiveness of our positive relation generation method for GCD. > Q6: In Appendix C, I observed the peaks of three curves almost coincide at similar values. However, as for SCars dataset, the peaks of the labelled accuracy and silhouette score are located at highly different x-axis values. Is it a common phenomenon in the three fine-grained datasets? Could the author provide more discussion about this? A6: We further plotted the curves on other datasets in Appendix C. We observe this is a common phenomenon in CUB-200 and S-Cars because inter-class differences are subtle, except the Herbarium19 due to its more challenging long-tailed distribution challenge. For fine-grained datasets, the inter-cluster distance will not change much with the change of cluster number, while the inner-cluster distance can decrease with the increase of cluster number. Thus, silhouette score favors overclustering under such cases. For clustering accuracy on labelled data, labelled instances from the same class may be misclustered into multiple clusters, and underclustering may reduce the error. Hence, its trend is the opposite with silhouette score. While our class number estimation takes both measures into consideration, resulting in a reasonably good estimation. > Q7: In Appendix D, the time efficiencies between [1] and the proposed method in this paper. How about the time efficiency of the estimator in [2] compared to the above methods? A7: The method of [1] is an improved method of DTC [2] by Brent’s optimization algorithm for GCD. This optimization leads to a significant efficiency improvement from O(n) to O(log(n)). As our method is already notably more efficient than [1], we didn’t compare with [2]. [1] Generalized category discovery. CVPR 2022. [2] Learning to discover novel visual categories via deep transfer clustering. ICCV 2019. [3] Neighborhood contrastive learning for novel class discovery. CVPR 2021.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.