```
Andrew: Revised version on Aug. 8
```
```
Andrew: changed the format a little bit to adopt to OpenReview
```
$\textbf{Dear all Reviewers}$
We want to thank the reviewers for their constructive suggestions!
Aside from the individual responses, we conduct an additional analysis to quantify the annotation mismatches in the nuScenes $\rightarrow$ 16k-nuImages scenario in the rebuttal PDF. Our objective is to determine the specific types of annotation mismatches present in this context. We hope the additional analysis complement the current submission.
Thanks!
ICLR 2022 Conference Paper1032 Authors
Link: [Rebuttal](https://www.overleaf.com/5773341458tndhmkkmwgwm)
$\textbf{Dear Reviewer 89LB}$
Thank you for your positive comments regarding the immediate value of our work and the summarization of this new problem setting.
**Assumption on the similarity between the source and target datasets**
> the proposed approach would require the source and target datasets to be already very similar to each other, which limits its use, e.g. it cannot merge together COCO and Cityscapes.
We agree that label translation makes sense when the source and target datasets are 'similar', but we **only assume that they share the same class label space**. This means that COCO and Cityscapes could definitely be merged. For example, by selectively pruning the COCO dataset of non-driving labels/images or by targeting only overlapping classes in the label translation step.
Finally, we note that having similar datasets is a standard assumption in the context of other source-target joint-training methods such as domain adaptation. We will better clarify this limitation in our revision.
**Baseline: target-trained RPN and label translator**
> the authors should compare to the baseline where both the RPN and Label Translator are trained on the target datasets
Thank you for the suggestion. We want to clarify that the suggested baseline is actually exactly the pseudo-labeling (PL) baseline in our paper in Table 1! Furthermore, our approach (LGPL) consistently outperforms this baseline across datasets and downstream detector architectures, improving by 2.68 on average. We will expand on our description of the baselines and their implementations in the paper to clarify this confusion.
**Misc**
> The Eq 2 and 3 are confusing as many notations are not explained (e.g. b_{h,j}), in fact I feel they are not necessary and the target construction process can be well explained in texts.
Thanks for the feedback. We will revise the writing to explain more clearly in the methods section, especially the class-conditional IoU-based assignment in Eq. 2 and 3.
$\textbf{Dear Reviewer AfT8}$
Thank you for your positive comments regarding the significant practical importance of our problem, analysis of the problem, and our simple yet effective solution approach.
**Comparison with pseduo-labeling**
> In my opinion, the key idea of the label translator is actually the idea of pseudo labels, and simply determine the category it belongs to through confidence level, the only difference from the previous methods of pseudo labeling is that the source of the dataset is different. I think this contribution is not enough.
To clarify our contributions:
1. We are the first to introduce the formal problem of label translation, and we appreciate the recognition of its practical significance. We also rigorously analyze the problem by categorizing a taxonomy of examples.
2. We propose an intuitive translation algorithm, which, although motivated by pseudolabeling, expands upon the naive baseline. LGPL leverages source dataset bounding boxes and class labels at test-time. During training, we propose separating the training sources of the RPN and the label translator to simulate the bounding box distribution at test time. Additionally, since the label translator is class-conditional, we perform class-conditional ground truth assignment during training.
3. We rigorously validate our approach on seven datasets and four source-target combinations, while also analyzing the effect at different thresholds. Our approach outperforms all baselines in all settings. Importantly, we compare against conventional strategies like domain adaptation, demonstrating that our data-centric approach is a new alternative to the standard method for addressing image/label mismatch.
**Significance compared to the baselines**
> and it can also be seen in the experiment that using pseudo labels and noise filtering methods can achieve competitive results.
We emphasize that LGPL is the **only** approach consistently improving over 'No translation' across all scenarios and detectors. From Table 1, the pseudolabel and noise filtered (PL, PL&NF) baselines are (1) consistently weaker than our approach by **2.68 and 1.11 on average**, and (2) **occasionally worse than 'No translation'**. Below, we repeat the relevant parts of Table 1 in terms of the difference with respect to 'No translation' (the higher the better).
| nuScenes $\rightarrow$ 16k nuImages | Faster-RCNN-R50 | YOLOv3 | Def-DETR |
| ----------------------------------- | --------------- | ------ | -------- |
| PL | -0.76 | -2.57 | -0.53 |
| PL & NF | -0.57 | +2.02 | +1.32 |
| LGPL (Ours) | +1.35 | +3.56 | +1.87 |
| MVD-cyclist + nuImages-cyclist $\rightarrow$ Waymo-cyclist | Faster-RCNN-R50 | YOLOv3 | Def-DETR |
|-----------------------------------------------------------|-----------------|--------|----------|
| PL | -1.56 | +0.01 | -3.07 |
| PL & NF | -0.45 | +1.57 | +1.57 |
| LGPL (Ours) | +1.61 | +2.88 | +1.74 |
**LGPL is a general-purpose label translator**
> Since the author lists the possible reasons for label mismatch problem in Figure 2, the following method introduction should provide targeted explanations on how the proposed method solves these problems, which I did not see in the manuscript.
To clarify, our method is a **general-purpose label translator** capable of handling any annotation mismatch. Our goal in presenting the detailed taxonomy in Section 3 is to underscore the importance of the label translation problem, which has not yet been formally studied in the object detection literature. Importantly, with new datasets emerging, we foresee new types of mismatches. We believe it is crucial to develop a learnable data-centric strategy that remains adaptable to unforeseen dataset combinations.
**Experimental analysis**
> The experimental part of the article is slightly weak and insufficient to support the conclusions elaborated in the article
Our primary conclusion is that LGPL is an effective general-purpose label translator, surpassing intuitive baselines and even outperforming off-the-shelf domain adaptation. To support this, we evaluate LGPL alongside baselines on **7 datasets with 4 source-target combinations and 3 architectures** (Table 1). Every baseline underwent rigorous hyperparameter optimization (see Appendix B). Our approach outperforms all settings. We also show that LGPL outperforms object detection domain adaptation techniques such as DANN and CycConf (Fig 4b), positioning it as a novel approach to label mismatch.
We acknowledge that there's always room for improvement and welcome specific suggestions for additional experiments that would better support our conclusions.
> Similarly, in the experimental results, I also hope to see a comparison of the alignment before and after using this module.
Please note that we provide some qualitative comparisons in the paper (Fig 5) and the Appendix (Fig 4-7). We are ready to include further specific comparisons if requested.
**Misc**
> How to solve the open-set problem among annotations in training and inference process?
We appreciate the thoughtful question, which we interpret as an inquiry into label translation with arbitrary class labels available during both training and testing. This is an excellent and important question, but it falls beyond the scope of the current work. We emphasize that this paper's focus is introducing and addressing the standard label translation problem, where the class label sets are pre-defined and known. The idea of an open-set/multi-task label translator is intriguing but demands significant technical effort, which we hope to address in future work.
$\textbf{Dear Reviewer 1wmh}$
Thank you for appreciating the positive feedback on the simplicity and effectiveness of our method.
**Comparison with a target-trained RPN and label translator**
> The proposed method separately trains the RPN and RoI on the source and target datasets. What is the performance that uses the detector trained on the target to refine the gt bboxes of the source dataset?
Thanks for identifying this intuitive baseline. In fact, this baseline is exactly the pseudolabeling (PL) baseline we compare with in Table 1! Our approach LGPL dominates the baseline for every setting tested, and on average by **2.68**
Our revised paper will clarify the explanation of our baseline's implementation to confirm your suggestion and remove confusion.
**The class label spaces of the source and the target dataset need to be similar**
> The proposed method only works for the case where label spaces between source and target are similar.
While we agree that this is a core assumption, we argue this assumption is mild and mandatory in any application involving combining multiple data sets for a specific target task, e.g., domain adaptation. Combining very different data sets would not be practical. Nonetheless, our approach works even if the label spaces are only partially aligned, for example, by applying our method on only overlapping classes while preserving the integrity of the remaining source labels.
**Experimental analysis**
> The performance gain compared with 'No Translation' baseline is limited.
LGPL consistently generates a moderate yet **consistent** improvement (**averaging 1.87**) over 'No Translation' across all datasets, source-target scenarios, and detector backbones. Notably, our approach is the **only** strategy consistently outperforming 'No Translation'. In contrast, other baselines like PL and PL & NF may occasionally underperform the 'No Translation' baseline. It is important to emphasize that all baselines were carefully fine-tuned to achieve their optimal performance.
> experiments are only conducted on Streetview datasets. Can this method be applied to other common detection datasets, like COCO and Object365?
We expect our strategy to work on other common detection data sets like COCO and Object365, since our method and problem are not restricted to driving data. We focus on driving data sets because this is a concrete motivating application where the practice involves combining multiple data sets from different vendors [1] or mixing synthetic data [2]. We also emphasize that we have so far considered four scenarios from seven different data sets and arrived at consistent results.
We are happy to add new experiments applying this method to COCO in a final revision.
[1] Train in Germany, Test in The USA: Making 3D Object Detectors Generalize (CVPR 2020)
[2] Bridging the Sim2Real gap with CARE: Supervised Detection Adaptation with Conditional Alignment and Reweighting
**Compared to multi-dataset detection [1,2,3]**
> Other works [1,2,3] are also for annotation inconsistency. It would be better to give a further comparison.
Multi-dataset detection addresses the taxonomy disparitie, while our work's primary focus is on resolving annotation mismatches within shared classes originating from distinct datasets.
Applying multi-dataset detection often yields minimal impact on our problem. For example, applying existing techniques [3] to our problem leads to multiple identical dataset-specific language embeddings since the source and target datasets share the same class label space.
[3] Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding, 2022
$\textbf{Dear Reviewer VrRz}$
Thank you for your positive comments regarding the significant practical importance of our problem and our simple yet effective solution approach.
**Analysis of annotation mismatches**
> I think the paper is lacking statistics showing the percentages of the mismatches across many datasets representing an underlying distribution
Thank you for the valuable suggestion. To illustrate this, we have chosen the "nuScenes $\rightarrow$ 16k-nuImages" scenario. In this scenario, we randomly select 91 images from nuScenes and re-annotate them according to the nuImages annotation specifications, treating these as "gold translated labels." We regard nuScenes annotations as predictions and the manually curated gold labels as the ground truth. Below, we present the TIDE analysis [1]:
|Type|Cls|Loc|Both|Dupe|Bkg|Miss|
|-|-|-|-|-|-|-|
|dAP|0|11.19|0|0|0.83|0|
The TIDE analysis dissects the sources of error in object detection, quantifying their contributions to the AP. Notably, we observe that **localization errors (Loc: 11.19) are significant and are way higher than background errors (Bkg: 0.83)**.
*Please note that we offer qualitative comparisons in the rebuttal PDF document.
[1] TIDE: A General Toolbox for Identifying Object Detection Errors
**Other simple baselines**
> I can think of a solution using GPT or any other LLM solving only the label semantic problem. This can be done by creating prompts that can generate synonyms of classes, and just do a simple label "match". Another solution is to encode the classname using a text-encoder (e.g., CLIP or FLORENCE) and just do a NN assignment. Clearly, this will solve the mismatching problem w/o box refinement.
We appreciate the insightful suggestion. While we believe the suggested approach could be effective in addressing the class semantics issue specifically in label translation,, which is supported by prior work in the classification problem [2, 3]. However, as we note in Section 3, class semantics represents only one of four major issues. The other three issues necessitate bounding box refinement. The TIDE analysis above further reinforces that the challenges of annotation mismatches extend beyond class semantics.
[2] Confident Learning: Estimating Uncertainty in Dataset Labels
[3] Detecting Corrupted Labels Without Training a Model to Predict
**Justification of a learning-based approach**
> The drawbacks of the proposed network is thatd it needs to be trained and many other training tricks (e.g., learning-rate scheduling, augmentations, etc.) may be needed to make it work. I think the paper would've been stronger if it included a discussion justifying that a network is really needed.
Our approach, LGPL, is built upon widely-studied two-stage object detectors, we can leverage existing training techniques such as cascade detection heads. Additionaly, as shown in Sec 3, annotation mismatches exhibit variations across different scenarios. A learning-based approach enables us to adaptively capture these mismatches.
Finally, we note that simple (non-learning-based) methods (e.g., statistical normalization) are not effective for addressing all of the different challenges as noted in Table 1 of our paper.
> Lastly, if the proportion of mismatches is not that significant, I can also see humans doing the mapping in a few hours.
The challenge of asking humans to solve the annotation mismatches is that we do not know which images suffer from the annotation mismatches in advance. Therefore, humans still need to go through all the images even though the proportion of mismatches is small.
Even with advance knowledge of which images suffer from the annotation mismatches, a 1% error rate on a dataset exceeding one million images remains massive.
**Concerns of using downstream detector as the main evaluation metrics**
> The performance improvement could be due to different factors including training tricks, learning scheduling, etc.
While we understand the reviewer's concerns about training tricks potentially confounding our analysis, we have taken great efforts to minimize these confounders by rigorously optimizing the hyperparameters for every baseline and architecture (see Appendix B for details) and reporting the multi-seed results.
Specifically, we (1) adopt each detector's standard training tricks (e.g., gradient clipping and weight decay) in MMDetection, (2) sweep the learning rate and the loss weights, (3) report the performance of the best set of hyperparameters, and finally (4) report the average of three random seeds.
> I think the experiments need to directly evaluate the label correspondences between source and target datasets
This is a great suggestion. In the following table, we measure the direct correspondences on nuScenes $\rightarrow$ 16k-nuImages:
||mAP (direct correspondences)|mAP (downstream performances)|
|-|-|-|
|No translation|24.4|37.38|
|SN|6.5|37.44|
|PL|37.6|36.09|
|PL & NF|42.8|38.3|
|LGPL (Ours)|44.4|39.64|
Our analysis reveals that direct correspondences metrics such as the mAP between translated labels and our gold translated labels **do not correlate strongly** with downstream predictive performance; for example, PL outperforms 'No translation' and 'SN' in direct correspondences, but results in poorer downstream predictive performances.
<!--Our hypothesis the disagreement is that DC does not reflect how a translator fails. Different types of noise can impact the downstream detector differently, which is validated in image classification [4].-->
Since the translated labels are used to train the downstream detectors, we consider the downstream predictive performance as a more "straightforward" metric.
<!--[4] Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations-->
**Suggestion: Bi-directional translation**
> Why does the network need to operate from source -> target only? Would it help if the solution of the mismatching problem operates in a bi-directional way?
Thanks for the great suggestion. This was also an idea we had considered in our preliminary work. However, we found that a cycle-consistent bi-directional translator to require many additional custom tricks to train with stability and consistency.
<!--Thanks for the great suggestion. This was actually also an idea we had considered in our preliminary work. However, we found that a cycle-consistent bi-directional translator to require many additional custom tricks to train with stability and consistency, whereas a pseudolabeling-based approach is easier to train and already competitive. In the spirit of proposing a simple and intuitive strategy, we opted for the current method. We envision future work on a consistent bi-directional translator. -->
### Reviewer's response
>Analysis of annotation mismatches: Thanks for conducting this experiment. I have two concerns: 1) Although informative, the analysis is still only one dataset. It would've been more informative if this was done at least on two more datasets on various domains to confirm the motivation. 2) According to the abstract (lines 3 - 5), the proposed method is translating labels. Nevertheless, the justification for this answer is using bounding box locations. I think I may be missing something. Either, labels in this paper mean "object class name" and "bounding box" or only "class names". If the focus is to show that the bounding boxes are the main problem, then the entire narrative is confusing and not clear. Please use directly bounding boxes for clarity. Note that I understood "label" as class names.
>Concerns of using downstream detector as the main evaluation metrics. I am still concerned about the shown "indirect" evaluation. At the end of the day, the proposed method is fixing labels (assuming a broader meaning of it, i.e., b-boxes, class labels, etc.). Why couldn't the evaluation include human "verification" and/or "revision"? If "revision", measure the number of times humans had to "fix" new labels. I think that is a very direct way of measuring the performance of the proposed approach.
### Author response
$\textbf{Dear Reviewer VrRz}$
Thank you for the detailed response and the additional clarification.
> It would've been more informative if this was done at least on two more datasets on various domains to confirm the motivation.
We appreciate your recognition of our additional analysis. We agree that performing more analyses across different domains would enhance the clarity of the motivations.
> I think I may be missing something. Either, labels in this paper mean "object class name" and "bounding box" or only "class names".
Thank you for highlighting your confusion. In this paper, the term "labels" pertains to both bounding boxes and class names. We will make this explicit, starting from the abstract.
> Why couldn't the evaluation include human "verification" and/or "revision"?
We acknowledge the potential value of human verification/revision. Nonetheless, we also note that human verification conducted by crowd workers is not ideal as well [1]. Additionally, human verification could be subject to more compounded variables (e.g., verification instructions) compared to the evaluation of downstream ML models. Furthermore, our motivation lies in label translation's ability to enhance downstream performance on the target dataset, which we deem significant in real-world applications.
[1] The False Promise of Imitating Proprietary LLMs, 2023
---
---
---
---
---
---
```
RRM: I don't think the reviewer is saying our method is bad, just that we need a
stronger justification why our problem is a real problem. The reviewer has also
given some suggestions at how to do this, e.g.,
> I think the paper is lacking statistics showing the percentages of the
> mismatches across many datasets representing an underlying distribution.
Instead, we should target this point and respond with:
"
great idea, we will do this for the final version. also these errors can
be a huge issue in practice where data sets are > 1 million images (and
a 1% error rate is massive --- we can't solve it in a couple hours).
The issue is also prevalent when using synthetic data because synthetic
data generators can typically produce infinite amounts of training data,
making this a persistent problem. Finally, we hope to expand the
justification and importance of this problem in revision.
"
The response below is answering the wrong point.
```
```
RRM: We should also have another sentence saying "Finally, we note
that simple (non-learning-based) methods (e.g., statistical normalization)
are not effective for addressing all of the different challenges as noted
in Table 1 of our paper."
```
We respectfully disagree with the reviewer that a learning-based method is a drawback. To clarify, our goal is to propose a **general-purpose label translator** that can handle any annotation mismatch. A learning-based approach can capture the hidden annotation mismatches based on the provided source and target datasets.
### Other simple baselines
> I can think of a solution using GPT or any other LLM solving only the label semantic problem. This can be done by creating prompts that can generate synonyms of classes, and just do a simple label "match". Another solution is to encode the classname using a text-encoder (e.g., CLIP or FLORENCE) and just do a NN assignment. Clearly, this will solve the mismatching problem w/o box refinement.
Thanks for the great suggestion, which we also believe may be effective in handling the class semantics problem specifically in label translation. However, as we note in Section 3, class semantics is one of
four major issues, where the other three issues specifically require addressing bounding box refinement.
```
RRM: See the above my version of the response.
Reading the review, I feel that the reviewer is confused about the importance
of the problem or they think that class semantics is the only issue, that
there are not that many examples of the issue. I think this response would
also be strengthened by some numerical statement that bounding box problems
is the **majority** of translation errors. Can we provide a sentence?
For example, XX% of cars in Dataset are occluded, or there are 5000 constr.
vehicles in Dataset, all of which have bounding boxes that need correction.
```
The suggested approach only mitigates the **classification part** of the detection problem. Prior works [1,2] on detecting label errors in the classification datasets can leverage strong features or predictions from the foundation models (e.g., CLIP, FLORENCE, GLIP). However, the annotation mismatches in obect detection go beyond label semantic problem, as pinpointed in Sec 3. The proposed approach is a **general-purpose label translator** that aims to capture both localized and semantic mismatches.
[1] Confident Learning: Estimating Uncertainty in Dataset Labels, Curtis G. Northcutt et al., 2019.
[2] Detecting Corrupted Labels Without Training a Model to Predict, Zhaowei Zhu et al., 2022
> Lastly, if the proportion of mismatches is not that significant, I can also see humans doing the mapping in a few hours.
We emphasize that in a common scenario, we do not know (1) what annotation mismatches are presented and (2) which labels contribute to the annotation mismatches. Therefore, even if the proportion of the mismatches is not significant, **humans still need to go through all labels**.
### Concerns of using downstream detector as the main evaluation metrics
> The performance improvement could be due to different factors including training tricks, learning scheduling, etc.
While we understand the reviewer's concerns about training tricks potentially confounding our analysis, we have taken great efforts to minimize these confounders by rigorously optimizing the hyperparameters for every baseline and architecture (see Appendix B for details) and reporting the multi-seed results. Specifically, we (1) adopt each detector's standard training tricks (e.g., gradient clipping and weight decay) in mmdetection [3], (2) sweep the learning rate and the loss weights, (3) report the performance of the best set of hyperparameters, and finally (4) report the average of three random seeds.
[3] MMDetection: Open MMLab Detection Toolbox and Benchmark, 2019
> I think the experiments need to directly evaluate the label correspondences between source and target datasets
This is a great suggestion and one that we had actually tried early on in this work. However, our
preliminary analysis had revealed that direct correspondence (DC) metrics such as the mAP between translated labels and some manually corrected labels **do not correlate strongly** with downstream predictive performance. In the following table, we measure the direct corresondances on nuScenes $\rightarrow$ 16k-nuImages:
```
RRM: The table is a bit confusing because it uses mAP making it seem like we are just
measuring downstream performance. Can we revise this table to mAP (DC) &
mAP (downstream performance), just 2 columns, where the second column is Table 1
repeated? Then we can directly show the correlation issue. Right now, a reader has
to look at this table then cross-ref with the paper to get the insight.
Once we have a cleaner table, we just need 1 sentence to show how this doesn't
correlate and then we can tell the reviewer we will put this in our revision.
```
| nuScenes $\rightarrow$ 16k-nuImages | AP@50 | AP@75 | mAP |
|-------------------------------------|-------|-------|------|
| No translation | 56.2 | 18.2 | 24.4 |
| SN | 24.3 | 1.6 | 6.5 |
| PL | 63.2 | 36 | 37.6 |
| PL & NF | 71.9 | 39.6 | 42.8 |
| LGPL (Ours) | **75** | **44.8** | **44.4** |
Comparing the above table with Table 1, we find that DC disagree with the downstream detector performances. For example, 'PL' outperforms 'No translation' by 13.2 on DC, but 'PL' leads to inferior downstream detector that is 1.29 worse than 'No translation'.
Our hypothesis to the disagreement between DC and downstream detector performance is that DC does not reflect how a translator fail. Different types of noise can impact the downstream detector differently, which is validated in image classification [4].
Since the translated labels are used to train the downstream detectors, we consider the performance of the downstream detectors as a more "straightforward" metric.
[4] Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations, Jiaheng Wei et al., 2022
Note: to measure the direct correspondances, we randomly sub-sample 91 images from nuScenes and re-annotate it based on the nuImages annotation specification. We show the comparison of the original nuScenes annotations and the re-annotated nuScenes annotations in the rebuttal pdf file.
### Suggestion: Bi-directional translation
> Why does the network need to operate from source -> target only? Would it help if the solution of the mismatching problem operates in a bi-directional way?
Thanks for the great suggestion. This was actually also an idea we had considered in our preliminary work. However, we found that a cycle-consistent bi-directional translator to require many additional custom tricks to train with stability and consistency, whereas a pseudolabeling-based approach is easier to train and already competitive. In the spirit of proposing a simple and intuitive strategy, we opted for the current method. We envision future work on a consistent bi-directional translator.
```
RRM: below is very detailed, I think a simple response is okay --- see above.
```
We thank the reviewer for the suggestion. While introducing cycle-consistency loss sounds attractive, there are several limitations:
The integration of cycle-consistency loss necessitates that the employed detector be end-to-end differentiable, limiting the design space to transformer-based object detecotrs, e.g., DETR or Deformable DETR. Furthermore, in our preliminary experiments, we found that a CycleGAN-like framework leads to high inconsistency and instability within predictions, leading to poor performances of the downstream detectors.
---
---
---
---
---
```
Version before Aug. 7
```
# Dear Reviewer 89LB,
We sincerely appreciate your time in reviewing our work. And we address your concerns as below.
## The similarity requirement between the source and the target dataset
> the proposed approach would require the source and target datasets to be already very similar to each other, which limits its use, e.g. it cannot merge together COCO and Cityscapes.
1. **Leveraging similar source datasets is a widely-accepted strategy in many ML research areas.**
- Domain adaptation leverages similar datasets, such as Cityscapes and Synscapes.
2. **This work assumes a shared class label space but does not assume a shared image distribution.**
- Having a shared class label space is a mild assumption.
- By selectively pruning the labels based on class, it is easy to get a source dataset that has the same class label space.
3. **Adaptation to partially aligned class label spaces:**
- Even in scenarios where class label spaces are only partially aligned, our proposed approach remains applicable. Specifically, the method can effectively target the overlapping classes while preserving the integrity of the remaining source labels.
## Lack of ablation/comparison with a target-trained RPN and label translator
> the authors should compare to the baseline where both the RPN and Label Translator are trained on the target datasets
- The target-trained RPN and label translator baseline is included in the submission manuscript, named pseudo-labeling (PL) in Table 1.
- The following tables from Table 1 highlight that the detector trained on LGPL's labels outperform PL's labels.
| nuScenes $\rightarrow$ 16k nuImages | Faster-RCNN-RN50 | YOLOv3 | Def-DETR | Mean |
|--------------------------|------------------|--------|----------|-------|
| PL | 40.49 | 28.67 | 39.12 | 36.09 |
| LGPL (ours) | **42.6** | **34.8** | **41.52** | **39.64** |
| Synscapes $\rightarrow$ Cityscapes | Faster-RCNN-RN50 | YOLOv3 | Def-DETR | Mean |
|-------------------------|------------------|--------|----------|-------|
| PL | 37.88 | 28.86 | 30.67 | 32.47 |
| LGPL (ours) | **39.71** | **29.29** | **34.45** | **34.48** |
## Unclear explanation on Eq. 2 and 3
We appreciate the reviewer's feedback and will explain more clearly in the method section, especially the class-conditional IoU-based assignment in Eq. 2 and 3.
# Dear Reviewer AfT805,
We sincerely appreciate your time in reviewing our work. And we address your concerns as below.
## The contributions are not enough
> I think this contribution is not enough
We respectfully disagree with the reviewer. We would like to emphasize the substantial contributions that our paper brings to the field:
1. **New insights and observations**
- Annotation mismatches are pervasive but often overlooked
- Introduce the taxonomy of the annotation mismatches
- Mitigating annotation mismatches can be as important as aligning image features (shown in Fig. 4b)
2. Technical contributions
- **LGPL is a general-purpose label translator** that tackles all types of annotation mismatches found in this work.
- **LGPL is data-centric:** LGPL can be considered as a preprocessing step that improves across many downstream detectors.
- LGPL's efficacy is validated on **three downstream detector architectures and four scenarios encompassing seven datasets**.
3. Broader impacts
- **LGPL is simple and easily extendible:** LGPL is built upon the widely-investigated two-stage object detector, making it easy to extend.
We believe that these contributions collectively demonstrate the significance and novelty of our work, offering a valuable addition to the existing body of research in the domain.
## Baselines achieve competitive results?
> and it can also be seen in the experiment that using pseudo labels and noise filtering methods can achieve competitive results.
- We emphasize that the proposed approach, LGPL, is the **only** approach that consistently improves over `No translation` across every scenario and detector.
- The improvements of PL and PL & NF baselines are inconsistent. The following tables from Table 1 show the *performance differences w.r.t.* `No translation`.
- **Broader impact:** This implies that LGPL can potentially be an go-to option when combining multiple detection datasets with a shared label space.
| nuScenes $\rightarrow$ 16k nuImages | Faster-RCNN-R50 | YOLOv3 | Def-DETR |
|-------------------------------------|-----------------|--------|----------|
| PL | -0.76 | -2.57 | -0.53 |
| PL & NF | -0.57 | +2.02 | +1.32 |
| LGPL (Ours) | **+1.35** | **+3.56** | **+1.87** |
| MVD-cyclist + nuImages-cyclist $\rightarrow$ Waymo-cyclist | Faster-RCNN-R50 | YOLOv3 | Def-DETR |
|-----------------------------------------------------------|-----------------|--------|----------|
| PL | -1.56 | +0.01 | -3.07 |
| PL & NF | -0.45 | +1.57 | +1.57 |
| LGPL (Ours) | **+1.61** | **+2.88** | **+1.74** |
## Experimental parts are not sufficient to support the conclusions
> The experimental part of the article is slightly weak and insufficient to support the conclusions elaborated in the article
We make the following conclusions in the manuscripts and we expand in details here:
1. LGPL can serve as a **pre-processing step** in existing training workflows
- This is achieved by design.
- All experiments presented in this work consistently treat LGPL, alongside the various baselines, as a pre-processing stage.
2. LGPL improves downstream detectors **with different architectures and training objectives**.
- In Table 1, we evaluate LGPL and the baselines with Faster-RCNN, YOLOv3, and Def-DETR, and each has its own loss function and training strategies.
- Table 1 shows that LGPL is the only approach that consistently improves over `No translation.`
3. **LGPL's superiority over off-the-shelf domain adaptation techniques:**
- As shown in Fig. 4b, we find that aligning annotation mismatches via LGPL is competitive or better than aligning image features via DANN [1] and CycConf [2].
We do acknowledge that there is room for improvement. We will be more than happy if the reviewer can point out the concerns.
[1] Domain-adversarial training of neural networks, Yaroslav Ganin et al., 2016
[2] Robust object detection via instance-level temporal cycle confusion, Xin Wang et al., 2021
## Open-set problem
> How to solve the open-set problem among annotations in training and inference process?
We appreciate the reviewer's thoughtful question, which we interpret as inquiring about the broader scope of label translation with any class labels available during training and testing. While this perspective is indeed valuable, we would like to clarify the specific focus of our work.
- This work primarily centers on the task of translating labels between the provided source and target datasets, where the class label sets are predefined and known. The central goal of this study is to address the challenges associated with label mismatches within this context.
While the idea of a multi-task label translator that accommodates a wider range of class labels is intriguing, it extends beyond the primary scope of this work. As such, we concentrated on the specific challenge of translating labels between the established class label sets of the source and target datasets.
## Four annotations mismatches in Sec.3 vs. the proposed LGPL
> In response to the four questions raised in the Sec.3, what are the designs for the methods proposed in this paper?
The proposed method serves as a **general-purpose label translator** and is not tailored to certain types of annotation mismatch.
It is important to have a general-purpose label translator since the types of annotation mismatch are often unknown.
# Dear Reviewer 1wmh,
We sincerely appreciate your time in reviewing our work. And we address your concerns as below.
## Lack of ablation/comparison with a target-trained RPN and label translator
> The proposed method separately trains the RPN and RoI on the source and target datasets. What is the performance that uses the detector trained on the target to refine the gt bboxes of the source dataset?
- The target-trained RPN and label translator baseline is included in the submission manuscript, named pseudo-labeling (PL) in Table 1.
- The following tables from Table 1 highlight that the detector trained on LGPL's labels outperform PL's labels.
| nuScenes $\rightarrow$ 16k nuImages | Faster-RCNN-RN50 | YOLOv3 | Def-DETR | Mean |
|--------------------------|------------------|--------|----------|-------|
| PL | 40.49 | 28.67 | 39.12 | 36.09 |
| LGPL (ours) | **42.6** | **34.8** | **41.52** | **39.64** |
| Synscapes $\rightarrow$ Cityscapes | Faster-RCNN-RN50 | YOLOv3 | Def-DETR | Mean |
|-------------------------|------------------|--------|----------|-------|
| PL | 37.88 | 28.86 | 30.67 | 32.47 |
| LGPL (ours) | **39.71** | **29.29** | **34.45** | **34.48** |
## The class label spaces of the source and the target dataset need to be the same
> The proposed method only works for the case where label spaces between source and target are similar.
1. **This work assumes a shared class label space but does not assume a shared image distribution.**
- Having a shared class label space is a mild assumption.
- By selectively pruning the labels based on class, it is easy to get a source dataset that has the same class label space.
2. **Adaptation to partially aligned class label spaces:**
- Even in scenarios where class label spaces are only partially aligned, our proposed approach remains applicable. Specifically, the method can effectively target the overlapping classes while preserving the integrity of the remaining source labels.
## Other works [1,2,3] that discuss annotation inconsistency
> Other works [1,2,3] are also for annotation inconsistency. It would be better to give a further comparison.
## Misc
> The performance gain compared with `No Translation` baseline is limited.
- We emphasize that the proposed approach, LGPL, is the **only** approach that consistently improves over `No translation` across every scenario and detector.
- The improvements of PL and PL & NF baselines are inconsistent. The following tables from Table 1 show the *performance differences w.r.t.* `No translation`.
- **Broader impact:** This implies that LGPL can potentially be an go-to option when combining multiple detection datasets with a shared label space.
| nuScenes $\rightarrow$ 16k nuImages | Faster-RCNN-R50 | YOLOv3 | Def-DETR |
|-------------------------------------|-----------------|--------|----------|
| PL | -0.76 | -2.57 | -0.53 |
| PL & NF | -0.57 | +2.02 | +1.32 |
| LGPL (Ours) | **+1.35** | **+3.56** | **+1.87** |
| MVD-cyclist + nuImages-cyclist $\rightarrow$ Waymo-cyclist | Faster-RCNN-R50 | YOLOv3 | Def-DETR |
|-----------------------------------------------------------|-----------------|--------|----------|
| PL | -1.56 | +0.01 | -3.07 |
| PL & NF | -0.45 | +1.57 | +1.57 |
| LGPL (Ours) | **+1.61** | **+2.88** | **+1.74** |
> experiments are only conducted on Streetview datasets. Can this method be applied to other common detection datasets, like COCO and Object365?
- We want to emphasize that we thoroughly validate the effectiveness of the proposed approach, we conduct experiments on **four scenarios from seven detection datasest**.
- The proposed approach does not make any assumption on the domain of the dataset.
- We expect it to work on common detection datasets as well.
# Dear Review VrRz02,
## Justification of a learning-based approach
> The drawbacks of the proposed network is thatd it needs to be trained and many other training tricks (e.g., learning-rate scheduling, augmentations, etc.) may be needed to make it work. I think the paper would've been stronger if it included a discussion justifying that a network is really needed.
While we respectfully disagree with the reviewer that a learning-based method is a drawback, we understand the unclarity and will emphasize this point in the manuscript.
- Our goal is to capture the **unknown annotation mismatches** between the source and the target datasets. We empahsize the benefits of the learning-based approach under this context
- **Data-driven**: A learning-based approach allows us to capture the hidden annotation mismatches from the data with little human efforts.
- **Scalability**: While a machine learning engineer might be able to discover the annotation mismatches via eyeballing for a subset of the dataset, it does not scale to large and diverse datasets.
## Other simple baselines
> I can think of a solution using GPT or any other LLM solving only the label semantic problem. This can be done by creating prompts that can generate synonyms of classes, and just do a simple label "match". Another solution is to encode the classname using a text-encoder (e.g., CLIP or FLORENCE) and just do a NN assignment. Clearly, this will solve the mismatching problem w/o box refinement.
- The suggested approach only mitigates the **classification part of the detection problem*
- Leveraging strong features or predictions from the foundation models (e.g., CLIP, FLORENCE, GLIP) can mitigate the label semantic problem, which works well in the classification problem [1, 2].
- However, the annotation mismatches in object detection go beyond the label semantic problem, as pinpointed in Sec 3.
- The proposed approach is a **general-purpose label translator** that aims to capture both localized and semantic mismatches.
[1] Confident Learning: Estimating Uncertainty in Dataset Labels, Curtis G. Northcutt et al., 2019.
[2] Detecting Corrupted Labels Without Training a Model to Predict, Zhaowei Zhu et al., 2022
> Lastly, if the proportion of mismatches is not that significant, I can also see humans doing the mapping in a few hours.
- Even if the proportion of the mismatches is not significant, **humans still need to go through all labels** since we do not know which labels suffer from the annotation mismatches beforehand.
- Furthermore, if the proportion of the mismatches is not significant, humans tend to suffer from confirmation biases [3].
[3] Confirmation Bias: A Ubiquitous Phenomenon in Many Guises, Raymond S. Nickerson, 1998
## Concerns of using downstream detector as the main evaluation metrics
> The performance improvement could be due to different factors including training tricks, learning scheduling, etc.
We understand the reviewer's concerns; this is the issue of many ML projects, especially when the ML systems become increasingly complex.
We minimize the concerns of the confounding factors by performing an extensive hyper-parameter sweep:
- For each set of translated labels, we train the downstream detectors with the follow steps
- Adopt each detector's standard training tricks (e.g., gradient clipping and weight decay) in mmdetection [4]
- Sweep the learning rate and the loss weights. Please refer to Table 3 in the appendix for the search range.
- Report the performance of the best set of hyper-parameters
- Average over three random seeds
[4] MMDetection: Open MMLab Detection Toolbox and Benchmark, 2019
> I think the experiments need to directly evaluate the label correspondences between source and target datasets
- Measuring direct correspondences might be **misleading**.
- Given the direct correspondences, i.e., perfectly translated source labels, we can measure the standard mAP for object detection. However, **mAP does not reflect "how" the translation fails**, which might impact the downstream detectors differently.
- Prior work [5] validate this point in the image classification by comparing human and synthetic noises.
- Since the translated labels are used to train the downstream detectors, we consider the performance of the downstream detectors as a more "straightforward" metric.
We understand the concerns the reviewer raised. While we argue that downstream performance is a better evaluation metric, we measure the direct correspondences on nuScenes $\rightarrow$ 16k-nuImages:
- Evaluating **direct correspondances (DC)**
| nuScenes $\rightarrow$ 16k-nuImages | AP@50 | AP@75 | mAP |
|-------------------------------------|-------|-------|------|
| No translation | 53.4 | 20.2 | 25.9 |
| SN | 20.2 | 7 | 5.2 |
| PL | 62.5 | 30.9 | 34.3 |
| PL & NF | 70 | 36.4 | 40 |
| LGPL (Ours) | **73.4** | **48.9** | **44** |
- Evaluating **downstream performances (DP)** (shown in Table 1)
| nuScenes $\rightarrow$ 16k-nuImages | Mean |
|-------------------------------------|-------|
| No translation | 37.38 |
| SN | 37.44 |
| PL | 36.09 |
| PL & NF | 38.3 |
| LGPL (Ours) | **39.64** |
- **Direct correspondances (DC) and downstream performances (DP) disagree**
- From the following two tables, we find that the DC metric places `PL` above `No translation` and `SN`, while DP metric positions `PL` lower than `No translation` and `SN`
- We are motivated to enhance the downstream detector performances through label translation. As such, the DP reflects a more truthful impact of the targeted problem.
- Direct correspondance annotations details
- We randomly sub-sample 45 images from nuScenes and re-annotate it based on the nuImages annotation specification. We show the comparison of the original nuScenes annotations and the re-annotated nuScenes annotations in the rebuttal pdf file.
[5] Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations, Jiaheng Wei et al., 2022
## Suggestion: Bi-directional translation
> Why does the network need to operate from source -> target only? Would it help if the solution of the mismatching problem operates in a bi-directional way?
We thank the reviewer for the suggestion. While introducing cycle-consistency loss sounds attractive, there are several limitations:
1. The requirement for end-to-end differentiability:
- The integration of cycle-consistency loss necessitates that the employed detector be end-to-end differentiable.
- Consequently, our design options are constrained to transformer-based object detectors, e.g., DETR or Deformable DETR.
2. Practical limitations based on preliminary experiments:
- In our preliminary experiments, we adopted a CycleGAN-like framework and found high inconsistency and instability within predictions, leading to poor performances of the downstream detectors.