## Common Response
We thank all reviews for their comments and constructive suggestions. We would like to remark on our technical novelty as a common response.
In this study, we have three modules integrated together to address the inconsistency issues in SSOD.
**Adaptive Sample Assignment** addresses the label-shifting phenomenon caused by the extra static label assignment in SSOD. The static label assignment breaks one important property that exists in semi-supervised classification. In classification, the instance-level pseudo-label satisfies
$$\hat{\mathbf y} = \mathop{\mathrm{argmin}}_{\hat{\mathbf y}}\mathcal L(f_t(\mathbf x^u), \hat{\mathbf y}^u)$$
meaning that the one-hot pseudo-label $\hat{\mathbf y}$ can be reapplied to the teacher model and aligns with its own prediction. This property is critical to semi-supervised learning as it would not induce biases and label noises even in the simplest scenario where the student model is identical to the teacher.
However, this rule can be easily broken in most previous SSOD methods that adopt heuristic and static anchor assignments. That is, the assigned labels for anchors differ greatly from their own predictions, which is the root of the pseudo-label drifting phenomenon in `Fig. 1. `.
Therefore, we propose to assign anchors that minimize the loss on unlabeled images
\begin{align}
\min_{a_1, \cdots, a_N} \sum_n^N \Big[\mathcal{L}_{cls}\big(f_s( \mathbf x^u)_n, \hat{\mathbf {y}}_{a_n}^u)\big) + \mathcal{L}_{reg}\big(f_s(\mathbf x_j^u)_n, \hat{\mathbf {y}}_{a_n}^u\big)\Big]
\end{align}
$$\min_{a_1, \cdots, a_N} \sum_n^N \Big[\mathcal{L}_{cls}\big(f_s( \mathbf x^u)_n, \hat{\mathbf {y}}_{a_n}^u)\big) + \mathcal{L}_{reg}\big(f_s(\mathbf x_j^u)_n, \hat{\mathbf {y}}_{a_n}^u\big)\Big]$$
where $n$ is the anchor index, and $a_n \in \{1, 2, \cdots, L+1\}$ stands for the assigned pseudo-bbox index from the $L$ predicted bboxes, and the index $L+1$ represents the background label.
In this way, as we apply the pseudo-bboxes back to a student model that is identical to the teacher, the newly-assigned label from the pseudo-bboxes could align with the model's prediction.
This dynamic assignment strategy dubbed adaptive sample assignment (ASA) is therefore devised to reduce noises in the pseudo targets of anchors for semi-supervised object detection. We find that ASA takes a similar formulation as that proposed in supervised models such as [1], and thus applies ASA to both labeled and unlabeled images to make a unified detector for SSOD.
In summary, although the ASA module is similar to the form of [1] that is used in supervised object detection, it is motivated differently to solve the unique pseudo-label drifting problem in SSOD. Moreover, we have compared the performance of ASA under both supervised and semi-supervised settings in `Tab. 4`, and found that the performance gain of ASA in SSOD is almost twice as much as that in the supervised setting. The extra improvement comes from the prohibition of the label noises induced by the static anchor assignment.
**Novelty in Feature alignment**. The feature alignment between classification and regression is also to address a problematic pseudo-bbox evaluation protocol that is unique in SSOD and absent from semi-supervised classification. The pure reliance on confidence score to output pseudo-bboxes would result in unsatisfactory bboxes that are sensitive to input augmentations and model weight change and tend to oscillate during training. Better calibration of cls-reg tasks would calibrate the classification score and the bbox quality. No prior work in SSOD applied the feature alignment module to reduce the gap between high classification confidence and accurate bounding box prediction. Besides, [2] restricts its feature re-sampling on each scale, while alignment across multiple feature scales(FAM-3D) has not been introduced, even in the general object detection task. Our 3D-Feature alignment is indeed new. As the first attempt to address the cls-reg inconsistency, 1.0% mAP improvement (including the 0.4% mAP extra improvement of FAM-3D) on MS-COCO should be enough to highlight its significance.
We have uploaded a new demo video in the `Supplementary Material` to illustrate the superiority of the FAM-3D module. It shows that FAM-3D present high-score but low-quality noise predictions.
**Novelty in GMM-based thresholding**. We take the first attempt to introduce the GMM model to dynamically adjust the pseudo-label threshold in SSOD. It reduces the inconsistency in the long training process when some labels are first treated as background but then treated as foreground, which incurs both inefficient learning and confirmation bias issues in SSOD. This GMM module is considered significant as it would free us from the tedious finetuning of the threshold hyperparameter and also brings performance boost.
Therefore, the three modules are closely related to the SSOD settings and are integrated to solve the inconsistency issues. We achieve truly compelling improvement over past SSOD methods, with 40.0 mAP with COCO 10% labels and 47.2 mAP on COCO additional evaluations. We believe this work provides another perspective to analyze the unique inconsistency problems in SSOD and thus contributes to the community.
[1] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.
[2] Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detection. ICCV 2021.
# Reviewer tBiG
We thank the reviewer for acknowledging our problem setup. And thanks again for considering our experiments comprehensive and convincing.
```>>> Q1``` The ASA algorithm is not applicable to anchor-free object detections.
```>>> A1``` **Our ASA generalises to both anchor-based and anchor-free detectors**.
Actually, anchor-free object detectors differ from anchor-based methods by the definition of ``anchors``. Anchor-based methods predefine a set of anchors boxes on each feature map, while anchor-free regress bboxes from the center points of feature maps.
In this study, we unify the notation of both **anchor points** in anchor-free detectors and **anchor boxes** in anchor-based detectors as our `anchor` (`Footnote on page 4`). Our ASA algorithm is designed for adaptive anchor assignment and can be applied in both anchor-based and anchor-free methods by simply replacing the assignment modules.
```>>> Q2.1``` How does FAM-3D module reduce the pseudo-label inconsistency?
```>>> A2.1``` We thank the reviewer for the question. In SSOD, bboxes are evaluated only according to the classification scores. Unfortunately, the classification score does not necessarily reflect the quality of its bounding box, resulting in false positives whose scores are higher than the threshold but have unsatisfactory bboxes.
In FAM-3D, the regression branch is enhanced, preventing high-confidence predictions with poor bounding box regression. Let's say we have a box with a large confidence score (>threshold). Simple detection head may produce error regression results, while FAM-3D module provides more accurate bounding box with flexible feature selection. In return, such high-score high-quality pseudo boxes further refine the students with calibrated predictions. We upload a new demo video in `supplementary ` for your reference.
```>>> Q2.2``` How do we know that the improvement shown in Table 5 is solely due the better consistency due to FAM-3D?
```>>> A2.2``` Thanks for your question. As illustrated in `Figure 6`, FAM-3D reduces the mis-calibration between classification score and bbox IOU, such that the classification score serves as better indicator for pseudo box filtering.
`Table 5` also compared the relative improvement of FAM-3D under both semi-supervised and fully-supervised settings. The improvement under semi-supervised setting is almost twice that as in fully-supervised setting, implying the extra benefit of FAM-3D under SSOD.
```>>> Q2.3``` Is the process in Eq. 4 differentiable?
```>>> A2.3```Yes. both Eq. 3 and 4 are fully differentiable. For $0 \leq d_2\leq 1$, the features $\mathbf P(:, :, l+1)$ at level $l+1$ are rescaled to the same size of $\mathbf P(:, :, l)$ and then a weighted average of resized $\mathbf P(:, :, l+1)$ and $\mathbf P(:, :, l)$ is conducted according to $d_2$ to interpolate at non-integer offsets. For example, when $d_2=0.5$, then we would re-sample as
$$\mathbf P'(i, j, l) = 0.5 \times \text{resize}(\mathbf P)(i, j, l+1) + 0.5 \times \mathbf P(i, j, l)$$
By the way, the notation $\mathbf P$ is adopted to represent a pyramid of featuremaps, with $\mathbf P(i, j, l)$ standing for the feature at planar coordinate $(i, j)$ ($1\le i\le H_l$ and $1 \le j \le W_l$) in the $l^{\mathrm{th}}$ pyramid level.
```>>> Q3``` A principled method to address the pseudo-label drifting issue.
```>>> A3``` Thanks for the nice question. All designs are closely bounded with a single and core problem: "**How can we reduce the pseudo label inconsistency in the SSOD?**", which is depicted in the common response. Summarizing, ASA reduces the target shifting with the consistent assignment; FAM3D calibrates the classification score and bbox accuracy; GMM alleviates the fluctuation of the confidence score thus stabilizing the pseudo target number during SSOD training. We have elaborated our motivation in the adaptive anchor assignment method to solve this issue.
# Reviewer eai7
We thank the reviewer for the constructive comments and would like to address them as follows.
`>>> Q1` Technical novelty
`>>> A1` We thank the Reviewer eai7 for the question. Our technical contribution is described in the common response and we summarize them again
1. **Novelty in the problem definition**. We believe that our biggest contribution is the formal introduction of three inconsistency problems in SSOD. We define them, point out their existence and bring up new solutions.
2. **Novelty in adaptive sample assignment**. Our adaptive sample assignment is specially designed for SSOD. The pseudo-label drifting issue is explained in the general reply(Also see `Figure 1 in the paper`), and we have elaborated our motivation in the adaptive anchor assignment method to solve this issue. We would revise our submission to highlight this point.
Coincidentally,our approach shares a similar form with standard adaptive assignment in the fully supervised scenario. But the two methods are designed with different problem setups and motivations.
3. **Novelty in Feature alignment**. No prior work in SSOD applied the feature alignment mdule to reduce the gap between classification confidence andbounding box accuracy. Besides, feature alignment across multiple feature scales(FAM-3D) has not been introduced, even in the general object detection task. Our 3D-Feature alignment is indeed new.
4. **Novelty in GMM-based thresholding**. We take the first attempt to introduce the GMM model to dynamically adjust the pseudo-label threshold in SSL.
In sum, all of our problem setup and techiques are, for the first time, introduced in the SSOD task. Our proposed methods brings about ~3 mAP improvement on MS-COCO datasets, which should also be considered as a strong and practical contribution. We kindly hope the reviewers should moderately consider our novelty.
`>>> Q2` Relations between the introduced modules
`>>> A2` Thanks for the question. All designs are closely bounded with a single and core problem: "**How can we reduce the inconsistency in the SSOD?**" ASA reduces the target shifting with the consistent assignment; FAM3D calibrate the classification score and bbox accuracy; GMM alleviates the fluctuation of the confidence score thus stabilizing the pseudo target number during SSOD training. Three modules address three different aspects of a single problem, with sufficient evidence(`Fig 3-6`) and significant improvements (`Table 4`, `Fig 7-8`).
`>>> Q3` Writings improvement
`>>> A3` We sincerely thank the reviewer for the suggestion. We will soon revised our manuscript for better presentation and language quality. Stay tuned.
# Reviewer 6uS6
We sincerely thank the reviewer for the insightful and constructive comments. We are glad that the reviewer finds our work well written and clearly presented. The concerns are fully addressed as follows.
```>>> Q1``` Highlight the novelty and contritubitions
```>>> A1``` We thank the Reviewer for the question. Our methodology novelty and contributions are summarized below
1. **Novelty in the problem definition**. We believe that our biggest contribution is the formal introduction of three inconsistency problems in SSOD. We define them, point out their existence and bring up new solutions.A good question is worth a million good answers.
2. **New solution in inproper pseudo label assignment**. Our adaptive sample assignment is specially designed for SSOD. The pseudo-label drifting issue is explained in the general reply(Also see `Figure 1 in the paper`), and we have elaborated our motivation in the adaptive anchor assignment method to solve this issue. We have revise our submission to highlight this point.
Coincidentally, our approach shares a similar form with the standard adaptive assignment[1][2][3] in the fully supervised scenario. But our method is designed with completely different problem setups and motivations. We are also surprised to find that our ASA, though simple, sufficiently address the assignment shifting in SSOD.
3. **Novelty in Feature alignment**. No prior work in SSOD applied the feature alignment module to reduce the gap between high classification confidence and accurate bounding box prediction. Besides, [3] restricts its re-sampling on each scale, while alignment across multiple feature scales(FAM-3D) has not been introduced, even in the general object detection task. Our 3D-Feature alignment is indeed new. As the first attempt to address the cls-reg inconsistency, 0.3% - 0.4% mAP improvement on MSCOCO should be enough to highlight its significance.
4. **Novelty in GMM-based thresholding**. We take the first attempt to introduce the GMM model to dynamically adjust the pseudo-label threshold in SSL.
In sum, all of our problem setup and techiques are, for the first time, introduced in the SSOD task. Our proposed methods brings about ~3 mAP improvement on MS-COCO datasets, which should also be considered as a strong contribution. We kindly hope the reviewers should moderately consider our contribution.
[1] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. CVPR 2021.
[2] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformer, ECCV 2020.
[4] Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detection. ICCV 2021.
# Reviewer 4adR
```>>> Q1``` Definition of j in equation 2
```>>> A1``` Sorry for the typo. $C_{ij}$ in Eq 2 should be written as $C_{il}$, which represents the matching cost between the $I$-th anchor/feature point and the $l$-th gt bbox. We have revise Eq 2 accordingly.
```>>> Q2``` Effect of the $\lambda_{dist}$.
```>>> A2``` We thank R4adR for the question. Though our experiments, $\lambda_{dist}=0.001$ serves to stableize our training as a weak centerness prior. We provides the results with for $\lambda_{dist}\in \{0.001, 0.02 , 0.01\}$.
For $\lambda_{dist}<0.001$, the assignment is highly unstable, especially in the beginning of training when the matching cost is high and inaccurate. It results in a very low mAP.
When the $\lambda_{dist}$ is large, the centerness prior cancels out the performance benefit of our proposed ASA.
| $\lambda_{dist}$ |<0.001 | 0.001 |0.002 |0.01|
|--|--|--|--|--|
| mAP | 35.0$\pm$ 5 | 40.0 |39.8| 39.2|
```>>> Q3``` Name for adaptive thresholding
```>>> A3``` Thanks for the suggestion. We have revised the name from `temporal consistency` to `thresholding consistency`.
```>>> Q4``` Connection between figure 1 and the three contributions
```>>> A4``` We thank the reviwer for the question. We believe that Figure 1 highlights our 3 motivations clearly.
1. **Assignment Inconsistency**. On the right figures, the IOU-based strategy may make errorous assignment, with bbox even assigned to anchor far away of the real object with high loss values(Red dots far from the polar bear). Our ASA, on the other hand, consistently assign the to the feature close to the object center with low matching cost.
2. **Cls-Reg inconsistency**. On the left features, the classification loss quickly decreases and overfits, while then regression objective can not be properly optimized. It shows that the imbalanced optimization of the two tasks. Our feature alignment resamples the feature for regression branch, thus balancing the two taks with faster convergence.
3. **Temporal Inconcistency**. On the right figures, there are bound to be more and more predicted bounding boxes given a fixed score threshold as training goes on. Therefore, the false positives are inevitable to some extent. For example, the bounding boxes of the polar bear endures consistent updating through training, even with multiple pseudo label for the same object.
Our GMM apporach (Row 2), on the other hand, reduces the abnormal pseudo targets, as the increasing adaptive threshold (shown in Fig. 5) would supress the false predictions and thus results in superior performance.
`Figure 1` thus clearly states all our motivations. If anything is still unclear, please leave messages here and we will try our best to address them.