rebuttal - HackMD

# R1 (3) > **Q1: Class-wise prototype generation could lead to overfitting issues.** (한글) A1: Class-wise prototypes는 discriminative information을 포함하고 있고, 따라서 non-discriminative information에 overfitting 되는 것을 오히려 완화시키는 효과를 갖는다. 또한 prototypes를 refine 하는 과정이 있기 때문에 prototypes는 특정 point에 고정되어있지 않고, 이러한 점 또한 overfitting에 강인한 요소로 작용한다. 실제로 많은 실험적 결과가 ProtoReg를 사용하는 것이 CE에 비해 overfitting에 훨씬 더 강인함을 보여준다. * Table 1: overfitting에 취약한 limited data regime (e.g., using only 15% of the training data)에서의 눈에 띄는 성능 개선 * Table 5: training epochs를 증가시킴에 따라 test accuracy가 saturate되지 않고 지속적으로 향상 * Figure 13: training accuracy가 훨씬 느리게 saturate되고 validation accuracy도 지속적으로 향상 지금 우리가 풀고 있는 시나리오는 기본적으로 1) pretrained encoder가 있음 2) 그리고 target task에 transfer하는 것임 3) 애초에 이렇게 transfer를 하고 나서 다른 task에 또 쓰이는 시나리오라면, fine-grained information에 focus하는 게 니 말대로 불리할 수도 있지만, 4) 지금은 task 자체에 최적화하는 것이 목적이기 때문에 overfit이라고 볼 수 없다. Transfer learning의 핵심은 downstream task에 최적화하도록 모델을 tuning하는 것이다. 특정 task에 잘 동작하면 되기 때문에 해당 task를 잘 풀기 위한 discriminative information에 focus하는 것이 leads to overfitting이라고는 볼 수 없다. 되려 우리의 방법론은 discriminative information에 집중함으로써 non-discriminative information에 overfit되는 문제를 완화하는 것이라고 이해하는 게 맞다. 또한 prototypes가 discriminative information을 더 잘 포함하도록 progressively 변화한다는 점을 고려해볼 때, fine-grained information 및 특정 prototype vector에 overfitting 하는 것에 robust하다. 우리 방법론의 핵심은 fine-grained information에 focus한다기 보다는 discriminative information에 focus하는 것이고, fine-grained data에서 이러한 dicriminative information이 fine-grained의 형태로 존재하는 것이다. 따라서 general data에서도 discriminative information에 집중하는 효과로 인해 일반적인 성능 향상을 보인다. 뒤에다가 generalization 관점에서의 overfitting에서도 우리는 강점이 있다는 말 추가 (영어) A1: The essence of transfer learning lies in tuning the model to optimize for the downstream task. Therefore, focusing on discriminative information for a specific downstream task cannot be considered as leading to overfitting. On the contrary, our method is better understood as preventing the downstream model from being biased towards non-discriminative information by focusing on discriminative information. Moreover, when considering the progressive refinement of prototypes to better encompass discriminative information, our method demonstrates robustness against overfitting to fine-grained information or specific prototypes. Numerous experimental results demonstrate that using ProtoReg is much more robust against overfitting compared to CE: * Table 1: Significant performance improvements in the limited data regime which is more vulnerable to overfitting, e.g., using only 15% of the training data. * Table 5: Test accuracy continues to improve steadily without saturating as training epochs increase. * Figure 13: Training accuracy saturates much more slowly, and validation accuracy consistently improves. We clearly emphasize the following: our goal is to prioritize discriminative information. Prioritizing discriminative information can also lead to performance improvement in general classification tasks (Table 7). In fine-grained downstream tasks, as this discriminative information tends to be fine-grained with low granularity (pose challenges for vanilla fine-tuning due to granularity gap), explicitly focusing on them results in more significant performance improvement. --- > **Q2: Why and how exactly the equation (4) is formed?** (한글) A2: Equation 4 is formed to make prototypes adaptively evolve to incorporate class-discriminative information. Fine-tuning 과정에서 downstream data를 학습에 사용하면서 features가 downstream data의 discriminative information을 더 잘 encoding하도록 학습되는데, prototype refinement가 없으면 초기에 설정한 prototypes가 포함하는 discrminative information은 outdated 된다. Equation (4)를 활용한 refine 과정을 상세히 기술하면 아래와 같다. * For a downstream task with N classes, we prepare N different memory banks (which are same as queues) to hold features of each class. * After each iteration, we enque each feature to the corresponding memory bank (where the corresponding memory bank is determined by the class label of the feature). * After an epoch of training, each memory bank will hold $N_i^{train}$ features, where $N_i^{train}$ is the number of training samples in class $i$. * After each epoch, class prototypes are updated based on the sample mean of their corresponding memory bank features. * The memory banks are flushed to hold the features of the next epoch. Equation (4)와 같은 형태를 사용한 이유는 학습의 안정성과 computational efficiency를 고려하였기 때문이다. Prototype을 매 iteration마다 업데이트 하는 경우 prototype이 적은 수의 feature로부터 계산되기 때문에 (혹은 특정 class에 해당하는 feature가 없을수도 있다) 불안정하고 특정 sample에 overfitting될 수 있다. 매 epoch마다 업데이트 하는 방식은 prototypes가 급격하게 변화하는 것을 막는 momentum effect를 줌으로써 overfitting에 의한 문제를 완화할 수 있다 (handling the concern raised in Q1). 또한 매 iteration마다 계산한 feature를 바로 저장하면 되기 때문에 매 epoch이 끝나고 전체 feature를 recompute하는 방식에 비해 computationally efficient하다. Appendix B.7은 이러한 지점을 다루고 있으며, 실제로 Table 10은 우리의 방식이 alternatives에 비해 overfitting에 robust하고 computationally efficient하다는 것을 보여준다. 질문의 요지가 왜 sample mean을 취하는 형태의 prototype을 쓰는지 물어보고 있는 것 같음 -> sample mean을 prototype을 쓰는 work? 인용? -> 거기 표현들 참고? -> class 마다의 대표 representation을 구하는 데에 있어 가장 intuitive한 접근은 sample mean이 될 수 있다. 실제로 few-shot learning에서도 많이 사용하고 그 효과성이 입증되었다. (인용) -> 이 뒤에 니가 미리 적은 내용 더 간략하게 -> when using downstream data for learning, features are trained to better encode discriminative information from the downstream data => feature space가 학습을 통해서 점점 task-discriminative하게 shift가 일어나서 shift된 여러 feature space에서의 평균으로 prototype을 구하는 게 stable한 training에 도움이 된다 (이런 느낌) (영어) A2: Equation 4 is designed to adaptively evolve prototypes to incorporate class-discriminative information. During fine-tuning, when using downstream data for learning, features are trained to better encode discriminative information from the downstream data. Without prototype refinement, the initially set prototypes become outdated. The detailed process of refinement using Equation (4) is as follows: * For a downstream task with N classes, N different memory banks (queues) are prepared to hold features of each class. * After each iteration, each feature is enqueued into the corresponding memory bank based on its class label. * After an epoch of training, each memory bank contains $N_i^{train}$ features, where $N_i^{train}$ is the number of training samples in class $i$. * After each epoch, class prototypes are updated based on the sample mean of their corresponding memory bank features. * The memory banks are cleared to hold features for the next epoch. The use of Equation (4) is chosen for stability and computational efficiency. Updating prototypes at each iteration can lead to instability and overfitting to specific samples since prototypes are computed from a small number of features (or may not have features for certain classes). Updating prototypes after each epoch prevents rapid changes and mitigates overfitting with a momentum effect (addressing the concern raised in Q1). Additionally, it is computationally efficient as it only needs to store features after each iteration, rather than recomputing all features at the end of each epoch. Appendix B.7 addresses these points, and Table 10 demonstrates the robustness against overfitting and computational efficiency of our approach compared to alternatives. --- > **Q3: The comparing methods are not very up-to-date and basically not focusing on fine-grained problem.** (한글) A3: 비교 방법론들인 BSS [1] , SN [2], Co-tuning [3]에서는 Aircraft, Cars, CUB200에서 실험을 진행하였는데, 이는 모두 우리가 실험한 fine-grained 데이터셋에 포함되어있다. 따라서 fair한 비교라고 생각한다. 또한 fine-grained dataset에서 granularity gap이 더 크게 발생하여 우리 방법론의 효과가 더 뛰어나지만, fine-grained problem에 국한되지 않고 우리는 general dataset인 Caltech101과 Cifar100에서도 우수한 성능 향상을 보여준다. (Table 7 in Appendix) 또한 우리는 2023년 이후에 publish된 최신 works인 Robust_FT ([1], CVPR 2023), DR_Tune ([2], ICCV 2023) 과의 비교를 추가하였다. 해당 방법론들과 비교했을 때도 우리의 방법론은 월등히 앞서는 성능을 보여준다. (영어) A3: Comparison methods, BSS [1], SN [2], and Co-tuning [3], conducted experiments on Aircraft, Cars, and CUB200, all of which are fine-grained datasets we also experimented with. This ensures a fair comparison. Moreover, extending beyond fine-grained problems, we show substantial performance improvements on general datasets such as Caltech101 and CIFAR100 (refer to Table 7 in the Appendix). Additionally, we added comparisons with the two latest works published in 2023, Robust_FT ([4], CVPR 2023), and DR_Tune ([5], ICCV 2023), where our methodology consistently outperforms them. The table below provides a summary of the experimental results, with these results highlighted in red in Table 1 of the revised manuscript. Aircraft |Methods| 15% | 30% | 50% | 100% | | --- | --- | -------- | -------- | -------- | |Robust_FT|23.91|37.84|49.09|65.08| | DR_Tune |24.42|40.89|50.65|68.80| |ProtoReg (self)|33.66|46.83|59.95|75.25| |ProtoReg (LP)|**34.35**|**50.41**|**61.45**|**77.89**| Cars |Methods| 15% | 30% | 50% | 100% | | --- | --- | -------- | -------- | -------- | |Robust_FT|25.07|52.63|70.10|84.29| | DR_Tune |26.80|56.04|73.55|84.80| |ProtoReg (self)|39.95|65.91|78.36|87.74| |ProtoReg (LP)|**42.52**|**69.32**|**81.62**|**89.60**| CUB200 |Methods| 15% | 30% | 50% | 100% | | --- | --- | -------- | -------- | -------- | |Robust_FT|43.34|57.66|68.35|78.53| | DR_Tune |44.84|58.49|69.62|79.10| |ProtoReg (self)|57.09|67.48|74.53|81.22| |ProtoReg (LP)|**59.08**|**69.49**|**76.30**|**82.38**| NABirds |Methods| 15% | 30% | 50% | 100% | | --- | --- | -------- | -------- | -------- | |Robust_FT|43.34|57.66|68.35|78.53| | DR_Tune |41.37|57.07||| |ProtoReg (self)|50.75|63.68|70.84|78.40| |ProtoReg (LP)|**52.42**|**64.83**|**72.33**|**79.51**| References [1] Chen et al., "Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning.", NeurIPS 2019 [2] Kou et al., "Stochastic normalization.", NeurIPS 2020 [3] You et al., "Co-tuning for transfer learning.", NeurIPS 2020 [4] Xiao et al., "Masked images are counterfactual samples for robust fine-tuning.", CVPR 2023 [5] Zhou et al., "DR-Tune: Improving fine-tuning of pretrained visual models by distribution regularization with semantic calibration.", ICCV 2023 --- > **Q4: How to ensure the generality of the proposed method? To be specific, the discriminative is a relative concept. While the representative prototypes could preserve the main fine-grained information over each class, it does not necessarily preserve marginal fine-grained characteristics of the data, e.g., according to the theory of distributional clustering.** -> 시나리오 얘기 -> task에 따라서 discriminativity가 상대적인 거라서 -> 이런 궁금증을 가지는 것은 이해한다만, -> 이 문제는 여러 downstream task에 transfer하기 위한 general한 encoder를 학습하고자 하는 representation learning이 아니라 target task에 최적화시키는 게 목적인 transfer learning이다. 우리는 fine-grained 중에서도 class common한 정보가 중요하다고 생각한다. --- > **Q5: Computational overhead should be also reported. Although the proposed method is simple to implement, its clustering nature (repeating to compute the 'similarity' between two data points) can make the complexity very high.** (한글) A5: similarity를 계산하는 연산은 mini-batch 내에서 matrix multiplication을 통해 이루어지기 때문에 computational overhead가 거의 없다. 구체적으로, l2-normalized feature matrix $F \in \mathbb{R}^{B \times d}$와 l2-normalized prototypes matrix $C \in \mathbb{R}^{K \times d}$에 대해서 (B: batch size, d: feature dimension, K: number of classes) $FC^T \in \mathbb{R}^{B \times K}$를 통해 각 sample과 prototype 사이의 cosine similarity를 한번에 계산해서 활용한다. 아래의 table은 동일한 device에서 실험했을 때 CE와의 상대적인 시간 비교를 보여준다. | Method | Relative Time | | --------------- |:-------------:| | CE | 1 | | ProtoReg (self) | 1.03 | | ProtoReg (LP) | 1.12 | ProtoReg (self)의 경우 단 3%의 negligible computational overhead를 가지며, ProtoReg (LP)의 경우 prototype initialization을 위한 linear probing 과정으로 인해 12%의 overhead를 가지는데, 두 경우 모두 충분히 적은 overhead로 뛰어난 성능 향상을 보인다. (영어) A5: Similarity computations are performed within mini-batches through matrix multiplication, resulting in minimal computational overhead. Specifically, using l2-normalized feature matrix $F \in \mathbb{R}^{B \times d}$ and l2-normalized prototypes matrix $C \in \mathbb{R}^{K \times d}$ (where B is batch size, d is feature dimension, and K is the number of classes), the cosine similarity between each sample and prototype is computed in a single step via $FC^T \in \mathbb{R}^{B \times K}$. The table below demonstrates the relative time comparison against CE when experiments are conducted on the same device. | Method | Relative Time | | --------------- |:-------------:| | CE | 1 | | ProtoReg (self) | 1.03 | | ProtoReg (LP) | 1.12 | ProtoReg (self) has a negligible computational overhead of only 3%, while ProtoReg (LP) has a 12% overhead due to the linear probing process for prototype initialization. In both cases, the overhead is sufficiently low, considering the significant performance improvement they offer. -> + contrastive learning 유명한 논문 중에 computation overhead 기록한 케이스 찾아서 인용하면서 "실제로는 이 과정이 계산량이 많은 것이 아니다." --- # R2 (5) > **Q1: The novelty of this paper is not so significant. The idea of exploiting prototypes to represent and adjust the feature distribution according to label information from samples has been studied in a lot of works. Although the idea of utilizing such mechanism to fine-tune a pre-trained model for downstream fine-grained tasks is new, I do not recognize significant difference between the key idea of this work and the previous ones. It does has some novelty, but I am not sure whether it is sufficient to meet the criterion of ICLR.** (한글) A1: 우리의 방법론은 다음의 지점에서 novelty를 갖는다. * **transfer learning에서의 notable한 성능 향상.** -> prototypes를 활용하는 기존 works는 주로 few-shot learning, domain adaptation, continual learning과 같은 연구 분야에서 주로 이루어져 왔으며, transfer learning에서의 그 효과성에 대해서는 underexplored 되어있었다. 우리는 fine-grained transfer learning 상황에서 granularity gap에 의해 발생하는 motivational observation을 발견하고 prototypes에 기반한 regularization이 매우 효과적인 해결책이 될 수 있음을 밝혔다. * **새로운 prototype initialization 방식 제안** -> prototype을 사용하는 대부분의 works는 classwise한 sample mean을 사용하는 방식을 채택해왔음. 하지만 우리는 prototype에 포함된 class discriminative information을 극대화하기 위해 학습된 linear classifier weights를 prototype으로 initialize하는 방식을 추가적으로 제안하고 이러한 방식이 fine-grained transfer 성능을 크게 향상시킴을 보임. -> 지금 제안하는 게 두갠데 이렇게 써버리면 마치 linear probe을 initialize하는 게 디폴트인 것 같음 * **loss 분리해서 중요성 탐구.** -> 기존의 works는 prototype을 활용할 때 log likelihood 형태 혹은 metric space 상에서의 distance를 줄이는 방식으로 formulate 되었다. 우리의 연구는 prototype과의 aggregation과 separation으로 disentangle하여 formulate하며 처음으로 각 component에 different strength를 부여하는 것의 중요성에 대해 면밀히 탐구하였다. 실제로 아래의 table은 $\lambda_{aggr} = \lambda_{sep} = \alpha$ (which is equivalent to the negative log likelihood multiplied by $\alpha$)인 경우에 비해 different strength를 부여하는 우리의 방법론이 훨씬 성능을 향상시킴을 보여준다. -> 있을거 같은데.. (a) FGVC Aircraft |$\alpha$| 15%| 30%| 50%| 100%| |---|---|---|---|---| |1.0| 24.57| 39.96| 51.25| 69.31| |5.0| 27.39| 42.60| 56.02| 72.22| |10.0| 30.69| 45.37| 56.77| 73.03| |15.0| 31.62| 45.64| 57.82| 74.14| |20.0| 31.92| 45.49| 56.74| 73.00| |25.0| 31.53| 45.31| 56.71| 71.89| |Ours| **33.66**| **46.83**| **59.95**| **75.25**| (b) Stanford Cars |$\alpha$| 15%| 30%| 50%| 100%| |---|---|---|---|---| |1.0| 26.68| 55.01| 72.64| 85.70| |5.0| 29.74| 60.44| 75.41| 85.57| |10.0| 32.96| 60.35| 74.10| 85.60| |15.0| 33.16| 59.57| 74.23| 85.55| |20.0| 32.67| 59.00| 73.46| 84.83| |25.0| 30.95| 56.25| 70.95| 84.42| |Ours| **39.95**| **65.91**| **78.36**| **87.74**| (c) CUB-200-2011 |$\alpha$| 15%| 30%| 50%| 100%| |---|---|---|---|---| |1.0| 46.60| 61.00| 70.47| 79.57| |5.0| 49.78| 63.36| 70.95| 78.34| |10.0| 50.54| 61.37| 68.30| 77.13| |15.0| 49.26| 59.11| 66.67| 76.04| |20.0| 47.14| 58.10| 65.93| 75.77| |25.0| 46.41| 56.85| 64.67| 74.80| |Ours| **57.09**| **67.48**| **74.53**| **81.22**| * **refinement 방식** -> 기존의 works들은 prototype을 refine하지 않거나, refine 하더라도 전체 feature를 recompute 해야하는 등 computational inefficiency가 존재했음. Ours는 fine-tuning 과정에 학습한 discriminative information을 포함하도록 prototype을 adatively evolve하는 computationally efficient한 방식을 제안함. (영어) A1: Our methodology introduces novelty in the following aspects: * **Novel prototype initialization method**: Most works utilizing prototypes have adopted the approach of using classwise sample mean. However, we additionally propose initializing prototypes with learned linear classifier weights to maximize class discriminative information, demonstrating a substantial improvement in fine-grained transfer performance. * **Exploration of separated loss components**: Previous works often formulate prototype-based methods using log-likelihood or distance minimization in a metric space. We meticulously explore the disentanglement and the importance of assigning different strengths to each component for the first time. The table below illustrates that our approach, which assigns different strengths to $\lambda_{aggr}$ and $\lambda_{sep}$, significantly enhances performance compared to cases where both are equal to $\alpha$. (a) FGVC Aircraft |$\alpha$| 15%| 30%| 50%| 100%| |---|---|---|---|---| |1.0| 24.57| 39.96| 51.25| 69.31| |5.0| 27.39| 42.60| 56.02| 72.22| |10.0| 30.69| 45.37| 56.77| 73.03| |15.0| 31.62| 45.64| 57.82| 74.14| |20.0| 31.92| 45.49| 56.74| 73.00| |25.0| 31.53| 45.31| 56.71| 71.89| |Ours| **33.66**| **46.83**| **59.95**| **75.25**| (b) Stanford Cars |$\alpha$| 15%| 30%| 50%| 100%| |---|---|---|---|---| |1.0| 26.68| 55.01| 72.64| 85.70| |5.0| 29.74| 60.44| 75.41| 85.57| |10.0| 32.96| 60.35| 74.10| 85.60| |15.0| 33.16| 59.57| 74.23| 85.55| |20.0| 32.67| 59.00| 73.46| 84.83| |25.0| 30.95| 56.25| 70.95| 84.42| |Ours| **39.95**| **65.91**| **78.36**| **87.74**| (c) CUB-200-2011 |$\alpha$| 15%| 30%| 50%| 100%| |---|---|---|---|---| |1.0| 46.60| 61.00| 70.47| 79.57| |5.0| 49.78| 63.36| 70.95| 78.34| |10.0| 50.54| 61.37| 68.30| 77.13| |15.0| 49.26| 59.11| 66.67| 76.04| |20.0| 47.14| 58.10| 65.93| 75.77| |25.0| 46.41| 56.85| 64.67| 74.80| |Ours| **57.09**| **67.48**| **74.53**| **81.22**| * **Refinement strategy**: Prior works either did not refine prototypes or required recomputation of the entire feature set, leading to computational inefficiency. We propose a computationally efficient approach that adaptively evolves prototypes to include discriminative information learned during fine-tuning. * **Notable performance enhancement in transfer learning**: Prior works leveraging prototypes have primarily focused on few-shot learning, domain adaptation, and continual learning, leaving the effectiveness in transfer learning underexplored. We identify the motivational observation of the granularity gap in fine-grained transfer learning and highlight that prototype-based regularization can be a highly effective solution. --- > **Q2: Although the motivation of solving the problem of granularity gap is interesting, I do not quite understand how the proposed mechanism with prototypes helps to eliminate such "gap" under the perspective of transfer learning. In more detail, it seems that this method can also be used to train a randomly initialized model for the fine-grained task, because I do not see a mechanism to "preserve" useful information from the original pre-trained model but only recognize the mechanism to make change to the model.** (한글) A2: 먼저 우리는 우리의 방법론의 목적은 pre-trained information을 preserve하는 것이 아니라, pre-trained information에 포함된 downstream task-specific한 정보에 더욱 집중하도록 하는 것이다. Granularity gap이 클 때 낮는 granularity의 downstream task-specific한 정보들은 높은 granularity의 pre-trained 정보들에 비해 미세하고 이러한 높은 granularity의 정보들이 오히려 downstream-task specific한 정보를 transfer하는 것을 방해할 수 있기 때문이다. Randomly initialized model을 사용하는 경우 initialize된 prototypes는 아무런 class-discriminative information을 포함하지 않는다. 만약 Prototype refinement 없이, 이러한 무의미한 prototype에 ProtoReg를 적용한다면 경우 오히려 유의미한 정보의 학습을 방해할 것이다. 하지만 ProtoReg는 prototype refinement 과정이 있으므로 fine-tuning 과정에서 학습한 discriminative features을 활용하여 prototype이 업데이트 되기 때문에 prototypes가 어느 정도는 discriminative 정보를 포함하게 되지만, 여전히 pre-trained model의 discriminative information을 활용하지는 못한다. 아래의 실험 결과는 randomly initialized model에 ProtoReg를 적용한 경우 (with and without prototype refinement)에 대한 성능을 보여줌으로써 우리의 주장을 뒷받침한다. 일부 case의 경우 ProtoReg가 어느 정도의 성능 향상을 가져오기도 하며, 하지만 refinement를 사용하지 않을 때 성능이 drastically 떨어지는 것으로 보아 pre-trained model의 discriminative information을 포함하는 prototype의 역할이 중요함을 알 수 있다. [mechanism에 대한 부가 설명] ProtoReg (LP)에서 classifier weights를 사용할 때, pre-trained encoder는 freeze 시킨 채로 classifier weights를 학습시킨다. 이렇게 학습된 classifier weights는 pre-trained model에서 얻은 downstream features 에 대한 decision boundaries를 나타낸다. 이 decision boundaries는 downstream data의 서로 다른 classes를 구분할 수 있는 정보(fine-grained의 경우 low granularity를 갖는)에 기반하여 그어질 것이다. 결과적으로 이렇게 discriminative information을 포함하는 prototype을 명시적으로 설정하고 여기에 대해 aggregate하게 함으로써 low granularity의 discrinative에 더욱 focus하도록 한다. 마찬가지로 ProtoReg (self)의 경우는 sample mean을 사용하여 각 class를 대표하는 vector를 설정하는 방식으로 discriminative information을 포함하도록 한다. (영어) A2: Our approach aims not to preserve pre-trained information but to emphasize downstream task-specific details within the pre-trained knowledge. This focus becomes crucial when a significant granularity gap exists, as lower granularity task-specific details might be subtle compared to higher granularity pre-trained information, potentially hindering the transfer of task-specific knowledge. In the case of a randomly initialized model, the initialized prototypes do not contain any class-discriminative information. Applying ProtoReg to such meaningless prototypes without refinement could hinder the learning of discriminative information. However, since ProtoReg involves prototype refinement during fine-tuning, the prototypes get updated using discriminative features learned in the process. While this ensures that the prototypes contain some discriminative information, it still falls short of fully utilizing the discriminative information within the pre-trained model. The experiment results below support our argument by demonstrating the performance of applying ProtoReg to a randomly initialized model (with and without prototype refinement). In some cases, ProtoReg brings about a certain degree of performance improvement. However, the significant drop in performance when refinement is not used indicates the importance of prototypes incorporating discriminative information from the pre-trained model. |Methods|Aircraft|Cars|CUB200|NABirds| | --- | --- | --- | --- | --- | |CE|34.44|43.70|33.03|47.84| |ProtoReg (self)|40.23|39.06|39.56|50.55| |ProtoReg (self), w/o refine|4.14|9.77|30.24|20.44| |ProtoReg (LP)|11.01|0.87|0.50|0.22| |ProtoReg (LP), w/o refine|9.18|1.53|2.09|21.64| [Additional explanation of the mechanism] In ProtoReg (LP), the pre-trained encoder is frozen while training classifier weights. The learned classifier weights represent decision boundaries for downstream features obtained from the pre-trained model. These boundaries are drawn based on information distinguishing different classes (these information is with low granularity particularity when the data is fine-grained) in the downstream data. Consequently, explicitly setting prototypes containing discriminative information and aggregating features toward them enables to focus more on low granularity discriminative information. Similarly, ProtoReg (self) employs the use of sample mean to represent discriminative information for each class. --- > **Q3: A minor problem: the last sentence of the comments of Figure 1 is confusing. Maybe the word "suffer from" is not correctly used.** (한글) A3: 우리는 "suffer from" 대신에 더 적절한 표현인 "faces challenges in"으로 revised manuscript에서 수정하였다. 수정 사항은 red로 highlight하였다.지적해주어 고맙다. (영어) A3: We have made the revision in the manuscript, replacing "suffer from" with the more appropriate phrase "faces challenges in." The changes have been highlighted in red. Thank you for pointing that out. --- > **Q4: Considering the 2nd drawback listed above, please explain whether there is a mechanism to discover and preserve useful information from the pre-trained model.** (영어) A4: The question has been answered in the response to Q2. Please kindly refer to that for further details. -> Q2랑 동일 --- > **Q5: It is a common scene for real data that each class may have multiple prototypes. How do the authors deal with such a case?** (한글) A5: multiple prototype 실험? practical하게 몇 개의 prototype을 상정하는 것이 optimal한 지 알기 어렵고, class간 aggregation 방해할 수 있음? ProtoReg (self)의 경우 multiple prototypes를 가정하는 경우 sample mean을 사용하는 대신에 각 class별 features에서 K-means clustering을 수행하여 각 cluster의 centroid를 prototypes로 사용하는 방식으로 deal with 할 수 있다. 하지만 practical하게 몇 개의 prototype을 사용하는 것이 optimal한지 알기 어렵다. 우리는 실험적으로 하나의 prototype을 사용하는 현재의 방식으로 충분하고 가장 좋은 성능을 보임을 확인하였다. 이는 각 class 별로 multiple prototypes를 사용하는 경우 aggregation을 방해할 수 있기 때문이다? ProtoReg (LP)의 경우 decision boundary에 대응되는 linear classifier의 weights에 기반하기 때문에 class를 구분하는 여러 정보들이 하나의 vector에 통합되어 encoding 되어있다고 볼 수 있다. (영어) A5: For ProtoReg (self), instead of using the sample mean when assuming multiple prototypes, we can employ K-means clustering on class-specific features and use the centroids of each cluster as prototypes. However, determining the optimal number of prototypes practically is challenging. We experimentally confirmed that the current approach, using a single prototype per class, is sufficient and achieves the best performance. This is because using multiple prototypes per class may interfere with aggregation. For ProtoReg (LP), as it relies on the weights of a linear classifier corresponding to the decision boundary, various class-discriminative information are integrated and encoded into a single vector. | K | Test Accuracy (%) | | --- | ----------------- | | 1 | | | 2 | | | 4 | | | 8 | | --- > **Q6: Please explain why two different prototype updating strategies are used according to their initialization methods? What is the motivation or consideration of their designs, respectively?** (한글) A6: 4가지 조합 다 해보고, best가 그거였다고 말하고 initialization 방식과 refinement가 충돌할 수 있음 우리는 4가지 가능한 조합을 다시 실험해보았고 (2 initialization methods x 2 updating strategies) best 조합이 그와 같았다. 이 결과는 우리의 Motivation과 consideration과 부합했는데, 이해 대해서는 아래의 toy example을 통해 설명하고자 한다. 2D feature space 상에서 2개의 class에 대해서 2개의 samples씩 총 4개의 samples가 있다고 가정하자. class 1: (-0.1, 0.9), (0.1, 1.1) class 2: (1.1, 0.1), (0.9, -0.1) ProtoReg (self)의 경우 sample mean을 통해 얻어진 prototypes는 prototype 1: (0, 1.0) prototype 2: (1.0, 0)이다. ProtoReg (LP)의 경우 margin을 maximize하는 decision boundary는 y=x이고 classifier weight vectors는 decision boundary에 수직하므로 prototype 1: (-$\frac{1}{\sqrt{2}}$, $\frac{1}{\sqrt{2}}$) prototype 2: ($\frac{1}{\sqrt{2}}$, -$\frac{1}{\sqrt{2}}$)이다. (prototypes are l2-normalized for simplicity) 즉 두 intialization 방식에 따른 prototypes 사이에는 discrepancy가 존재하고, 따라서 linear classifier weights로 initialization한 prototype을 sample mean을 활용해 업데이트 하는 경우는 distortion이 발생하게 된다. 이를 방지하기 위해 linear classifier weights로 initialize하는 경우는 prototype을 learnable하게 설정한 뒤 backpropagation을 통해 update하고, sample mean으로 initialize하는 경우는 prototype을 learnable하게 설정하지 않고 updated features에 대한 sample mean으로 업데이트 한다. sample mean의 경우 learnable하게 설정하지 않은 이유는 서로 다른 iterations에서 얻어진 features를 전체적으로 활용하여 training이 stable해지도록 하기 위함이다. (하나의 mini-batch에는 특정 class의 sample이 없거나 매우 적을 수 있다.) We re-experimented with the four possible combinations (2 initialization methods x 2 updating strategies), and the best combination remained the same. This result aligns with our motivation and considerations, which we aim to illustrate through the following toy example: Consider a 2D feature space with two classes, each having two samples, totaling four samples. Class 1: (-0.1, 0.9), (0.1, 1.1) Class 2: (1.1, 0.1), (0.9, -0.1) For ProtoReg (self), prototypes obtained through sample mean are: Prototype 1: (0, 1.0) Prototype 2: (1.0, 0) For ProtoReg (LP), the decision boundary maximizing the margin is y=x, and classifier weight vectors are perpendicular to the decision boundary. Therefore, prototypes are: Prototype 1: (-$\frac{1}{\sqrt{2}}$, $\frac{1}{\sqrt{2}}$) Prototype 2: ($\frac{1}{\sqrt{2}}$, -$\frac{1}{\sqrt{2}}$) (prototypes are l2-normalized for simplicity) In summary, there is a discrepancy between prototypes based on the two initialization methods. Thus, when updating prototypes initialized with linear classifier weights using sample mean, distortion occurs. To prevent this, when initializing with linear classifier weights, prototypes are set to be learnable and updated through backpropagation. On the other hand, when initializing with the sample mean, prototypes are not set to be learnable, and they are updated using the sample mean of the updated features. The reason for not making the sample mean learnable is to utilize features obtained across different iterations comprehensively, ensuring stable training (as a mini-batch may have few or no samples of a specific class). # R3 (5) > Q1: The idea of prototype-based loss is not novel for dealing with the limited data scenario. Such an idea could be tracked back to the early few-shot research [1]. The proposed aggregation loss and the separation loss are same as the Eq. (2) in [1]. (한글) A1: novelty에 대해서는 공통 답변으로 refer. [1]과의 직접적인 비교를 하자면 다음과 같다. [1]은 prototype refinement 과정이 없고, few-shot learning을 위한 episodic training 방식에 의존한다. 이는 일반적으로 전체 class의 전체 데이터를 한번에 활용하는 transfer learning의 setting는 부합하지 않는다. 또한 prototype initialization 방식에서 [1]은 sample mean 만을 사용한다면 우리는 추가적으로 linear classifier weights를 활용하여 discriminative information의 포함을 극대화하는 novel한 방식을 사용한다. 또한, [1]은 maximum log-likelihood 방식의 optimize를 하지만 우리는 aggregation과 separation으로 disentangle한 뒤 각각의 term에 다른 strength를 부여하는게 큰 성능 향상을 야기함을 보인다. -> 달라서 어떻다. 효과가 어떻게 다른지. (영어) A1: In direct comparison with [1], the differences are evident. [1] lacks prototype refinement and relies on episodic training for few-shot learning, which doesn't align with the transfer learning setting where the entire dataset of all classes is typically utilized at once. Additionally, in prototype initialization, while [1] uses sample mean, we introduce a novel approach using linear classifier weights to maximize the inclusion of discriminative information. Moreover, while [1] employs maximum log-likelihood optimization, we demonstrate significant performance improvement by disentangling through aggregation and separation, assigning different strengths to each term. --- > Q2: The challenge of fine-grained classification I think lies in its limited data for each class, tens of samples for each class in most benchmark datasets (Table 6 A.2). For CIFAR100, hundreds of samples for each class, the proposed method achieves insignificant perform gains compared to CE (Table 7). Therefore, I think the few-shot works are related and should be included in the literature review. (한글) A2: 소중한 의견을 전달해주어 고맙다. 하지만, fine-grained classification에서의 challenge는 단순히 limited data for each class에 의한 것은 아니다. 이를 입증하기 위해, 우리는 FGVC Aircraft와 CIFAR100 benchmark datasets에서 class 당 sample 수를 동일하게 맞추어 진행하였다 (전체 class의 수도 100개로 동일하다). 구체적으로, 데이터 수가 더 적은 (평균적으로 class 당 33 samples) FGVC Aircraft에 맞추어 CIFAR100에도 평균적으로 class 당 33 samples가 포함되도록 sampling rate를 설정하였다. 위의 실험 결과를 통해 Class의 수가 갖고 class 당 sample 수가 동일한 setting에서도 fine-grained dataset에서 fine-tuning with cross-entropy가 더 어려움을 겪고 sub-optimal하며, ProtoReg에 의한 성능 개선 폭이 훨씬 더 큰 것을 확인 할 수 있다. 또한 few-shot learning 연구들은 일반적으로 5-way 1-shot 혹은 5-way 5-shot과 같은 매우 limited data의 상황을 실험 대상으로 하며, 반면 우리의 transfer learning 상황에서는 few-shot learning의 setting에 대입하면 100-way 33-shot (Aircraft), 555-way 35-shot (NABirds)와 같이 한번에 훨씬 많은 class의 많은 sample를 사용한다. 따라서 few-shot learning works와의 비교는 공정하지 않다고 생각한다. 더불어 15%와 같이 적은 sampling rate를 사용하는 세팅 마저도 shot의 수는 줄어들지만 way의 수는 여전히 100에서 555에 이르기까지 few-shot setting에 비해 매우 크다. ProtoNet의 경우 transfer learning의 세팅에 맞춰서 meta training은 사용하지 않음. (영어) A2: Thank you for your valuable feedback. However, the challenge in fine-grained classification we addresses goes beyond simply having limited data for each class. To demonstrate this, we equalized the number of samples per class on the FGVC Aircraft and CIFAR100 benchmark datasets (both datasets have a total of 100 classes). Specifically, to align with the fewer data in the FGVC Aircraft dataset (averaging 33 samples per class), we adjusted the sampling rate for CIFAR100 to include an average of 33 samples per class as well. | Methods | Aircraft | CIFAR100 | | --------------- | -------- | -------- | | CE | 66.01 | 66.87 | | Snell et al. [1]| | | | ProtoReg (self) | 75.25 | 68.12 | | ProtoReg (LP) | 77.89 | 68.39 | The experiment results reveal that, even in a setting where the number of classes and samples per class are equal, fine-tuning with cross-entropy faces more difficulty and is sub-optimal in the fine-grained dataset. While there is a notable performance improvement by ProtoReg in CIFAR100 as well, the impact is more pronounced in the fine-grained dataset. Furthermore, few-shot learning works typically experiment with highly limited data scenarios, such as 5-way 1-shot or 5-way 5-shot. In contrast, our transfer learning scenario involves using a much larger number of classes and samples at once, like 100-way 33-shot (Aircraft) or 555-way 35-shot (NABirds). Therefore, we believe comparisons with few-shot learning works are not fair. Even in settings with a low sampling rate, such as 15%, the number of classes is still considerably larger compared to few-shot settings, ranging from 100 to 555. --- > Q3. Some experimental results seem unfair. For the results in Figure 1 and Figure 3, the case where ProtoReg succeeds while others fails is picked up. A question arises here: whether the samples that ProtoReg mis-classify are mis-classified by CE? If so, the extra mis-classification by CE would indicate its deficiency. Otherwise, ProtoReg and CE possibly have different shortcuts. The results presented should be more statistical. (한글) A3: 당신의 comment를 반영하여, 우리는 더 statistical 분석을 제공할 수 있는 실험들을 추가하였다 (Figure 1과 Figure 3 관련). [statistical analysis regarding Figure 1] Figure 1에서는 우리의 motivation을 부각하기 위해 시각적인 결과들을 포함했던 반면, 아래는 image retrieval과 관련된 정량적인 metric인 mAP @K와 Recall @K를 보여준다. mAP calculates the average precision of search results, with mAP @K focusing on the average precision within the top K results. A higher mAP @K indicates increased accuracy and consistency in retrieval. On the other hand, recall @K measures the ability of the system to retrieve actual positive instances within the top K results, with higher values indicating a better capability to find relevant items. 실험 결과 ProtoReg를 사용하면 CE에 비해 두 metric이 모두 significantly 향상되는 것을 보여준다. - mAP @K |K|CE|ProtoReg (self)|ProtoReg (LP)| | --- | --- | --- | --- | |1|0.640|0.699|**0.738**| |3|0.701|0.754|**0.787**| |5|0.690|0.742|**0.781**| - Recall @K |K|CE|ProtoReg (self)|ProtoReg (LP) |---|---|---|---| |1|0.640|0.699|**0.738**| |3|0.548|0.631|**0.689**| |5|0.461|0.553|**0.612**| [statistical analysis regarding Figure 3] OOD test data (Waterbirds)에 대해서, 두 방법론 A와 B를 비교하기 위해 4가지 case에 대한 비율을 계산하였다: 1) A는 맞고 B는 틀림, 2) A는 틀리고 B는 맞음, 3) 둘 다 틀림, 4) 둘 다 맞음. K의 값에 따라 top-K prediction에 대한 결과를 보여준다. CE에 비해서 ProtoReg (self)와 ProtoReg (LP)가 이기는 비율이 각각 11.65%와 16.88%로 훨씬 높다. 주목해야할 점은 ProtoReg가 틀리는 경우는 거의 대부분 CE도 틀린다는 점이다. ProtoReg (self)가 틀린 31.12% 중에서 단 2.71% 만이 CE가 맞췄고, 이외의 28.41%는 CE도 함께 틀렸다. 우리는 이 수치가 CE의 deficiency를 보이기에 충분히 낮은 수치라고 생각한다. (영어) A3: In response to your comment, we have added additional experiments (related to Figure 1 and Figure 3) that provide more statistical analysis. [statistical analysis for Figure 1] In Figure 1, we included visual results to highlight our motivation. Below, we present quantitative metrics related to image retrieval: mAP @K and Recall @K. mAP calculates the average precision of search results, with mAP @K focusing on the average precision within the top K results. A higher mAP @K indicates increased accuracy and consistency in retrieval. Conversely, recall @K measures the ability to retrieve actual positive instances within the top K results, with higher values indicating better capability in finding relevant items. Experimental results demonstrate that using ProtoReg leads to a significant improvement in both metrics compared to CE. - mAP @K |K|CE|ProtoReg (self)|ProtoReg (LP)| | --- | --- | --- | --- | |1|0.640|0.699|**0.738**| |3|0.701|0.754|**0.787**| |5|0.690|0.742|**0.781**| - Recall @K |K|CE|ProtoReg (self)|ProtoReg (LP) |---|---|---|---| |1|0.640|0.699|**0.738**| |3|0.548|0.631|**0.689**| |5|0.461|0.553|**0.612**| [statistical analysis for Figure 3] For OOD test data (Waterbirds), we compared methodologies A and B by calculating ratios for four cases: 1) A correct and B incorrect, 2) A incorrect and B correct, 3) both incorrect, 4) both correct. The results for top-K predictions are presented based on the values of K. ProtoReg (self) and ProtoReg (LP) significantly outperform CE, winning by 11.65% and 16.88%, respectively. A noteworthy observation is that when ProtoReg makes a mistake, CE tends to make the same mistake in almost all cases. Out of the 31.12% errors made by ProtoReg (self), only 2.71% are correctly predicted by CE, while the remaining 28.41% are also errors for CE. We believe this percentage is sufficiently low to demonstrate CE's deficiency. - ProtoReg (self) vs CE | K | ProtoReg (self) win |CE win| Both wrong | Both correct| | --- | -------- | --- | --- | --- | | 1 |**11.65**|2.71|28.41|57.23| | 3 |**9.04**|1.42|14.34|75.20| | 5 |**7.77**|0.86|10.32|81.05| - ProtoReg (LP) vs CE |K|ProtoReg (LP) win|CE win|Both wrong|Both correct| |---|---|---|---|---| |1|**16.88**|2.76|23.18|57.18| |3|**11.75**|1.40|11.63|75.22| |5|**10.22**|1.16|7.87|80.76| --- > Q4. The pseudo-code did not include the refinement for the prototypes based on linear classifier weights. (한글) A4: linear classifier weights에 기반해서 intialize한 prototypes를 refine하는 경우, prototypes를 learnable하게 설정하게 backpropagation을 통해 update한다. 따라서 별도로 추가되는 코드 없이, loss.backward() 이후의 optimizer.step()에서 refinement가 이루어진다. (영어) A4: When refining prototypes initialized based on linear classifier weights, the prototypes are made learnable and updated through backpropagation. Refinement takes place seamlessly after loss.backward() in the optimizer.step(), without the need for additional pseudo-code. --- # R4 (6) > Q1: More details of ablation studies should be provided to further clarify the effectiveness of the proposed method. For example, the experiments with configurations “L_ce + L_aggr + L_step”, “L_ce + L_step + Refine” should be conducted to demonstrate the effectiveness of L_step when applied under different conditions. (한글) A1: 귀중한 조언을 해주어 고맙다. 당신의 comments를 반영하여 ablation studies 실험 결과를 추가로 진행하였다. 해당 실험 결과를 Table 4에 colored in red로 revised manuscript에 추가하였다. (영어) A1: Thank you for your valuable feedback. We conducted additional ablation studies based on your comments and incorporated the corresponding results into Table 4, highlighted in red in the revised manuscript. |$\mathcal{L}_{ce}$|$\mathcal{L}_{aggr}$|$\mathcal{L}_{sep}$|Refine|Accuracy (%)| | --- | --- | --- | --- | --- | |O|O|O||71.23| |O||O|O|63.79| --- > Q2: Some details of the proposed method should be stated more clearly. For example, in Equ (2), how \phi is defined and how the “argmax” value is computed with respect to the classification accuracy should be explicitly declared. (한글) A2: 조언 고맙다. 당신의 comments를 반영하여 방법론에 대한 설명이 더 명확해지도록 본문을 수정하였고, 수정한 부분은 red color로 표시하였다. Equ (2)에서 $\phi$는 randomly initialize 된다. 주목해야 하는 점은 Equ (2)에서 prototypes를 initialize할 때 pre-trained feature extractor $\mathcal{F}_{\theta}$만 사용되기 때문에 $\phi$와는 무관하다. "argmax"가 포함된 Equ (3)는 linear classifier weights를 활용해서 prototypes를 initialize를 하는 경우 활용된다. Equ (3)를 보면 "argmax"는 validation data에 대한 classification accuracy를 maximize하는 $\phi$에 해당하는 것을 알 수 있다. 구체적으로 우리는 frozen feature extractor에 대해서 downstream training data에 대해 linear classifier를 학습시키면서, 특정 epoch마다 (우리는 computational efficiency를 위해 매 5 epochs마다 측정) validation accuracy를 측정한다. Validation accuracy는 일반적으로 증가하다가 감소하기 때문에 (training data에 overfit하기 시작하면서) validation accuracy가 감소하기 시작하는 시점에서 early stopping을 하고 해당 시점의 $\phi$를 argmax로 활용하였다. 이렇게 함으로써 downtream training data에 overfit하지 않고 pre-trained feature extractor가 encode하는 discriminative information을 포함하도록 prototype을 initialize할 수 있다. (영어) A2: Thank you for your comments. We have revised the manuscript to clarify the methodology based on your comments, and the modified sections are indicated in red. In Equation (2), $\phi$ is randomly initialized. It's essential to note that when initializing prototypes in Equation (2), it is independent of $\phi$ since only the pre-trained feature extractor $\mathcal{F}_{\theta}$ is used. Equation (3) with "argmax" is employed when initializing prototypes using linear classifier weights. Examining Equation (3), "argmax" corresponds to $\phi$ that maximizes the classification accuracy on validation data. Specifically, we train a linear classifier for downstream training data with a frozen feature extractor, measuring validation accuracy every certain epoch (in our case, every 5 epochs for computational efficiency). As validation accuracy typically decreases after increasing (suggesting the start of overfitting on training data), we apply early stopping when validation accuracy starts to decrease and use the corresponding $\phi$ at that point as the argmax. This approach ensures prototype initialization without overfitting to downstream training data and incorporates discriminative information encoded by the pre-trained feature extractor. --- > Q3. Can ProtoReg outperform SOTA methods on other downstream tasks of transfer learning, such as semantic segmentation? (한글) A3: 우리의 연구에서 we consider a classification task as the downstream task for transfer learning. Semantic segmentation task의 경우 하나의 image sample 안에 여러 개의 classes가 존재하여, 우리의 setting에는 적합하지 않다. 대신 우리는 fine-grained classification에 국한되지 않고, 보다 general한 데이터셋인 Caltech101, CIFAR100에서도 상당한 성능 향상을 기록함을 보여준다 (Table 7 in Appendix). (영어) A3: In our research, we consider a classification task as the downstream task for transfer learning. Given that the semantic segmentation task involves multiple classes within a single image sample, it is not suitable for our setting. Instead, ProtoReg demonstrates notable performance improvement not limited to fine-grained classification but also on more general datasets like Caltech101 and CIFAR100 (see Table 7 in the Appendix). ---