Rebuttal - HackMD

# Rebuttal ### Title | CIFAR-10 | Batch Size | Epoch | kNN Accuracy | |:-----------: |:----------: |:-----: |:------------: | | BYOL* | 128 | 200 | __85\%__ | | SimSiam* | 128 | 200 | 73\% | | BarlowTwins* | 128 | 200 | 84\% | | DCL | 128 | 200 | 84\% | | | BYOL* | 512 | 200 | __84\%__ | | SimSiam* | 512 | 200 | 81\% | | BarlowTwins* | 512 | 200 | 78\% | | DCL | 512 | 200 | __84\%__ | | STL-10 | fc7+Linear| fc7+5-NN | Output+ Linear | Output+5-NN | | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | | [1] | 83.2\% | 76.2\% | 80.1\% | 79.2\% | | DCL | 84.4\% (+1.2\%) |77.3\% (+1.1\%) |81.5\% (+1.4\%) |80.5\% (+1.3\%) | ImageNet-100 | Epoch | Memory Queue Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | [1] | 200 | 16384 | 75.6\% | | DCL | 200 | 16384 | 76.8\% (+1.2\%) | | ImageNet-1k | Epoch | Batch Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | [1] | 200 | 256 (Memory queue = 16384) | 67.69\% | | DCL | 200 | 256 | 68.29\% (+0.6\%) | |Batch Size | 32 | 64 | 128 | 256 | 768 | | :-----------:| :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | | [1] | 78.9\% | 81.0\% | 81.9\% | 82.6\%| 83.2\%| | DCL | 81.0\% (+2.1\%) | 82.9\% (+1.9\%) | 83.7\% (+1.8\%) | 84.2\% (+1.6\%) | 84.4\% (+1.2\%) | | ImageNet-100 | Epoch | Memory Queue Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | [1] | 200 | 16384 | 77.66\% | | DCL | 200 | 8192 | 80.52\% (+2.86\%) | | ImageNet-1K (batch size = 256; epoch = 200) | Linear Top-1 Accuracy | | :-----------:| :-----------: | | DCL | 65.9\% | | + optimal ($\tau$, $lr$) = (0.2, 0.07) | 67.8\% (+2\%) | | + stronger augmentation | 68.2\% (+0.4\%) | | ImageNet-1k | Epoch | Bathch Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | DCL | 200 | 256 | 67.8\% | | DCL | 400 | 256 | 69.5\% (+1.7\%) | | DCL | 400 | 1024 | | | Batch Size | 32 | 128 | 512 | | -------- | -------- | -------- | -------- | | [2] | 82.2\% | 88.5\% | 89.1\% | | DCL | 86.1\% (+3.9\%) | 89.9\% (+1.4\%) | 90.3\% (+1.2\%) | | Batch Size | 32 | 128 | 512 | | -------- | -------- | -------- | -------- | | [2] | 49.8\% | 59.9\% | 61.1\% | | DCL | 54.1\% (+4.3\%) | 61.6\% (+1.7\%) | 62.2\% (+1.1\%) | |$\tau$ | 0.07 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | |Baseline | 83.6\% | 87.5\% | 89.5\% | 89.2\% | 88.7\% | 89.1\% | |DCL | 88.3\% | 89.4\% | 90.8\% | 89.9\% | 89.6\% | 90.3\% | $$ \begin{align} y_1=a_1x_2+b_1 \\ y_2=a_2x_2+b_2 \\ y_3=a_3x_3+b_3 \end{align} $$ $\lVert \rVert$ -------------------------------------------------------------- -------------------------------------------------------------- -------------------------------------------------------------- ### Official Review of Paper5271 by Reviewer K21n #### __Q2-1:__ However, despite those merits, I also have concerns .. The objective derived is similar in form to the alignment and uniformity loss [1]. In [1] the order of the expectation and exponential is swapped, but the key feature of removing the positive term in the denominator is there. If you computed the gradients of the alignment and uniformity loss you would also find that it avoids the cancellation effect that this work posits as its main goal to remove. It is critically important in my mind that [1] is discussed in the related work, and added as a baseline in experiments (especially for Fig. 3 on sensitivity to batch size and Fig. 4 on convergence speed). I will not feel confident enough to recommend acceptance without that comparison. #### __A2-1:__ This is a very inspiring question and thanks for your insightful comments! Your comments motivate us to do a deeper thinking to understand the connection and difference between DCL and [1]. There is indeed a critical difference between DCL and [1] and the difference is exactly due to that the order of the expectation and exponential is swapped! Let’s also assume the latent embedding vectors $z$ are normalized, for analytical convenience. When $z_i, z_j$ are normalized, $\exp\left(\langle z_i^{(k)}, z_i^{(l)}\rangle/\tau\right)$ and $\exp\left(-\lVert z_i^{(k)}- z_i^{(l)}\rVert^2/\tau\right)$ are the same, except for a trivial scale difference. Thus we can write $L_{DCL}$ and $L_{[1]}$ in a similar fashion: $$L_{DCL} = L_{DCL,pos} + L_{DCL, neg}, L_{DCL, neg} = \sum_i{\log\left(\sum_{j\neq i}{\exp\left(\langle z_i^{(k)}, z_i^{(l)}\rangle/\tau\right)}\right)}$$ $$L_{[1]} = L_{align} + L_{uniform},\ L_{uniform} = \log\left(\sum_i \sum_{j\neq i}{\exp\left(\langle z_i^{(k)}, z_i^{(l)}\rangle/\tau\right)}\right) $$ With the right weight factor, $L_{align}$ can be made exactly the same as $L_{DCL,pos}$. So let’s focus on $L_{DCL, neg}$ and $L_{uniform}$: $$L_{DCL, neg} = \sum_i{\log\left(\sum_{j\neq i}{\exp\left(\langle z_i^{(k)}, z_i^{(l)}\rangle/\tau\right)}\right)}$$ $$L_{uniform} = \log\left(\sum_i \sum_{j\neq i}{\exp\left(\langle z_i^{(k)}, z_i^{(l)}\rangle/\tau\right)}\right)$$ Similar to the analysis we provided in our manuscript, the latter $L_{uniform}$ introduces a negative-negative coupling between the negative samples of different positive samples. In other words, if two negative samples of $z_i$ are close to each other, the gradient for $z_i$ would also be attenuated. Such behavior would behave similarly to the negative-positive coupling. That being said, [1] indeed doesn’t have a negative-positive coupling but it has a similarly problematic negative-negative coupling. We will revise our manuscript to reflect this analysis to compare to [1]. Further as required by the reviewer, we will provide a comprehensive empirical comparison to show DCL outperforms [1], especially for a small batch setting. Further, we would show that the empirical experiments match our analytical prediction: DCL would outperform [1] with a larger margin under a smaller batch size. #### __Q2-2:__ I strongly suspect that the objective of [1], if tuned correctly, would yield similar results to those in this paper (if I have missed some fundamental difference between the objective given here and that in [1] then please point it out - I will listen carefully and update my review on this point if needed). As a result of this, I would argue that the main contribution of this paper is not the proposed decoupled loss itself, but the observation that the loss can be motivated from the point of view of removing this gradient multiplier. While I think this is an interesting observation, on its own I do not feel it is sufficient contribution by the standards of NeurIPS. #### __A2-2:__ In this experiment, we compare DCL to [1] on STL-10, ImageNet-100, ImageNet-1K under various settings. For STL-10 data, we implement and run our proposed DCL based on the official code of [1] (https://github.com/SsnL/align$\_$uniform/tree/master/examples/stl10). Both our backbone encoder and the hyperparameters are exactly the same as the implementation of [1]. Our DCL reaches 84.4\% (fc7+Linear) compared to 83.2\% (fc7+Linear) reported in [1]. Note that we did not tune the parameters or try other possible settings to further improve our performance. STL-10 comparisons of [1] and DCL under the same experiment setting. | STL-10 | fc7 + Linear| fc7 + 5-NN | Output + Linear | Output + 5-NN | | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | | [1] | 83.2\% | 76.2\% | 80.1\% | 79.2\% | | DCL | 84.4\% (+1.2\%) |77.3\% (+1.1\%) |81.5\% (+1.4\%) |80.5\% (+1.3\%) ImageNet-100 comparisons of [1] and DCL under the same setting (MoCo). | ImageNet-100 | Epoch | Memory Queue Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | [1] | 240 | 16384 | 75.6\% | | DCL | 240 | 16384 | 76.8\% (+1.2\%) | ImageNet-1K comparisons of [1] and DCL under the best setting. | ImageNet-1k | Epoch | Batch Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | [1] | 200 | 256 (Memory queue = 16384) | 67.69\% | | DCL | 200 | 256 | 68.2\% (+0.51\%) | STL-10 comparisons of [1] and DCL under different batch sizes. |Batch Size | 32 | 64 | 128 | 256 | 768 | | :-----------:| :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | | [1] | 78.9\% | 81.0\% | 81.9\% | 82.6\%| 83.2\%| | DCL | 81.0\% (+2.1\%) | 82.9\% (+1.9\%) | 83.7\% (+1.8\%) | 84.2\% (+1.6\%) | 84.4\% (+1.2\%) | #### __Q2-3: Minor criticisms__ * __Q:__ Somewhat slapdash reporting of results - e.g. Figure 4 doesn’t even tell the reader which dataset corresponds to which figure. __A:__ Thank you for pointing this out. We will revise the caption of the figures. * __Q:__ Figure 1(a) talks about something called the ``coefficient of variation'' - I still don’t know what this is since I could not find an explanation anywhere in the text. Even if the explanation is hidden away somewhere, it shouldn't be so hard for a reader to find it (do you just mean variance/standard dev.?). Please do a review of the pedagogical basics and make sure all figures are properly labeled and explained. __A:__ We thank the reviewer for the suggestion. The coefficient of variation, $c_v$, also known as relative standard deviation, is a standardized measure of dispersion of a probability distribution. We will provide the definition in the text and legend of the figure, which is the ratio of the standard deviation $\sigma$ to the mean $\mu$, $c_v = \frac{\sigma}{\mu}$. __Q:__ In Figure 3 why use MoCo-v1 when you can just as easily run MoCo-v2 experiments and obtain more competitive baseline comparisons? __A:__ Thank you for pointing this out. We implemented the proposed method on MoCo-V2. Compared with the official code of [1], the improvement in Table below shows that DCL holds itself as another valid solution for the problem. It is worth mentioning that the DCL achieves competitive performance with more efficiency (smaller memory queue). We will provide more comprehensive results in the later version. ImageNet-100 comparisons of [1] and DCL under the same setting (MoCoV2) except memory queue size. | ImageNet-100 | Epoch | Memory Queue Size | Linear Top-1 Accuracy | | :-----------:| :-----------: | :-----------: | :-----------: | | [1] | 200 | 16384 | 77.66\% | | DCL | 200 | 8192 | 80.52\% (+2.86\%) | -------------------------------------------------------------- #### __Q2-4: Clarity__ __Q:__ Clarity is fairly poor in general. Ideas that should be easy and intuitive are written in a convoluted way and end up harder than necessary to understand. I would especially suggest revising section 3, which contains the main idea of the paper. * l.92 “Simsam” —> SimSiam * Fig1 is on page 2, but relates to formulae \& discussions on Page 4. I would suggest moving the figure to a more logical position. * prop 2 is stated in a slightly odd (inverted) way. I would suggest a more linear sequence of logic: 1) define the new loss, 2) state that the new loss has gradients that are the same as the original loss, but with the “NPC multipliers” removed. __A:__ Thank you for pointing out the typos and errors. We will improve the clarity in the later version. -------------------------------------------------------------- * __Q:__ Table 2: baseline results are somewhat weak. ... __A:__ Thanks for making an important point! This is a slight misunderstanding, and we will revise accordingly to avoid this confusion. For a fair comparison with CLD [1] on CIFAR data, we used the ResNet-18 as the backbone instead of the ResNet-50 in other literature (e.g., SimCLR paper). This ResNet-18 choice makes the SimCLR baselines look relatively weaker than those based on ResNet-50. While we have stated this on line-157, we realize that this could have been made clearer. To address the reviewer’s concern more thoroughly, we show DCL ResNet-50 performance on CIFAR10 and CIFAR100 following the same experiment settings given in the open-source [2]. In these comparisons, we vary the batch size to show the effectiveness of DCL. CIFAR-10 comparisons of SimCLR baseline and DCL with the ResNet-50, kNN eval. | Batch Size | 32 | 128 | 512 | | -------- | -------- | -------- | -------- | | SimCLR baseline | 82.2\% | 88.5\% | 89.1\% | | DCL | 86.1\% (+3.9\%) | 89.9\% (+1.4\%) | 90.3\% (+1.2\%) | |BatchvSize | 32 | 64 | 128 | 256 | 768 | | :-----------:| :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | | [1] | 78.9\% | 81.0\% | 81.9\% | 82.6\%| 83.2\%| | DCL | 81.0\% (+2.1\%) | 82.9\% (+1.9\%) | 83.7\% (+1.8\%) | 84.2\% (+1.6\%) | 84.4\% (+1.2\%) |