We are happy to provide further reponses to address the comments from the reviewer, and please do let us know if there are more questions.
(Discussion with Related Work [1]) We agree with the reviewer that, objective-wise, we use different contrasting quantity: [1] computes $\langle \phi(x), \mu_{X|z} \rangle$ and ours computes $\langle \phi(x), \mu_{Y|z} \rangle$. Note that this is the statement we made in our prior response when we discuss the difference between objective functions. To be clear, we are not strongly defending that "the usage of different contrasting quanity makes our work completely novel". And thanks to the reviewer, we have added the discussion with this missing reference [1] into our updated manuscript. Last, we hope the contribution of our paper can be recognized: we aim to resolve a practical research challenge when performing conditional contrastive representation learning. Our application is very different from performing density estimation, which is the major application of the NCE-related approaches.
($\mathcal{H}$ and $\phi$) Thanks for bringing more detailed questions, and please see our bullet-points-answers. We hope our response can resolve potential confusions of our method/notations.
1. ($\mathcal{H}$ is the RKHS with $\phi$ as feature map?) Yes.
2. ($\phi\big(g_{\theta_X}(x)\big)$ and the CME live in the same RKHS?) Our method computes $\bigg\langle \phi\Big(g_{\theta_X}(x)\Big),\mu_{Y|z} \bigg\rangle_{\mathcal{H}} = \bigg\langle \phi\Big(g_{\theta_X}(x)\Big), \Phi_Y^\top (K_Z + \lambda {\bf I})^{-1} \Gamma_Z \gamma(z) \bigg\rangle_{\mathcal{H}}$. $\phi\Big(g_{\theta_X}(x)\Big)$ and $\Phi_Y^\top (K_Z + \lambda {\bf I})^{-1} \Gamma_Z \gamma(z)$ live in the same RKHS. In particular, $\Phi_Y = \big[\phi\big(g_{\theta_Y}(y_1)\big), \cdots, \phi\big(g_{\theta_Y}(y_b)\big)\big]^\top$ and hence $\Phi_Y^\top (K_Z + \lambda {\bf I})^{-1} \Gamma_Z \gamma(z)$ has the feature map $\phi$ as well.
3. (Evaluating the CME or not?) We do not know exactly what does it mean by evaluating the CME, but according to the reviewer's definition: evaluating the CME means computing $\bigg\langle \phi\Big(g_{\theta_Y}(y)\Big),\mu_{Y|z} \bigg\rangle_{\mathcal{H}} = \bigg\langle \phi\Big(g_{\theta_Y}(y)\Big), \Phi_Y^\top (K_Z + \lambda {\bf I})^{-1} \Gamma_Z \gamma(z) \bigg\rangle_{\mathcal{H}}$, we do not compute this quantity in our work. We compute $\bigg\langle \phi\Big(g_{\theta_X}(x)\Big),\mu_{Y|z} \bigg\rangle_{\mathcal{H}}$instead. The difference is between using $\phi\Big(g_{\theta_Y}(y)\Big)$ or using $\phi\Big(g_{\theta_X}(x)\Big)$.
4. (What are $\phi\Big(g_{\theta_X}(x)\Big)$ and $\phi\Big(g_{\theta_Y}(y)\Big)$?) $x$ is the input data, and $g_{\theta_X}(x)$ denotes the represention after feeding $x$ into the mapping $g_{\theta_X}(\cdot)$. Our work considers deep neural networks for $g_{\theta_X}(\cdot)$. $\phi\Big(\cdot\Big)$ then projects the deep neural network representations $g_{\theta_X}(\cdot)$ into the RKHS. On the other hand, $Y/y$ can represent a different set of inputs. While in our work, we consider $X/Y$ from the same set of inputs.
(InfoNCE/SimCLR) SimCLR is a framework and InfoNCE is an objective function, and SimCLR uses InfoNCE as its objective function. Our work considers SimCLR as our base framework. We suppose the confusion comes from the following: the paper [2] first proposed InfoNCE loss within the contrastive predictive coding (CPC) framework. We do not consider the CPC framework but the SimCLR framework.
(The number of conditioning.) We believe that we mis-interpret the wording "the number of conditioning". Now suppose "the numer of conditioning" you mean is the "the number of cluster". We first like to mention that the conditional variable $Z$ in the experiment in our plot are annotative attributes, which do not come in the form of clusters but vector-values. Our method (WeakSup_CCLK) directly works with "vector-values" conditioinal variable, while prior work (WeakSup_InfoNCE) can only work with conditioinal variables with cluster forms. Hence, it is a need to prform clustering on top of the "vector-values" conditional variable to make it suitable for WeakSup_InfoNCE. Our plot is showing that the number of clusters for conditional variable will affect the performance of WeakSup_InfoNCE. And because our method can directly works with "vector-values" conditioinal variable, the performance is unaffected. Please let us know if you are expecting other kinds of experiments, and we will try to include it before the rebuttal period ends (which is today).
(Statistical Difference) We are very sorry for the bold. We will unbold the numbers as suggested by the reviewer.
[1] Noise contrastive meta-learning for conditional density estimation using kernel mean embeddings. In International Conference on Artificial Intelligence and Statistics (pp. 1099-1107). PMLR.
[2] Representation Learning with Contrastive Predictive Coding.