REBUTTAL UCOOT

# REBUTTAL UCOOT # Reviewer RgZV > ## Strengths And Weaknesses: > > **Cons:** > > 1. The work has limited novelty as a) COOT setup has been well studied in [40], and b) advantages of unbalanced setups in OT is well known. The work primarily extends COOT in unbalanced setup. We respectfully disagree with the first comment for the following reasons. While COOT has been studied in [49] and [19], extending it to the continuous and unbalanced setting as in the paper requires more rigorous definition and different proof techniques. Also, the general formulation in Def. 2 that generalized OT, UOT, GW, UGW COOT and UCOOT has never been published before. Secondly, we agree that the advantage of unbalanced setup in OT is ***intuitive*** and ***empirically*** well-known. However, our theoritical justification for robustness to outliers is the first such provable result for OT in incomporable spaces, previously available only for UOT. > > 2. As mentioned in the paper, the main technical contribution (Theorem 1) is inspired from [25, Lemma 1]. The proof style is similar and the key challenges in proof steps is unclear. We gave credit for the unbalanced robustness result of [25, Lemma 1] to highlight the role played by the constructed contaminated plan in the proof. However, [25, Lemma1] is specific to unbalanced OT only with a single Dirac sample outlier. Adapting the proof for the UCOOT is definitely not straightforward: 1. With UCOOT, we have to consider both samples and feature contaminations *simultaneously*. This is technically more challenging specially with the non-separable KL penalties compared to regular OT. 2. We generalize the noise contaminations to be any kind of additional distribution. While for Dirac outliers the exponential bound is straighforward, it is not the case in the general setting and requires careful derivation to not obtain a very loose upper bound. To clarify the different steps of the proof, we have added a proof sketch and structured the proof in different paragraphs in the supplementary. > 3. The quantitative results are (Table 3) has very high standard-deviation, making the significance of improvement weak. Unsupervised HDA is notoriously hard and recent studies (see [67] and [49]) have also reported similar high variance results. One should note that this is unavoidable to some extent, as in unsupervised HDA the classes of the two domains can be easily mixed up. However, we believe that the difference of 6 performance points on average indicates that our results are significantly better on average despite the high variance. Also, taking $\varepsilon = 0$ (meaning that the NNPR solver is used) leads to competitive performance with lower variance. For convenience, we report the results here, but we will also add it to the revised version. | Domains | COOT | UCOOT ($\varepsilon > 0$) | UCOOT ($\varepsilon = 0$) | |:-------:|:-----------------:|:-------------------------:|:-------------------------:| | C -> C | 36.40 (+/- 12.94) | **44.05 (+/- 19.33)** | 38.60 (+/- 9.16) | | C -> A | 28.30 (+/- 11.78) | **31.90 (+/- 7.43)** | 29.45 (+/- 9.94) | | C -> W | 19.55 (+/- 14.51) | 28.55 (+/- 6.60) | **40.85 (+/- 12.53)** | | A -> C | **41.80 (+/- 14.81)** | 39.15 (+/-17.98) | 18.00 (+/- 9.22) | | A -> A | **57.90 (+/- 16.84)** | 42.45 (+/- 15.47) | 40.40 (+/- 8.40) | | A -> W | 42.10 (+/- 7.80) | 48.55 (+/- 13.06) | **49.15 (+/- 6.64)** | | W -> C | 8.60 (+/- 6.56) | **69.80 (+/- 14.91)** | 19.70 (+/- 5.79) | | W -> A | 16.65 (+/- 10.01) | **30.55 (+/- 10.09)** | 25.90 (+/- 5.48) | | W -> W | **75.30 (+/- 3.26)** | 51.50 (+/- 20.51) | 49.55 (+/- 6.02) | | Average | 36.29 (+/- 10.95) | **42.94 (+/- 13.93)** | 34.62 (+/- 11.17) | > > 4. The paper does some specific pre-processing for UCOOT (lines 746-748, supplementary) to improve the performance of the UCOOT. Interestingly, no such exploration was done for other baselines like GW, UGW, and COOT. It should be noted that other works have explored other pre-processing steps for GW (see for example [1]). The experimental setup seems to favor UCOOT as some performance enhancing pre-processing was explored for UCOOT but not for others. > 5. From the above, it also seems that UCOOT is sensitive to the domain of the input features which should be constrained to [-1,1]. The paper should discuss this limitation of UCOOT in the main paper. In fact, there is a typo in the lines 746-748, where UCOOT should be replaced by UGW. We note that both UCOOT and COOT can use directly the raw data (i.e. without any pre-processing step), which is exactly what we have done in the experiments on HDA. By contrast, UGW and GW require calculating the pair-wise distance matrix beforehand, for which a normalization in [-1,1] leads to more informative pair-wise similarities. By replacing UCOOT by UGW, the lines 746-748 now make sense. We also find that this pre-processing step improves significantly the performance of both UGW and GW. As for UCOOT, the magnitude of the input data has an impact on it, similar to any OT-based metric/divergence. > > ## Questions: > 1. In Section D.2, the range of hyper-parameters on which lambda_1, lambda_2 were tuned included 0. This does not adhere to the definition of UCOOT given in the main paper. We agree with the reviewer that $\lambda_1,\lambda_2=0$ are not covered in the definition of UCOOT. In fact, none of the alignments reported in the paper use $\lambda_1=0$ or $\lambda_2=0$ for alignment. We added a paragraph to report the hyperparameter combinations used in these experiments in Appendix D.2 under 'Hyperparameter Tuning' and took out $0$ from the grids of $\lambda_1$ and $\lambda_2$. > > 2. COOT paper performs Heterogenous domain adaptation (HDA) task on Office-Caltech dataset with Decaf and GoogleNet features. Any specific reason why the same setup was not considered and instead Caffenet and GoogleNet features were selected for the HDA task. In fact, Decaf and CaffeNet indicate the same thing (Decaf are embeddings provided by the CaffeNet architecture). > > 3. In Table 3, COOT for W->C task performed quite poorly compared to other tasks with accuracy of 8.60 (± 6.56). Could some preprocessing done for UCOOT (mentioned in the Section D.1) be helpful for COOT? If not the same, perhaps something similar? As explained above, in the HDA experiments, both UCOOT and COOT use directly the raw data (i.e. without any pre-processing step). We agree that the performance of COOT for W->C is poor. However, as UCOOT and COOT have the same nature (i.e. closely related formulation and learning simultaneously feature and sample couplings), we need to use the same input data for both methods to make the comparison fair. It is possible that there exists a pre-processing step of the data which boosts the performance of COOT. But in that case, we also need to apply the same step to the input data of UCOOT. We believe that the ability of UCOOT/COOT to handle raw data is extremely useful in practice, as shown in multiple evaluation studies (in both our and the original COOT paper) where it brings a clear advantage. So, we do not pre-process the data in the experiments. > ## Evaluation > > Ethics Flag: No > Soundness: 2 fair > Presentation: 3 good > Contribution: 1 poor > Rating: 4: Borderline reject: Technically solid paper where reasons to reject, e.g., limited evaluation, outweigh reasons to accept, e.g., good evaluation. Please use sparingly. > Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. > # Reviewer pwU4 > ## Strengths And Weaknesses: > > The authors start from a very high ground of proposing a general OT framework that can converge to all common OT formulation including OT, UOT, Gromov-W, UGW, COOT. The story then narrows down to the robustness of UCOOT to outliers, compared with COOT. The theoretical derivation looks good, but I have to say it's also not surprising at all. In general, the paper is well written. Thank you for finding our paper well written. We believe that formally proving a result that is "not surprising" is an important contribution to the community and the implication of robustness for UCOOT are much larger than for COOT, GW or even UGW. More precisely, we have shown the robustness **wrt. both sample outliers and feature outliers** while allowing for simultaneous sample and feature selection and alignment. > > The authors mainly promote UCOOT from the perspective of its robustness to outlier samples, namely Theorem 1. However, 5.1 looks like a domain shift, not an outlier problem. I also don't see outliers in 5.2. It still looks like a domain shift problem. I also wonder if the connections to certain part of the domain will be wrongly weakened due to mass growth of shrink, resulting in a collapse of the transport map to only part of the domain that leads to lower loss. A visualization of the embeddings and the transport map would be helpful to clarify in the next version. As discussed above, Theorem 1 is much more general than only outlier samples. As a matter of fact, in Experiment 5.1 we design the data so that it has outlier features (pixels) instead of samples and we show that UCOOT is robust to them. In Experiment, 5.2 we transport between different neural network embeddings with different size and architectures, meaning that some neurons on both sides can be seen as outliers and will not be transported/aligned. Again, UCOOT manages to handle them better than COOT. We agree that we did not investigate outlier samples in the experiments, as this configuration has been much explored in the OT literature. Here, we wanted to emphasize a unique strength of UCOOT, that is the ability to handle outlier features. Nevertheless, following a suggestion from reviewer tDQL, we have changed the Experiment 5.1 by adding random outlier images to the target distribution, so that the target data contains both outlier features and samples. The performance of UCOOT shown in Figure 4 remains good while COOT suffers a drastic loss in performance under the presence of both feature and sample outliers. > In 5.3, I do see outliers as "antibodies that lack matches in the gene expression domain". Perhaps the authors could expand the discussion there just a bit on using UCOOT to discover abnormal antibodies or something like that? We thank the reviewer for this suggestion. The outliers, defined here as features (or samples) without matches in the other domain, in single-cell experiments could be driven by the underlying biology or experimental design. For example, an experimental artifact could result in a rare feature (or sample) being missed by sequencing in one measurement and not in the other. We present this scenario in Figure 6 (b), where five antibodies do not have their relevant gene mappings in the features selected for the experiment. We find that while COOT tends to incorrectly match some of these features, UCOOT shrinks their mass in transport and tends to avoid incorrect mappings. Therefore, by relaxing the mass conservation constraint, UCOOT will help avoid subsequent incorrect downstream analysis of the results due to the outliers. We additionally have similar scenarios for the samples (cells with missing correspondences due to subsampling experiments) in Appendix Figure S2(b&c), where UCOOT yields better sample alignments. UCOOT makes it possible to detect these outliers by investinating the local mass relaxation (how much mass is shrunk per sample/feature). This will help discover rare cell types or genes or prune features (or samples) that result from experimental artifacts, making it a valuable method for single-cell alignment experiments. Per the reviewer's suggestion, we added this discussion in Section 5.3 > > ## Questions: > "By taking $\varepsilon$ sufficiently small, we can recover solutions close to the non-entropic case." Does the optimization become unstable when $\varepsilon$ is very close to $0$ like what we could expect from Sinkhorn OT? When $\varepsilon$ is very close to zero, the optimization of UCOOT loss does **not** become unstable, but the convergence can be very slow. We recall that, the unstability of Sinkhorn algorithm for very small $\varepsilon$ comes from the exponential operation, which causes the numerical overflow during the Sinkhorn update. However, if it is implemented in the log domain (see section 4.4 in [A]), then the algorithm is much less prone to this error, thus remains stable, even for very small $\varepsilon$. However, in that case, the convergence is very slow, regardless of implementation. UCOOT is estimated using the block coordinate descent scheme, where each iteration consists in alternatively solving two entropic unbalanced OT problems, which are both solved by the scaling algorithm [17] (a genelization of Sinkhorn algorithm). Similar to the Sinkhorn algorithm, we implement this algorithm in the log domain. So the optimization does not suffer from the numerical error, even for very small $\varepsilon$, but we suffer from the slow convergence. References: [A] Gabriel Peyré and Marco Cuturi (2019), "Computational Optimal Transport: With Applications to Data Science", Foundations and Trends in Machine Learning: Vol. 11: No. 5-6, pp 355-607. > > Figure 2 & last paragraph in 4 are not clear. > > 1. IIUC, each sample actually shows that uniform initialization leads to lower loss than what initialization with UCOOT gives. Weren't the axes swapped? Indeed, the axes were swapped. Thank you very much for noticing. We will correct it in the revised version. > > 2. Even if UCOOT leads to better initialization than uniform, that itself is not convincing because UCOOT is quite an expensive option compared to a uniform initialization. The trade-off is not discussed and there's no further experiment down the paper in this direction. Plus, it seems too early to show that UCOOT is a good way to initialize. Don't we have other simpler options to initialize? Like OT or GW on separate spaces, or other non-OT methods. This is a good point. We only wanted to illustrate that UCOOT initialization can lead to better solution, i.e. initialization from a different "valley" can lead to better solution for the non-convex COOT problem. This is in no way a definitive suggestion to do that in practice since, it is not a cheap initialization, as the reviewer has already remarked. Still note that GW is as complex as COOT to solve, so it is not an efficient initialization either (but the Third Lower Bound of GW [43] is a more feasible choice). We will also move the Figure and discussion to Appendix to have more space for this discussion [TODO]. > > 3. In Line 202, the last sentence seems unfinished. Indeed, we will reformulate the sentence. > > 4. Assuming UCOOT leads to better or even the best initialization for COOT, then, can we combine these two processes: increase $\lambda_1, \lambda_2$ during iterations so UCOOT gradually converges to COOT. Yes this is an very interesting suggestion. In fact, we can show when $\lambda_1$ and $\lambda_2$ tend to infinity, UCOOT converges to COOT. So, we can approach COOT by increasing $\lambda_1, \lambda_2$ in UCOOT in each iteration, similar to the epsilon scaling approach for solving entropic OT [A]. References [A] B. Schmitzer, Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems, SIAM Journal on Scientific Computing, 41 (2019), pp. 1443–1481. > Since the authors keep separating $\lambda_1$ and $\lambda_2$, I wonder if they tried different values for them so that we lean more on the marginalization of either source or target. As explained in section D in Appendix, we did try various combinations of $(\lambda_1, \lambda_2)$ and choose the one which has highest accuracy. In general, there is no reason to force $\lambda_1 = \lambda_2$. > > The authors treat $\lambda_1$ and $\lambda_2$ purely as hyper-parameters in the optimization. To solve real problems, it's not always the goal to find the lowest loss We do treat $\lambda_1, \lambda_2$ purely as hyperparameters but we do not choose the optimal hyperparameters based on the optimization loss. For example, in the HDA experiments, given a set of hyperparameters, we first train multiple models (sample and feature couplings) on multiple sub-samples of the dataset W->W, by minimizing the UCOOT loss. Then, we calculate the accuracy of each model and choose the tuple of hyperparameter corresponding to the highest average accuracy on this dataset (see Appendix for details). > > For 5.3, I'm curious of the range of the loss from different $\lambda_{1,2}$. When it comes to biological or medical data, it's not always the goal to get the lowest loss. If $\lambda_{1,2}$ can dramatically change the loss which then influences the transport map (alignment) thus the conclusion from the data, it's better that $\lambda_{1,2}$ also have biological or medical meaning? We agree with the reviewer that among the set of hyperparameter combinations tested, it is possible to have a $\lambda_{1,2}$ combination that yields the lowest loss upon optimization but not the most biologically relevant alignment. As a result, in our experiments, we picked the optimal $\lambda_{1,2}$ values based on the alignment score (evaluation metric) instead of using the lowest loss value. The alignment scores capture how well the sample and the features match according to the underlying biology (e.g., mapping cell types or genes to their respective antibodies). We added a description of the hyperparameter combinations used for the reported alignments in Appendix D.2 Hyperparameter Tuning section. As one might expect, we observe that when there are bigger discrepancies in number of features that have a match, lower lambda values are preferred. (For example $\lambda_2=1$ for the balanced scenario of matching 10 antibodies with 10 genes that code for them, but $\lambda_2=1e-2$ for the unbalanced scenario of matching 10 antibodies with 5 genes) If a user has a prior knowledge on whether a dataset is likely to contain samples or features that lack correspondences (e.g. rare cell types, unprofiled genomic features etc), we would advise them to start with a lower range of lambda values (e.g. $1e-5$ -- $1e-2$) for the alignment of samples/features. However, such prior knowledge would require domain expertise. > ## Limitations: > > The discussion on the complexity and time in real world is not sufficient. We didn't discuss this aspect and we thank the reviewer for this suggestion. To this end, (entropic) UCOOT's complexity has the same complexity to the entropic COOT that had been investigated in the supplementary of [49]. The latter solves two inner entropic OT problems which implies roughly quadratic complexity, which is also the complexity of solving entropic GW. However, we note that UCOOT can benefit from the speedup of Sinkhorn algorithm, for example [A], [B]. We will add a disussion to the paper reflecting that. References [A] B. Schmitzer, Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems, SIAM Journal on Scientific Computing, 41 (2019), pp. 1443–1481. [B] Meyer Scetbon, Marco Cuturi and Gabriel Peyré, Low-Rank Sinkhorn Factorization, International Conference on Machine Learning (ICML 2021). > > There're multiple hyperparameters and it doesn't seem the authors have a conclusion of a typical combination of them for all datasets. Comparing to COOT, we only introduce two additional hyperparameters corresponding to the relaxation of two marginal distributions. The choice of these hyperparameters depends on the dataset and on the domain knowledge. Intuitively, $\lambda$ controls the relaxation of the marginal distributions preservation constraint. So, if we know *a priori* that there are many outliers, then we should choose a smaller $\lambda$. In our experiments, we validate those parameters with the help of domain knowledge to narrow down the search range, as usually done in many ML applications. > > I doubt how practical it is to use UCOOT in real datasets, especially in biological or medical datasets. From my understanding, we could always get a zero loss by setting $\lambda_{1,2}$ to zero, meaning we don't transport anything... It is true that we can always get zero OT loss by setting $\lambda_{1,2} = 0$, but that is not yet all of what we do in practice. As explained above, we do not choose the optimal hyparameters based on the minimized UCOOT loss. Depending on the applications, other criterion is used, for example accuracy in the HDA experiment, or alignment score in the multi-omics experiment. In any case, these comments apply not only to UCOOT, but also to UOT that has been shown to be of strong practical interets on many application despite the apparent difficulty to slect the values of the hyperparameters. > > ## Evaluation > > Ethics Flag: No > Soundness: 3 good > Presentation: 3 good > Contribution: 3 good > Rating: 5: Borderline accept: Technically solid paper where reasons to accept outweigh reasons to reject, e.g., limited evaluation. Please use sparingly. > Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. # Reviewer L4eM > > ## Strengths And Weaknesses: > > **Weaknesses:** > > (1) I would like to see a runtime comparison for the proposed method and the baselines. Additionally, it would be nice if the authors can show the convergence curve of the proposed method. Below, we report the runtime of UCOOT, COOT, UGW, GW. We use the optimal hyperparameters tuned from Experiment 5.2: - For UCOOT, $(\lambda_1, \lambda_2, \varepsilon) = (50,1,0.1)$. - For UGW, $(\lambda_1, \lambda_2, \varepsilon) = (5, 5, 0.05)$. - For GW, $\varepsilon = 0.05$. - For COOT, $(\varepsilon_1, \varepsilon_2) = (0.1, 0.5)$. We measure the runtime (in seconds) of 4 competing methods in the task C->W. We repeat this measurement for 10 times, and report the average and the standard deviation of time. | | GW | UGW | COOT | UCOOT | |----------|-------------------|-------------------|-------------------|-------------------| | Time (s) | 1.75 ($\pm$ 2.41) | 4.45 ($\pm$ 1.22) | 1.24 ($\pm$ 0.26) | 7.25 ($\pm$ 0.54) | In this experiment, we observe that unbalanced methods clearly require more time to converge than the balanced ones. However, we also remark that: - UGW, UCOOT and COOT share the same optimization procedure i.e. block coordinate descent, whose iteration consists in alternatively solving two entropic (unbalanced or balanced) OT problems. For entropic GW, each iteration also requires solving one entropic balanced OT problem. - We use scaling algorithm [17] (a generalization of Sinkhorn algorithm) to solve the entropic OT. Its convergence depends crucially on the choice of $(\lambda_1, \lambda_2, \varepsilon)$ (for the unbalanced problem) or of $\varepsilon$ (for the balanced problem). So, Regarding the loss curve, we report the training loss of the optimal choice of hyperparameters. The loss only marginally decrease. > (2) To my knowledge, the weight of the KL term is critical for the performance of OT algorithms. The sensitivity of the method to the hyperparameters should be analyzed in detail. We agree that the regularization hyperparameter $\varepsilon$ is important to the performance. In fact, the parameters $\lambda_1$ and $\lambda_2$ are also important. We add the sensitivity analysis of all hyperparameters to the revised version. Here, we report the sensitivity of UCOOT's performance to all hyper-parameters for two tasks C$\to$W and A$\to$A. In general, the performance depends significantly on the choice of hyperparameters. - Given fixed values of $\lambda_1$ and $\lambda_2$, UCOOT performs badly for either too small or large values of $\varepsilon$. So, regularization is necessary, but needs not to be too strong. - Large value of $\lambda_1$ also degrades the performance, meaning that the marginal constraints on the source spaces should not be too tight. - It seems that large $\lambda_2$ is preferable, so the marginal distributions on the target spaces should not be too relaxed. Sensitivity analysis of $\varepsilon$: we fix $\lambda_2 = 50$ and $\lambda_1 = 1$ and show the accuracy for various value of $\varepsilon$. | Domains | $\varepsilon= 0.03$ | 0.05 | 0.07 | 0.1 | 0.2 | 0.3 | 0.4 | |:---------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|:------------------:| | C $\to$ W | 27.65 ($\pm$ 11.34) | 37.20 ($\pm$ 9.35) | 34.50 ($\pm$ 11.07) | 34.75 ($\pm$ 13.04) | 17.00 ($\pm$ 5.92) | 18.45 ($\pm$ 1.11) | 11.25 ($\pm$ 1.66) | | A $\to$ A | 21.95 ($\pm$ 9.46) | 35.30 ($\pm$ 15.11) | 35.65 ($\pm$ 15.05) | 41.15 ($\pm$ 19.16) | 58.45 ($\pm$ 15.54) | 22.30 ($\pm$ 3.74) | 8.90 ($\pm$ 1.34) | Sensitivity analysis of $\lambda_1$: we fix $\lambda_2 = 1$ and $\varepsilon = 0.1$ and show the accuracy for various value of $\lambda_1$. | Domains | $\lambda_1= 20$ | 30 | 40 | 50 | 60 | 70 | 80 | |:---------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:| | C $\to$ W | 35.80 ($\pm$ 9.33) | 34.15 ($\pm$ 12.98) | 37.35 ($\pm$ 13.82) | 27.45 ($\pm$ 8.33) | 32.45 ($\pm$ 11.62) | 30.00 ($\pm$ 8.04) | 30.15 ($\pm$ 12.89) | | A $\to$ A | 55.20 ($\pm$ 18.44) | 53.35 ($\pm$ 18.74) | 44.15 ($\pm$ 21.54) | 24.30 ($\pm$ 15.58) | 36.10 ($\pm$ 23.97) | 32.35 ($\pm$ 14.88) | 24.80 ($\pm$ 15.08) | Sensitivity analysis of $\lambda_2$: we fix $\lambda_1 = 50$ and $\varepsilon = 0.1$ and show the accuracy for various value of $\lambda_2$. | Domains | $\lambda_2= 0.3$ | 0.5 | 0.7 | 1 | 2 | 3 | 4 | |:---------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:-------------------:| | C $\to$ W | 34.20 ($\pm$ 9.83) | 34.45 ($\pm$ 10.80) | 34.20 ($\pm$ 10.50) | 34.75 ($\pm$ 13.04) | 29.70 ($\pm$ 10.55) | 37.70 ($\pm$ 17.96) | 32.30 ($\pm$ 18.81) | | A $\to$ A | 20.75 ($\pm$ 10.11) | 29.00 ($\pm$ 15.79) | 29.25 ($\pm$ 20.66) | 41.15 ($\pm$ 19.16) | 32.65 ($\pm$ 8.80) | 42.10 ($\pm$ 20.71) | 49.95 ($\pm$ 15.75) | > (3) The motivation of the KL regularizer in Eq.(1) should be further explained because there are other options constructing such regularizer, e.g. $KL(\pi_1^s | \mu_1^s) + KL(\pi_2^s | \mu_2^s) + KL(\pi_1^f | \mu_1^f) + KL(\pi_2^f | \mu_2^f)$, or $KL(\pi_1^s \otimes \pi_2^s | \mu_1^s \otimes \mu_2^s) + KL(\pi_1^f \otimes \pi_2^f | \mu_1^f \otimes \mu_2^f)$. What is the advantage of the proposed method compared to these options? This is a good suggestion. We agree that the term KL deserves more explanation. We choose the proposed formulation mainly for theoretical reason since it can be recast as a variation of unbalanced OT problem. This is advantegeous because we can apply the known techniques in the unbalanced OT literature to our main results, namely the existence of solution and the robustness property. > If my understanding is correct, like the original COOT, the UCOOT still requires learning feature-and sample-level coupling for the whole dataset. This setting makes the scalability of the method questionable when dealing with big data. > (1) Can the authors discuss the potential scalability issue of COOT and UCOOT? It is true that both UCOOT and COOT involve in simultaneously learning feature and sample couplings from the whole dataset. Generally speaking, none of the usual OT-based distances across spaces (GW, UGW, COOT, UCOOT) are adequate for big data applications because they require storing (or recomputing at each iteration) the sample coupling, which is a matrix of size $O(n^2)$. Our main goal in this paper is to tackle the robustness issue and show that it is important in practice. Finally, from our experience, UCOOT scales to size up to $O(10^4)$, which is arguably sufficient in many applications. There exist strategies to handle large scale data in OT (for instance, low rank entropic OT) but adapting them to UCOOT is out of scope for the paper. > (2) How to select the hyperparameters, e.g., the weights of the KL regularizers? > On real data, the hyperparameters have been selected using classical validation strategies (maximizing accuracy in one pair in HDA, and maximizing alignment score in the multi-omics application). The set of choices are described in section D in the Appendix. > ## Limitations: > > (1) The rationality of the proposed regularizers should be further highlighted. > (2) The sensitivity of the method to its hyperparameters should be discussed. > (3) Whether the UCOOT can be used to deal with big data should be discussed. These points have been answered above. > ## Evaluation > > Ethics Flag: No > Soundness: 3 good > Presentation: 3 good > Contribution: 3 good > Rating: 6: Weak Accept: Technically solid, moderate-to-high impact paper, with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations. > Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. # Reviewer tDQL > ## Strengths And Weaknesses: > > **Weaknesses** > > 1. As noted in Discussion section by the authors, it would be beneficial for the readers to see if the theoretical nature of the proposed algorithm is clarified. We use the classical Block Coordinate Descent that has been shown to converge to a stationnary point for non-convex problems. The main contribution of the paper are not the algorithm but the new formulation, the theoretical robustness result and the empirical results that show that UCOOT is strictly better than COOT in practice. > > 2. In the MNIST demonstration, the noise on the feature side was included, but would be more convincing if the noise on the sample side was added (e.g., inserting non-numeric images in Y only). This is a very interesting suggestion. We have added an additional figure in the corresponding experiment by testing the addition of 50 random outliers (images with random entries). The performance of UCOOT shown in Figure 4 remains good while COOT suffers a drastic loss of performance in the presence of both sample outliers. Moreover, notice how UCOOT gives 0 mass to all the outlier columns. > > ## Questions: > I would like to see more detailed descriptions of some parts in the updated version. I am specifically concerned with the following points: > > 1. §4, UCOOT helps finding better minima We conducted this small experiment to explain the fact that UCOOT alignments are significantly richer due to the relaxed constraints. The obtained alignments have lower alignment (transportation) costs. We could not however develop this section enough due to the page limit. For the sake of clarity, we have decided to move it to the Appendix where it is now discussed in detail. > > 2. §5.1, Figure 4 (b) barycentric mapping The barycentric mapping is a well known method for transporting discrete distrubution using an OT plan [A]. The barycentric mapping for source sample $x^s_i$ is equal to : $$\frac{\sum_{j} T_{i,j} x^t_j}{\sum_{j} T_{i,j}} $$ that is a weighted sum or target samples. Here, $T$ is the optimal sample coupling and $\{x^t_j\}_j$ represents the target data. We performed this operation in the color space to see where each area of a source image goes on a target image in 4.b . References: [A] Ferradans, S., Papadakis, N., Rabin, J., Peyré, G., Aujol, JF. (2013). Regularized Discrete Optimal Transport. In: Kuijper, A., Bredies, K., Pock, T., Bischof, H. (eds) Scale Space and Variational Methods in Computer Vision. SSVM 2013. Lecture Notes in Computer Science, vol 7893. Springer, Berlin, Heidelberg. > > 3. §5.2, Information loss when using Euclidean distance in high dimensional space We claim that using Euclidean distance in high dimensional space not only cause information loss, but also is not relevant. First, by construction, the Euclidean distance only contains aggregated information, thus naturally incurs information loss. Taking an example, the Euclidean distance between two points $(0,-1)$ and $(0,1)$ is $2$. We also know that the two points lie on the vertical axis. However, given the distance $2$ alone, it is impossible to recover the original position. In other words, when calculating the Euclidean distance, we also lose the information about the position of points. Second, it is known that Euclidean distance is not a good metric in high dimensional space [A], [B]. We will add the relevant reference to this claim in the revised version. We also emphasize that, while the Euclidean distance is used in GW, it is not required in COOT/UCOOT, where one can directly use the entry-wise similarity between two input matrices. References: [A] Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press [B] Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. In Database Theory — ICDT 2001, pages 420–434. Springer Berlin Heidelberg. > > ## Limitations: > > In Discussion section, the authors describe the limitations of the proposed approach and some interesting future work including its use as a loss function. > > > ## Evaluation > > Ethics Flag: No > Soundness: 4 excellent > Presentation: 4 excellent > Contribution: 3 good > Rating: 7: Accept: Technically solid paper, with high impact on at least one sub-area, or moderate-to-high impact on more than one areas, with good-to-excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. > Confidence: 2: You are willing to defend your assessment, but it is quite likely that you did not understand the central parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. # Reviewer 2UWg > ## Strengths And Weaknesses: > > **Weaknesses:** > > Only toy data sets are used in the experiment, and there are not many kinds of disturbances, which does not reflect the adaptability to various disturbances, We respectfully and strongly disagree with the reviewer. The toy example is only used for illustration purpose (in section 5.1). We also use real-world datasets in our experiments, namely Caltech-Office (in Section 5.2) and CITE-seq dataset (in Section 5.3): - Caltech-Office contains real images of objects in an office environment, taken from Amazon, a digital SLR camera and a webcam. All details on this dataset can be found in [51]. - CITE-seq dataset is a real-world experimental dataset that simultaneously measures gene expression and antibody (or surface protein) abundance in single cells. Single-cell measurements of different biological phenomena allow scientists to study how the cellular processes are regulated during development and disease. The dataset in this study is measured from 1000 human peripheral blood cells for ten antibodies and 17,014 genes. CITE-seq experiments have been used widely in the single-cell community to evaluate the performance of alignment algorithms (e.g., [A], [B]). This is because we know the ground-truth correspondences on both the samples (i.e., cells, thanks to the paired measurements) and the features (i.e., genes and their encoded antibodies), thus allowing us to quantify and compare the alignment performances. References: [A] Hao, Yuhan, et al. "Integrated analysis of multimodal single-cell data." Cell 184.13 (2021): 3573-3587. [B] Dou, Jinzhuang, et al. "Bi-order multimodal integration of single-cell data." Genome biology 23.1 (2022): 1-25. > and there is no corresponding experiment for the large number of disturbances mentioned above. Can the reviewer be more precise about the kinds of pertubation that they are expecting? Note that we updated the experiment 5.1, so that now the target data has both feature and sample outliers. We also show that UCOOT is the only method that remains relevant in this configuration. > > ## Questions: > For UCOOT method, what percentage of noise will affect the result? To the best of our knowledge, there is no exact (or even approximative) threshold as it depends on the nature of the experiments. > > How to automatically identifying the outliers and enhence the performance of COOT? The objective of UCOOT is to do that automatically: simultaneously find correspondance and identify outliers by not transporting them (see the modified rightmost column of Figure 3). To improve the perfomance of COOT, the outliers would need to be detected and removed first. This leads to a two-step approach that is hard to use and tune in practice, especially when compared to solving one unique well-posed problem such as UCOOT. > > ## Evaluation > > Ethics Flag: No > Soundness: 2 fair > Presentation: 3 good > Contribution: 2 fair > Rating: 3: Reject: For instance, a paper with technical flaws, weak evaluation, inadequate reproducibility and incompletely addressed ethical considerations. > Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. # SECOND ROUND OF DISCUSSION # Reviewer RgZV > > Regarding "In fact, Decaf and CaffeNet indicate the same thing (Decaf are embeddings provided by the CaffeNet architecture)." > > In case Decaf and CaffeNet indicate the same thing, it will be great if the authors can explain the difference in the results reported in COOT paper (https://proceedings.neurips.cc/paper/2020/file/cc384c68ad503482fb24e6d1e3b512ae-Paper.pdf, Table 2) and Table 3 of the submitted work. The difference between the COOT results on unsupervised HDS (reported by its authors) and that reported in Table 3 of the submitted work is big, especially for, say W->C task (35.5 vs 8.6). From the paper and the code provided by the authors, COOT was not tuned in COOT paper and a fixed parameter was used (with no information about how it was selected). In our paper, we provide a more realistic validation of the parameters on one pair of datasets, which explains the diference in performance. We also note that unsupervised HDA is a difficult task, so the choice of hyperparameters is critical to the performance. > Regarding additional NNPR solver results for UCOOT. > > It seems that with NNPR solver, COOT performs better than UCOOT on unsupervised HDA experiment. It will be great if the authors can explain why UCOOT is unable to retain its advantage over COOT in this setting (of $\varepsilon = 0$). My understanding is that the given Definition 3 and robustness results (Theorem 1) are applicable to $\varepsilon = 0$ setting. It is true that, on average, COOT performs better than UCOOT using NNPR solver (UCOOT-NNPR, for short). However, if we look at the indivudual performances, UCOOT-NNPR overperforms COOT in 6 over 9 tasks, similarly to UCOOT using scaling algorithm. Note that average performance is not a good proxy due to the very different accuracies across DA tasks. It is true that the robustness result is applicable to $\varepsilon = 0$, but we believe that the gain on average of UCOOT($\varepsilon > 0$) might be due to the better ability of entropic OT to avoid overfitting in high dimension (see existing statistical results). ### Second reply We greatly appreciate that you take the time to discuss with us. But we strongly disagree with your comment about the performances for COOT. We include performance of COOT in the original COOT paper in the table below. As can be seen, the corresponding average accuracy is 33.26%, whereas with our validation strategy, it is 36.29%. This is an gain of 3% in performance (there is an important loss on one pair but an important gain on the others), meaning that our validation procedure is not unfair toward COOT. Also, as stated in our reply, the authors of COOT fixed the parameter and did not discuss how it was chosen (maybe by selecting the one that works better in average hence using the testing data for validation). By contrast, we provide experiments with a fair and realistic parameter validation procedure, since the same is done for all methods and can be used on real data when only two labeled domains are available. The fact that COOT does not work as well on some dataset pairs is interesting, especially from a reproducible research perspective, and shows that it might be sensitive to parameters in practical applications. Selecting a parameter without proper validation is common in HDA papers, due to the difficulty of the problem, but we believe that it is important for the community to provide more rigorous and realistic experiments in the future, as we did in this paper. In other words, HDA is a difficult problem and the question of validation procedure is still open. So it is unfair to penalize us for promoting more rigourous experiments that highlight reproducibility issue. | Domains | COOT (paper) | COOT (valid) | UCOOT ($\varepsilon > 0$) | UCOOT ($\varepsilon = 0$) | |---------|-------------------|-----------------------|---------------------------|---------------------------| | C -> C | 42.85±18.44 | 36.40 (+/- 12.94) | **44.05 (+/- 19.33)** | 38.60 (+/- 9.16) | | C -> A | **33.25±15.93** | 28.30 (+/- 11.78) | 31.90 (+/- 7.43) | 29.45 (+/- 9.94) | | C -> W | 25.50±11.76 | 19.55 (+/- 14.51) | 28.55 (+/- 6.60) | **40.85 (+/- 12.53)** | | A -> C | 17.40±8.86 | **41.80 (+/- 14.81)** | 39.15 (+/-17.98) | 18.00 (+/- 9.22) | | A -> A | 42.85±17.65 | **57.90 (+/- 16.84)** | 42.45 (+/- 15.47) | 40.40 (+/- 8.40) | | A -> W | 30.95±18.19 | 42.10 (+/- 7.80) | 48.55 (+/- 13.06) | **49.15 (+/- 6.64)** | | W -> C | 35.40±14.61 | 8.60 (+/- 6.56) | **69.80 (+/- 14.91)** | 19.70 (+/- 5.79) | | W -> A | **34.25±13.03** | 16.65 (+/- 10.01) | 30.55 (+/- 10.09) | 25.90 (+/- 5.48) | | W -> W | 37.10±14.57 | **75.30 (+/- 3.26)** | 51.50 (+/- 20.51) | 49.55 (+/- 6.02) | | Average | 33.28±7.61 | 36.29 (+/- 10.95) | **42.94 (+/- 13.93)** | 34.62 (+/- 11.17) | We agree that UCOOT($\varepsilon = 0)$ result is interesting and would deserve investigation but we already have numerous results in Supplementary and we believe that it should be done in future works. The role of $\varepsilon$ doesn't affect the robustness property of UCOOT: the relaxation of marginal constraints are controlled by $\lambda_i$ hyperparameters and not by $\varepsilon$. The reason why $\varepsilon=0$ may lead to worse results in practice is due to fact that its postive values bring the same improvement (unrelated to robustness) that entropic regularization brings to traditional OT: it reduces the sample complexity of estimating the OT distance, thus avoiding, to some extent, the curse of dimensionality. Given that our theoretical guarantees regarding the robustness of UCOOT are asymptotic, in practice (and for small sample size high-dimensional datasets like Office/Caltech one) we obtain better results with $\varepsilon>0$ because we benefit from these latter properties in this case. The robustness properties are actually illustrated in the MNIST experiment where we used both COOT and UCOOT with $\varepsilon = 0$. UCOOT performs very well: (1) out of 50 sample outliers, none are transported; (2) irrelevant features are discarded. This experiment sheds some light on why COOT can perform poorly in the W->C task: it is very sensitive to extremely small perturbations: notice how a small noise can lead to a totally different alignment. Whether COOT can be optimized further for this specific W-> C task is not relevant: UCOOT is less sensitive and more robust. # Reviewer tDQL Thank you for the suggestion. In the revised version, we will add detailed description of the barycentric mapping to Appendix and work on the lines 211-218 for more clarity, as suggested. # Message to AC Dear Senior Area Chairs and Area Chairs, We are writing to discuss the review provided by the reviewer 2UWg. By contrast to its confidence score of 5, the review is very short and exhibits a serious lack of adequate discussion, including incorrect claim on the paper (e.g. only toy datasets are used in the experiments) and unrelated concern that is out of scope of the paper (how to automatically identifying the outliers to enhance the performance of COOT). We also tried to answer some of the vague comments but the reviewer does not seem eager to exchange with us or even to clarify its position. We believe that this review is not fair and does not qualify for the standard of a selective conference such as NeuRIPS, thus should not be taken into account in the final decision. Finally, note that while we disagree with some other reviewers concerning their notes or comments, they did provide us with constructive feedback and discussion, which is greatly appreciated. Kind regards, The authors.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.