# Rebuttal NeurIPS
**Note to AC regarding Reviewer 2Xn4** (do not forget to exclude reviewers from readers of this message)
Dear Area Chair,
We believe that the review of Reviewer 2Xn4 lacks factual justification for some of their claims.
First, the cited reference [1] used by the Reviewer to claim that "aligning source and target distributions is not important and not interesting" doesn't in fact back up this claim: it merely tells that DA methods following invariant representation learning idea may fail in some pathological scenarios that, nevertheless, are far away from common DA uses-cases considered in this, and many other papers. An even earlier work of Ben-David et al. AISTATS'10 on impossibility theorems for DA proved similar results for any DA method and this didn’t abate the importance of the DA field. Arguably, such methods often presented under the name of adversarial DA methods (minimization of the source error + align source and target distributions) are still among the best performing and most famous ones.
Second, the reviewer claims that linear Monge mapping is widely-studied in DA even though, to the best of our knowledge, there is only one ArXiv paper that uses it for DA and two other published papers from computer vision community over the space of last 15 years. This can be hardly called a well-studied topic, or else some references should be included by the reviewer to back up their claim about the lack of novelty and incremental nature of our method. This is also confirmed by our new ablation study showing that by itself, linear Monge mapping doesn't perform well on DA task when raw data is considered.
**General reply**
We would like to thank all reviewers for the time that they took to provide us with their valuable feedback. We appreciate that our work was found intruiging and exciting, our paper clearly written and easy to follow, and our idea sound and well-motivated.
To respond to different questions mentioned by the reviewers, we added all additional ablation studies suggested by reviewers to the Supplementary material. These include: 1) evaluation of the impact of $\lambda$ on the trade-off between the data fidelity loss and linear alignment loss; 2) comparison to a baseline that uses linear Monge mapping on raw data. The first evaluation also includes the performance for $\lambda=0$, the case when no alignment is performed. Both these studies are to highlight that 1) linear Monge mapping on raw data doesn't work in DA and our linear alignability loss term is *necessary* to achieve good performance; 2) our achieved performance is not solely due to the increased quality of the embeddings obtained using the AEs as without the alignment term this leads to very poor DA results.
We hope that our latest revision/replies to raised concerns will be taken into account by the reviewers in revising their scores accordingly.
We now continue by replying to individual comments of the reviewers in the threads below.
**Reviewer nXd3** (5: borderline accept)
We thank the reviewer for their feedback, and for using such encouraging words as "exciting" and "intriguing" in relation to our proposed idea. We hope that our additional evaluations highlight more clearly its benefits. We would be delighted to hear any other suggestion that the reviewer may have and that we would be able to clarify to make our work stronger.
> **Weaknesses:**
> 1. I would be interested to see if such a method could be used for more recent datasets like WILDS, which has much more subtle domain changes.
We agree that Office/Caltech is not the most recent benchmark for DA methods, yet we chose it for three main reasons:
1. Historically, OT-based DA methods were evaluated on this benchmark, most likely due to the fact that 1) they relied on primal OT (Kantorovich formulation) that doesn't scale to larger datasets and 2) they were not suited to image data of more recent benchmarks. As our contribution is at the crossroads of computational OT and DA, we decided to stick to the common evaluation protocol for the sake of fairness of comparison.
2. Intuitively, evaluating DA methods on larger datasets from computer vision (CV) field requires more effort in the feature extraction/representation learning part, while most of them still rely on fine-tuning applied to the extracted embeddings. In this sense, the evaluation on Office/Caltech skips the representation learning part and directly evaluates the adaptation efficiency of the proposed technique during the fine-tuning stage.
3. Office/Caltech can be used for both heterogeneous and homogeneous DA. Additionally, for both of these settings the performance of OT-based DA methods still has a large margin for improvement meaning that this is not an easy adaptation task. The small size of the datasets also brings an additional challenge, especially for deep DA methods.
All that being said, we agree with the reviewer that providing an extension of LaOT with a different more powerful backbone that can be applied to more recent datasets is definitely a good idea.
> 2. In depth evaluation of why it does better on some shifts and worse on others would be appreciated.
This is a very good suggestion. Given a recent work on using the Monge mapping for generalized target shift (Kirchmeyer et al. ICLR'22), we believe that our work can be also extended to other types of shift, such as target shift, by incorporating a proper reweighing strategy. We will consider it as an interesting future perspective for our work.
> **Questions:**
> 1. Why is the method mentioned in 3.1 not compared to? It sounded like the method learned invariant feature transformations, which seems like an applicable baseline.
The paper of Seguy et al. tackles a different problem: it aims at showing how stochastic optimization can be used to solve large-scale OT problems. Consequently, it doesn't constitute a DA method per se, but more an optimization procedure allowing to solve OT on large datasets. As mentioned in their paper, "our goal here is not to compete with the state-of-the-art methods in domain adaptation but to assess that our formulation allows to scale optimal transport based domain adaptation (OTDA) to large datase".
> 2. Since LaOT takes in network features, can this method theoretically be applied to any network architecture?
Yes, the reviewer is absolutely right about that. This is why the extension required for it to work on more challenging datasets, such as WILDS, can be done by adding an efficient backbone for some specific dataset. In this case, our simple MLP encoder part of AE will be seen as a fine-tuning step that produces linearly alignable features on top of any embedding extracted by any backbone previously.
We will add the explanations to the paper clarifying these several points.
**Reviewer XmZ5** (6: weak accept)
We thank the reviewer for their feedback, and for finding our idea sound, and our paper clear and well-written. We hope that our additional evaluations highlight more clearly its benefits. We would be delighted to hear any other suggestion that the reviewer may have and that we would be able to clarify to make our work stronger.
> **Weaknesses:**
>
> 1. I think the paper lacks details about the training process; such as the practical cost of evaluating the closed-form Wasserstein distance at every training iteration. Or the stability of training in terms of trading off the OT and data fidelity loss.
As explained in Section 3.2, the computational cost of evaluating the closed-form Wasserstein distance is linear in $n$ (size of the mini-batch during training) and cubic in terms of the embedding size. The former and the latter are usually quite small and don't exceed 128 and 256 in our experiments. We would also like to encourage the reviewer to run our code to see that the optimization is very fast and converges quickly: this doesn't even require a GPU!
As for the trade-off between the data fidelity and linear alignability, we provide an additional study in the revised version of our manuscript showing the performance of best performing LaOT models for different values of $\lambda \in [0, 0.01, 0.05, 0.1, 0.5, 1]$ in all UDA tasks. From it, we can see that minimizing only data fidelity loss leads to very bad DA results even though the in-domain performance remains very high. This is explained by the fact that without the alignement term, there is no constraint that keeps semantically similar classes close to each other across the two domains in the embedding space. Otherwise, different values of $\lambda > 0$ have a mild effect on the final performance of the model showing the robustness of our method to this hyperparameter.
Please let us know if it replies to your question or whether we can provide more evaluations/details on this part.
> 2. I also think the paper could include some simpler baselines/ablations: it is my understanding that the linear alignment loss term first fits two Gaussian distributions to the source and target dataset, then compute a distance on those statistics following the OT formulation. I am wondering how the model would compare to simpler moment matching of the mean/covariance of these Gaussian distributions: i.e. how much improvement does the optimal transport formulation brings to the problem.
One important thing to precise here is that LaOT doesn't fit two Gaussians in the embedding spaces of the two domains: it learns two embeddings that are linearly alignable via a PSD matrix.
Also, we agree with the reviewer that comparing to a moment-matching method is meaningful in our case as closed-form Wasserstein distance involves two terms that compare the means and the covariances matrices via the so-called Bures metric. The baseline suggested for comparison by the reviewer is in fact very close to how deep CORAL works: it aligns covariance matrices of the centered embeddings of the two domains. We compare to deep CORAL in Table 2 and Table 5 in the Appendix.
Finally, we also add two ablation studies showing the effect of adding linear alignability term to our loss function (ie, changing $\lambda$ from 0 to a positive value) and a baseline that directly uses linear Monge mapping on the available raw data. The two studies can be found in the revised version of our manuscript in Table 4 and Figure 3. We can see that both the representation learning step and the alignment term are needed to achieve good performance with linear Monge mapping thus justifying our architectural choices.
Do let us know if we can provide more details on this or regarding the questions answered below.
> **Questions:**
> **Q1** Does the score of the linear alignability loss correlate well with final accuracy ? I am wondering if the model always truly learns linearly alignable representations, or if the proposed loss performs well even without reaching this ideal case in practice.
The plots for learning dynamics of OT showing the behaviour of the accuracy and the linear alignability loss are given in Figure 5 of the Supplementary material for all UDA tasks. The reviewer's intuition is right: the final value of the Wasserstein distance is greater than 0 even in the end of the optimization process as its further decrease doesn't provide any additional benefit.
> **Q2** For the unsupervised domain adaptation task, is the classifier trained separately after training the embeddings with equation 7 ? or is a cross entropy loss added to the objective (or even replacing the data fidelity loss) ?
The classifier's cross-entropy loss (line 266, 1 hidden layer with ReLU + softmax) is added to the objective function to promote low source error and learn a good discriminative embedding for the labelled source domain. However, we use a cross-validated linear classifier (as mentioned on lines 291-292) on the learned embeddings afterwards and report the accuracy of this same classifier for all baselines for the fairness of comparison.
**Reviewer ZhTU** (7: accept)
We thank the reviewer for their feedback, and for finding our paper clear and well-written, and our method simple and effective. We hope that our additional evaluations highlight even more clearly its benefits. We would be delighted to hear any other suggestion that the reviewer may have and that we would be able to clarify to make our work stronger.
> **Weakness**
There is not a lot of motivation in the paper on why using OT, or learning linearly alignable features is benefitial for DA or have some theoretical advantages over existing invariance-driven learning.
The DA task in any of its many flavours relies on tackling the mismatch between the source and target distributions. Its solution thus requires to find a meaningful tool to bridge this distributional gap to reduce the problem to the usual supervised learning task. OT is a geometric tool that allows to do exactly that: it provides the least effort way of transforming one distribution to another and gives a mapping that achieves this transformation. This explains why it is particularly suited for DA.
We will clarify this in the revised version of our manuscript.
> **Questions:**
>
> 1. Authors note that their proposed approach is a generalised version of learning invariant features --- that is, if the affine transform between two domains $z \rightarrow Az+b$ is carried out by identity matrix $A$ and zero-vector $b$, the representations will be invariant across domains. I wonder what is the benefit of using this more generalised framework, and if the ease of learning caused by less strict constraint results in a performance trade-off in some cases. In addition, one can imagine that the condition can be further relaxed if one adopts a non-linear transform (e.g. MLP) instead of an affine transform: what would be the benefit/disadvantage of doing that?
>
This is a very interesting question! Below is the reply to its two distinct parts:
1. We believe that learning representations that are invariant modulo a linear transformation adds an additional degree of freedom to the optimization procedure compared to invariant representation learning. Consequently, it allows to find the local minimum faster as the set of feasible solutions becomes larger. Also, while we do not have a rigorous mathematical proof for this claim, experimental evidence presented in Figure 5 of the Supplementary material shows that our approach reduces the Wasserstein distance and achieves a higher accuracy faster when compared to invariant representation learning.
2. In principle, one can replace the affine transform with a non-linear one but, intuitively, we think that the benefit of non-linearity in this case will be somewhat redundant with respect to the learned encoder. The rationale behind this is that the encoder should do the representation learning job well enough for the alignment problem to become easy in some sense. Also, using non-linear transform will reduce the appeal of our approach from the OT point of view as we want to learn two embedding spaces having this computationally cheap mapping readily available for them in the closed-form. We also note that the reviewer's idea is similar to another possible idea worth exploring: the use of linear Gromov-Monge mapping between measure-metric spaces of the two embeddings. The mapping in this case will still be linear, yet it will align the two domains based on their projection to possible infinite-dimensional features space via non-linear feature maps.
> 2. Authors used autoencoders to ensure that information in the data is preserved in the representations. For me this is quite a curious choice, as in classic representation learning autoencoders are not known to be the best for representation learning (can suffer from latent collapse and also tend to waste modelling capacity on reconstructing high-level frequency). I have a question and one suggestion regarding this:
>* Can you show some examples of the reconstructions generated by the autoencoders?
For DA tasks that we consider, we work with high-dimensional feature representations that do not admit any direct visualization beyond tSNE which is provided in Figure 3 of the Supplementary material for all UDA tasks. As we believe that the reviewer is more interested in the embeddings that are visually meaningful (just as we were!!!), we refer to Figure 2 for such an example of the embedding in $\mathbb{R}^2$.
>* Have you considered using the better performing augmentation-based self-supervised learning approaches to learn and? (or is there's any particular reason that you don't use them?)
We thought about it but were not totally convinced by the fact that augmentation-based methods would work in a comparable way. We acknowledge that there is plenty of work on learning time, position and motion invariant representations. However, one introduces bias by hand-picking the augmentation to achieve invariance. To the best of our knowledge, augmentations are used in input space of one domain and we think it is very difficult to design meaningful augmentations which would facilitate linear alignablilty between the representations of two domains.
Our method imposes the additional constraint of linear alignability on the optimisation problem of learning two domain specific representation spaces in parallel. The AE is seems well suited for easily adding such a constraint.
Finally, we believe that more complex backbones and more elaborate loss functions can further boost the obtained results but that would not be explained by the core of our idea which is the use of embeddings that are alignable with the linear Monge map in DA.
Let us know if we understood correctly your suggestion!
> 3. It would be helpful to see an ablation study with the performance of the model trained using only the reconstruction loss term (without
). This will help us understand how the linear monge mapping term contribute to DA performance.
We provide these results in Figure 3 where the corresponding results are those obtained with $\lambda=0$. As one can note, these results are rather bad which is due to the fact that in this case the semantic similarity between the same class in the two domains embeddings spaces may not be preserved: the source embedding of class 1 may end up being closer to some other class in the embedding space of target domain as no coupling constraint is used to avoid this scenario. In fact, we even had the idea of presenting the use-case of the so-called embedding poisoning based on this idea: learn two embeddings that are very hard to adapt due to this mixup in terms of their classes similarity.
> 4. (optional question, out of curiosity) Are you aware of any literature that draws connections between OT and learning equivariant representations? One could consider a similar approach being proposed under the equivariance assumption: suppose there exists a transformation that maps the input from one domain to another, the linear monge mapping can be considered as this same transformation but in the representation space.
We really liked the idea of extending our proposal to equivariant representation learning using OT! Unfortunately, we are not aware of any work that makes such link explicit but it definitely feels like an interesting idea to explore. We thank the reviewer for this valuable feedback.
**Reviewer 2Xn4**
We thank the reviewer for their feedback, and for finding our work clear and easy to follow. We would like to clarify several important misunderstanding regarding our work in the comments below.
> **Weaknesses:**
> 1. Given the counterexample in [1], I personally do not feel targeting 'the alignment of source and target distributions' a very interesting or important direction. The capability of OT does not make it a necessary tool to study the DA problem. The authors need to stress the target of the paper and provide better motivation: what goals are you trying to achieve and what take-aways are you trying to deliver to the audience? These are important parts of a well-grounded paper that are currently missing.
Impossibility theorems from [1] and those proved earlier by Ben-David et al. AISTATS'10 state that for **any** DA method, there exist instances of problems were it can fail. This happens most often when the two domains are not semantically related in terms of the downstream tasks that are solved, ie, when there is no classifier that performs well on both domains distributions. This is a pathological case though, as most benchmarks in DA are clearly semantically related and adaptable in the sense of both [1] and Ben-David et al. To this end, we disagree with the reviewer as clearly, alignment of the source and target domains is still a workhorse of today's DA and the vast majority of most prominent methods, including adversarial ones, follow it and achieve good results in practice.
Regarding our goals and take-aways, they are summarized and motivated on page 2 in the “Our contributions” paragraph. We reiterate them in a concise way below:
1. Our **goal** is given in italic on page 2: we want to investigate whether representation learning help to find an embedding space where the Monge mapping can be calculated explicitly for two discrete measures? Monge mapping is used in many ML applications such as, for instance, domain adaptation and generative modelling. Usually it is parametrized by a neural network or by an input convex neural network: in both cases the approximation is costly and may not converge to the true solution. To this end, having the Monge mapping available in closed-form for arbitrary measures is a very desirable feature.
2. Our main take-away is that the posed question can be replied affirmatively: we propose a simple coupling AE architecture that promotes linear alignability between the embedding space of the two AEs. This allows to use a famous results from OT theory stating that the Monge mapping admits a closed-form solution for random variables linked through such an affine PSD transformation.
> 2. The paper feels rather incremental. Using different embeddings on the source and target domain appears in [1]. The linear Monge problem as well as its closed-form has been well-studied in the OT literature. The theoretical analyses are mostly built upon the classical DA theory and the authors fail to convince me that their theoretical results are 'strong'.
Below, we reply separately to the three statements of the reviewer:
1. Linear Monge mapping was studied for DA only in Flamary et al. 2019. It is a very strong exaggeration to call it well-studied in this case. Similarly, there are only two applications of linear Monge mapping in ML community, including Pitie and Kokaram '2007 (reference [24]) and Mroueh, AISTATS'20 (reference [11]): both come from computer vision field. It can hardly be called "well-studied" given the current size of the ML and CV fields.
2. Our work is incomparable to [1] as the latter is a theoretical study and it doesn't provide any new method for DA! Their experimental evaluation is to highlight how another popular adversarial DA method, DANN, confirms the theoretical insights derived in their work. Also, their theoretical analysis was not derived for two different embeddings, but for the same transformation function $g$.
3. Regarding our theoretical results, to the best of our knowledge there is no other algorithm that benefits from a statement similar to Theorem 1. We do not know what the reviewer means by "strong" in this case, but Theorem 1 provably justifies our proposal from the theoretical point of view. Its peculiarity stems directly from the fact that we use a closed-form solution for alignment meaning that we can explicitly express several quantities that appear in classic DA bounds.
> 3. The closed-form of the linear Monge problem relies on the Gaussian assumption. What if the data exhibit long-tailed behavior?
Linear Monge mapping **doesn't rely** on the Gaussian assumption! As mentioned on line 35, we precisely say that Gaussian assumption is unrealistic and that is why it is rarely used on raw data that is neither Gaussian, nor linear alignable by itself. This claim is confirmed in Table 4 of the Supplementary materials where the application of the linear Monge mapping directly on raw data leads to very bad adaptation results.
>**Questions:**
> 1. Discuss the significance of their work, especially given the failure mode in [1]
The failure mode in [1] is model-agnostic: it applies to all DA methods in some particular pathological cases, whether they are based on OT or not. As noted in Ben-David et al. '10, there is nothing one can do about it without having access to target labels. We note, however, that in practice the considered benchmarks clearly do not correspond to this failure mode as the considered source and target domains are semantically close.
The significance of our work lies in the fact that we show how representation learning can help to solve the DA task via the simplest most intuitive solution based on OT theory. In particular, our method learns an embedding space where the closed-form solution of OT problem for general probability measures supported on possible high-dimensional spaces admits a simple closed-form solution. This solution can be used for several settings of interest, such as homogeneous and heterogeneous DA with attractive computational and statistical guarantees.
> 2. Highlight how their theoretical results are 'strong' (compared to classical bounds)
Our bounds are explicit: they express common terms from classic DA bounds thanks to the fact that we know the analytic solution solving the instance of our problem. Additionally they are strong from the sample complexity point of view: the estimation of the Wasserstein distance in high-dimensional space suffers from the curse of dimensionality, while the sample complexity of the estimation of the linear Monge mapping is dimension free.
> 3. Discuss how violation of the Gaussian assumption will affect the algorithm
As explained above, we do not rely on the Gaussian assumption. Linear Monge mapping is optimal for Gaussian distributions and for random variables linked through and affine PSD transformation. We exploit the second option and not the first one.