ICML 2023 Rebuttal

# ICML 2023 Rebuttal ## Main concerns - Space complexity (6hi2) - $A$ is sparse with $O(n^2)$ non-zero entries. - We only use implicit forms of A when performing martix multiplication. - Thanks to the formulation, $Ax$ and $A^\top \lambda$ are both performed in $O(n^2)$ - Point to specific lines in the algorithms. - Whether APDAGD and Dual Extrapolation is simply an extension of their OT versions or actually novel (dUuu, 32ad) - Acknowledge APDAGD and DE are NOT new algorithms proposed in this paper. - But the lack of a rounding procedure in the literature - The novelty is: - POT constraints are different from OT constraints - Adaptation to POT, but it is not a direct naive adoption of APDAGD - Bounding $R$ for APDAGD - Implementation of DE (cf. Jambulapati et al.) - Why is rounding necessary? (dUuu) - Complexity w.r.t. $s$ (32ad) - Experiments - Space complexity (6hi2) - Color transfer is "not enough" (dUuu) - Ethics flag - Put in official comment - This work is purely technical and all ethical concerns are application-specific ## Reviewer 6hi2 ### Weaknesses - My main concern of the method comes from its space complexity, which is $O(n^3)$, which is much higher than Sinkhorn's $O(n^2)$. The authors should not focus only on the time complexity and overlook the space complexity, which actually makes an unfair comparison in the paper. In my view, the space complexity of the proposed methods is prohibitively high and makes it unusable in real applications. I guess this is also the reason that in the experimental section only small scale POT is computed. The question is, with such small scale problems, they can be solved by linear programming, why do we need so complex conductions? - Sinkhorn returns infeasible so is it really a comparision? > Thank you for your review. We would like to clarify a subtle point that our space complexity is $O(n^2)$ per iteration, which is the same as that of Sinkhorn (refer to Q1). Because the space complexity is the same for both APDAGD and Sinkhorn, we would kindly contend that the paper does not make "an unfair comparison." We will also address the scalability of APDADG with a $10000 \times 10000$ example as requested in $\textbf{Q2}$ below. > $\textbf{Q1}$: [Space complexity] *"My main concern of the method comes from its space complexity, which is $O(n^3)$, which is much higher than Sinkhorn's $O(n^2)$."* > $\textbf{Reply}$: We want to clarify that our space complexity is similar to that of Sinkhorn at $O(n^2)$. Though matrix $A \in \mathbb{R}^{(n^2+2n)\times(2n+1)}$, the implementations of both our algorithms make uses of the fact that A is an augmented bipartie matrix with $O(n^2)$ non-zero entries. To be more specfic, operations in the form of $Ax$ (line 1 of APDAGD Algorithm 6, lines 10, 13 of DE Algorithm 4) or $A^\top x$ (lines 9, 12 of DE Algorithm 4 and lines 1, 4 of AM Algorithm 8) are all implemented implicitly in $O(n^2)$ utilizing the given structure of matrix A (refer to the following code). ```python=3.9 def A_x(x): """ Implicitly compute A x, equal to [X 1, X^T 1, 1^T X 1] args: x: flattened variables, of shape (n ** 2 + 2 * n) Complexity: O(n^2) """ X = x[:n**2].reshape(n, n) p = x[n**2:n**2 + n] q = x[n**2 + n:] return np.hstack((X.sum(axis=1) + p, X.sum(axis=0) + q, np.sum(X))) def A_T_lambda(y, z, t): """ Implicitly compute A^T lambda. args: y: dual variables corresponding to column constraints, of shape (n,) z: dual variables corresponding to row constraints, of shape (n,) t: dual variable corresponding to total mass constraint, scalar Complexity: O(n^2) """ return np.hstack((np.add.outer(y, z).flatten() + t, y, z)) ``` > Specifically, this code makes use of the explicit form $(\mathbf{A} \mathbf{x})^\top = ((\mathbf{X} \mathbb{1} + \mathbf{p})^\top, (\mathbf{X}^\top \mathbb{1} + \mathbf{q})^\top, \mathbb{1}^\top \mathbf{X} \mathbb{1})$ (line 117 in our paper). In all these calculations, only $O(n^2)$ space complexity is used for storing $\mathbf{X} \in \mathbb{R^{n \times n}}, \mathbf{p,q} \in \mathbb{R^{n}}$. > $\textbf{Q2}$: [Scalability of APDAGD] *"With such small scale problems, they can be solved by linear programming, why do we need so complex conductions?"; "It is necessary to prove its effectiveness on problems with scales to be $10,000 \times 10,000$ or even larger"; "The main limitation is that if the proposed problem can be generalized to solve relatively large scale problem."* > $\textbf{Reply}$: Given the complexity claims in the paper, it is important to show how running time scales with $n$, the number of supports in each marginal. In the below figure, we show the wall-clock time of running APDAGD to solve the POT problem between two Gaussian mixtures (shown in the end of the appendix). We fix the error tolerance to $\varepsilon = 10^{-1}$ and vary $n$ in the set $\{ 10, 30, 100, 300, 1000, 3000, 10000 \}$. The figure plots wall-clock time in seconds against $n$. The expected slope of the best-fit line between $\log(\text{time})$ and $\log(n)$ should be at most 2.5, as the complexity is $\widetilde{O}(n^{2.5} / \varepsilon)$. We find that the slope is $2.19$, which is consistent with the upper bound. Furthermore, the correlation is statistically significant. ![](https://i.imgur.com/7laVapL.png) > With the clarification of space complexity and the evidence of APDAGD's scalability, we hope that the reviewer could positively re-evaluate our work. ### Questions - The OT or POT can be formulated as linear programming (LP) problems, when the scale is large, we cannot use typical LP solvers. This is the reason that entropic OT or other solutions are used. For the authors' methods, it is necessary to prove its effectiveness on problems with scales to be $10,000 \times 10,000$ or even larger. - Real datasets' settings are default - Plus new experimental results (synthetic...) ### Limitations The main limitation is that if the proposed problem can be generalized to solve relatively large scale problem. ## Reviewer dUuu ### Weaknesses - The paper is targeted to a rather specific audience of researchers (those who work on optimization for discrete OT). This is not a problem itself (of course!), but this fact indirectly but negatively affect the papers’ exposition. Namely, understanding this paper requires being familiar with a ton of related work (many of which is very recent). The authors devoted very limited part of the paper explaining how the existing algorithms work. There is no separate background or related work section for this – a lot of non-authors results are introduced directly in the sections where I expect to see the authors’ contributions. - Discrete OT common in ICML and relevant literature (cite some papers). - Why is discrete OT and its variants important: to approximate cts OT - Existing algorithms aren't directly applicable to the POT framework - If these algorithms were introduced properly... space constraints -> refer to the original papers - APDAGD: differences in formulation and bounding $R$ (the technique is novel, with reference to the appendix --- (67)--(69)) - DE: formulation and normalization step - Rounding algorithm -> NOVEL > Thank you for reviewing our work and making detailed comments. We acknowledge your comments that our topic, discrete POT, targets "a rather specific audience of researchers", yet there has been a growing line of works in ICML community focusing on discrete OT and its variants (refer to $\mathbf{Q1}$). We summarize your main concerns in the following four questions. > $\textbf{Q1}$ [Why discrete POT/OT is important]: *"The paper is targeted to a rather specific audience of researchers (those who work on optimization for discrete OT). This is not a problem itself (of course!), but this fact indirectly but negatively affect the papers’ exposition"* > $\textbf{Reply}$: There have been a number of works on Discrete OT and its variants in ICML and relevant literature such as (Lin et al., 2019; Altschuler et al., 2017; Blondel et al.,2018; Pham et al., 2020; Nguyen et al., 2022). With the introduction of entropic regularization (Cuturi, 2013), discrete OT and its variants can be solved very efficiently and they offer practically accurate approximations to continous OT (Ferradans et al., 2013). Discrete POT is a newer formulation with much fewer works in which the current SOTA solver (Sinkhorn) is provably incorrect (as is shown in our Theorem 3.1). We aim to introduce the first valid rounding procedure for POT as well as give the two correct and efficient gradient-based algorithms to solve the discrete POT problem. - Continuing the previous point: the authors contribution is not well-introduced. As far as I understand, the rounding procedure is indeed novel. Yet, it is not clear to which extent the proposed algorithms (APDAPGD and Dual Extrapolation) are novel or they are just slight modifications of existing approaches. The same applies to the theoretical results presented in the paper. For example, the authors write: “We have the following guarantees for APDAGD, as proven in (Dvurechensky et al., 2018, Theorem 3).” Is this theorem a new result, a modification of result by Dvurechensky et al. or just a copy-paste of their theorem? The authors definitely should better distinguish what is new and what is not. - Acknowledge Theorem 5.2 (and possibly others...) is not novel - But bounding $R$ is novel. In fact in the orginal paper ((Dvurechensky et al., 2018), R is not bounded and was kept in that form. We use some new technique to achieve the bound for R (such as (67)---(69)) > $\textbf{Q2}$ [Novelty within our two algorithms]: *"The authors' contribution is not well-introduced."; "It is not clear to which extent the proposed algorithms (APDAGD and Dual Extrapolation) are novel or they are just slight modifications of existing approaches."; "There is no separate background or related work section for this – a lot of non-authors' results are introduced directly in the sections where I expect to see the authors’ contributions."* > $\textbf{Reply}$: Besides our novel rounding algorithm, we would like to clarify the novelty within the two algorithms proposed in this paper. We use two existing algorithms and provide their non-trivial adaptations from OT to POT with our novel rounding procedure. Theorem 5.2 is exactly the same as Theorem 3 in (Dvurechensky et al., 2018) which is about APDAGD's guarantees. However, our bound for $R$---the upperbound for the norm of dual variables---for POT is indeed the first in literature. In its proof, we also use a new technique utilizing the primal optimal values to evaluate dual variables (refer to equations (67) and (68)) which did not appear in typical $R$-bounding analyses such as in (Lin et al, 2019). Additionally, we include the exact Corollary 1 in (Jambulapati et al., 2019) as Lemma F.1 as it includes the Dual Extrapolation algorithm's guarantees. However, we rewrite and adapt the Dual Extrapolation algorithm from OT to POT with the new bounds for the number of iterations for Alternating Minimization (Theorem 6.6) and Dual Extrapolation (Theorem 6.7). We are also able to implement the Dual Extrapolation algorithm practically while the orginal paper's authors had to use mirror prox "for numerical stability considerations" (Jambulapati et al., 2019). We will address these points in a following version of our paper. - One of the main claims of the paper is a new method recovering the feasible solution. Yet it remains unclear to me why recovering a feasible solution is indeed beneficial. In which applied problems it is beneficial to have a strictly feasible solution? I do not see an answer to this question in the paper. I have looked through the applications of POT which the authors list (2nd column, lines 14-30) but It seems to me that in most of them it is ok to have a good but not necessarily feasible approximation of the solution. - Theoretical guarantee is the focus of this paper - **Is there any practical application that requires arbitrary precision?** - Try $s = \min\{r, c\} - 10^{-5}$ - Stick to Definition 1 - Robust domain adaption: control the exact amount of noise -> **violation** - Few-shot segmentation > $\textbf{Q3}$ [Benefits of feasible solution]: *"Yet it remains unclear to me why recovering a feasible solution is indeed beneficial. In which applied problems it is beneficial to have a strictly feasible solution?"* > $\textbf{Reply}$: We first hope to clarify that the focus of our paper is theoretical, so it is our intention to introduce algorithms that find the correct feasible solutions. We believe the whole aim of formulating POT stems from the need to transfer partial mass $s$ (Chapel et al., 2020, Benamou et al, 2015), leading to the extra equality constraint involving $s$. Hence, from an optimization perspective, respecting this constraint is the point of POT, and its violation implies Sinkhorn **does not solve** the POT problem. We kindly note that in applications where strictly enforcing POT's constraints is not necessary, other variants, such as unbalanced optimal transport, exist. Within the context of POT, our novel rounding algorithm allows our two algorithms to actually solve the POT problem and respect the constraints. Aside from this theoretical benefit, recovering a feasible solution will allow the experimental results to be more accurate. To demonstrate the practical benefits of theoretical correctness, we highlight two applications that require strict adherence to feasible solutions: > - **Point cloud registration** (Qin et al., 2022): In this paper, the authors noted that "the hard marginal constraint provides an explicit parameter to adjust the ratio of points that should be accurately matched, and helps avoid incorrect many-to-many correspondences." Respecting the equality in practice can help increase the accuracy of point cloud registration. > - **Mini-batch optimal transport** (Nguyen et al., 2022): Another application that requires strict enforcement of the total mass constraint is solving mini-batch OT using POT. In this paper, the authors noted that reducing $s$ can minimize misspecification but can also exclude potential good matches. Consequently, it is imperative to identify and transport the appropriate fraction of mass to achieve an optimal mapping, which is vital for the effective performance of ML models. Nevertheless, the current Sinkhorn solver's inaccuracies in the transported mass fraction prevent obtaining such optimal mappings. > We will include these works in the Related Work section in a following version of this paper. Finally, as we show in our answer to Question 4 below, recovering a primal feasible solution to POT can lead to better results in domain adaptation, another application of OT and POT. - The claim "Extensive experiments" is a little bit an over-claim as only one (still toy) application (color transfer) is covered. Overall, in my opinion, evaluation in the color transfer, despite being popular in OT/ML community, is absolutely non-representative. It is known that this problem can be solved by a dozen of (much easier) methods than OT; there are no widely acceptable quantitative metric of performance. Importantly, in this problem the visual results significantly depend on the pair of images in view. Therefore, I do not consider this experiment to be any representative. Indeed, I would expect the authors to provide an experiments which demonstrate in some more practical and convincing application that (a) the feasibility of the solution is beneficial and (b) improve in complexity w.r.t. epsilon leads to improvement (practically) in convergence time. - (a) last column, domain adaption if possible, maybe also refer to minibatch OT - (b) The motivation behind APDAGD is better dependence on $\epsilon$ than Sinkhorn. However, directly comparing APDAGD with Sinkhorn for POT is not straightforward, since using Sinkhorn is incorrect for POT... Wrong solution --> sinkhorn unable to reach the desired accuracy ($\varepsilon = 10^{-2}$) even when choosing sufficiently small $\gamma$ > $\textbf{Q4}$ [Experiments]: *"Provide experiments which demonstrate in some more practical and convincing application that (a) the feasibility of the solution is beneficial and (b) improvement in complexity w.r.t. $\varepsilon$ leads to improvement in practical convergence time."* > $\textbf{Reply (4a)}$: Here we provide a new experimental result in domain adaptation. In Courty et al. (2017), the authors presented an OT-based method to transform a source domain to a reference target domain such that the their densities are approximately equal. We present a binary classification setting involving the "moons" dataset. The source dataset is displayed as colored scatter points. The target dataset comes from the same distribution, but the points go to a rotation of 60 degrees (black scatter points below), representing a domain covariate shift. > In our setting, the source and target datasets contain different numbers of data points ($N_s = 300$ and $N_t = 400$, respectively). For OT, we use Sinkhorn to approximate the OT matrix of size 300 $\times$ 400, then transform the source examples according to Courty et al. (2017). For POT, we use K-means clustering to transform the target domains to a histogram of $300$ bins. The two unbalanced marginals are then normalized so that $\left\| r \right\|_1 = 0.75, \left\| c \right\|_1 = 1.0$, and we set $s = 0.999 \times \min\{ \left\| r \right\|_1, \left\| c \right\|_1 \}$ then use APDAGD to find the POT matrix between these two unbalanced marginals. The source examples are transformed similarly. Finally, for both methods (OT and POT), we train a support vector machine with the radial basis kernel (parameter $\sigma^2 = 1$) using the transformed source domain. In the figure below we report the accuracy on the target domain using 20,000 test examples. POT offers a flexibility in choosing the mass transported and helps avoid the need to normalize both marginals to have a sum of $1$. POT also results in a higher accuracy than OT. ![](https://i.imgur.com/cUwTgQx.png) > In response to Q4 (a), we additionally compare two methods of approximating the POT matrix in this paper: Sinkhorn (which outputs a primal infeasible solution) and APDAGD. All hyperparameters are equal. The results indicate that, in addition to giving a primal feasible solution, APDAGD gives a higher accuracy than Sinkhorn. ![](https://i.imgur.com/n1tiIMU.png) > In addition, revisiting the color transfer example in our paper, we can show that "the feasibility of the solution is beneficial". In the last column of the below figure ((d), (h) and (l)), LP denotes the exact solution (availabe for small $n$) retrieved by the Gurobi solver. The solution produced by APDAGD closely matches this exact solution, in contrast to Sinkhorn's. ![](https://i.imgur.com/Tk8dbGd.jpg) > $\textbf{Reply (4b)}$: The motivation behind APDAGD is better dependence on $\varepsilon$ than Sinkhorn (Dvurechensky et al., 2018). However, directly comparing APDAGD with Sinkhorn for POT w.r.t. practical convergence time is not straightforward, since Sinkhorn returns an incorrect and infeasible solution for POT. As we shown in sub-figure $(c)$ above for color transfer, Sinkhorn fails to converge to the true solution (the desired accuracy is $\varepsilon = 10^{-2}$, red horizontal line) even when choosing a sufficiently small value for $\gamma$.  ### Questions Please consider the questions in the section above. Additionally, could the authors please elaborate more on application of partial OT to machine learning problems and beyond? May be provide more examples in the rebuttal. - There are several other instances of ... ### Limitations The authors did not discuss any limitations and potential negative societal impact of their work. At the same time I personally do not see any specific negative societal impact of this work. > $\textbf{Societal Impact}$: Thank you for your reminder about the societal impacts, which we will discuss in our future versions. Our paper adopts a theoretical viewpoint, and the algorithms we discuss have several potential real-world applications such as color transfer and domain adaptation. To this end, we believe our work does not present any specific ethical and societal concerns. > $\textbf{References}$: > [1] Lin, T., Ho, N. & Jordan, M.. (2019). On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:3982-3991 Available from https://proceedings.mlr.press/v97/lin19a.html. > [2] Altschuler, J., Niles-Weed, J., & Rigollet, P. (2017). Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in neural information processing systems, 30. > [3] Blondel, M., Seguy, V. & Rolet, A.. (2018). Smooth and Sparse Optimal Transport. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 84:880-889 Available from https://proceedings.mlr.press/v84/blondel18a.html. > [4] Pham, K., Le, K., Ho, N., Pham, T. & Bui, H.. (2020). On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:7673-7682 Available from https://proceedings.mlr.press/v119/pham20a.html. > [5] Nguyen, K., Nguyen, D., Nguyen, Q.D., Pham, T., Bui, H., Phung, D., Le, T. & Ho, N.. (2022). On Transportation of Mini-batches: A Hierarchical Approach. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:16622-16655 Available from https://proceedings.mlr.press/v162/nguyen22d.html. > [6] Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26. > [7] Jambulapati, A., Sidford, A., & Tian, K. (2019). A Direct Õ(1/ε) Iteration Parallel Algorithm for Optimal Transport. ArXiv, abs/1906.00618. > [8] Dvurechensky, P., Gasnikov, A. & Kroshnin, A.. (2018). Computational Optimal Transport: Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn’s Algorithm. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:1367-1376 Available from https://proceedings.mlr.press/v80/dvurechensky18a.html. > [9] Chapel, L., Alaya, M. Z., & Gasso, G. (2020). Partial optimal tranport with applications on positive-unlabeled learning. Advances in Neural Information Processing Systems, 33, 2903-2913. > [10] Benamou, J. D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111-A1138. > [11] Nguyen, K., Nguyen, D., Pham, T., & Ho, N. (2022, June). Improving mini-batch optimal transport via partial transportation. In International Conference on Machine Learning (pp. 16656-16690). PMLR. > [12] Qin, H., Zhang, Y., Liu, Z., & Chen, B. (2022, September). Rigid Registration of Point Clouds Based on Partial Optimal Transport. In Computer Graphics Forum (Vol. 41, No. 6, pp. 365-378). > [13] Ferradans, S., Papadakis, N., Peyré, G., & Aujol, J. F. (2014). Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3), 1853-1882. > [14] N. Courty, R. Flamary, D. Tuia and A. Rakotomamonjy, "Optimal Transport for Domain Adaptation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1853-1865, 1 Sept. 2017, doi: 10.1109/TPAMI.2016.2615921. ## Reviewer 32ad ### Weaknesses - I wonder if there shall be a multiplicative $s$ dependence in the running time for both methods as well. Can author elaborate a bit on that, or give a counterexample if other way around? - Thank you :-) - In POT literature, $s$ is treated in $O(1)$ - In general, if not we divide by $s$ - $\epsilon' = \epsilon / s$ - So complexity scales with $s$ - The paper is a little bit incremental over prior work, as both algorithms are already developed for OT settings so it is unclear how much new ideas are in the algorithmic design. The most novel part I found are the proposal of rounding algorithm, which is definitely of its own interest, and also the real-world application in color transfer. - Acknowledge APDAGD and DE are NOT new algorithms proposed in this paper. - But the lack of a rounding procedure in the literature - The novelty is: - POT constraints are different from OT constraints - Adaptation to POT - This is not a direct naive adoption of APDAGD - Bounding $R$ for APDAGD - Implementation of DE (cf. Jambulapati et al.) > Thank you very much for reviewing our work and your insightful comments. We would like to acknowledge that your intuition is correct as the complexity grows linearly with $s$. We would also clarify that (i) the non-trivial algorithmic adaptation to our rounding algorithm, (ii) the bound for $R$ and its proof for APDAGD case and (iii) the bounds for the number of iterations and practical implementation for DE case are novel. In the below we summarize your comments in two main questions. > $\textbf{Q1}$ [$s$ dependence]: *"I wonder if there shall be a multiplicative $s$ dependence in the running time for both methods as well. Can the authors elaborate a bit on that, or give a counterexample if other way around?"* > $\textbf{Reply}$: In this paper, the assumption that $s = \mathcal{O}(1)$ is common in OT literature. When $s$ is not $\mathcal{O}(1)$ for POT, we can rescale the problem by $s$ as follows. First note the original formulation (P1): >\begin{align} \text{(P1)} \: \begin{split} &\min_{\mathbf{X} \geq 0, \mathbf{p} \geq 0, \mathbf{q} \geq 0} \langle \mathbf{X},\mathbf{C} \rangle \\ &\text{s.t. } \: \mathbf{X} \mathbb{1}_n + \mathbf{p} = \mathbf{r}, \mathbf{X}^\top \mathbb{1}_n + \mathbf{q} = \mathbf{c}, \mathbb{1}_n^\top \mathbf{X} \mathbb{1}_n = s. \end{split} \end{align} > By letting $\mathbf{X'} = \mathbf{X}/s$ and scaling all the constraints by $s$, we have a reformulation (P2): > \begin{align} \text{(P2)} \: \begin{split} &\min_{\mathbf{X'} \geq 0, \mathbf{p'} \geq 0, \mathbf{q'} \geq 0} \langle \mathbf{X'},\mathbf{C} \rangle \\ &\text{s.t. } \: \mathbf{X'} \mathbb{1}_n + \mathbf{p'} = \mathbf{r}/s, \mathbf{X'}^\top \mathbb{1}_n + \mathbf{q'} = \mathbf{c}/s, \mathbb{1}_n^\top \mathbf{X'} \mathbb{1}_n = 1. \end{split} \end{align} > With our proposed algorithms, we set $\varepsilon' = \varepsilon /s$ and then obtain the $\varepsilon'$-approximate solution $\mathbf{X'^\ast}$ to problem (P2) in $\mathcal{O}(n^2/\varepsilon')$ by Dual Extrapolation or $\mathcal{O}(n^{2.5}/\varepsilon')$ by APDAGD. Hence, we obtain the $\varepsilon$-approximate solution to problem (P1) as $\mathbf{X^\ast} = \mathbf{X'^\ast} \times s$ in $\mathcal{O}(s n^2/\varepsilon)$ and $\mathcal{O}(s n^{2.5}/\varepsilon)$ for DE and APDAGD, respectively. As the reviewer suggests, the final complexity will grow linearly with $s$. In most applications, s is small or a constant that we can choose, so this fact will not present a serious problem for the two algorithms. > $\textbf{Q2}$ [Novelty within our algorithms]: *"It is unclear how much new ideas are in the algorithmic design."* > $\textbf{Reply}$: The algorithmic designs of APDAGD and Dual Extrapolation were proposed for OT and are not first introduced in our work. Naive adaptions of these algorithms from OT to POT would lead to infeasible solutions like Sinkhorn for POT (Theorem 3.1). By proposing a novel rounding algorithm, we are able to non-trivially adapt these algorithms to POT where there are additional contraints and structural changes in the matrix $A$ and marginal vector $b$. For APDAGD, the bounding of $R$---the upperbound for the norm of dual variables---for POT is novel. In the proof of Lemma 5.3 (for bounding $R$), we also use a new technique utilizing the primal optimal values to evaluate dual variables (refer to equations (67) and (68)). This has not appeared in other $R$-bounding analyses for OT such as that in (Lin et al., 2019). Regarding Dual Extrapolation for POT, we prove new bounds for the number of iterations for Alternating Minimization (Theorem 6.6) and Dual Extrapolation (Theorem 6.7). We can also practically implement the Dual Extrapolation algorithm while the orginal paper's authors had to use mirror prox "for numerical stability considerations" (Jambulapati et al., 2019). > $\textbf{References}$: > [1] Lin, T., Ho, N. & Jordan, M.. (2019). On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:3982-3991 Available from https://proceedings.mlr.press/v97/lin19a.html. > [2] Jambulapati, A., Sidford, A., & Tian, K. (2019). A Direct Õ(1/ε) Iteration Parallel Algorithm for Optimal Transport. ArXiv, abs/1906.00618. ## Reviewer zZwd ### Weaknesses The experiment is not enough to support the effectiveness of the algorithms. - 10000 $\times$ 10000 - remind appendix has more experiments - $\color{red}{\text{Domain Adaptation}}$ ![](https://i.imgur.com/cUwTgQx.png) ### Questions - In figure 3, why should the window preserve the white frames? How to evaluate the result of color transfer? - Qualitative evaluation? - POT allows for tuning -> produce a wide range of result images - Compare with exact solver's result (s = 0.95) - APDAGD's result is closer to exact solution - Compared with the existing methods, what are the advantages of the proposed method？ - Respect for POT primal contraints, compared to Sinkhorn + existing rounding - Both of our algs achieve the best complexities for all admissible solvers (i.e., those that output admissible couplings of the two marginals...) - Ethic flag: yes? clarifications > Thank you for reviewing our work and your comments. We would like to clarify the evaluations of our color transfer experiment and the advantages of our methods as follows. > $\textbf{Q1}$ [Color transfer evaluations]: *"How to evaluate the result of color transfer?"; "In figure 3, why should the window preserve the white frames?"* > $\textbf{Reply}$: We acknowledge that evaluating the result of color tranfer is subjective. However, in this case, the "white window frame" refers to the fact that the exact solution of the problem retrieved by the LP solver shows the white window frame. Refer to the below image where we set $\alpha = 0.9$. In this figure, our algorithm APDAGD shows better results than Sinkhorn by being qualitatively closer to the true LP solution. This comes as no suprise because as our Theorem 3.1 suggests, the solution returned by Sinkhorn is not primal feasible. ![](https://i.imgur.com/RwHbWTO.png) > $\textbf{Q2}$ [Advantages of our algorithms]: *"Compared with existing methods, what are the advantages of the proposed method"* > $\textbf{Reply}$: We would like to emphasize that the current SOTA solver, the Sinkhorn algorithm (Le et al., 2021), returns a primal infeasible POT solution (refer to our Theorem 3.1). Hence, the first advantage of our proposed methods is that both our algorithms return feasible solutions that respect the POT equality constraints with our novel rounding algorithm. Morever, the APDAGD algorithm achieves the best complexity of $\widetilde{\mathcal{O}}(n^{2.5} / \varepsilon)$ for entropic regularized POT among all admissible solvers. As illustrated in our paper and in our answers to other reviewers, APDAGD also produces better experimental results than Sinkhorn due to its corectness. Lastly, we show that the Dual Extrapolation algorithm achieves the best known complexity for POT solvers in the literature. > $\textbf{Q3}$ [Experimental results]: *"The experiment is not enough to support the effectiveness of the algorithms.*" > $\textbf{Reply}$: In our response to other reviewers, we have provided several additional experiments. For example, in our reply to Reviewer dUuu's Question 4, we show another application of POT in machine learning: domain adaptation. We demonstrate that POT produces favorable results to OT, and the primal feasible solution given by APDAGD is superior to that by Sinkhorn. We also illustrate, in our respose to Reviewer 6hi2, the relationship between $n$, the number of supports in the marginals, and the wall-clock time of running APDAGD for approximating a POT solution. Our result empirically validates the scalability of our algorithm for POT even when $n$ is large (up to 10,000). > $\textbf{Ethics flag}$: We believe that our work adopts a theoretical perspective and does not present any negative societal impact. We would appreciate it if the reviewer could clarify what potential ethical concerns could be raised from our work. > $\textbf{References}$: > [1] Le, K., Nguyen, H., Nguyen, K., Pham, T. & Ho, N.. (2022). On Multimarginal Partial Optimal Transport: Equivalent Forms and Computational Complexity . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:4397-4413 Available from https://proceedings.mlr.press/v151/le22a.html. Official comments: Dear reviewer 6hi2, As the discussion period has started, we would like to kindly inquire if we have properly addressed your concerns. In short, we have clarified the concern with APDAGD's space complexity and have run a large-scale $10000 \times 10000$ experiment. We are more than happy to address any further questions. Thank you very much and look forward to your response. Dear reviewer dUuu, As the discussion period has started, we would like to kindly inquire if we have properly addressed your concerns. In brief, we have provided both theoretical and practical examples that show with the benefits of a feasible solution. We also have clarified our novel contributions in the two algorithms. We are more that happy to address any further concerns. Thank you very much and look forward to your response. Dear reviewer 32ad, As the discussion period has started, we would like to kindly inquire if we have properly addressed your concerns. In summary, we have shown the linear dependence on s for the complexity and clarified the novel points of our algorithms. We hope that you can let us know if there is any further concern. Thank you very much and look forward to your response. Dear reviewer zZwd, As the discussion period has started, we would like to kindly inquire if we have properly addressed your concerns. Specifically, we have clarified the evaluation of color transfer experiment and the advantages of our algorithms. If there is any further concern, we are more than happy to address. Thank you very much and look forward to your response. Last remider: Dear xyz, There are only 14 hours before the discussion period ends, we hope that you reply to our reviews and confirm whether our points have sufficiently address your concerns. We would be happy to clarify any further concern in the short remaining time. Thank you. To AC: Dear AC, As there is a short time left to the discussion period, we hope that you can help to push some reviewers' discussions and/or acknowledgement to our rebuttals. We have run all the extra suggested experiments and pointed out Reviewer 6hi2's factually incorrect claim that our space complexity is $O(n^3)$. Thank you very much for your work. Final remark: Dear AC, Thank you for the kind response and your hard work. As the final point, we want to highlight that the predominant contributions of this works are to propose timely changes with two different algorithms after showing the ungroundedness of Sinkhonrn in the context of POT. Morever, as we have shown in the rebuttals, our algorithms have better time complexity, equal space complexity to the (now invalid) SOTA Sinkhorn and can be scaled to $10000 \times 10000$ case and beyond. Thank you.