Neurips Rebuttal

Final message: Dear AC, Thank you again for your effort in coordinating the reviews and responses during this period. We believe so far we have had generally fruitful discussions with the reviewers which have helped clarify and enhance the contributions of our paper. *For convenience, here we would like to reiterate the main points in our rebuttal and responses to reviewers... -> Should we do this or just refer the AC to our official rebuttal?* While the dicussions among us and the reviewers have been productive, we again would like to bring into attention the unreponsiveness of reviewer VicL in particular. It is our belief that the main concerns raised by this reviewer have been adequately responded to in our rebuttal. Notwithstanding the ignorance of our empirical results, reviewer VicL continued to avoid acknowledging our answers or engaging in further open discussion. We really appreciate your frequent reminders throughout this author-reviewer discussion period. Unfortunately we think that reviewer VicL, unlike the other three, did not want to commit to a such a respectful and transparent discourse, and this has led to an unreasonable judgment of our work. We therefore urge you to consider the other reviews, especially including their productive follow-up discussions, carefully when assessing this paper. While we cannot force reviewer VicL to respond, we believe any rebuttal we have should be taken into account in good spirit, which we believe the other reviewers did. Thank you again, The authors General response: All of the additional experiments are included in our attached PDF. - $\textbf{G1:}$ Quantitative bound for Theorem 4.1. - $\textbf{Reply:}$ We hereby revise Theorem 4.1 to provide the quantitative bound for the constraint violation of Sinkhorn. While it suggests increasing $A$ as a resort to alleviate the infeasibility (note that $n$ and $\varepsilon$ should be the same for both APDAGD and Sinkhorn as inputs), we highlight that Sinkhorn's complexity is dependent on $\|\mathbf{\tilde{C}}\|_\infty^2=A^2$ (Dvurechensky et al., 2018), thereby exploding as $A \to \infty$. - $\textbf{Revised Theorem 4.1:}$ We have: $1^T \mathbf{\bar{X}} 1 \geq s + \exp\left(-\dfrac{12A \log{n}}{\varepsilon} - \mathcal{O}(\log{n})\right )$ $\textbf{Proof:}$ Recalling that $1^T \mathbf{X}1=s+\tilde{X}_{n+1,n+1}$, we thus proceed to lower-bound $\tilde{X}_{n+1,n+1}$. Note that the reformulated OT problem that Sinkhorn solves \begin{equation} \min_{\mathbf{\tilde{X}} \geq 0} \langle \mathbf{\tilde{X}},\mathbf{\tilde{C}} \rangle \quad \text{s.t. } \quad \: \mathbf{X} \mathbf{1} = \mathbf{\tilde{r}}, \: \mathbf{X}^\top \mathbf{1} = \mathbf{\tilde{c}}. \end{equation} With a well known dual formulation for entropic OT (Lin et al., 2019), we can minimize the Lagrangian w.r.t. $\mathbf{\tilde{X}}$ and obtain \begin{equation} \mathbf{\tilde{X}} = \exp \left( - \dfrac{\mathbf{\tilde{C}}}{\gamma} + \mathbf{u} \mathbf{1}^\top + \mathbf{1} \mathbf{v}^\top\right), \end{equation} where $\mathbf{u}, \mathbf{v}$ are the dual variable associated with each constraints. Now consider only the $\tilde{X}_{n+1,n+1}$ element, we can deduce \begin{equation} \tilde{X}_{n+1,n+1} = \exp \left(-\frac{\tilde{C}_{n+1,n+1}}{\gamma} - u_{n+1} - v_{n+1}\right) \geq \exp \left( - \dfrac{A}{\gamma} - \|\mathbf{u}\|_\infty - \|\mathbf{v}\|_\infty\right). \end{equation} Similar to our Lemma F.2 POT, a similar bound for the infinity norm of OT dual variables Lemma 3.2 (Lin et al., 2019) shows that $\| \mathbf{u} \|_\infty, \| \mathbf{v} \|_\infty \leq R$, where $R = \dfrac{A}{\gamma} + \log(n) - 2 \log\left({\min_{1\leq i, j\leq n}\{r_i, c_j\}}\right)$. With $\gamma = \dfrac{\varepsilon}{4 \log{n}}$ set by Sinkhorn, we obtain \begin{equation} \tilde{X}_{n+1,n+1} \geq \exp\left(-\dfrac{12A \log{n}}{\varepsilon} - \mathcal{O}(\log{n})\right ). \end{equation} Next, through both wall-clock time and iteration count (see $\textbf{Figure 1}$ in our attached PDF), we empirically verify the tightness of our bound as well as showing the practical outperformance of APDAGD over Sinkhorn. - $\textbf{G2:}$ Whether Sinkhorn can be fixed (e.g. by applying ROUND-POT to Sinkhorn); whether the intolerable error degrades target applications' performances. - $\textbf{Reply:}$ We remind the reviewers that in our paper's Figure 1(e), we applied our novel feasible rounding routine ROUND-POT to Sinkhorn, but the infeasibility problem remains; this shows that the infeasibility of Sinkhorn is fundamental in its algorithmic design. To highlight that such intolerable error of Sinkhorn can noticeably degrades the performance of target application, we refer the reviewers to our Appendix J.3 for the $\textbf{Point Cloud Registration}$ application. Finally, we note that throughout all the three real-world applications in this paper, APDAGD consistently improves over Sinkhorn, thereby validating the benefits of an efficient and feasible solver. - $\textbf{G3:}$ Run time comparison between APDAGD and Sinkhorn. - $\textbf{Reply:}$ For per iteration cost, we report that the ratio between that of APDAGD and that of Sinkhorn is 0.68 for the same setting as Figure 1 in our paper. In our attached PDF, we provide additional experiments showcasing that APDAGD is better than Sinkhorn in wall-clock time in terms of both aggregate run time cost in Figure 2. Moreover, for aggregate wall-clock time, we note that previous works showed that gradient methods and especially APDAGD are practically faster than Sinkhorn (see also Figure 1 in (Dvurechensky et al., 2018)).   $\textbf{Summary}$: All in all, we believe that this work serves to inform the practitioners of the potential constraint violation of Sinkhorn which potentially worsens the final results in real world applications both quantitatively (Figure 4, Figure 2 ($c$)) and qualitatively (Figure 2 last column, Figure 7) especially in robust regimes such as point cloud registration. To this end, we provide novel algorithms filling both this gap in practicality (via APDAGD) and unanswered questions on optimal theoretical rate (via DE) for solving POT. While we reported iteration count in the paper as the most direct way to verify our theories, the new experiments on wall-clock time demonstrate practical outperformance of APDAGD over the infeasible Sinkhorn and will be integrated in the future revision.  [1] Tianyi Lin, Nhat Ho, and Michael Jordan. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pages 3982–3991, 2019 [2] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376, 2018. 1. Reviewer 1vmH: - Thank you very much for reviewing our work and your insightful comments. > It seems the authors argue that the current method used for solving POT – which I will simply call “Sinkhorn” as the authors do – produces a fixed error in the violation of the preservation of mass constraint, which their methods APDAGD and NE asymptotically will not produce. Figure 1 $(c)$ seems to indicate, however, that Sinkhorn can convergence faster to some error level compared to APDAGD. Then, my question is, when would Sinkhorn be more desirable than APDAGD (or DE)? The paper does not specify such situations. I imagine there may be applications where the error produced by Sinkhorn could be tolerable as a part of a tradeoff between iterations (computational time) vs accurary (i.e., running the algorithm for more iterations will be better in reducing the constraint violation error, but will take more time and computational effort). There might be problems perhaps where we are dealing with large support dimension or other settings that could make the dimensionality of the constraints larger, and thus larger the computational effort per iteration. In such a case, a practitioner may perhaps prefer to tolerate more error since iterations of the solver are more expensive. Thus, Figure 1 $(c)$ indicates that one would prefer Sinkhorn since it seems to “” quicker than APDAGD to its solution (though not an exact solution, but totally OK if our epsilon-error is large enough). So, are there specific examples when one would prefer Sinkhorn over the proposed algorithms? Is there perhaps an application where the “speed” of using Sinkhorn makes up for its constraint violation error? Figure 1 $(c)$ is a little bit misleading: it seem that Sinkhorn “converges” in almost no iterations (if any). - $\textbf{Q1}$ $($Regarding Figure 1 $(c)$$)$: "Figure 1 $(c)$ seems to indicate that Sinkhorn can converge faster up to some error... Such error “seems” pretty low ..." - $\textbf{Reply}$: For relatively simple datasets such as CIFAR-10, Sinkhorn can convergence faster for some error levels, but can not improve after that point due to the inherent constrainst violation (Figure 1 ($c$)). However, for more skewed datasets which are more prevalent in real world applications, Sinkhorn fails to achieve the adequate magnitude of error. A salient example is our color transfer experiment in which $\textbf{Figure 2 (c)}$ indicates that the Sinkhorn does not satify the desired tolerance of the problem, represented by the red line (line 318, 319). Another situation where intolerable error from Sinkhorn results in unsatisfactory performance is point cloud registration (details in Reply for $\textbf{Q2}$ ). - $\textbf{Q2}$ "When would Sinkhorn be more desirable than APDAGD (or DE)?"; Can the “speed” of Sinkhorn make up for constraint violation, i.e. asymptotic error by Sinkhorn is tolerable? - $\textbf{Reply}$: Similar to our answer to the previous question, Sinkhorn might be occasionally preferred to APDAGD for some adequately simple datasets. However, in the vast majority of real applications, datasets tend to be more skewed or nuanced. Then, infeasibility of Sinkhorn, which can result in intorelable errors (e.g. Figure 2 $(c)$), significantly degrades the target performance. To illustrate this claim, we refer the reviewer to both color transfer and point cloud registration experiments. The former illustrate that qualitatively APDAGD (Figure 2 (l)) outperforms Sinkhorn (Figure 2 (h)) by being much closer to the actual POT solution (Figure 2 (d)). Furthermore, in robust regimes (Remark 4.2) that will require strict adherence to feasible solutions like point cloud registration, APDAGD also shows signifcantly better results than Sinkhorn (Figure 7). In Figures 7 $(c)$ and (d), the POT solution leads to a much better result than the OT solution: the left halves of the rabbit are closer together. Importantly, the POT solution obtained by APDAGD leads to an even better alignment between the two point clouds, compared to Sinkhorn, which is known to produce incorrect transport plans due to infeasibility. > Also, it would be good if the authors could provide more plots like Figure 1 $(c)$ for all the experiments shown in subsections 7.1, 7.2, and 7.3. It is relevant for a practitioner to know whether there is evidence that that Sinkhorn has actually a faster drop in the error, i.e., a faster convergence to its solution (of course, knowing that its solution will have some fixed error). Numerical simulations are important – after all, it was numerically how the authors found that APDAGD is faster than DE, which is valuable information for a practitioner. > Also related to these points, in Figure 1 $(c)$, the y-axis represents a measure of violation of constraints. Such error “seems” pretty low, on the scale of 10^-8 for Sinkhorn. It would be better to have an idea of why such a small error in the violation of the constraints is important – the magnitude of the error is important as far as how it compares to the quantities being measured. There may be situations where the asymptotic error produced by Sinkhorn is tolerable. - Refer to the above answers > There is something that would be important to know. According to the paragraph in lines 167 – 171, it seems that the current approach to solve POT problems (as specified in reference [15]) is first to start with Sinkhorn and then do the rounding procedure from reference [11]. Then the question is, is it possible to replace the rounding procedure from [11] by the rounding procedure proposed by the authors, namely, ROUND-POT (with the addition of some adaptation for its direct use)? If so, I think it would be of uttermost importance to see what happens if such change on the rounding procedure is carried out – maybe the resulting algorithm will have a better error in satisfying the constraints than the current approach proposed by the authors. It could also happen that such new algorithm could empirically show faster convergence than APDAGD (and DE) empirically, with zero asymptotic error. Can the authors investigate this and include it in the paper? - $\textbf{Q3}$ (ROUND-POT for Sinkhorn): "... is it possible to replace the rounding procedure from [11] by ... ROUND-POT ... It could also happen that such new algorithm could empirically show faster convergence than APDAGD (and DE) empirically, with zero asymptotic error. Can the authors investigate this and include it in the paper?" - $\textbf{Reply}$: We would like to kindly note that we have implemented what the reviewer have requested in Figure 1(e), stating in the caption that all iterates for both APDAGD and Sinkhorn go through ROUND-POT to ensure primal feasiblity. This figure highlights that even with our ROUND-POT, the infeasibility of Sinkhorn cannot be fixed. > Theorem 4.1 only mentions that the mass preservation equality constrained doesn’t hold on the POT solution by [15]. However, it would be good to know whether such error reduces as one or more parameters from the problem change, e.g., the dimension of the supports of the measures n, the mass chosen to be transferred, etc. I know this may be difficult to prove, but at least empirical evidence should be included to complete that which couldn’t be proved mathematically. - $\textbf{Q4}$ [Regarding Theorem 4.1]: We refer the reviewer to our global response addressing the theory and practical implications of Theorem 4.1. > I am trying to understand why the statement $\| \mathbf{A}\|_1 = 3$(or 2?) holds, as explained in line 251. Just by looking at lines 107-108, it doesn’t seem immediate to reach such conclusion. An extra explanation should be added. Also, as a follow-up, can more detail be added as to why [28] has a different value for $\| \mathbf{A}\|_1$? - $\textbf{Q5}$ ($\|\mathbf{A}\|_1$): "I am trying to understand why the statement $\| \mathbf{A}\|_1 = 3$(or 2?) holds, as explained in line 251...why [28] has a different value for $\| \mathbf{A}\|_1$" - $\textbf{Reply}$: The matrix $\mathbf{A}$ in [28] is only equivalent to the $\mathbf{A}'$ part of our matrix $\mathbf{A}$ (line 108). As 1-norm of a matrix is the maximum column sum, we can deduce that from the construction of our matrix $\mathbf{A}$, each column has at most an extra $1$-entry compared to matrix $\mathbf{A}$ in [28], since we have the extra last row $[\mathbf{1}_{n^2} \quad \mathbf{0}_{2n}]^\top$. So our $\| \mathbf{A}\|_1$ equal to that of [28] add 1. In [28], the author has given an example of how their $\| \mathbf{A}\|_1 = 2$, so our $\| \mathbf{A}\|_1 = 3$. > Why aren’t Theorem 6.2 and Theorem 5.2 written in the same style? It would be much better to see them both written in an analogous way to make both results easier to compare to each other. For example, Theorem 6.2 mentioned iterations in its statement, introducing the T, whereas Theorem 5.2 doesn’t do any mentioning. - $\textbf{Q6}$ (Theorem 5.2 & Theorem 6.2): "Why aren’t Theorem 6.2 and Theorem 5.2 written in the same style?" - $\textbf{Reply}$: We hope to clarify that Theorem 5.2 is the main convergence theorem for APDAGD, while the main convergence theorem for DE is the Theorem 6.3 rather than Theorem 6.2 in question. Theorem 6.2 is an intermediate result to analyze the convergence of Alternating Minization (AM), which is a subroutine of our main algorithm DE. The appearance of $T$ is merely due to the algorithmic difference between the two methods APDAGD and DE. In particular, while APDAGD ensures the convergence by directly checking the errors (in the form of dual gap and constraint violation), DE runs the routine, i.e. the main for-loop, for enough iterations, i.e. the counter $T$, which ultimately results in provable convergence in the form of Theorem 6.2 and 6.3. > Other issues... - $\textbf{Other issues}$: We thank you for your detailed comments and will improve our work in future versions according to your comments. - "H in line 105?" - We hope to clarify that there is a typo in line 105; the correct version should be $\mathbf{b} \in H \subseteq \mathbb{R}^{2n+1}$. Specific details for $H$ are in Section E.2 (line 611-613). - "In line 202, is $H^\ast$ the dual space of $H$";"dimensions of $\lambda$?" - We clarify in line 613 that $H^\ast$ is the dual space of $H$. As dual space of $\mathbb{R}^{2n+1}$ is $\mathbb{R}^{2n+1}$ itself, we deduce $\mathbf{\lambda} \in H^\ast \subseteq \mathbb{R}^{2n+1}$. - "In the caption of Figure 2, the symbol $f^\ast$ is used, which I guess is some dual function of $f$?" - $f^\ast$ is the common notation for optimal value of objective $f$ (to be clarified better in the future). - "Also, why is the optimality gap important to track in Figure 2? " - Optimality gap is the prevalent metric (in optimization) to evaluate the algorithm's performance, in the sense of $\varepsilon$-approximate solution (Definition 2.1). For example, by tracking the optimality gap, in Figure 2 ($c$), we can quantitatively detect that Sinkhorn fails to reach the desired error $\varepsilon$.  2. Reviewer Abdp: - Thank you very much for your review and comments for our work. > The authors should explain the main differences between the proposed APDAGD algorithm and [21] in this paper. - $\textbf{Q1}$ (Differences with APDAGD in [21]): "The authors should explain the main differences between the proposed APDAGD algorithm and [21] in this paper." - $\textbf{Reply}$: Due to the intricate differences in the coupling constraints between OT and POT, the POT problem results in totally different primal and dual formulations, as opposed to standard OT. To this end, the methodology in [21], which is designed for standard OT, is not directly applicable to the POT problem. In particular, our reformulation of the POT problem (line 103-107) introduces a different structure for the matrix $\mathbf{A}$ and contraints $\mathbf{b}$, and requires an extra dual variable (see line 223-224), which ultimately requires more involved and refined analysis as disscussed in our proof sketch of Theorem 5.2. > - If possible, the authors should also add the complexity result of [21] to Table 1 for solving POT problems. - $\textbf{Q2}$ ([21] complexity in Table 1): "If possible, the authors should also add the complexity result of [21] to Table 1 for solving POT problems." - $\textbf{Reply}$: We kindly note that the result of [21] (Dvurechensky et al., 2018) is for the $\textit{standard OT}$ problem, while our table attempts to compare algorithms that solve the POT problem. A direction application of the algorithm presented in [21] will not resolve the POT problems due to a missing suitable rounding algorithm and the incompatibility in formulation and analysis of POT and OT as we have mentioned in the reply to $\textbf{Q1}$. > In Table 1, the iteration complexity of APDAGD is $\mathcal{O}(\sqrt{n}/ \varepsilon$) while ones of (Infeasible) Sinkhorn is $\mathcal{O}(1/\varepsilon^2$), the relationship n and $\varepsilon$ should be discussed. - $\textbf{Q3}$ (Relationship between $n$ and $\varepsilon$):"In Table 1, ... the relationship n and $\varepsilon$ should be discussed." - $\textbf{Reply}$: We thank the reviewer for the suggestion. From the theoretical rates provided, we can deduce that when $\sqrt{n} \leq 1 / \varepsilon$, APDAGD will have a better rate than Sinkhorn and vice versa. In practical applications, this means that when we want high accuracy, i.e. very small $\varepsilon$, APDAGD will have a better performance. We will add this additional clarification to the revised manuscript. > Can the algorithms proposed by the authors be extended to solve UOT objectives? - $\textbf{Q4}$ (UOT): "Can the algorithms proposed by the authors be extended to solve UOT objectives?" - $\textbf{Reply}$: UOT is another alternative to POT, and it is quite independent of our work due to a fundementally different objective formulation. There are several works on solving UOT problem such as (Chizat et al., 2018) or (Pham et al., 2020). We hope to emphasize that unlike that of UOT, the literature for POT is much more scant, so our work desires to fill this gap. - $\textbf{References}$: - [1] Tianyi Lin, Nhat Ho, and Michael Jordan. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pages 3982–3991, 2019. - [2] Chizat, Lenaic, et al. "Scaling algorithms for unbalanced optimal transport problems." Mathematics of Computation 87.314 (2018): 2563-2609. - [3] Pham, Khiem, et al. "On unbalanced optimal transport: An analysis of sinkhorn algorithm." International Conference on Machine Learning. PMLR, 2020. - For the sake of notation convenience, we use the same reference [21] as the reviewer and our submission to cite the work: - [21] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376, 2018. 3. Reviewer TPpx: > I think the paper would be improved with a clearer outline of the existing failure mode - specifically, Theorem 4.1 should be augmented with a quantitative bound on infeasibility, at least for some example. The fact that this gap need not shrink to 0, as demonstrated by Figure 1 (d), is rather interesting, but it would nice to know how large it can be compared to the problem parameters (and if it can be arbitrarily large). - $\textbf{Q1}$ (Theorem 4.1): "Theorem 4.1 should be augmented with a quantitative bound on infeasibility..." - $\textbf{Reply}$: We first refer the reviewer to our global response G1 that presents the revised Theorem 4.1 incorporating quantitative bound for such error violation, and empirically verifying that such bound is tight. From the revised theorem, while this error may be improved if $A$ (or $\mathbf{\|\tilde{C}}\|_{\text{max}}$) tends towards infinity, we note that the complexity of Sinkhorn will explode due to Sinkhorn complexity 's dependence on $\|\mathbf{\tilde{C}}\|_\infty^2=A^2$ (Dvurechensky et al., 2018) > In particular, in the special case where both input measures have equal mass (where the POT problem can be relaxed without loss of generality to optimize over all measures with mass at least s and marginals dominated by the input measures), are these rounding issues still as problematic? I would consider increasing my score if quantitative bounds could be provided for the previous two issues - $\textbf{Q2}$: "In particular, in the special case where both input measures have equal mass (where the POT problem can be relaxed without loss of generality to optimize over all measures with mass at least $s$ and marginals dominated by the input measures), are these rounding issues still as problematic?" - $\textbf{Reply}$: We kindly note that relaxing the constraint from "equal to" $s$ to "at least" $s$, though mathematically interesting, (i) may not have as much practical meaning, and (ii) would result in a totally different problem. For point (i), the whole point of $\textbf{partial}$ optimal transport is that you can partially transport and thus fine-tune the mass to exemplify certain scenario in realistic application; also, people would not consider "less than" $s$, since in such case, the optimal couling would be trivially 0. For point (ii), we note that one would be able to prove that, if the constraint is changeed to "at least" $s$, the equivalence form between POT and standard OT would no longer hold; such equivalence form had been the workhorse for using Sinkhorn to solve standard OT before converting the solution back to POT as done by the literature and pointed out in this paper that such pipeline subtlely results in infeasible POT solution. Thus, in this case of "at least" $s$, the recipe of using Sinkhorn to solve standard OT and then rounding may potentially be not possible in the first place. - > I see the main contribution of this work as the rounding algorithm, which is pretty simple, since the two first-order algorithms follow somewhat standard approaches (though I recognize that refined error analysis was required). One question here: is there anything preventing one from composing this new rounding procedure with an application of Sinkhorn / other standard solvers to the dummy point reformulation? It would be nice if you could pinpoint why new algorithms are needed on top of the improved rounding procedure. If it is appropriate, Sinkhorn + Round-POT results should be included in the experiments. If not, please clarify any issues that would arise. - $\textbf{Q2}$ (Sinkhorn + Round-POT): "Sinkhorn + Round-POT results should be included in the experiments." - $\textbf{Reply}$: We would hope to clarify that we actually have run Sinkhorn + Round-POT in Figure 1 (e). As suggested in the caption, we aim to see whether the primal gap would be reduced when Round-POT is applied to Sinkhorn. This highlights that even with our ROUND-POT, the infeasibility of Sinkhorn can not be fixed. > Would be best to clarify SOTA complexities for standard OT. In particular, Table 1 only presents Sinkhorn guarantees for the "dummy point" approach, which can be improved in various ways The authors cite a pretty thorough amount of related work on POT, but is a bit incomplete when it comes to robust statistics applications. For example, [1] below provides formal guarantees for robust estimation based on POT and seems worth mentioning. I would have preferred that code be included in the supplement, though the authors promise to include it with the final version - $\textbf{Q3}$ (Minor nits): "... is a bit incomplete when it comes to robust statistics applications." - $\textbf{Reply}$: Thank you for pointing out the interesting POT application in robust statistics that we are not aware of, we will definitely include some citations on this topic in the future version of the work such as (Nietert et al., 2023). For the other minor nits, we are grateful for your comments and will change our work accordingly. - [1] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376, 2018. - [2] Nietert, Sloan, Rachel Cummings, and Ziv Goldfeld. "Robust Estimation under the Wasserstein Distance." arXiv preprint arXiv:2302.01237 (2023). 4. Reviewer VicL: - Thank you very much for your review and comments for our work. With your intiative to have an open discussion on the main issues, we will reply as follows and are more than happy to explain or discuss more at your request in the author-reviewer discussion period. > ... While this is true, it lacks quantitative results. For instance, I believe that when $\gamma \rightarrow 0$, or even when $A \rightarrow \infty$, $X_{n+1,n+1} \rightarrow 0$ mitigating this infeasibility. Of course, these asymptotic regimes implies more iterations in the Sinkhorn loop (so reducing the feasibility gap is not free), but having a deeper understanding of this would be better. As currently stated, it is hard to assess from Theorem 4.1 how bad the inequality $\mathbf{1}^\top \tilde{\mathbf{X}}\mathbf{1} \geq s$ is. - $\textbf{Q1}$ (Theorem 4.1): "... when $\gamma \rightarrow 0$, or even when $A \rightarrow \infty$, $X_{n+1,n+1} \rightarrow 0$ mitigating this infeasibility... how bad the inequality $\mathbf{1}^\top \tilde{\mathbf{X}}\mathbf{1} \geq s$ is." - $\textbf{Reply}$: We kindly refer the reviewer to our global response G1 for providing the revised Theorem 4.1 with quantitative bound on the constraint violation. We thank the reviewer for thoughtful comments and intution that helped us to improve our work. As discussed in G1, increasing $A$ does help to alleviate the constraint violation but would significantly increase the overall cost of Sinkhorn due to its dependence on $A^2$.  > The experimental comparison between Sinkhorn and APDAGD (Figure 1) is only made in term of iterations. Is it clear that this comparison is fair/meaningful? I agree that both methods can be seen as gradient-descent-like algorithms, but it is unclear to me if the steps are comparable, either from a "theoretical" viewpoint (stepsize, role of $\gamma$ etc.) or a practical one (computational burden of one iteration for each algorithm). - $\textbf{Q2}$ "wall-clock per iteration cost? role of $\gamma$ in per iteration cost? In particular, The experimental comparison between Sinkhorn and APDAGD (Figure 1) is only made in term of iterations. Is it clear that this comparison is fair/meaningful? ... if the steps are comparable, either from a "theoretical" viewpoint (stepsize, role of $\gamma$ etc.) or a practical one (computational burden of one iteration for each algorithm)." - $\textbf{Reply}$: First, we clarify that $\gamma$ does not affect the per-iteration cost, but only affects the convergence rate of the algorithms. To this end, we reported the iteration count in the paper to directly and most accurately verify the convergence rates established by our theories. We really appreciate the reviewer's insight on the practical concern on the actual per-iteration cost and thus provide experiments showing the better wall-clock per iteration cost of APDAGD in Figure 3 of our attached PDF. - $\textbf{Q3}$: "role of $\gamma$?"; "the value of $\gamma$ is not provided as far as I can tell ... This question should be investigated for varying $\gamma$ I think." - $\textbf{Reply}$: We kindly note that the choice for $\gamma$ for APDAGD is given in Algorithm 3 (first line) and tuning $\gamma$ is in fact just equivalent to tuning $\varepsilon$. This misunderstanding that $\gamma$ can be an independent hyperparameter is indeed very common in the OT/POT/UOT literature. Nevertheless, to complement our understanding and discussion, we include total wall-clock time for both algorithms for varying $\varepsilon$ (and thus $\gamma$ as requested) in Figure 2 of our attached PDF. Finaly, (as also mentioned in G3), gradient methods and APDAGD are indeed known to be practically faster than Sinkhorn in wall-clock time (see Figure 1 in (Dvurechensky et al., 2018)).    > From what the experiments currently tell us, even though Sinkhorn+Rounding does not satisfy the feasibility constraint, it is on the other hand much faster (in terms of iterations at least) at start (and seemingly "converges" faster). Even though asymptotically APDAGD is better (in terms of feasibility), it takes thousands of iterations to beat Sinkhorn. It is not clear to me (i) how long (running time) does it actually take? (ii) if it is worth it then. - $\textbf{Q4}$ (running time): "how long (running time) does it actually take?" - $\textbf{Reply}$: Please refer to our answers to Q2 and Q3 above.  - $\textbf{Q5}$: "...if [APDAGD] is worth it then." - $\textbf{Reply}$: We do believe that the use APDAGD will worth it in robust applications that demands a strict adherence to the feasible solutions (Remark 4.2) like $\textbf{point cloud registration}$ (Figure 7), and thus serve as a complement to Sinkhorn. In Figures 7 ($c$) and (d), the POT solution obtained by APDAGD leads to a much better alignment between the two point clouds, compared to Sinkhorn, which produces incorrect transport plans due to infeasibility. Furthermore, in Figure 2 ($c$), we can quantitatively check whether the optimality gap of the two algorithms Sinkhorn and APDAGD will be lower than the tolerance $\varepsilon$ of the problem. In short, as the optimality gap of Sinkhorn is unable to reduce lower than the tolerance (red line as mentioned in line 319), Sinkhorn fails to produce an adequate $\varepsilon$-approximate solution regardless of its speed. Also, Figure 2 (d,h,l) qualitatively show how the violation of feasibility constraint leads to a worse result for Sinkhorn - less closer to the real solution (Figure 2 (d)).  > Related to the point (ii) above, the third experiment (domain adaptation) brings scarce insights to me. From my understanding, it is a single experiment on a synthetic dataset which showcases a relatively minor improvement in terms of performances for APDAGD over Sinkhorn. It is hard to conclude from this single experiment that overcoming the feasibility issue in Sinkhorn+Rounding is worth it. This is especially striking given the various parameters to be investigated ($\gamma$, $\epsilon$, number of iterations, running time, etc.) - $\textbf{Q6}$: "Related to the point (ii) above, the third experiment (domain adaptation) brings scarce insights to me...It is hard to conclude from this single experiment that overcoming the feasibility issue in Sinkhorn+Rounding is worth it... More extensive ML experiments that showcase the importance of fixing the infeasibility issue occuring in Sinkhorn+Rounding would also make the significance of the contribution much more convincing."" - $\textbf{Reply}$: We believe that the improvement in domain adaptation though modest as in the reviewer's opinion, is still noticeable and is only one of the three real-world applications in this paper in all of which APDAGD showcases its benefits. For even more significant improvement, we kindly refer the reviewer to $\textbf{point cloud registration}$ (Figure 7). Finally, we highlight that throughout all the 3 applications, the results also demonstrate the fact that the infeasibility of Sinkhorn does degrade the target performance. Regarding Sinkhorn+Rounding, we refer the reviewer to our global response G2 where the adoption of our feasible rounding rountine cannot fix Sinkhorn yet. $\textbf{Summary:}$ We believe that Sinkhorn is indeed an important class of algorithms and future studies for addressing this infeasibility of Sinkhorn, which can be inspired from or benefit from existing results in this work, would be a worthy and interesting topic. Also, the fact that there has been no work designing and analyzing first-order methods for POT (despite the rich literature of OT counterpart) already gives merits to the two proposed gradient algorithms in this paper! - Reply to reviewer 3 comment: Thank you very much for the prompt and interesting response. While your scaling idea is insightful, the new coupling subtly impacts the quality of $\varepsilon$-approximate solution. Let the new coupling (in our discrete context) be $\mathbf{X}'$ after the scaling every elements of $\mathbf{X}$ by $\dfrac{s}{\mathbf{1}^\top \mathbf{X} \mathbf{1}}$. Note that $\begin{align} | \langle \mathbf{C},\mathbf{X}' \rangle - \langle \mathbf{C},\mathbf{X} \rangle | &=\left| \frac{s}{\mathbf{1}^\top \mathbf{X} \mathbf{1}} \langle \mathbf{C},\mathbf{X} \rangle - \langle \mathbf{C},\mathbf{X} \rangle \right| \\ &= \left|\frac{s}{\mathbf{1}^\top \mathbf{X} \mathbf{1}} - 1 \right| \langle \mathbf{C},\mathbf{X} \rangle \\ &= \frac{|\mathbf{1}^\top \mathbf{X} \mathbf{1} - s|}{\mathbf{1}^\top \mathbf{X} \mathbf{1}} \langle \mathbf{C},\mathbf{X} \rangle \\ &\geq \frac{|\mathbf{1}^\top \mathbf{X} \mathbf{1} - s|}{\min \{\|\mathbf{r}\|_1,\| \mathbf{c}\|_1\}} \langle \mathbf{C},\mathbf{X} \rangle \\ &\geq |\mathbf{1}^\top \mathbf{X} \mathbf{1} - s| \langle \mathbf{C},\mathbf{X} \rangle, \end{align}$ where the last two inequalities come from the fact that $\mathbf{1}^\top \mathbf{X} \mathbf{1} \leq s \leq \min \{\|\mathbf{r}\|_1,\| \mathbf{c}\|_1\} \leq 1$ (or equal to 1 in your example). Thus, the scaling step, as suggested by the reviewers, unfortunately would always incur the additional error that is lower-bounded by $|\mathbf{1}^\top \mathbf{X} \mathbf{1} - s|$, i.e. how close $\mathbf{1}^\top \mathbf{X} \mathbf{1}$ is to $s$, so changing "equal to" $s$ to "at least" $s$ would not completely resolve the problem. Reviewer VicL: Thank you for the acknowledgement. We are more than happy to address any of your concerns in this period regarding not only your comments but also other reviewers' comments. Dear Area Chairs, Thank you very much for your reminders to the reviewers. We hope that the next reminders will encourage the rest of the reviewers to respond, especially Reviewer 1vmH and Reviewer Abdp. Following Reviewer VicL's reply and previous request for "an open discussion," we would be very pleased to respond to any arising questions from him/her. If possible, please kindly encourage the reviewer VicL to direct any remaining questions in the private discussion between reviewers to us, to which we would be happy to answer! Best regards, The authors Reviewer TPpx: Thank you for your response and acknowledge of our previous answer. For the new question, we first hope to reiterate that theoretically, the complexity for Sinkhorn even with Round-POT will involve $\|\mathbf{\tilde{C}}\|_\infty^2$ or $A$ (General Response G1). On the other hand, we normalize the cost matrix so that $\|\mathbf{C}\|_\infty = 1$ in accordance with most practical OT/POT implementations, and we do not have to choose a sufficiently large A to ensure feasibility. Hence, the term $\|\mathbf{C}\|_\infty$ does not affect our computational complexity in practice. Furthermore, we refer the reviewer to the $\textbf{Figure 1}$ in our attached pdf, in which we show that Sinkhorn will take increasingly more iterations than APDAGD to converge as predicted by the theoretical complexity bounds. More iterations undoubtedly lead to more aggregate wall clock time as we also show that APDAGD can have lower per iteration cost (General Response G3). We also note that some examples with complex datasets will require $A$ to be greater than $10^6$ for the error to be tolerable, affecting the overall performance of Sinkhorn. Reviewer TPpx: Thank you the clarification of your question. We will give a brief reasoning to obtain the complexity of Sinkhorn + Round-POT to be $\mathcal{O} \left(\dfrac{n^2 \|\mathbf{\tilde{C}}\|_\infty^2 \ln(n)}{\varepsilon^2}\right)$ or $\mathcal{O} \left(\dfrac{n^2A^2\ln(n)}{\varepsilon^2} \right)$ since we do not enough space to formally reprove the Sinkhorn complexity. There are several works in the literature providing the complexity of Sinkhorn with without normal OT rounding (Dvurechensky et al., 2018, Altschuler et al., 2017). Firstly, (Dvurechensky et al., 2018) provides the tightest bound in the literature for Sinkhorn algorithm for OT with the well-known OT rounding (Altschuler et al., 2017) as $\mathcal{O} \left(\dfrac{n^2 \|\mathbf{C}\|_\infty^2 \ln(n)}{\varepsilon^2} \right)$. Specifically, in Algorithm 2 (Dvurechensky et al., 2018), the rounding algorithm from (Altschuler et al., 2017) projects the Sinkhorn solution onto the OT constraint set $\mathcal{U}(\mathbf{r}, \mathbf{c})$ with the cost of $\mathcal{O}(n^2)$ (Altschuler et al., 2017). Now that our Round_POT also achieves the similar guarantees (but for the constraints of the POT problem) in $\mathcal{O}(n^2)$ (our Theorem 3.2), we will obtain the similar overall complexity $\mathcal{O} \left(\dfrac{n^2 \|\mathbf{C}\|_\infty^2 \ln(n)}{\varepsilon^2} \right)$ for Sinkhorn + Round-POT to solve the POT problem. (Le et al., 2021) augments the cost matrix $\mathbf{C}$ to $\mathbf{\tilde{C}}$ (refer to line 162-163), the new (n+1,n+1)-th element $A$ is required to be greater than the original $\|\mathbf{C}\|_\infty$ (Chapel et al., 2020). Therefore, $A$ now becomes $\|\mathbf{\tilde{C}}\|_\infty$. The final complexity for this augmented cost matrix reformulation will have the term $\|\mathbf{\tilde{C}}\|_\infty^2$ instead of $\|\mathbf{C}\|_\infty^2$. We will be more than happy to further clarify if the reviewer has any other concern regarding this question. References [1] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376, 2018. [2] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1964–1974, 2017. [3]Khang Le, Huy Nguyen, Tung Pham, and Nhat Ho. On multimarginal partial optimal transport: Equivalent forms and computational complexity, 2021. URL https://arxiv.org/abs/2108.07992. [4] Laetitia Chapel, Mokhtar Z. Alaya, and Gilles Gasso. Partial optimal tranport with applications on positive-unlabeled learning. In Advances in Neural Information Processing Systems 33, 2020. Reviewer TPpx: Thank you for the clarification of your concern. Note that we have shown previously the constraint violation $|\mathbf{1}^\top \mathbf{X} \mathbf{1} - s|$ equal to $\mathbf{\tilde{X}}_{n+1,n+1}$, where $\mathbf{\tilde{X}}$ is the solution returned after running Sinkhorn and the well known rounding algorithm for OT (Altschuler et al., 2017). Using a similar approach to our general comment G1, we can express $\tilde{X}_{n+1,n+1}$ as \begin{equation} \tilde{X}_{n+1,n+1} = \exp \left(-\frac{\tilde{C}_{n+1,n+1}}{\gamma} - u_{n+1} - v_{n+1}\right) . \end{equation} Now as $- u_{n+1} \leq \|\mathbf{u}\|_\infty$ and $- v_{n+1} \leq \|\mathbf{v}\|_\infty$, we can deduce \begin{align} \tilde{X}_{n+1,n+1} &\leq \exp \left( - \dfrac{A}{\gamma} + \|\mathbf{u}\|_\infty + \|\mathbf{v}\|_\infty\right). \end{align} We again utilize the bound for the infinity norm of OT dual variables Lemma 3.2 (Lin et al., 2019), showing that $\| \mathbf{u} \|_\infty, \| \mathbf{v} \|_\infty \leq R$, where $R = \dfrac{A}{\gamma} + \log(n) - 2 \log\left({\min_{1\leq i, j\leq n}\{r_i, c_j\}}\right)$. With $\gamma = \dfrac{\varepsilon}{4 \log{n}}$ set by Sinkhorn, we obtain \begin{equation} \tilde{X}_{n+1,n+1} \leq \exp\left(\dfrac{4A \log{n}}{\varepsilon} + \mathcal{O}(\log{n})\right ). \end{equation} We now consider the equality constraint violation is below the tolerance when \begin{equation} \exp\left(\dfrac{4A \log{n}}{\varepsilon} + \mathcal{O}(\log{n})\right ) \leq \varepsilon \end{equation} Reviewer TPpx: Thank you for the clarification of your question. We first want to clarify that our Round-POT (Algorithm 1) takes input 3 inputs: a $n$ by $n$ matrix and two n-dimensional vectors. This implies that for the Sinkhorn solution $\mathbf{\tilde{X}}$ in the form (in line 166 and 167) \begin{align} \left( \begin{array}{cc} \bar{\mathbf{X}} & \tilde{\mathbf{p}} \\ \tilde{\mathbf{q}}^\top & \tilde{X}_{n+1,n+1} \end{array} \right) \in \mathbb{R}_{+}^{(n+1) \times (n+1)}, \end{align} we input into Round-POT only $\bar{\mathbf{X}}, \tilde{\mathbf{p}}$ and $\tilde{\mathbf{q}}$. The equality violation based on $\tilde{X}_{n+1,n+1}$ therefore can not be improved by Round-POT. Next, we provide an upperbound for the violation, and then evaluate how big A should be to achieve the $\varepsilon$ tolerance. We consider a feasible transportation matrix $\mathbf{\hat{X}}$ in the form $\hat{X}_{i,j} = \dfrac{s}{n^2}$ for $i,j\in[n]$ and $\hat{X}_{i,j} = 0$ otherwise. As $\mathbf{\tilde{X}}$ is a Sinkhorn $\varepsilon$-approximate solution of POT, we have $\begin{align} \langle \mathbf{\tilde{C}}, \mathbf{\tilde{X}} \rangle - \gamma H(\mathbf{\tilde{X}}) &\leq \langle \mathbf{\tilde{C}}, \mathbf{\hat{X}} \rangle - \gamma H(\mathbf{\hat{X}}) + \varepsilon\\ &= \frac{s}{n^2} \sum^{n}_{i,j} C_{ij} - \gamma s \log\frac{s}{n^2} + \varepsilon \\ &\leq s \|\mathbf{C}\|_\infty - \gamma s \log\frac{s}{n^2} + \varepsilon. \end{align}$ On the other hand, we also have $\begin{align} \langle \mathbf{\tilde{C}}, \mathbf{\tilde{X}} \rangle - \gamma H(\mathbf{\tilde{X}}) &\geq A \tilde{X}_{n+1,n+1} + \gamma \sum^{n+1}_{i,j} \tilde{X}_{ij} \log(\tilde{X}_{ij}) \\ &\geq A \tilde{X}_{n+1,n+1} + \gamma \left(\sum^{n+1}_{i,j} \tilde{X}_{ij}\right) \log \left( \frac{1}{n+1} \sum^{n+1}_{i,j} \tilde{X}_{ij} \right)\\ &= A \tilde{X}_{n+1,n+1} + \gamma \left(\sum^{n+1}_{i,j} \tilde{X}_{ij}\right) \log \left(\sum^{n+1}_{i,j} \tilde{X}_{ij} \right) - \gamma \left(\sum^{n+1}_{i,j} \tilde{X}_{ij}\right) \log \left( n+1\right)\\ &\geq (A + \gamma - \gamma \log(n+1))\tilde{X}_{n+1,n+1} + \gamma \left( \sum^{n}_{i,j} \tilde{X}_{ij} - 1 \right) - \gamma \left(\sum^{n}_{i,j} \tilde{X}_{ij}\right) \log \left( n+1\right), \end{align}$ where the second steps follows from Jensen's inequality, and the last step follows from the fact that $y\log{y} \leq y-1 \forall y > 0$. Combining the above two inequalities we deduce $\begin{align} \tilde{X}_{n+1,n+1} &\leq \frac{1}{A - \gamma (\log(n+1)-1)} \left( s \|\mathbf{C}\|_\infty - \gamma s \log\frac{s}{n^2} + \varepsilon + \gamma \left( \sum^{n}_{i,j} \tilde{X}_{ij} - 1 \right) - \gamma \left(\sum^{n}_{i,j} \tilde{X}_{ij}\right) \log \left( n+1\right) \right) \\ &= \mathcal{O}\left(\frac{\|\mathbf{C}\|_\infty}{A}\right). \end{align}$ In order to achieve the $\varepsilon$ tolerance we need $\begin{align} \mathcal{O}\left(\frac{\|\mathbf{C}\|_\infty}{A}\right) \leq \varepsilon, \end{align}$ yielding the choice of A to be $\begin{align} A = \mathcal{O}\left(\frac{\|\mathbf{C}\|_\infty}{\varepsilon}\right). \end{align}$ Reviewer 1vmH: > What do the authors mean by "more skewed datasets" when referring those datasets where Sinkhorn tends to fail, i.e., when it gives a less tolerable error? - $\textbf{P1}$: "more skewed datasets" - > Given that Sinkhorn's problem still persists even when using POUND-POT, I think the paper needs to explain why this is the case. Though the proof might indicate why this is the case, the main paper should explain why it is that no matter which algorithm one might use to ensure feasibility, Sinkhorn has the intrinsic problem of violating the constraints. Can the authors provide such explanation? What is it? - $\textbf{P2}$: "[Why] Sinkhorn's problem still persists even when using POUND-POT"; "Sinkhorn becomes more complex when A increases (what kind of complexity?)"; "how slowly Sinkhorn becomes?" - $\textbf{Reply}$: Firstly, we note that in our submission, we follow the previous's work procedure for POT reformulation to OT with a small $A$ (Le et al., 2021). Hence, while having a relatively fast speed, Sinkhorn fails to return feasible solutions even with Round-POT. - Thanks to the comments and discussions with the reviewers, we continue to show that our Round-POT with a choice of a very large $A$, $\mathcal{O} \left( \frac{\|\mathbf{C}\|_{\text{max}}}{\varepsilon} \right)$, can fix the infeasibility of Sinkhorn's solutions. However, this large $A$ for Sinkhorn leads to a worse theoretical complexity of $\tilde{\mathcal{O}} \left(\frac{n^2 \|\mathbf{C}\|^2_{\text{max}} }{\varepsilon^4} \right)$ for Sinkhorn compared to $\tilde{\mathcal{O}} \left(\frac{n^2 \|\mathbf{C}\|^2_{\text{max}} }{\varepsilon^2} \right)$ as (Le et al., 2021) claimed. We have addressed how large A should be in terms of $\varepsilon$ and computational complexity with in details in $\textbf{Reply to Reviewer TPpx Fourth Comment}$. We have also practically showed that Sinkhorn takes increasingly more iterations to converge with an increasingly larger $A$ in Figure 1 of the attached PDF in the general response. Moreover, we show the wall-clock time per iteration for Sinkhorn is slower than APDAGD for varying $\varepsilon$ in Figure 2 of the attached PDF in the general response, reproducing similar result to $\textbf{Figure 1 (Dvurechensky et al., 2018)}$ (more details in our global response G3). > Following my previous point, I am still not convinced that in most relevant cases the violation of Sinkhorn is problematic. From what I understood, I think there is still room for cases where Sinkhorn is still desirable for its speed, albeit having its final error. For example, in the general response given by the authors it says that increasing A can make the error being lower, while it also says that the problem of this is making Sinkhorn more complex (what kind of complexity is this?). Now, could the authors provide simulations about this? About how Sinkhorn error may decrease, and check how slowly it becomes? This is important since Sinkhorn seems much faster than APDAGD in any case. - "could the authors provide simulations about this?" > The authors should include this quantification of the error in terms of A in the new statement of Theorem 4.1 How would the new statement of Theorem 4.1 look after this change? - $\textbf{P3}$: "The authors should include this quantification of the error in terms of A in the new statement of Theorem 4.1"; "How would the new statement of Theorem 4.1 look after this change?" - We note that in its current form in G1, the revised theorem 4.1 is insufficient to derive how large A is to make Sinkhorn feasible. As mentioned in our previous replies to $\textbf{P2}$, the detailed discussion on the the error in terms of $A$ is in our Reply to Reviewer TPpx Fourth Comment. - For the sake of clarity, we can further extend the revised Theorem 4.1 as follows \begin{align} s+ \tilde{\mathcal{O}}\left( \frac{\|\mathbf{C}^2\|_{\text{max}}}{A}\right) \geq \mathbf{1}^\top \bar{\mathbf{X}} \mathbf{1} \geq s + \exp\left(\frac{-12A \log{n}}{\varepsilon}\right) - \mathcal{O}(\log n) \end{align} References: - Khang Le, Huy Nguyen, Tung Pham, and Nhat Ho. On multimarginal partial optimal transport: Equivalent forms and computational complexity, 2021. URL https://arxiv.org/abs/2108.07992. - P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376, 2018.  Thank you very much for reading our reply and listing out your further queries. We would like to address them as follows. - $\textbf{P1}$: "more skewed dataset." - $\textbf{Reply}$: We observe this empirical Sinkhorn performance with the synthetic data which we can make the data spread unevenly. - $\textbf{P2}$: "[Why] Sinkhorn's problem still persists even when using POUND-POT"; "Sinkhorn becomes more complex when A increases (what kind of complexity?)"; "how slowly Sinkhorn becomes?" - $\textbf{Reply}$: Firstly, we note that in our submission, we follow the previous's work procedure for POT reformulation to OT with a small $A$ (Le et al., 2021). Hence, while having a relatively fast speed, Sinkhorn fails to return feasible solutions even with Round-POT. Thanks to the comments and discussions with the reviewers, we continue to show that our Round-POT with a choice of a very large $A$, $\mathcal{O} \left( \frac{\|\mathbf{C}\|_{\text{max}}}{\varepsilon} \right)$, can fix the infeasibility of Sinkhorn's solutions. However, this large $A$ for Sinkhorn leads to a worse theoretical complexity of $\tilde{\mathcal{O}} \left(\frac{n^2 \|\mathbf{C}\|^2_{\text{max}} }{\varepsilon^4} \right)$ for Sinkhorn compared to $\tilde{\mathcal{O}} \left(\frac{n^2 \|\mathbf{C}\|^2_{\text{max}} }{\varepsilon^2} \right)$ as (Le et al., 2021) claimed. We have addressed how large A should be in terms of $\varepsilon$ and computational complexity with in details in $\textbf{Reply to Reviewer TPpx Fourth Comment}$. We have also practically showed that Sinkhorn takes increasingly more iterations to converge with an increasingly larger $A$ in Figure 1 of the attached PDF in the general response. Moreover, we show the wall-clock time per iteration for Sinkhorn is slower than APDAGD for varying $\varepsilon$ in Figure 2 of the attached PDF in the general response, reproducing similar result to $\textbf{Figure 1 (Dvurechensky et al., 2018)}$ (more details in our global response G3). - $\textbf{P3}$: "The authors should include this quantification of the error in terms of A in the new statement of Theorem 4.1"; "How would the new statement of Theorem 4.1 look after this change?" - $\textbf{Reply}$: We note that in its current form in G1, the revised theorem 4.1 is insufficient to derive how large A is to make Sinkhorn feasible. As mentioned in our previous replies to $\textbf{P2}$, the detailed discussion on the the error in terms of $A$ is in our Reply to Reviewer TPpx Fourth Comment. For the sake of clarity, we can further extend the revised Theorem 4.1 as follows \begin{align} s+ \tilde{\mathcal{O}}\left( \frac{\|\mathbf{C}^2\|_{\text{max}}}{A}\right) \geq \mathbf{1}^\top \bar{\mathbf{X}} \mathbf{1} \geq s + \exp\left(\frac{-12A \log{n}}{\varepsilon}\right) - \mathcal{O}(\log n). \end{align} And with upperbound, we can quantify the error in terms of A and obtain a worse complexity for Sinkhorn (Reply to Reviewer TPpx Fourth Comment). - $\textbf{P4}$: "In the statement of Theorem 6.3: How is then the "time" is related to $T$ ?"; "Are $T$ iterations of the main for-loop run for every "time-step" of the DE algorithm? If so, why isn't this affecting some computational complexity of the algorithm?" - $\textbf{Reply}$: Thank you for this feedback. The $T$ iterations are indeed number of iterations for DE to achieve an $\varepsilon-$approximate solution (refer to line 728, 729 in the main paper). We would like to clarify $T$ did appear in our proof for Theorem 6.2 (line 747, etc), hence affecting the complexity of AM, and consequently DE. We agree with the reviewer and will include 2,3 sentences clarifying this subtle relation in future manuscripts for the sake of clarity. References: [1] Pham, K., Le, K., Ho, N., Pham, T., & Bui, H. (2020, November). On unbalanced optimal transport: An analysis of sinkhorn algorithm. In International Conference on Machine Learning (pp. 7673-7682). PMLR. [2] Nguyen, Q. M., Nguyen, H. H., Zhou, Y., & Nguyen, L. M. (2022). On Unbalanced Optimal Transport: Gradient Methods, Sparsity and Approximation Error. arXiv preprint arXiv:2202.03618. [3] Khang Le, Huy Nguyen, Tung Pham, and Nhat Ho. On multimarginal partial optimal transport: Equivalent forms and computational complexity, 2021. URL https://arxiv.org/abs/2108.07992. [4] Qin, H., Zhang, Y., Liu, Z., & Chen, B. (2022, September). Rigid Registration of Point Clouds Based on Partial Optimal Transport. In Computer Graphics Forum (Vol. 41, No. 6, pp. 365-378). [5] Nguyen, K., Nguyen, D., Pham, T., & Ho, N. (2022, June). Improving mini-batch optimal transport via partial transportation. In International Conference on Machine Learning (pp. 16656-16690). PMLR. [6] Nguyen, K., Nguyen, D., Pham, T., & Ho, N. (2022, June). Improving mini-batch optimal transport via partial transportation. In International Conference on Machine Learning (pp. 16656-16690). PMLR. [7] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International conference on machine learning, pages 1367–1376, 2018.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.