ICML23_Geodesic_Interpolation_Rebuttal

# ICML23_Geodesic_Interpolation_Rebuttal  ## General response for all audience Here is a list of modifications we have made: * Writing: * Related works (By reviewer NEt5 and hvma) * Explicitly introduced Sinkhorn (By reviewer NEt5) * More about multimarginal OT, intorduce the limitation and an example in the appendix (By reviewer NEt5 and hvma) * Methodologuy: * Multi-marginal, and their limitation (By reviewer NEt5 and hvma) * The role of geodeisc (By reviewer hvma) * Alternative optimization of the regularized minmax objective, for adversarial robustness (By reviewer hvma) * Experiments and evaluation: * More about geodesic interpolation, (By reviewer hvma) * Other adversarial robustness metric (By reviewer fwar) * Standard error in empirical robustness ## Response to Reviewer NEt5 First of all, we thank **reviewer NEt5** for the thorough and in-depth review of our paper! The critical comments allowed us to revisit the work and boarden our vision of this work through the comments of the **reviewer NEt5**. We have modified our draft (as given in blue), and we believe the comments and suggestions largely streghtens this work Modification highlights: * A more clear introduction for the McCann interpolation. * Modified discussion regarding K-mixup (Greenewald et al., 2021). * A detailed discussion on the multi-marginal OT and the differences between multimarginal OT and conventional 2-marginal OT, with a rigorous description which can be leveraged later in experiments.The benefits and limitation of our binary approximation in both section 3.1 and appendix. * Detailed & explicit discussion on our OT operation in the embedding space, in section 3.1. * Related works, (Liu, Gong, Liu 2022), (Albergo and Vanden-Eijnden 2022), a more explicit introduction of the Sinkhorn * Modified typos * #### 1. Choose either $\mu_t:=((1-t) Id + t T) \# \nu_0$ or $T_t(x)=(1-t)x+tT(x)$ for the McCann interpolation. Thank you for the suggestions! Yes, it is indeed a little bit tautological to use both expression for the same thing. Initially, we consider Eq.(3) as a natural extension that bridges the Wasserstein barycenter with the DRO objective, while Eq.(7) for inducing the specific interpolation. As per reviewer's suggestion, now we choose to emphasize Eq.(3), $\mu_t:=((1-t) Id + t T) \# \nu_0$ for its interpretation, and only invoke $T_t(x)=(1-t)x+tT(x)$ within the proposition 3.2. We hope this will make readers better appreciate our formulation. #### 2. Sub-optimal coupling from mini-batch OT still provides reasonable path and improve the training, so as in (Greenewald et al. 2021). We appreciate this important reminder. We have modified our related work part to the following, so as to provide a more objective comparision. "One recent work (Greenewald et al., 2021) also explores mixing multiple-batch samples with OT alignment. ~~However, their method relies on optimal coupling. In contrast, this work looks for in interpolation path given by optimal transport maps.~~ This work demonstrates that while K-mixup better preserves the data manifold, our method proposes to find the worst-case Wasserstein barycenter realized by interpolation by transport maps." #### 3. The limitation of multi-marginal OT in our formulation and the computational difficulties We appreciate the reviewer for pointing out this! As in our Eq.(4), we intented to formulate the adversarial barycenter with multi-marginal OT since we are targeting a realistic multi-class classification problem. Indeed, the current mathematical machinery for multimarginal OT (MOT) is still under development. The standard MOT linear programming (LP) solution may scale up to $O(n^{3m})$, where each of the $m$ measures are supported on $n$ atoms. Even an accelerated Sinkhorn algorithm for MOT requires $\tilde{O}(m^3n^{m+1})$. Considering that seeking for the worst-case barycenter ideally is solving a *free-support* Wasserstein barycenter and we aim at incorporating these OT computations into large-scale deep learning pipelines. We thus choose to exemplify the multimarginal OT problem with $N=2$ marginal OT scenarios. Also, as is detailed in our response to your question later in the reply, we clarify the limitation of doing so. We have changed our discussion in Section 3.1, to discuss the current limitation of OT computation and that of approximation. Additionally, we also incorporate an explicit explanation of the adversarial distributions in distribution polytope in our limitation section Modification "Here we make in important choice by focusing on the interpolation between $K=2$ distributions ... ~~We emphasize here that this is not a simplification but~~ However,to date the efficient computational solutions for multimarginal OT is still nascent (Lin et al 2022 JMLR). To sovle an MOT with LP would scales to $O(n^K)$, and would be more costly when we want to solve for (free-support) barycenters to facilitate robustness in deep learning pipelines. Focusing on $k=2$ not only allows us to provde computational feasible algorithms (Perrot et al, Flamary et al, Cuturi et al), but also to focus on the decision boundary between the different categories..." #### 4. Clarify we are interpolating within the embedding space We have modified our narrative to emphasize this in order not to distract the readers. Actually, we have tried different embedding (dimensional reduction) approaches as in our appendix (Table. 12, 15 and 16). We notice a better embedding leads to a better certifiable robustness. We added clarification in Section 3.3 to emphasize this right after discussion interpolating with OT maps, as follows Modification "...Moreover, the map can project *out-of-sample* points, which improves the generalization. For the case where real data lies on a complex manifold, it is necessary to utilize an embedding space. Specifically, the Wassersteion geodesic and interpolation are computed in an embedding space which is usually more regularised (DP Kingma et al., 2013). Such OT results for labeled data distributions have shown effective..." #### 5. Related works, an explicit statement for the Sinkhorn algorithm, and the typos. We agree! We have corrected all the typos. Also, we modified the paragraph of the **OT map estimation** with an explicit introduction of the Sinkhorn algorithm. Also, we have extended our related works so that the readers can better learn the OT interpolation methegologies and theoreis. We also notice the suggested related work could serve as one map estimation baseline in our work. Modification: "... interpolate the source dataset towards the target in a probabilistic fashion following the OT schema. The recently proposed rectified flow (Liu et al 2022) facilities a generating process by interpolating data distribuions in their optimal paths. In addition, the distribution interpolant (Albergo et al 2022) has motivated an efficient flow model which avoid the costly ODE sovler but minimizes a quaduatic objective that controls Wasserstein distance between the source and the target." "... regarded as the two-sample estimation problem (Manole et al., 2021). Amoung them, The celebrated Sinkhorn algorithm (Cuturi 2013) solves the entropic Kantorovich objective efficiently $\hat{\pi}_{\epsilon,n}:=\arg\min_{\pi \in \Pi (\mathbb{R}^{n \times n})} < \pi, M > + \epsilon H(\pi)$, where $H(\pi):=-\sum \pi_{ij} \log \pi_{ij}$ is the negative entropy, provides an efficient $O(n^2)$ solution. Furthermore, it was shown in Section 6.2 of (Pooladian & Niles-Weed, 2021) that an entropy-regularised map (given by $\hat{T}_{\epsilon,n}$) for empirical distributions can approximate..." #### 6. Question: The limitation of just the 2-marginal optimal transport formulation? Thank you for this question. This allowed us a chance to revisit our work to explore it from a deeper level of understanding. Consider the setting of Eq. 4 in the paper. In the 2-class setting, $\tilde{\alpha}=(\alpha_1,1-\alpha_1)$ is restricted to a straight line for various values of $\alpha_1$, and depending on the level of robustness, $\epsilon$ desired can be constrained to lie in $(0,\epsilon)$. Generating augmented samples from the $\alpha-$barycenter leads to the geodesic perturbation in the paper. For the multi-class setting with K-classes, $\tilde{\alpha}=(\alpha_1,\dots,\alpha_K)$ is restricted to a polytope $\Delta_{K}=\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, \alpha_i \geq 0\}$, where vertices $e_i$ (with 1 at the $i^{th}$ position and 0 elsewhere) corresponding to the $i^{th}$ distribution weight. Depending on the understanding of adversarial attack, augmented data samples can be generated from $\tilde{\alpha}$-barycenter with $\tilde{\alpha}$ constrained to lie in specific subregions of the polytope. For example is equal $\epsilon$-level of robustness is desired for all forms of adversarial perturbations, $\tilde{\alpha}$ can be restricted to the set $\Delta^{(\epsilon,\dots,\epsilon)}_{K}=\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, 0 \leq \alpha_i \leq \epsilon\}$. Varying levels of conservativeness in classification can also be adjusted. For example, when $\epsilon_i$-level resistance to attacks is desired for class $i$, $\tilde{\alpha}$ can be restricted to the set $\Delta^{(\epsilon_1,\dots,\epsilon_K)}_{K}=\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, 0 \leq \alpha_i \leq \epsilon_i\}$. Additionally constraint sets can also be adjust to cover specific attacks. Suppose, it is known that attacks are designed to transform class $i$ objects to a class $j$ objects and vice versa only, without affecting other classes. In this scenario the the constraint set can be chosen as $\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, 0 \leq \alpha_i \leq \epsilon, 0 \leq \alpha_j \leq \epsilon, \alpha_l=0 \forall l \notin \{i,j\}\}$. Similarly 3-level interactions can also be addressed accordingly. This places our method as a basic foundation to achieving resistance to various forms of attacks and establishes the novel aspect of our approach. The details of these extensions will be explored in a future work. Modification: We have included this in our appendix. We sincerely hope this discussion of limitations further piques the readers' interest and recognise the potential of this work. ## Response to Reviewer fwar We are glad that **reviewer fwar** finds our work clearly written. Moreover, they also find the motivation of the worst-case Wasserstein barycenter is reasonable and novel, and our mathematical justification are well explained. So far, the only limitation is the lack of one $l$-$\infty$ robustness evaluation protical, PGD attack, in addition to what we already have (PGD). Fortunately we do have the implementation of PGD attack. Also, since it shares similarity with FGSM thus we have provided additional evaluation results as suggested. In addition, reviewer fwer suggest us to report the standard error. Modification highlights: Table 2. Tiny-ImageNet(64x64) | | ERM | Mixup | Manifold | CutMix | AugMix | PuzzleMix | **Ours** |:--------: |:--------:| :--------:| :--------:|:--------:|:--------:|:--------:|:--------:| | clean acc | 58.28 $\pm$ 0.87 | 56.32 $\pm$ 0.49 | 58.08 $\pm$ 0.13 |57.71 $\pm$ 0.53 | 56.13 $\pm$ 0.95 | 63.31 $\pm$ 0.59 | **64.15 $\pm$ 0.37** | | FGSM acc | 8.22 $\pm$ 0.55 | 11.21 $\pm$ 0.85 | 10.69 $\pm$ 0.92 | 11.61 $\pm$ 0.52 | 11.07 $\pm$ 0.27 | 7.82 $\pm$ 0.62 | 13.46 $\pm$ 0.66 | | PGD acc | 2.12 $\pm$ 0.55 | 3.11 $\pm$ 0.85 | 5.02 $\pm$ 0.12 | 5.92 $\pm$ 0.28 | 5.12 $\pm$ 0.21 | 6.09 $\pm$ 0.32 | 7.14 $\pm$ 0.16 | | Deepfool acc | 2.12 $\pm$ 0.55 | 3.11 $\pm$ 0.85 | 5.02 $\pm$ 0.12 | 5.92 $\pm$ 0.28 | 5.12 $\pm$ 0.21 | 6.09 $\pm$ 0.32 | 7.56 $\pm$ 0.21 | #### 1. In all experiments, the accuracy is provided without standard error, which makes it hard to identify whether the benefit of the proposed method is really significant. We would like to address this concern from two aspects. (A) The **certifiable robustness** of (Cohen et al, 2019) provides the **lower bound** of robust accuracy under **any** attacks under certain conditions with certain *robust training* methods. In other words, the values in our table (4)-(18) are all **accuracy lower bounds**. More specifically, evaluating the robustness of a predictor around an input requires a Monte Carlo step where we have $n$ samples of $f_\theta(x + \delta)$, to approximate the prediction probability (section 3.2.1 (Cohen et al, 2019)) and eliminate the randomness. We built our implementation following the pipeline in (https://github.com/locuslab/smoothing). In conclusion, the certifiable robustness evaluates the **lower bound** accuracy with sampling scheme. **All** the basline methods papers report certifiable robustness as we did. (B) In terms of table (1)-(3), we do have experiment results from multiple trials. Sepcifically, we kept the top 5 results for both the experiments on CIFAR-100 and TinyImageNet with much more ablation studies on MNIST. Initially, we omitted the standard error only to keep consisent with the certifiable robustness table. We observe limited variance in terms of the performance. The experiments on such large datasets are relatively stable. Moreover, we think it is because our method starts from a probabilistic sampling persective. #### 2. For CIFAR100 and Tiny Imagenet, robust accuracy is reported only for FGSM attacks, so it is unclear whether the benefits of the method are generalizable to other attacks (PGD, DeepFool, etc https://foolbox.readthedocs.io/en/stable/modules/attacks.html) Thank you for your question. We agree that it is important to evaluate the robustness of our proposed method against other adversarial attacks. Actually, we did have the PGD attack result. To be more specific, we set $\epsilon=2$, step size$=?$, and iteration$=4$. #### Q.1 How you can explain that (Table 5 in Appendix) the proposed augmentation method combined with SmoothMix and SmoothAdv works worse than SmoothMix and SmoothAdv in many settings? Our experimental results demonstrate that our proposed approach of combining data augmentation with regularization significantly enhances the robustness of the model across a wider range of values (when $\sigma \geq 2.00$) compared to SmoothMix and SmoothAdv. This characteristic aligns with the desired properties for certifiable robustness. We hypothesize that the interpolation introduced by our approach fills the data distribution beyond the local area surrounding each data point, leading to the observed improvements in robustness. In contrast, both SmoothMix and SmoothAdv rely on adversarial samples to achieve robustness. ## Response to Reviewer hvma Thank you for your encouraging assessment of our work! We are glad that you find our paper "method is novel and the intuition behind why using the Wasserstein barycenter to improve robustness is well-explained." Our work proposes a new problem of the worst-case Wasserstein Barycenter and covers a wide range of topics, e.g. optimal transport, adversarial robustness, and experimental results on large-scale image classification. Thus, we truly appreciate your time, effort and expertise for providing constructive comments for this work. Modification highlights * We added more detailed discussion for geodesic regularization, including a roadmap to new theoretical results and new experimental results. * We added more discussions regarding how we choose to approximate the multimarginal OT problem with the 2-marginal OT from different aspects. We are constrained by the current mathmatrical machinary and we are seeking an efficient approach for large-scale deep learning application. * Now we explicitly explained the alternative optimization scheme for the regularized minimax problem. * We added the missing related works with discussions (Liutkus et al 2019, Hua et al 2023, Mroueh et al, 2021). We hope our reponse could address the concerns of yours. If so, We will be grateful if you would like improve your assessmnent of our work. #### 1. Intuition of geodesic regularization, with empirical and theoretical justification Thank you for advice! Here we provided more discussions for our intuition on the geodesic regularization, a roadmap for a theoretical justification, and more clarification on the empirical effect of the geodesic regularization. * More intuition and theoretical justification: * In our Eq.(9), we are regularizing the smoothness of the predictive function when the input distribution is changing smoothly from one class to another $t:0 \mapsto 1$, and this is directly reducing the **"undesirable oscillations when predicting outside the training examples [zhang 2017, mixup]"**, and thus have **"an increased boundary thickness[Yang 2020, thickness]"**, which is shown to produce better robustness. * More straightforwardly, as shown in our eq.(13) and eq.(15), we are regularizing the **Jacobian** $\nabla f_\theta$. It has been shown that Jacobian regularization [Hoffman 2020] improves the adversarial robustness of DNNs. Precisely, **"the Frobenius norm of the Jacobian at a given point is related to the closest adversarial example and to the curvature of the network's decision boundaries [Jakubovitz 2018]"**. * Our regularization admits a special case of Mixup when the transport map is defined as $T^x(X_0)=X_1$, $T^y(Y_0)=Y_1$. Recent work has theoretically shown the mixup loss corresponds to minimizing an upper bound of the adversarial loss (Theorem 3.1. [Zhang et al 2021]). In addition, their Lemma 3.1 rewrites the higher order approximation of mixup regularized loss contains $\nabla f_\theta(x)$, which further aligns with the special case of our regularization. Due to the time constraint, we are not providing a new theorem here, but this allows us to further investigate the benefit of OT geodesic regularization theoretically as a next step. We appreciate the reviewers for understanding.  * Empirically, we provided additional explanation and experimental results on MNIST dataset. * As shown in Figure (2) and Table (1), solely using regularization terms still improves the performance geodesic Eq. (9). * As shown in our table, just adding the regularization will improve the empirical robustness and we select the optimial regularization effect as $\alpha_{reg}=5$ empirically. Table A. the effect of the regularization (MNIST) | | ERM ($\alpha_{reg}=0$) | ERM + $\alpha_{reg}$ Reg ($\alpha_{reg}=0.1$) | ERM +$\alpha_{reg}$ Reg ($\alpha_{reg}=1$) | ERM +$\alpha_{reg}$ Reg ($\alpha_{reg}=5$)| |:--------: |:--------:| :--------:| :--------:| :--------:| | clean acc | 99.09 $\pm$ 0.02** | 99.17 $\pm$ 0.08 | 99.41 $\pm$ 0.06 | **99.27 $\pm$ 0.03** | robust acc | 31.47 $\pm$ 0.12 | 34.92 $\pm$ 0.13 |35.19 $\pm$ 0.16 | **35.66 $\pm$ 0.20** | Modification: * We have put this table in our appendiex. * Also, we have modified the discussion at the end of section 3.2 to better illustrate the role of the geodeisc regularization, as follows: "... ~~This agrees with the recent theoretical understanding of mixup and robustness (Zhang et al., 2021)~~ As shown latter in section 3.3, Our regularization includes the Jacobian $\nabla f(x)$ which is shown to improve the robustness by smoothing the decision boundry (Jakubovitz et al. 2018, Hoffman et al, 2020). Moreover, when the map is defined as $T^x(X_0)=X_1$, $T^y(Y_0)=Y_1$, our regularization admits a special case of Mixup has been theoretically shown as an upper bound of the adversarial loss (Theorem 3.1. of (Zhang et al 2021). " * We have included the whole discussion in the appendix. #### 2. The paper currently is based on the assumption that there are two classes and optimized the decision boundary between the two classes. Thanks for raising these questions, this is a key point! Our proposed method indeed attempts the multi-marginal OT setting, just as the initial formulation eq.(4) in our main text and eq.(59) in our Appendix, section C.3. However, we do not choose to focus on the multiclass setting for the following reasons: **(A)** The current mathmetical machinary for efficient optimization on multimargimal OT (MOT) is *limited*. Specifically, as also suggested by **reviewer NEt5**, it is pointed out that the MOT problem has the linear programming (LP) formulation with $mn$ constraints and $n^m$ variables (where $n$ is number of support points, $m$ is the number of marginals). Many existing LP algorithms *cannot* sovle it with near-linear complexity, which makes it non-trivial to optimize on the space of (free-support) barycenters. **(B)** On the contrary, the are sufficient existing studies regarding the OT problem, e.g., Map estimation, gradient flow, and geodesic interpoaltion, between $2$ measures, including the two related works shared by **reviewer hvma** [Liutkus et al 2019, Hua et al 2023]. We establish our work with the help of all these works. **( C)** For adversarial robustness, we are able to study the sample complexity gap with the generated binary classification setting following a list of established works (Schmidt et al 2018 NeurIPS, Carmon et al NeurIPS 2019, Dan etal ICML 2020, Deng et al 2021, 2022), which help us to rigorously understand the relationship among *robustness improvement*, *sample complexity*, and *the barycenter position*, as we shown in proposition (3.3) and theorem (4.3). Lastly, **(D)** we have demonstrated that our approximation is effective even for large-scale datasets, which essentially encourage us to present this work to the community. In short, we recognize the multimarginal OT setting but focus on the 2-marginal OT problems for both theoretical and practical considerations. The multimarginal OT and it's (free-support) barycenter optimization is not ready, and is beyond the scope of this paper since we aim at a data-driven method for deep learning robustness. The relaxation of multiple 2-marginal OT allows us to exploit the existing algorithms and theoies from both OT and robustness literatures. In addtion, our approach is effective in experimental results. However, we do believe it's better to add more clarification and discussions about related works. We have added additional discussion in our Section 3.1 as well as in the appendix. More details can be found in our detailed response to your Q. 1 and 6. of reviewer NEt5   #### 3. There is no formula that says the final optimization problem being solved in the numerical experiments. The question of how to combine the Wasserstein barycenter and the explicit regularization term remains unclear. There is also no complexity or runtime analysis. Thank you for pointing out this! We have added more clarification for our numerical solutions, in addition to our Fig (1). specifically, following the conventional minimax dversarial training scheme, we iteratively (A) update the neural network by minimizing the classification loss with the regularization term Eq.(15), (B) obtain the argumented data by maximizing Eq.(6). We have included a more detailed discussion and a pseuocode in our response to your question 2. We have made it more clearly in our **section 3.3** before the "scalability and batch OT part", as follwed. Modification: * ... and pretrained ResNets to obtain the embedding space. We follow a standard adversarial training scheme where we iteratively (a) update the predictor $f_\theta$ with objective $\min_\theta \mathbb{E}_{x_i,y_i \sim \sum P^i + \sum \nu_{adv} [l(f_\theta(x), y)]} + \lambda \text{Reg}(f_\theta)$, where the geodesic reguarization $\text{Reg}(f_\theta)$ is computed via Eq.(15). (b) Find and store the augmented data by maxmizing the equation (6). A psuedocode Algo.(2) is attached in the Appendex. " * As emphasized in our response to your question 2, We have changed the equation in Fig (1), 2(b) to explicit the usage of regularization. #### Q.1 Can the paper easily adapt to multi-class settings? How does the complexity of the algorithm scale up with the number of classes? Thank you for raising this question. The short answer is: Our method works effectively, but with some factors to explore! From the empirical perspective, when training large-scale neural network, we iteratively sample data from two different categolries so as to improve the robustness. Since the neural networks can be reviewed as the universal function approximater, our approach smooths the decision boundary by regularizing the Jacobian of the network. On the otehr hand, from the mathematical perspective, it has been shown that the multi-marginal OT yet does not have realizable computational solutions[Lin et al, JMLR]. As a result, our choice is a reasonable strategy to approximate the multi-marginal adverarial Barycenter when the mathmatical machiney is not fully ready. For the computational complexity, our method scales reasonable with the number of classes. Specifically, we list the computational cost of all the components: * OT Map estimation: While we need $1$ maps for binary classification. For K classes, we will need $K(K-1)/2$ transport maps. Suppose we have $n$ samples in each batch, we use the Sinkhorn algorithm $O(n^2)$ and barycentric projection $O(1)$. We deem it as acceptable. * Neither data augmentation nor geodesic regularization requires additional OT computation. Actually, most computational burden comes from the Jacobian of the neural network. * Deep learning: In our experiments, the computational bottleneck remains to be the training of deep neural network. When we are training an ResNet50 on tinyImageNet using A5000 GPU. It approximately take 10 hours for 1000 epoches. We used various techniques such as multiprocessing to accelerate the process. In conclusion, our approach is effective for the multi-category deep learning classification task. We approximate the multimarginal barycenter with an esemble of two-class OT problems, not only because of the simplicity, but also due to the lack of mathematical tools for this problem. During implementation, the complexity of our approach scales reasonable with the class number. Modification: We have add the above discussion in our appendix. Moreover, as also suggested by **reviewer NEt5**, we clairify the current limitation in our section 3.1. For example, "Here we make in important choice by focusing on the interpolation between $K=2$ distributions ... ~~We emphasize here that this is not a simplification but~~ However,to date the efficient computational solutions for multimarginal OT is still nascent (Lin et al 2022 JMLR). To sovle an MOT with LP would scales to $O(n^K)$, and would be more costly when we want to solve for (free-support) barycenters to facilitate robustness in deep learning pipelines. Focusing on $k=2$ not only allows us to provde computational feasible algorithms (Perrot et al, Flamary et al, Cuturi et al), but also to focus on the decision boundary between the different categories..." #### Q.2 In Proposition 3.3, there is a closed-form $\theta$ in the simplified problem. Is it used in the experiments? What is the algorithm or numerical method you use to solve the minimax problem with regularization? Yes! We did use Eq.(15) in our code. We use the standard adversarial training pipeline (outer minimization, inner maximization) following the seminal works (cite, cite). In practice, we approach the *(min+reg) minimization of the classification loss* and the *(max) maximization for the worst-case barycenter* into an **alternative optimization** scheme, and, it is actually shown as Fig (1) 2(a) and 2(b) respectively. * During the *min* step, we use *adam* stochastic gradient descent to minimize the empirical loss with the geodesic regularization with regard to both the original data and the augmented dataset, $\min_\theta \mathbb{E}_{x_i,y_i \sim \sum P^i + \sum \nu_{adv} [l(f_\theta(x), y)]} + \lambda \text{Reg}(f_\theta)$, where the geodesic reguarization $\text{Reg}(f_\theta)$ is computed via Eq.(15). In addition, our embedded space closed-form solution allows to take the advantage of pretraining embeddings for large-scale datasets. * During the *max* step, we freeze the classifer and find the adversarial barycenter that maximizes Eq.(8). The reason why we change the Barycenter to Monge Map is that this allows us to approach the geodesic optimization by monto-carlo in a relatively efficient manner, we only need to sample the locations $t \in \mathbb{R}^1$ and evaluate the pushforward samples realized by the (entropic) transport map. As demonstrated in our eq.(x). During the training of deep neural network on large dataset (for example, the ImageNet(64x64)). The samples size is huge thus we sample a batch of data only from two different classes. Then we follow the Modification: 1. We have changed the equation in Fig (1), 2(b) from xxx to xxx to explicit the usage of regularization. 2. We have added a brief pseudocode to illustrate the practical adversarial training algorithm in the appendix, as follows. ``` Input: Dataset {X, Y}, model, loss, epoch number t, data augmentation flag da for every epoch: if da: # data augmentation flag {\hat{x}_j, \hat{y}_j} <- Algorithm 1, where Eq.(8) is invoked. for every batch {X, Y} in the data: # We use a special sampler to only sample two class of data pred = model(X) classification_loss = criterion(Y, Pred) {X^0, Y^0}, {X^1, Y^1} = {X, Y} # splite the data into different categories pi_epsilon = Sinkhorn(X^0, X^1) reg_loss = Reg(X^0, Y^0, X^1, Y^1, pi_eps), where Eq.(15) is invoked. total_loss = classification_loss + lambda x reg_loss total_loss.backward() ``` #### Q.3 Table 1, the title says "ours: DA+Reg". Does it mean the accuracy in the second column is our accuracy? In Table 1, **ERM+Ours** and **PGD+Ours** are the two use cases of our proposed method. The first one means applying **DA(data augmentation)** and **Reg(regularization)** to standard **ERM(Empirical Risk Minimization)**, and the second one means applying **DA** and **Reg** to adversarial training **PGD(Projected Gradient Descent)**. Here, the second column should be **ERM+reg** which means enhancing the valina **ERM** *only* with the geodesic regularization. This is a standalone ablation study to evaluate the effectiveness of our regularization method. More analysis for the this table: (1) Using our proposed data augmentation and regularization, one can significantly boost the robustness as shown in **ERM+Ours**. (2) Adversarial training is usually considered sufficient for providing robustness. However, we can further improve based it as shown in **PGD+Ours**, achieveing the best robust accuracy. (3) It's worth noting that data augmentation is computationally expensive, as it involves solving a minimax problem. However, our results show that even solely using our proposed regularization method (without data augmentation) can still improve the robustness, as shown in the **ERM+Reg** column. Modification: Table 1 (Modification). Adversarial robustness (MNIST). | | ERM | PGD | ERM+Reg | ERM + Ours | PGD +Ours | |:--------: |:--------:| :--------:| :--------:| :--------:|:--------: | Clean Acc.(%) | 99.09 $\pm$ 0.02 | 98.94 $\pm$ 0.03 | 99.27 $\pm$ 0.06 | **99.39 $\pm$ 0.03** | 99.34$\pm$ 0.02 | Robust Acc. (%) | 31.47 $\pm$ 0.12 | 81.23 $\pm$ 0.17 | 35.66 $\pm$ 0.20 | 81.23 $\pm$ 0.13 | **82.74 $\pm$ 0.16** #### Q.4 Why the mixup methods are not compared in the certifiable robustness section? We have provided an additional experimental result for using the Mixup method! In general, the vanilla mixup (linearly interpolation) is not considered significantly helpful for *certifiable robustness* since it is uniformally selecting samples from all categlories and interpolating them according some Beta distribution. We suspect the interpolations between the data from the same categlory are not effective. Also, usually it is not included in certifiable robustness baselines, as we can see in the related works (Cohen et al.). Modification: Table 5 (Modification). (Part of the) Certified accuracy (MNIST). | $\sigma$ | Models (MNIST) | 2.00 | 2.25 | 2.50 | 2.75 | |:--------: |:--------:| :--------:| :--------:| :--------:|:--------: | 1.00 | Gaussian | 32.5 | 19.7 | 10.9 | 5.8 | 1.00 | Vanilla Mixup | 36.7 | 28.2 | 19.3 | 7.6 | 1.00 | **Ours** | **45.0** | **37.8** | **30.7** | **23.2** #### Q.5 At lines 305-306, $P^1$ and $P_2$ are not consistent. Thank you for pointing out this! We have now unified all the notations for the data distribution as $P^i$! #### Q.6 I recommend citing the following recent papers on optimal transport: Thanks for bringing these works to us! We have added these two papers into our related work part. Below is how we introduce them in our paper: Modification: "...for large-scale high-dimensional datasets, and OT between labeled datasets (Alvarez-Melis & Fusi, 2020; Fan & Alvarez-Melis). Nonparametric gradient flows [1] provides satisfactory data generation with theoretical gurantee. Further, data synthesization [2] on the feature-Gaussian manifold can be realized by the maximum mean discrepancy (MMD) gradient flow [3] are proven to be effective for transfer learning tasks. " [1] Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., & Stöter, F. R. (2019, May). Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning (pp. 4104-4113). PMLR. [2] Hua, X., Nguyen, T., Le, T., Blanchet, J., & Nguyen, V. A. (2023). Dynamic Flows on Curved Space Generated by Labeled Data. arXiv preprint arXiv:2302.00061. [3] Mroueh, Y., & Nguyen, T. (2021, March). On the convergence of gradient descent in GANs: MMD GAN as a gradient flow. In International Conference on Artificial Intelligence and Statistics (pp. 1720-1728). PMLR.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.