JiachengZhu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # ICML23_Geodesic_Interpolation_Rebuttal <!-- ## Letter to Area Chair [Use this after discussion] Dear Area Chairs, Seniaor Area Chairs, and Program Chairs, We thank all the reviewers, ACs, SACs, and PCs for the time and effort in providing the assessment for teh contribution in our paper. We have added justification and experimental validated as requested during the discussion. We hope this fully address the concerns of the reviewers. However, we have concern when replying to reviewer fwar as the comments are superficial and non-informative. It poses our work in an unfair position in the rebuttal and discussion phase. --> ## General response for all audience Here is a list of modifications we have made: * Writing: * Related works (By reviewer NEt5 and hvma) * Explicitly introduced Sinkhorn (By reviewer NEt5) * More about multimarginal OT, intorduce the limitation and an example in the appendix (By reviewer NEt5 and hvma) * Methodologuy: * Multi-marginal, and their limitation (By reviewer NEt5 and hvma) * The role of geodeisc (By reviewer hvma) * Alternative optimization of the regularized minmax objective, for adversarial robustness (By reviewer hvma) * Experiments and evaluation: * More about geodesic interpolation, (By reviewer hvma) * Other adversarial robustness metric (By reviewer fwar) * Standard error in empirical robustness ## Response to Reviewer NEt5 First of all, we thank **reviewer NEt5** for the thorough and in-depth review of our paper! The critical comments allowed us to revisit the work and boarden our vision of this work through the comments of the **reviewer NEt5**. We have modified our draft (as given in <span style="color:blue">blue</span>), and we believe the comments and suggestions largely streghtens this work <span style="color:blue">Modification highlights:</span> * A more clear introduction for the McCann interpolation. * Modified discussion regarding K-mixup (Greenewald et al., 2021). * A detailed discussion on the multi-marginal OT and the differences between multimarginal OT and conventional 2-marginal OT, with a rigorous description which can be leveraged later in experiments.The benefits and limitation of our binary approximation in both section 3.1 and appendix. * Detailed & explicit discussion on our OT operation in the embedding space, in section 3.1. * Related works, (Liu, Gong, Liu 2022), (Albergo and Vanden-Eijnden 2022), a more explicit introduction of the Sinkhorn * Modified typos * #### 1. Choose either $\mu_t:=((1-t) Id + t T) \# \nu_0$ or $T_t(x)=(1-t)x+tT(x)$ for the McCann interpolation. Thank you for the suggestions! Yes, it is indeed a little bit tautological to use both expression for the same thing. Initially, we consider Eq.(3) as a natural extension that bridges the Wasserstein barycenter with the DRO objective, while Eq.(7) for inducing the specific interpolation. As per reviewer's suggestion, now we choose to emphasize Eq.(3), $\mu_t:=((1-t) Id + t T) \# \nu_0$ for its interpretation, and only invoke $T_t(x)=(1-t)x+tT(x)$ within the proposition 3.2. We hope this will make readers better appreciate our formulation. #### 2. Sub-optimal coupling from mini-batch OT still provides reasonable path and improve the training, so as in (Greenewald et al. 2021). We appreciate this important reminder. We have modified our related work part to the following, so as to provide a more objective comparision. "One recent work (Greenewald et al., 2021) also explores mixing multiple-batch samples with OT alignment. ~~However, their method relies on optimal coupling. In contrast, this work looks for in interpolation path given by optimal transport maps.~~ <span style="color:blue"> This work demonstrates that while K-mixup better preserves the data manifold, our method proposes to find the worst-case Wasserstein barycenter realized by interpolation by transport maps.</span>" #### 3. The limitation of multi-marginal OT in our formulation and the computational difficulties We appreciate the reviewer for pointing out this! As in our Eq.(4), we intented to formulate the adversarial barycenter with multi-marginal OT since we are targeting a realistic multi-class classification problem. Indeed, the current mathematical machinery for multimarginal OT (MOT) is still under development. The standard MOT linear programming (LP) solution may scale up to $O(n^{3m})$, where each of the $m$ measures are supported on $n$ atoms. Even an accelerated Sinkhorn algorithm for MOT requires $\tilde{O}(m^3n^{m+1})$. Considering that seeking for the worst-case barycenter ideally is solving a *free-support* Wasserstein barycenter and we aim at incorporating these OT computations into large-scale deep learning pipelines. We thus choose to exemplify the multimarginal OT problem with $N=2$ marginal OT scenarios. Also, as is detailed in our response to your question later in the reply, we clarify the limitation of doing so. We have changed our discussion in Section 3.1, to discuss the current limitation of OT computation and that of approximation. Additionally, we also incorporate an explicit explanation of the adversarial distributions in distribution polytope in our limitation section <span style="color:blue"> Modification</span> "Here we make in important choice by focusing on the interpolation between $K=2$ distributions ... ~~We emphasize here that this is not a simplification but~~ <span style="color:blue"> However,to date the efficient computational solutions for multimarginal OT is still nascent (Lin et al 2022 JMLR). To sovle an MOT with LP would scales to $O(n^K)$, and would be more costly when we want to solve for (free-support) barycenters to facilitate robustness in deep learning pipelines. Focusing on $k=2$ not only allows us to provde computational feasible algorithms (Perrot et al, Flamary et al, Cuturi et al),</span> but also to focus on the decision boundary between the different categories..." #### 4. Clarify we are interpolating within the embedding space We have modified our narrative to emphasize this in order not to distract the readers. Actually, we have tried different embedding (dimensional reduction) approaches as in our appendix (Table. 12, 15 and 16). We notice a better embedding leads to a better certifiable robustness. We added clarification in Section 3.3 to emphasize this right after discussion interpolating with OT maps, as follows <span style="color:blue"> Modification</span> "...Moreover, the map can project *out-of-sample* points, which improves the generalization. <span style="color:blue">For the case where real data lies on a complex manifold, it is necessary to utilize an embedding space. Specifically, the Wassersteion geodesic and interpolation are computed in an embedding space which is usually more regularised (DP Kingma et al., 2013).</span> Such OT results for labeled data distributions have shown effective..." #### 5. Related works, an explicit statement for the Sinkhorn algorithm, and the typos. We agree! We have corrected all the typos. Also, we modified the paragraph of the **OT map estimation** with an explicit introduction of the Sinkhorn algorithm. Also, we have extended our related works so that the readers can better learn the OT interpolation methegologies and theoreis. We also notice the suggested related work could serve as one map estimation baseline in our work. <span style="color:blue"> Modification:</span> "... interpolate the source dataset towards the target in a probabilistic fashion following the OT schema. <span style="color:blue">The recently proposed rectified flow (Liu et al 2022) facilities a generating process by interpolating data distribuions in their optimal paths. In addition, the distribution interpolant (Albergo et al 2022) has motivated an efficient flow model which avoid the costly ODE sovler but minimizes a quaduatic objective that controls Wasserstein distance between the source and the target.</span>" "... regarded as the two-sample estimation problem (Manole et al., 2021). <span style="color:blue">Amoung them, The celebrated Sinkhorn algorithm (Cuturi 2013) solves the entropic Kantorovich objective efficiently $\hat{\pi}_{\epsilon,n}:=\arg\min_{\pi \in \Pi (\mathbb{R}^{n \times n})} < \pi, M > + \epsilon H(\pi)$, where $H(\pi):=-\sum \pi_{ij} \log \pi_{ij}$ is the negative entropy, provides an efficient $O(n^2)$ solution. Furthermore, </span>it was shown in Section 6.2 of (Pooladian & Niles-Weed, 2021) that an entropy-regularised map (given by $\hat{T}_{\epsilon,n}$) for empirical distributions can approximate..." #### 6. Question: The limitation of just the 2-marginal optimal transport formulation? Thank you for this question. This allowed us a chance to revisit our work to explore it from a deeper level of understanding. Consider the setting of Eq. 4 in the paper. In the 2-class setting, $\tilde{\alpha}=(\alpha_1,1-\alpha_1)$ is restricted to a straight line for various values of $\alpha_1$, and depending on the level of robustness, $\epsilon$ desired can be constrained to lie in $(0,\epsilon)$. Generating augmented samples from the $\alpha-$barycenter leads to the geodesic perturbation in the paper. For the multi-class setting with K-classes, $\tilde{\alpha}=(\alpha_1,\dots,\alpha_K)$ is restricted to a polytope $\Delta_{K}=\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, \alpha_i \geq 0\}$, where vertices $e_i$ (with 1 at the $i^{th}$ position and 0 elsewhere) corresponding to the $i^{th}$ distribution weight. Depending on the understanding of adversarial attack, augmented data samples can be generated from $\tilde{\alpha}$-barycenter with $\tilde{\alpha}$ constrained to lie in specific subregions of the polytope. For example is equal $\epsilon$-level of robustness is desired for all forms of adversarial perturbations, $\tilde{\alpha}$ can be restricted to the set $\Delta^{(\epsilon,\dots,\epsilon)}_{K}=\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, 0 \leq \alpha_i \leq \epsilon\}$. Varying levels of conservativeness in classification can also be adjusted. For example, when $\epsilon_i$-level resistance to attacks is desired for class $i$, $\tilde{\alpha}$ can be restricted to the set $\Delta^{(\epsilon_1,\dots,\epsilon_K)}_{K}=\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, 0 \leq \alpha_i \leq \epsilon_i\}$. Additionally constraint sets can also be adjust to cover specific attacks. Suppose, it is known that attacks are designed to transform class $i$ objects to a class $j$ objects and vice versa only, without affecting other classes. In this scenario the the constraint set can be chosen as $\{(\alpha_1,\dots,\alpha_K):\sum_i \alpha=1, 0 \leq \alpha_i \leq \epsilon, 0 \leq \alpha_j \leq \epsilon, \alpha_l=0 \forall l \notin \{i,j\}\}$. Similarly 3-level interactions can also be addressed accordingly. This places our method as a basic foundation to achieving resistance to various forms of attacks and establishes the novel aspect of our approach. The details of these extensions will be explored in a future work. <span style="color:blue"> Modification:</span> We have included this in our appendix. We sincerely hope this discussion of limitations further piques the readers' interest and recognise the potential of this work. ## Response to Reviewer fwar We are glad that **reviewer fwar** finds our work clearly written. Moreover, they also find the motivation of the worst-case Wasserstein barycenter is reasonable and novel, and our mathematical justification are well explained. So far, the only limitation is the lack of one $l$-$\infty$ robustness evaluation protical, PGD attack, in addition to what we already have (PGD). Fortunately we do have the implementation of PGD attack. Also, since it shares similarity with FGSM thus we have provided additional evaluation results as suggested. In addition, reviewer fwer suggest us to report the standard error. <span style="color:blue">Modification highlights:</span> Table 2. Tiny-ImageNet(64x64) | | ERM | Mixup | Manifold | CutMix | AugMix | PuzzleMix | **Ours** |:--------: |:--------:| :--------:| :--------:|:--------:|:--------:|:--------:|:--------:| | clean acc | 58.28 $\pm$ 0.87 | 56.32 $\pm$ 0.49 | 58.08 $\pm$ 0.13 |57.71 $\pm$ 0.53 | 56.13 $\pm$ 0.95 | 63.31 $\pm$ 0.59 | **64.15 $\pm$ 0.37** | | FGSM acc | 8.22 $\pm$ 0.55 | 11.21 $\pm$ 0.85 | 10.69 $\pm$ 0.92 | 11.61 $\pm$ 0.52 | 11.07 $\pm$ 0.27 | 7.82 $\pm$ 0.62 | 13.46 $\pm$ 0.66 | | PGD acc | 2.12 $\pm$ 0.55 | 3.11 $\pm$ 0.85 | 5.02 $\pm$ 0.12 | 5.92 $\pm$ 0.28 | 5.12 $\pm$ 0.21 | 6.09 $\pm$ 0.32 | 7.14 $\pm$ 0.16 | | Deepfool acc | 2.12 $\pm$ 0.55 | 3.11 $\pm$ 0.85 | 5.02 $\pm$ 0.12 | 5.92 $\pm$ 0.28 | 5.12 $\pm$ 0.21 | 6.09 $\pm$ 0.32 | 7.56 $\pm$ 0.21 | #### 1. In all experiments, the accuracy is provided without standard error, which makes it hard to identify whether the benefit of the proposed method is really significant. We would like to address this concern from two aspects. (A) The **certifiable robustness** of (Cohen et al, 2019) provides the **lower bound** of robust accuracy under **any** attacks under certain conditions with certain *robust training* methods. In other words, the values in our table (4)-(18) are all **accuracy lower bounds**. More specifically, evaluating the robustness of a predictor around an input requires a Monte Carlo step where we have $n$ samples of $f_\theta(x + \delta)$, to approximate the prediction probability (section 3.2.1 (Cohen et al, 2019)) and eliminate the randomness. We built our implementation following the pipeline in (https://github.com/locuslab/smoothing). In conclusion, the certifiable robustness evaluates the **lower bound** accuracy with sampling scheme. **All** the basline methods papers report certifiable robustness as we did. (B) In terms of table (1)-(3), we do have experiment results from multiple trials. Sepcifically, we kept the top 5 results for both the experiments on CIFAR-100 and TinyImageNet with much more ablation studies on MNIST. Initially, we omitted the standard error only to keep consisent with the certifiable robustness table. We observe limited variance in terms of the performance. The experiments on such large datasets are relatively stable. Moreover, we think it is because our method starts from a probabilistic sampling persective. #### 2. For CIFAR100 and Tiny Imagenet, robust accuracy is reported only for FGSM attacks, so it is unclear whether the benefits of the method are generalizable to other attacks (PGD, DeepFool, etc https://foolbox.readthedocs.io/en/stable/modules/attacks.html) Thank you for your question. We agree that it is important to evaluate the robustness of our proposed method against other adversarial attacks. Actually, we did have the PGD attack result. To be more specific, we set $\epsilon=2$, step size$=?$, and iteration$=4$. #### Q.1 How you can explain that (Table 5 in Appendix) the proposed augmentation method combined with SmoothMix and SmoothAdv works worse than SmoothMix and SmoothAdv in many settings? Our experimental results demonstrate that our proposed approach of combining data augmentation with regularization significantly enhances the robustness of the model across a wider range of values (when $\sigma \geq 2.00$) compared to SmoothMix and SmoothAdv. This characteristic aligns with the desired properties for certifiable robustness. We hypothesize that the interpolation introduced by our approach fills the data distribution beyond the local area surrounding each data point, leading to the observed improvements in robustness. In contrast, both SmoothMix and SmoothAdv rely on adversarial samples to achieve robustness. ## Response to Reviewer hvma Thank you for your encouraging assessment of our work! We are glad that you find our paper "method is novel and the intuition behind why using the Wasserstein barycenter to improve robustness is well-explained." Our work proposes a new problem of the worst-case Wasserstein Barycenter and covers a wide range of topics, e.g. optimal transport, adversarial robustness, and experimental results on large-scale image classification. Thus, we truly appreciate your time, effort and expertise for providing constructive comments for this work. <span style="color:blue">Modification highlights</span> * We added more detailed discussion for geodesic regularization, including a roadmap to new theoretical results and new experimental results. * We added more discussions regarding how we choose to approximate the multimarginal OT problem with the 2-marginal OT from different aspects. We are constrained by the current mathmatrical machinary and we are seeking an efficient approach for large-scale deep learning application. * Now we explicitly explained the alternative optimization scheme for the regularized minimax problem. * We added the missing related works with discussions (Liutkus et al 2019, Hua et al 2023, Mroueh et al, 2021). We hope our reponse could address the concerns of yours. If so, We will be grateful if you would like improve your assessmnent of our work. #### 1. Intuition of geodesic regularization, with empirical and theoretical justification Thank you for advice! Here we provided more discussions for our intuition on the geodesic regularization, a roadmap for a theoretical justification, and more clarification on the empirical effect of the geodesic regularization. * More intuition and theoretical justification: * In our Eq.(9), we are regularizing the smoothness of the predictive function when the input distribution is changing smoothly from one class to another $t:0 \mapsto 1$, and this is directly reducing the **"undesirable oscillations when predicting outside the training examples [zhang 2017, mixup]"**, and thus have **"an increased boundary thickness[Yang 2020, thickness]"**, which is shown to produce better robustness. * More straightforwardly, as shown in our eq.(13) and eq.(15), we are regularizing the **Jacobian** $\nabla f_\theta$. It has been shown that Jacobian regularization [Hoffman 2020] improves the adversarial robustness of DNNs. Precisely, **"the Frobenius norm of the Jacobian at a given point is related to the closest adversarial example and to the curvature of the network's decision boundaries [Jakubovitz 2018]"**. * Our regularization admits a special case of Mixup when the transport map is defined as $T^x(X_0)=X_1$, $T^y(Y_0)=Y_1$. Recent work has theoretically shown the mixup loss corresponds to minimizing an upper bound of the adversarial loss (Theorem 3.1. [Zhang et al 2021]). In addition, their Lemma 3.1 rewrites the higher order approximation of mixup regularized loss contains $\nabla f_\theta(x)$, which further aligns with the special case of our regularization. Due to the time constraint, we are not providing a new theorem here, but this allows us to further investigate the benefit of OT geodesic regularization theoretically as a next step. We appreciate the reviewers for understanding. <!-- * We have added additional analysis for the role of the geodesic regularization, Specifically, following the proposition X, we have... [closed form solution, and compare it with]. --> * Empirically, we provided additional explanation and experimental results on MNIST dataset. * As shown in Figure (2) and Table (1), solely using regularization terms still improves the performance geodesic Eq. (9). * As shown in our table, just adding the regularization will improve the empirical robustness and we select the optimial regularization effect as $\alpha_{reg}=5$ empirically. Table A. the effect of the regularization (MNIST) | | ERM ($\alpha_{reg}=0$) | ERM + $\alpha_{reg}$ Reg ($\alpha_{reg}=0.1$) | ERM +$\alpha_{reg}$ Reg ($\alpha_{reg}=1$) | ERM +$\alpha_{reg}$ Reg ($\alpha_{reg}=5$)| |:--------: |:--------:| :--------:| :--------:| :--------:| | clean acc | 99.09 $\pm$ 0.02** | 99.17 $\pm$ 0.08 | 99.41 $\pm$ 0.06 | **99.27 $\pm$ 0.03** | robust acc | 31.47 $\pm$ 0.12 | 34.92 $\pm$ 0.13 |35.19 $\pm$ 0.16 | **35.66 $\pm$ 0.20** | <span style="color:blue">Modification: </span> * We have put this table in our appendiex. * Also, we have modified the discussion at the end of section 3.2 to better illustrate the role of the geodeisc regularization, as follows: "... ~~This agrees with the recent theoretical understanding of mixup and robustness (Zhang et al., 2021)~~ <span style="color:blue">As shown latter in section 3.3, Our regularization includes the Jacobian $\nabla f(x)$ which is shown to improve the robustness by smoothing the decision boundry (Jakubovitz et al. 2018, Hoffman et al, 2020). Moreover, when the map is defined as $T^x(X_0)=X_1$, $T^y(Y_0)=Y_1$, our regularization admits a special case of Mixup has been theoretically shown as an upper bound of the adversarial loss (Theorem 3.1. of (Zhang et al 2021). </span> " * We have included the whole discussion in the appendix. #### 2. The paper currently is based on the assumption that there are two classes and optimized the decision boundary between the two classes. Thanks for raising these questions, this is a key point! Our proposed method indeed attempts the multi-marginal OT setting, just as the initial formulation eq.(4) in our main text and eq.(59) in our Appendix, section C.3. However, we do not choose to focus on the multiclass setting for the following reasons: **(A)** The current mathmetical machinary for efficient optimization on multimargimal OT (MOT) is *limited*. Specifically, as also suggested by **reviewer NEt5**, it is pointed out that the MOT problem has the linear programming (LP) formulation with $mn$ constraints and $n^m$ variables (where $n$ is number of support points, $m$ is the number of marginals). Many existing LP algorithms *cannot* sovle it with near-linear complexity, which makes it non-trivial to optimize on the space of (free-support) barycenters. **(B)** On the contrary, the are sufficient existing studies regarding the OT problem, e.g., Map estimation, gradient flow, and geodesic interpoaltion, between $2$ measures, including the two related works shared by **reviewer hvma** [Liutkus et al 2019, Hua et al 2023]. We establish our work with the help of all these works. **( C)** For adversarial robustness, we are able to study the sample complexity gap with the generated binary classification setting following a list of established works (Schmidt et al 2018 NeurIPS, Carmon et al NeurIPS 2019, Dan etal ICML 2020, Deng et al 2021, 2022), which help us to rigorously understand the relationship among *robustness improvement*, *sample complexity*, and *the barycenter position*, as we shown in proposition (3.3) and theorem (4.3). Lastly, **(D)** we have demonstrated that our approximation is effective even for large-scale datasets, which essentially encourage us to present this work to the community. In short, we recognize the multimarginal OT setting but focus on the 2-marginal OT problems for both theoretical and practical considerations. The multimarginal OT and it's (free-support) barycenter optimization is not ready, and is beyond the scope of this paper since we aim at a data-driven method for deep learning robustness. The relaxation of multiple 2-marginal OT allows us to exploit the existing algorithms and theoies from both OT and robustness literatures. In addtion, our approach is effective in experimental results. However, we do believe it's better to add more clarification and discussions about related works. We have added additional discussion in our Section 3.1 as well as in the appendix. More details can be found in our detailed response to your Q. 1 and 6. of reviewer NEt5 <!-- For example <span style="color:blue">Modification: </span> * In section 3.1 "Here we make an important choice by focusing on the interpolation between $N=2$ distributions rather than a generalized $K$-margin barycenter.<span style="color:blue">The computation of $K$-margin... </span> " * We have put the aforemented discussion in appendix. --> <!-- <span style="color:red">Todo?:a more intuitive toy experiment suggest by first reviewer? </span> --> #### 3. There is no formula that says the final optimization problem being solved in the numerical experiments. The question of how to combine the Wasserstein barycenter and the explicit regularization term remains unclear. There is also no complexity or runtime analysis. Thank you for pointing out this! We have added more clarification for our numerical solutions, in addition to our Fig (1). specifically, following the conventional minimax dversarial training scheme, we iteratively (A) update the neural network by minimizing the classification loss with the regularization term Eq.(15), (B) obtain the argumented data by maximizing Eq.(6). We have included a more detailed discussion and a pseuocode in our response to your question 2. We have made it more clearly in our **section 3.3** before the "scalability and batch OT part", as follwed. <span style="color:blue">Modification:</span> * ... and pretrained ResNets to obtain the embedding space. <span style="color:blue"> We follow a standard adversarial training scheme where we iteratively (a) update the predictor $f_\theta$ with objective $\min_\theta \mathbb{E}_{x_i,y_i \sim \sum P^i + \sum \nu_{adv} [l(f_\theta(x), y)]} + \lambda \text{Reg}(f_\theta)$, where the geodesic reguarization $\text{Reg}(f_\theta)$ is computed via Eq.(15). (b) Find and store the augmented data by maxmizing the equation (6). A psuedocode Algo.(2) is attached in the Appendex.</span> " * As emphasized in our response to your question 2, We have changed the equation in Fig (1), 2(b) to explicit the usage of regularization. #### Q.1 Can the paper easily adapt to multi-class settings? How does the complexity of the algorithm scale up with the number of classes? Thank you for raising this question. The short answer is: Our method works effectively, but with some factors to explore! From the empirical perspective, when training large-scale neural network, we iteratively sample data from two different categolries so as to improve the robustness. Since the neural networks can be reviewed as the universal function approximater, our approach smooths the decision boundary by regularizing the Jacobian of the network. On the otehr hand, from the mathematical perspective, it has been shown that the multi-marginal OT yet does not have realizable computational solutions[Lin et al, JMLR]. As a result, our choice is a reasonable strategy to approximate the multi-marginal adverarial Barycenter when the mathmatical machiney is not fully ready. For the computational complexity, our method scales reasonable with the number of classes. Specifically, we list the computational cost of all the components: * OT Map estimation: While we need $1$ maps for binary classification. For K classes, we will need $K(K-1)/2$ transport maps. Suppose we have $n$ samples in each batch, we use the Sinkhorn algorithm $O(n^2)$ and barycentric projection $O(1)$. We deem it as acceptable. * Neither data augmentation nor geodesic regularization requires additional OT computation. Actually, most computational burden comes from the Jacobian of the neural network. * Deep learning: In our experiments, the computational bottleneck remains to be the training of deep neural network. When we are training an ResNet50 on tinyImageNet using A5000 GPU. It approximately take 10 hours for 1000 epoches. We used various techniques such as multiprocessing to accelerate the process. In conclusion, our approach is effective for the multi-category deep learning classification task. We approximate the multimarginal barycenter with an esemble of two-class OT problems, not only because of the simplicity, but also due to the lack of mathematical tools for this problem. During implementation, the complexity of our approach scales reasonable with the class number. <span style="color:blue">Modification:</span> We have add the above discussion in our appendix. Moreover, as also suggested by **reviewer NEt5**, we clairify the current limitation in our section 3.1. For example, "Here we make in important choice by focusing on the interpolation between $K=2$ distributions ... ~~We emphasize here that this is not a simplification but~~ <span style="color:blue"> However,to date the efficient computational solutions for multimarginal OT is still nascent (Lin et al 2022 JMLR). To sovle an MOT with LP would scales to $O(n^K)$, and would be more costly when we want to solve for (free-support) barycenters to facilitate robustness in deep learning pipelines. Focusing on $k=2$ not only allows us to provde computational feasible algorithms (Perrot et al, Flamary et al, Cuturi et al),</span> but also to focus on the decision boundary between the different categories..." #### Q.2 In Proposition 3.3, there is a closed-form $\theta$ in the simplified problem. Is it used in the experiments? What is the algorithm or numerical method you use to solve the minimax problem with regularization? Yes! We did use Eq.(15) in our code. We use the standard adversarial training pipeline (outer minimization, inner maximization) following the seminal works (cite, cite). In practice, we approach the *(min+reg) minimization of the classification loss* and the *(max) maximization for the worst-case barycenter* into an **alternative optimization** scheme, and, it is actually shown as Fig (1) 2(a) and 2(b) respectively. * During the *min* step, we use *adam* stochastic gradient descent to minimize the empirical loss with the geodesic regularization with regard to both the original data and the augmented dataset, $\min_\theta \mathbb{E}_{x_i,y_i \sim \sum P^i + \sum \nu_{adv} [l(f_\theta(x), y)]} + \lambda \text{Reg}(f_\theta)$, where the geodesic reguarization $\text{Reg}(f_\theta)$ is computed via Eq.(15). In addition, our embedded space closed-form solution allows to take the advantage of pretraining embeddings for large-scale datasets. * During the *max* step, we freeze the classifer and find the adversarial barycenter that maximizes Eq.(8). The reason why we change the Barycenter to Monge Map is that this allows us to approach the geodesic optimization by monto-carlo in a relatively efficient manner, we only need to sample the locations $t \in \mathbb{R}^1$ and evaluate the pushforward samples realized by the (entropic) transport map. As demonstrated in our eq.(x). During the training of deep neural network on large dataset (for example, the ImageNet(64x64)). The samples size is huge thus we sample a batch of data only from two different classes. Then we follow the <span style="color:blue">Modification:</span> 1. We have changed the equation in Fig (1), 2(b) from xxx to xxx to explicit the usage of regularization. 2. We have added a brief pseudocode to illustrate the practical adversarial training algorithm in the appendix, as follows. ``` Input: Dataset {X, Y}, model, loss, epoch number t, data augmentation flag da for every epoch: if da: # data augmentation flag {\hat{x}_j, \hat{y}_j} <- Algorithm 1, where Eq.(8) is invoked. for every batch {X, Y} in the data: # We use a special sampler to only sample two class of data pred = model(X) classification_loss = criterion(Y, Pred) {X^0, Y^0}, {X^1, Y^1} = {X, Y} # splite the data into different categories pi_epsilon = Sinkhorn(X^0, X^1) reg_loss = Reg(X^0, Y^0, X^1, Y^1, pi_eps), where Eq.(15) is invoked. total_loss = classification_loss + lambda x reg_loss total_loss.backward() ``` #### Q.3 Table 1, the title says "ours: DA+Reg". Does it mean the accuracy in the second column is our accuracy? In Table 1, **ERM+Ours** and **PGD+Ours** are the two use cases of our proposed method. The first one means applying **DA(data augmentation)** and **Reg(regularization)** to standard **ERM(Empirical Risk Minimization)**, and the second one means applying **DA** and **Reg** to adversarial training **PGD(Projected Gradient Descent)**. Here, the second column should be **ERM+reg** which means enhancing the valina **ERM** *only* with the geodesic regularization. This is a standalone ablation study to evaluate the effectiveness of our regularization method. More analysis for the this table: (1) Using our proposed data augmentation and regularization, one can significantly boost the robustness as shown in **ERM+Ours**. (2) Adversarial training is usually considered sufficient for providing robustness. However, we can further improve based it as shown in **PGD+Ours**, achieveing the best robust accuracy. (3) It's worth noting that data augmentation is computationally expensive, as it involves solving a minimax problem. However, our results show that even solely using our proposed regularization method (without data augmentation) can still improve the robustness, as shown in the **ERM+Reg** column. <span style="color:blue">Modification:</span> Table 1 (Modification). Adversarial robustness (MNIST). | | ERM | PGD | ERM+Reg | ERM + Ours | PGD +Ours | |:--------: |:--------:| :--------:| :--------:| :--------:|:--------: | Clean Acc.(%) | 99.09 $\pm$ 0.02 | 98.94 $\pm$ 0.03 | 99.27 $\pm$ 0.06 | **99.39 $\pm$ 0.03** | 99.34$\pm$ 0.02 | Robust Acc. (%) | 31.47 $\pm$ 0.12 | 81.23 $\pm$ 0.17 | 35.66 $\pm$ 0.20 | 81.23 $\pm$ 0.13 | **82.74 $\pm$ 0.16** #### Q.4 Why the mixup methods are not compared in the certifiable robustness section? We have provided an additional experimental result for using the Mixup method! In general, the vanilla mixup (linearly interpolation) is not considered significantly helpful for *certifiable robustness* since it is uniformally selecting samples from all categlories and interpolating them according some Beta distribution. We suspect the interpolations between the data from the same categlory are not effective. Also, usually it is not included in certifiable robustness baselines, as we can see in the related works (Cohen et al.). <span style="color:blue">Modification:</span> Table 5 (Modification). (Part of the) Certified accuracy (MNIST). | $\sigma$ | Models (MNIST) | 2.00 | 2.25 | 2.50 | 2.75 | |:--------: |:--------:| :--------:| :--------:| :--------:|:--------: | 1.00 | Gaussian | 32.5 | 19.7 | 10.9 | 5.8 | 1.00 | Vanilla Mixup | 36.7 | 28.2 | 19.3 | 7.6 | 1.00 | **Ours** | **45.0** | **37.8** | **30.7** | **23.2** #### Q.5 At lines 305-306, $P^1$ and $P_2$ are not consistent. Thank you for pointing out this! We have now unified all the notations for the data distribution as $P^i$! #### Q.6 I recommend citing the following recent papers on optimal transport: Thanks for bringing these works to us! We have added these two papers into our related work part. Below is how we introduce them in our paper: <span style="color:blue">Modification:</span> "...for large-scale high-dimensional datasets, and OT between labeled datasets (Alvarez-Melis & Fusi, 2020; Fan & Alvarez-Melis). <span style="color:blue">Nonparametric gradient flows [1] provides satisfactory data generation with theoretical gurantee. Further, data synthesization [2] on the feature-Gaussian manifold can be realized by the maximum mean discrepancy (MMD) gradient flow [3] are proven to be effective for transfer learning tasks. "</span> [1] Liutkus, A., Simsekli, U., Majewski, S., Durmus, A., & Stöter, F. R. (2019, May). Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning (pp. 4104-4113). PMLR. [2] Hua, X., Nguyen, T., Le, T., Blanchet, J., & Nguyen, V. A. (2023). Dynamic Flows on Curved Space Generated by Labeled Data. arXiv preprint arXiv:2302.00061. [3] Mroueh, Y., & Nguyen, T. (2021, March). On the convergence of gradient descent in GANs: MMD GAN as a gradient flow. In International Conference on Artificial Intelligence and Statistics (pp. 1720-1728). PMLR.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully