Hoang Anh Just
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # 2nd-round Responses # Reviewer 4TXZ (2nd Reviewer) ## Adding to the choice of OT over other dissimilarity measures */Most should have been covered in the previous response, but the language here is more accurate and concise.* OT is a celebrated choice for measuring the discrepancy between probability distributions [1]. Compared to other notable dissimilarity measures such as the Kullback-Leibler Divergence [2], Total Variation norm [3] or Maximum Mean Discrepancies (MMD)[4], the mathematically well-defined OT distance has advantageous analytical properties (is a valid metric; compatible with sparse-support distributions; stable with respect to deformations of the distributions’ supports), and it is also computationally tractable and commutable from finite samples [5,6,7]. [1] Villani, Cédric (2009). Optimal Transport, Old and New. Grundlehren der mathematischen Wissenschaften. Vol. 338. Springer-Verlag Berlin Heidelberg. p. 10. doi:10.1007/978-3-540-71050-9. ISBN 978-3-540-71049-3 [2] Kullback, Solomon, and Richard A. Leibler. "On information and sufficiency." The annals of mathematical statistics 22.1 (1951): 79-86. [3] Rudin, Leonid I., Stanley Osher, and Emad Fatemi. "Nonlinear total variation based noise removal algorithms." Physica D: nonlinear phenomena. 1-4 (1992): 259-268. [4] Gabor J Szekely, Maria L Rizzo, et al. Hierarchical clustering via joint between-within distances: Extending ward’s minimum variance method. Journal of classification, 22(2):151–184, 2005 [5] David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020. [6] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681–2690. PMLR, 2019. [7] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608–1617. PMLR, 2018. ## Follow-up the reponses with our new updates. *\courtesy words* *\summary of changes* We would like to express again our appreciaton of the Reviewer 4TXZ's insights and comments. We have addressed the questions and would like to provide additional explanation which are addressed in the newest version of our paper. Especially, we would like to highlight that we have modified our theoretical contribution section with a clearer statement and proof, which should now be more intutitve for the reader. ## Theorem section rewritten We have fundamentally rewritten theorem 4 and reorganized subsection 2.2 which contains the theorem. We have also thoroughly changed the notation scheme to reduce the number of notations to a strict minimum, in an effort to ease the pain of comprehending the proposed theorem and ensure maximum consistency throughout the paper. We take this chance to invite the reviewer to look at our comprehensively revised manuscript with essential improvements in the presentation of the theorem. Specifically, we have deferred the rigorous definitions and detailed explanations of assumptions to the Appendices (A.1) and kept only the introductions necessary for presenting the conceptual results of the theorem. Now the theorem is self-explainable and the subsection contains all the necessary information for interpreting the idea. With mild assumptions on the Lipschitznss of the model $f$ (labeling function) and loss function and boundness on the output of labeling functions, the theorem extends the celebrated Kantorovich-Rubinstein Duality (KR-duality) to include the hybrid distance metric $\mathcal{C}$ between feature-label pairs. Even though the hybrid distance metric hierarchically defined on features and labels has shown promising results in practical applications, the much-desired theoretical justification has been long missing. We note the proposed theorem fills the gap by providing theoretical guidance between the validation performance and Wasserstein distance under the hybrid distance metric. With the hybrid distance metric formulated in Eq. (5), the main theoretical result is presented in Eq. 6. It states that the expected validation performance for the given model is upper-bounded by its training performance plus the Wasserstein distance between training and validation distribution under the hybrid distance metric $\mathcal{C}$ and some constant depending on the Lipschitz conditions. In practice, when a model with a large enough capacity is used, the training loss, the first term on the RHS, is small, and the upper-bound of validation performance is largely determined by the Wasserstein distance between the distributions under metric $\mathcal{C}$. This result also offers a direct explanation for the correlations between validation performance and Wasserstein distance depicted in Fig. 1. Rigorous definitions of all the assumptions, conditions, and notations and their detailed explanations can be found in Appendices A.1. Also provided are complete proof of the theorem and discussions for the ideas involved. # We hope this helps to clarify the confusion and addresses your questions. Thank you for pointing out the typos and we made sure to fix all of them. We have updated the theory in the updated version of the paper for better clarification. # Reviewer czdG (3rd Reviewer) [TO Feiyang: Some key points that we’d like to mention in the response include but are not limited to: Directly address the question to explain the nature of labeling function as binary, it cannot satisfy deterministic liipsthic but probabilistic lipschitz is reasonable. Say that we actually find that probabilistic assumption is not needed because this assumption is designed to deal with binary-valued function but f outputs continuous value probability as output Mention that we rewrite section 2.2 [focus the main text there on the storyline and provide more intuition for the theoretical result, and in the appendix, we explicate about the assumptions and give detailed understanding of the assumptions] thanks.] --- ## Reponse to reviewer's question *\courtesy words* Thank you for the explanation and we can now understand your concern, which we will address here. *\summary of response* *\summary of changes* (tl;dr: ) **What is the labeling function?** The labeling function $y=f(x)$ is a function that maps a feature $x$ to a real value between 0 and 1, i.e., $f: \mathcal{X}\rightarrow [0,1]$. In the theorem, we consider three labeling functions: (1) $f$: representing a model that associates an output probability with an input feature $x$; (2) $f_v$: representing a function that maps an input feature to the corresponding label in the validation set; and (3) $f_t$: representing a function that maps an input feature to the corresponding label in the training set. $f_t$ and $f_v$ are ground-truth labeling functions that generates the true label for each sample, whose explicit expressions are rarely known in practice, while the model $f$ is trained on the given feature-label pairs to achieve a minimized loss. While the output space of $f_v$ and $f_t$ is still [0,1], they actually map the input features to a binary label {0,1}. In the example kindly formulated by the reviewer, the labeling function corresponding to $P_\text{true}$ is a binary function, i.e., $$f(x)=\left\{ \begin{matrix} 0, & if \, x<0 \\ 1, & otherwise. \\ \end{matrix}\right.$$ # **Lipschitz Assumptions.** The standard deterministic Lipschitz condition is stated for real-valued functions, namely, for all $x,y$, $|f(x)-f(y)|\leq \epsilon \|x-y\|$ for some constant $\epsilon$. This condition can be readily applied to the first labeling function (1) whose output is a *continuous* variable. However, when we consider labeling functions (2) and (3) which actually are deterministic mappings to $\{0,1\}$, the Lipschitz constant $\epsilon$ would impose a gap of $1/ \epsilon$ between points of different classes. ***Thus, it requires the domain to be disconnected or the label function to be constant, rendering it rarely practical for applications of this kind.*** Probabilistic Lipschitzness generalizes this overly strigent condition, extending it to fit into practical scenarios. By ensuring the Lipschitz condition is held with some high probability, the practical relaxation encapsulates the concept of "cluster" and "low-density" in the sense that it explicitly penalizes having high probability density in areas with heterogeneous labels. For a detailed formulation and its discussions, please refer to the original work [1]. For deterministic functions with binary outputs, the standard $\epsilon$-Lipschitz conditions (which imposes a constant gap between different classes) may not be satisfied. The generalized probabilistic condition requires the Lipschitzness to hold with some high probability (rather than everywhere), allowing for a non-constant gap between different classes. The labeling function provided by the reviewer satisfies the Probabilistic Lipschitzness condition with the parameter depending on how much probabilistic density $P_\text{true}$ concentrates near the boundary $x=0$ where the labels are flipped. These probabilistic conditions have been widely used [1, 2, 3], though, we find it turns out to be **not necessary for the development of the proposed theorem**. The crucial difference is that the labeling function $f$ considered in the theorem is a mapping to the range [0, 1]–i.e., its output is continuous rather than binary. Thus, the standard Lipschitz condition can be assumed. *If one were to model the labeling function with binary outputs, then the probabilistic Lipschitz condition would become necessary.* The Probabilistic Cross-Lipschitzness stated in Definition 3 extends the concept to function pairs. It bounds the probability of finding pairs of training and validation instances with similar features but different binary labels. This condition is still required as the ground-truth labeling functions $f_t$ and $f_v$ are considered to have binary outputs. # **Evaluation of Eq. 6** The left-hand side (LHS) of Equation (4) is the expectation of the validation loss. Given the samples from validation distribution, we can take the samples’ average to approximate the expectation. By the uniform law of large numbers, the samples’ average converges to the expectation. For example, one may consider the validation loss as the cross-entropy loss, $\sum_y q_y \log(p_y)$, where $q_y$ is the ground-truth one-hot encoding of a label and $p_y$ is the probability output by a classifier for each class. [1] Urner, Ruth, Shai Shalev-Shwartz, and Shai Ben-David. "Access to unlabeled data can speed up prediction time." ICML. 2011. [2] Ben-David, Shai, and Ruth Urner. "Domain adaptation–can quantity compensate for quality?." Annals of Mathematics and Artificial Intelligence 70.3 (2014): 185-202. [3] Courty, Nicolas, et al. "Joint distribution optimal transportation for domain adaptation." Advances in Neural Information Processing Systems 30 (2017). --- ## Theorem section rewritten (tl;dr: Motivated by your comments and questions, we have revised the theoretical contribution section to enhance the clarity of the theorem's statement and proof. In addition, we provide explanations and simple examples to the reader for stronger intuitive understanding. We appreciate your insights as these are valuable for improvement of our paper.) We have fundamentally rewritten theorem 4 and reorganized subsection 2.2 which contains the theorem. We have also thoroughly changed the notation scheme to reduce the number of notations to a strict minimum, in an effort to ease the pain of comprehending the proposed theorem and ensure maximum consistency throughout the paper. We take this chance to invite the reviewer to look at our **comprehensively revised manuscript with essential improvements in the presentation of the theorem.** Specifically, we have deferred the rigorous definitions and detailed explanations of assumptions to the Appendices (A.1) and kept only the introductions necessary for presenting the conceptual results of the theorem. Now the theorem is self-explainable and the subsection contains all the necessary information for interpreting the idea. With mild assumptions on the Lipschitznss of the model $f$ (labeling function) and loss function and boundness on the output of labeling functions, the theorem extends the celebrated Kantorovich-Rubinstein Duality (KR-duality) to include the hybrid distance metric $\mathcal{C}$ between feature-label pairs. Even though the hybrid distance metric hierarchically defined on features and labels has shown promising results in practical applications, the much-desired theoretical justification has been long missing. *We note the proposed theorem fills the gap by providing theoretical guidance between the validation performance and Wasserstein distance under the hybrid distance metric.* With the hybrid distance metric formulated in Eq. (5), the main theoretical result is presented in Eq. 6. It states that the expected validation performance for the given model is upper-bounded by its training performance plus the Wasserstein distance between training and validation distribution under the hybrid distance metric $\mathcal{C}$ and some constant depending on the Lipschitz conditions. In practice, when a model with a large enough capacity is used, the training loss, the first term on the RHS, is small, and the upper-bound of validation performance is largely determined by the Wasserstein distance between the distributions under metric $\mathcal{C}$. This result also offers a direct explanation for the correlations between validation performance and Wasserstein distance depicted in Fig. 1. Rigorous definitions of all the assumptions, conditions, and notations and their detailed explanations can be found in Appendices A.1. Also provided are complete proof of the theorem and discussions for the ideas involved. # We hope this helps to clarify the confusion and addresses your questions. Thank you for pointing out the typos and we made sure to fix all of them. We have updated the theory in the updated version of the paper for better clarification. # Rebuttal for LAVA # Reviewer RGDd ### Comment C1. [theoretical contribution] "*...heavily built upon existing methods... does not contribute to the effective solution... or the scalability/efficiency.*" ### Explanation E1: #### (tl;dr: offering novel theoretical insights into computational aspects associated with downstream applications–e.g. approximation errors from entropy penalization do not affect the accuracy in rankings of datapoint values) We agree that this work does not expand the frontier on the efficient solution to the optimal transport problem. We would like to note that this work does provide **novel theoretical results** on computational aspects more closely associated with downstream applications. For example, the entropy-penalized OT solutions are subject to the approximation error introduced by the penalty terms. However, in this work, we show that in the context of evaluation and ranking of the datapoints, the differences between the value of the datapoints can be **precisely recovered** from the approximated solutions such that the **order of values** for datapoints is actually **preserved**. We believe this offers new insights for a variety of applications that there may not necessarily be a tradeoff between efficient solution (scalability) and solution quality (accuracy). --- ### C2. [reliance on target examples] "*The performance of the method highly relies on target examples (quantity and quality), without which it cannot properly function.*" ### E2: #### (tl;dr: it is cheap to obtain high-quality samples for the scale needed by the proposed method) This method is remarkably **data-efficient** for clean validation examples---in our experiments, a few thousand samples are generally good enough for the method to have a strong performance. As an example to illustrate this point, we could compare with the best performing baseline TracIn-Self on mislabeled data detection. Given 10,000 validation samples, TracIn-Self can achieve a detection rate of 80% at a data removal budget of 25,000; by contrast, our method can achieve the same detection rate at the same budget using only 2,000 validation samples. It is fair to say that the reliance on validation samples is much weaker than in existing data-valuation methods. Also, it is generally considered **not difficult to find high-quality samples on this scale**---crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) can help prepare such samples economically and conveniently (e.g., acquiring a high-quality validation set of 1,000 samples only costs $12 on MTurk). --- ### C3. [upper bounds given by the OT distance] "*..OT distance upper bounds the performance...it is unknown how or whether this bound is approached during the training...*" ### E3. Thanks for mentioning this interesting point and we acknowledge this worth noticing. Further investigations are underway to explore these theoretical relationships. --- ### C4. [type of experiments] "*...confined to the detection of abnormal data... other relevant valuation scenarios... training the model on the subsets with high utility...*" ### E4. #### (tl;dr: additional experiment results are provided in this regard) Thanks for pointing out the interesting direction. We agree these experiments will be very helpful in demonstrating the capabilities of the proposed approach. New experiments are conducted with results providing new findings. By selecting datapoints with higher values and removing datapoints with poor values, we construct subsets of the original dataset to have high utilities. <br /> In an example on CIFAR-10, it is demonstrated that the performance is well maintained even with smaller subsets of the original dataset. Remarkably, even reducing a clean training set (25,000 samples) by 15% based on our method’s valuation, the performance can still stay relatively high while outperforming other valuation baselines as seen in Fig. 12 (in the updated Appendix version). <hr style="border:1px dashed gray"> ### Question Q1. [why performance is good without model] "*why the performance of the model-agnostic approach is even higher ... than model-based approaches...*" ### Answer A1. #### (tl;dr: model training involves high stochasticity and tend to tolerate bad data, thus not necessarily leading to better valuation) Modern machine learning approaches introduce significant stochasticity during training and often witness high variances in model performance [1, 2]. Model-based data-valuation approaches crucially rely on the performance scores associated with models trained on different subsets to determine the value of data. Hence, model-based data-valuation approaches are naturally susceptible to noise and the accuracy of the validation results is often affected [2]. Also, modern machine learning models (e.g., overparameterized DNNs) are known to be able to tolerate problematic data in the training datasets [3]. In other words, the model performance might not change significantly when including a problematic datapoint. Hence, it is actually difficult to identify the existence of the problematic datapoints through the inspection of model performance scores as what was done in existing model-base data-valuation approaches. Overall, we think that the better performance of our model-agnostic approach is attributable to two factors: one is that it does not suffer learning stochasticity and another is that it bypasses the “noise-tolerant” modeling step.<br /> [1] Hyland, Stephanie L., and Shruti Tople. "An Empirical Study on the Intrinsic Privacy of SGD." arXiv preprint arXiv:1912.02919 (2019). [2] Wang, Tianhao, and Ruoxi Jia. "Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity." arXiv preprint arXiv:2205.15466 (2022). [3] Rolnick, David, et al. "Deep learning is robust to massive label noise." arXiv preprint arXiv:1705.10694 (2017). --- ### Q2. [rebalance biased datasets] "*...examine bias in the given data... help with rebalancing unbalanced datasets?*" ### A2. #### (tl;dr: Yes! Additional experiment results are provided in this regard.) Thank you for suggesting a promising direction. To rebalance the unbalanced dataset, we calculate the value of each point and only retain the points with the highest values. We conducted an experiment on CIFAR10, where the Frog class consisted of 2 times more samples (5000) than the rest classes (2500). After removing over 1000 samples, we are able to rebalance the dataset and improve the model accuracy as shown in Fig. 12 (in the updated Appendix version). <br /> At the same time, other valuation methods were not able to steadily increase the model accuracy and quickly downgraded the model performance, which in turn shows a stronger effectiveness of our method. --- ### Q3. [scale that can apply] "*...how large is the dataset you can apply the method to... is there a known limitation...*" ### A3. #### (tl;dr: We implemented the method on ImageNet with 100k samples. The practical limitation is on GPU memory.) The proposed data valuation method is based on Sinkhorn's algorithm that solves entropy-penalized OT problems exceptionally efficiently. With state-of-the-art implementations and support of GPU accelerations for large-scale problems, a near-linear time and memory overhead may be achieved in practice. In the paper, we reported the results on full CIFAR-10 (50,000 samples). During the rebuttal period, we successfully implemented our method on ImageNet-100 (100,000 samples). We report our computation time in comparison with other baselines to showcase the efficiency of our method below on the ImageNet-100 dataset with Tesla T4 GPU 16GB: | LAVA | KNN-SV | TracIn-Clean | TracIn-Self | | ----------- | ----------- | ------------ | ----------- | | 1 hr 54 min | 4 hr 21 min | 7 hr 50 min | 7 hr 51 min | We note the limiting factor for implementing on even-larger-scale problems is actually the requirement on GPU memory as it is often not so scalable. --- ### Q4. [limitation of the method] "*Is there a case... does not work... known limitation... weak at/or weaker than the model-based methods...* " ### A4. #### (tl;dr: the method may not work well for some adversarial attacks aimed at matching the feature and label of normal data, discussed in the limitations at the end of the paper.) As noted in the question, the proposed method is based on distributional distance over the feature and labels, hence if the abnormal data share highly similar features and labels as the good data (e.g., clean-label backdoor attacks [4,5]), we suspect that it may fall out of sight for our method. On the other hand, these adversarially crafted attacks are designed to not affect model accuracy on clean validation data. Hence, we suspect that existing model-based data valuation approaches are also susceptible to these attacks as they all assign values based on validation performance change induced by removal of a training point. An in-depth study of security aspects of data valuation approaches is an interesting and important future work. <br /> [4] Souri, Hossein, et al. "Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch." arXiv preprint arXiv:2106.08970 (2021). [5] Zeng, Yi, et al. "NARCISSUS: A Practical Clean-Label Backdoor Attack with Limited Information." arXiv preprint arXiv:2204.05255 (2022). --- ### Q5. [explanabillity of valuation results] "*Does this method offer explainability... how should a practitioner interpret*... " ### A5. #### (tl;dr: the gradients predict how adding/removing the datapoint will change the validation performance) The valuation is given by the calibrated gradients of the OT distance between the training and validation datasets w.r.t. to the probability mass of the datapoints. The probability mass represents the strength of existence of the datapoint in the dataset–removing the datapoint will reduce its probability mass to zero while duplicating the datapoint will double its probability mass.<br /> With Theorem 4 as well as empirical results connecting OT distance to the upper bounds of the validation performance for downstream models, the valuation can be interpreted as a prediction of how adding/removing the datapoint will change the validation performance. The gradients may be positive or negative, representing different contributions (negative and positive) of the datapoint in the training dataset to the target task (validation dataset). For example, adding datapoints with positive gradients to the training dataset suggests this will increase the OT distance to the target dataset, lowering the prediction of performance on the target dataset for downstream models trained on this dataset. <hr style="border:1px dashed gray"> # Reviewer 4TXZ ### Comment C1. [theoretical contribution of the paper] "...*the idea of estimating a certain divergence between two datasets is not novel... widely used in measuring the “transferability”... theoretical contribution is limited*..." ### Explanation E1. #### (tl;dr: The main contribution of the paper is the invention of a novel valuation approach of individual datapoints for a target task, without requiring information for downstream models or using a proxy model. We did not invent new measures for discrepancies between dataset distributions, though, novel theoretical insights are provided associated with downstream applications.) Most existing methods for data-valuation (measure the value of individual datapoints on a target task) are model-based, an approach which generally involves massive model retraining to examine the performance of models trained on different subsets. For example, the celebrated Shapley-value-based techniques often require retraining the model on an exponential number of subsets of the given dataset [1]. The significant pain of computational intractability makes many such methods demanding in computational power and especially poor in scalability. Also, the inevitable stochasticity introduced from model training causes high variances in performance [2, 3], affecting the accuracy of valuation results as well.<br /> In this work, leveraging existing theories in Optimal Transport and transfer learning, as you kindly noted, we proposed a model-agnostic approach to measure the value of individual datapoints to a target task based on distributional distances. Specifically, we found it is possible to extract the gradients of OT distance w.r.t. the strength of existence for each individual datapoint directly from the solutions to the OT problem–completely free from additional computations. With Theorem 4 as well as empirical results connecting OT distance to the upper bounds of the validation performance for downstream models, the valuation can be interpreted as a prediction of how adding/removing the datapoint will change the validation performance. This leads to a valuation approach **exceptionally efficient**, magnitudes faster than other valuation methods, and **applicable to a scale far beyond that of existing methods**. Moreover, the valuation results are also of **good quality**---demonstrating its strong effectiveness in a variety of challenging tasks–all in a small fraction of computing time compared to its predecessors.<br /> Besides, this work offers **novel theoretical insights** on computational aspects more closely associated with downstream applications. For example, the entropy-penalized OT solutions are subject to the approximation error introduced by the penalty terms, however, in this work, we show that in the context of evaluation and ranking the datapoints, the differences between the value of the datapoints can be precisely recovered from the approximated solutions such that the order of values for datapoints is actually preserved. We believe this offers new insights for a variety of applications that there may not necessarily be a tradeoff between efficient solution (scalability) and solution quality (accuracy).<br /> <u>In summary, our work contributed to solving a question that puzzles the data valuation field for years, namely, how to estimate the value of data in a meaningful and efficient manner? Towards that end, we couple insights from statistical learning (to bound the validation performance with distributional distance) and optimization theory (to calculate gradients of OT through duals). While the invention of distributional distance dated back to more than half a century ago, it is the first time that the idea of distribution distance is leveraged in the context of data valuation and importantly, it significantly boosts the efficiency and utility of data valuation applications. In addition, our theoretical result that the dual solutions to entropy-regularized OT preserve the rank of dual solutions to original OT is novel and actually very surprising: in contrast to traditional wisdom that there is direct tradeoff between efficiency and accuracy, our result shows that efficiency of entropy-regularizer does not tradeoff with accuracy when only the rank of dual solutions are of interest, like in the data valuation context.</u><br /> [1]. Jia, Ruoxi, et al. "Towards efficient data valuation based on the shapley value." The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019. [2]. Ilyas, Andrew, et al. "Datamodels: Predicting predictions from training data." arXiv preprint arXiv:2202.00622 (2022). [3]. Wang, Tianhao, and Ruoxi Jia. "Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity." arXiv preprint arXiv:2205.15466 (2022). <hr style="border:1px dashed gray"> ### Q1. [discrepancy measure, existing research, and Theorem 4] "*Theorem 4 is not presented clearly... such a result can be easily obtained using the Kantorovich-Rubinstein duality of Wasserstein distance... does not provide anything new... different type of divergences, e.g., KL divergence... does not convince... Wasserstein distance is the correct metric.*" ### A1. #### (tl;dr: 1. Wasserstein distance is mathematically well-defined and advantageous over KL-divergence for analytical properties in this application; 2. The authors agree that Theorem 4 leverages the Kantorovich-Rubinstein duality (KR-duality) of Wasserstein distance, as in the reference kindly provided by the reviewer, though, the proposed theorem is anything but a trivial application. In contrast to the 1-Lipschitz condition assumed in the KR-duality, the theorem is built for a hierarchically defined hybrid cost function, which has shown strong potential in practical applications. Employing a series of mathematical constructions (on the basis of Def. 1 and 2), the novel theoretical progress shall provide guidance for a broader field of research.) **Why Wasserstein distance?** Wasserstein distance is mathematically well-defined and provides desired analytical properties in many aspects. For example, KL-divergence generally requires that the discrete probability distributions to be defined on the same probability space to calculate the relative entropy. Yet, in this work, when measuring the distributional discrepancy between the datasets, we treat each dataset directly as the empirical distribution of its datapoints, rather than approximating it with additional assumptions (e.g. Guassian). If the datapoints are images, then each of the non-identical images located is in a different place in the high dimensional space. KL-divergence between such datapoints is inappropriately defined and may not be a proper metric, while the Wasserstein distance provides better support for conducting such analysis. Moreover, Wasserstein distance is also in favor from practical considerations. It is generally considered to be insensitive to small variations in distributions while also not robust to large deviations [1], making it particularly suitable for the data-valuation scenario. More specifically, the OT distance may ignore normal variations within clean data while is sensitive to abnormal data with large distributional deviations–which was empirically verified in this work.<br /> **Novelty of Theorem 4.** The authors are appreciative of the reviewer for kindly providing the reference on Kantorovich-Rubinstein duality (KR-duality) of Wasserstein distance and agree that it is one of the key pillars of the proof (as already noted in the proof, line 561). However, we would like to clarify that **Theorem 4 cannot be trivially accomplished by naively applying the KR-duality result**. Firstly, the KR duality holds when $h$ (defined in https://courses.cs.washington.edu/courses/cse599i/20au/resources/L12_duality.pdf) is 1-Lipschitz as can be seen from the first inequality on the second page). In our case, $h$ is a composite function $(x,y)\rightarrow l(y,f(x)))$. Hence, we need to understand the Lipschitz property of this function to make use of the KR-duality result. **[We hope this addresses the confusion why the Lipschitz continuity assumptions in definition 1 and 2 are needed**]. Let $y_1=f^*(x_1)$ and $y_2=f^*(x_2)$. Let’s assume that $f^*$ and $f$ are $L^*$ and $L$-Lipschitz, respectively, for simplicity (the probabilistic Lipschitz assumption used in our paper is a relaxation of this assumption). Let $l$ be $k$-Lipchitz. Leveraging the Lipschitz assumptions in definition 1 and 2 and the triangle inequality, we could get $$\begin{align} |l(y_1,f(x_1))-l(y_2,f(x_2)| &= |l(f^*(x_1),f(x_1)-l(f^*(x_2),f(x_2)| \\ & \leq |l(f^*(x_1),f(x_1) - l(f^*(x_2),f(x_1)| + |l(f^*(x_2),f(x_1) - l(f^*(x_2),f(x_2)| \\ & \leq kL^*d(x_1,x_2) + kLd(x_1,x_2) = k(L^*+L)d(x_1,x_2). \end{align}$$ **Simply applying the KR-duality gives us an upper bound of the validation performance which is characterized by a Wasserstein distance grounded on the cost function $d(\cdot, \cdot)$, but this is far from what we want to prove**. In fact, the Wasserstein distance we used is grounded on a special **hybrid cost**, which is $$ d(x,x’) + W_d(P(X|y), P(X’|y’)). $$ The Wasserstein distance with this particular hybrid cost has shown great empirical performance in detecting abnormal features and labels in our experiments. In addition, this Wassestein distance has also been shown in past work [2] to characterize the domain and task difference. However, all of these observations remain at an empirical level. This paper provides the first theoretical justification for the Wasserstein distance with the hybrid cost by showing that it characterizes the upper bound of the validation performance. Our proof might be of independent interest to a broader community involving meta learning and domain adaptation as it can be simply adapted to provide justification for why this Wasserstein distance is a good proxy for predicting transfer learning and meta learning performance. **The novelty of the proof lies in a series of mathematical constructions to make the Wasserstein distance grounded on the hybrid cost appear on the upper bound, thus providing this Wasserstein distance a formal justification**.<br /> A final note is that the reviewer might be curious why we did not directly employ the Wasserstein distance grounded on a metric $d$ in the feature space. This is because such Wasserstein distance ignores the label part of the training dataset and will be incapable of detecting labeling errors—a popular type of bad data in practice.<br /> Overall, proving an upper bound depending on the Wasserstein distance grounded on the hybrid cost cannot be accomplished simply by using the KR-duality theorem. Moreover, Wasserstein distance is advantageous to other distributional distances in terms of many analytical properties. We hope that this long answer can address your concern. Please let us know if you have further comments. <br /> [1]. Villani, Cédric. Topics in optimal transportation. Vol. 58. American Mathematical Soc., 2021. [2]. Alvarez-Melis, David, and Nicolo Fusi. "Geometric dataset distances via optimal transport." Advances in Neural Information Processing Systems 33 (2020): 21428-21439. --- ### Q2. [benefit on computation] "...*the choice of Wasserstein distance... can be estimated easily from empirical distributions...this benefit on computation is not evident*..." ### A2. #### (tl;dr: Existing data-valuation methods require massive retraining of the model on subsets, scaling poorly and computationally demanding even on small datasets. the proposed method is magnitudes faster and applicable to a scale far beyond that of existing methods; also, the entropy-OT can be solved efficiently while also providing non-approximated precise results for the proposed approach) As mentioned above, most existing methods for data-valuation (measure the value of individual datapoints on a target task) are model-based, and this approach generally involves massive model retraining to examine the performance of models trained on different subsets [2]. For example, the celebrated Shapley-value-based approaches often require retraining the model on an exponential number of subsets of the given dataset [1]. The significant pain of computational intractability makes many such methods demanding in computational power and especially poor in scalability. Also, the inevitable stochasticity introduced from model training causes high variances in performance, affecting the accuracy of valuation results as well [2, 3].<br /> In this work, leveraging existing theories in Optimal Transport and transfer learning, we proposed a model-agnostic approach to measure the value of individual datapoints to a target task based on distributional distances. Specifically, we found it is possible to extract the gradients of OT distance w.r.t. the strength of existence for each individual datapoint directly from the solutions to the OT problem–completely free from additional computations. With Theorem 4 as well as empirical results connecting OT distance to the upper bounds of the validation performance for downstream models, the valuation can be interpreted as a prediction of how adding/removing the datapoint will change the validation performance. This leads to a valuation approach **exceptionally efficient**, orders of magnitude faster than other valuation methods, and **applicable to a scale far beyond that of existing method**s. Moreover, the valuation results are also of good quality–demonstrating its strong effectiveness in a variety of challenging tasks–all in a small fraction of computing time compared to its predecessors.<br /> In the experiments, we report runtime comparisons only for 2,000 test samples as this is the scale existing methods can solve in a not excessively long time (within a day). It showcases the advantageous computing efficiency that the proposed approach enjoys over other methods. In this work, we employ solution methods based on Sinkhorn's algorithm that solves entropy-penalized OT problems exceptionally efficiently. With state-of-the-art implementations and support of GPU accelerations for large-scale problems, a near-linear time and memory overhead may be achieved in practice. For instance, we provides results on the computation time vs training size demonstrating the near-linear time complexity of our method on the CIFAR-10 dataset in Fig. 13 (in the updated Appendix version).<br /> Additionally, we report our computation time with other baselines to showcase the large efficiency advantage of our method on the ImageNet-100 dataset below:<br /> | LAVA | KNN-SV | TracIn-Clean | TracIn-Self | | ----------- | ----------- | ------------ | ----------- | | 1 hr 54 min | 4 hr 21 min | 7 hr 50 min | 7 hr 51 min | --- ### Q3. [free calculation for data values] "...*the calibrated gradient can be obtained for free from off-the-shelf optimization solvers...more details about how to use the duality form ... are needed*." ### A3. #### (tl;dr: we provide below a complete explanation for the calculation of gradients from the dual solutions, with both a conceptual storyline and technical details) The original formulation of the optimal transport problem is a linear programming problem to minimize the transport cost from the empirical distribution of the training dataset to that of the target dataset. These empirical distributions are discrete distributions consisting of the probability mass of each datapoint, with the transport cost (or price) between datapoints in the training and target datasets calculated as some distance in high dimensional space. The optimal solution to this problem (the optimal transport map) leads to the minimum transport cost as the 2-Wasserstein distance.<br /> We would like to calculate the gradients of this Wasserstein distance w.r.t. the probability mass of the datapoints. This will give us the information on how the OT distance would change if we increase/decrease the probability mass of the datapoint (e.g., duplicating the datapoint in the dataset or removing the datapoint from it).<br /> When writing the OT problem in dual form, then weights in the cost vector (the objective function) are changed from the transport price to the probability mass of the datapoint. Then, if taking derivatives of the cost vector w.r.t. a datapoint, it turns out this gradient is exactly the dual potential corresponding to this datapoint. As all of the OT solvers solve the dual problem in practice, this dual potential is always directly available! (For LP and broader optimization problems, the primal problem, often defined with real-world values and constraints, tends to be extremely sparse or singular, causing numerical problems in solution programs; while the dual formulation almost always leads to much better numerical stabilities, becoming the de facto default choice in solution programs).<br /> Despite the aforesaid conceptual storyline, there are additional issues that need to be taken care of before the valuation can be precisely obtained, which leads to more theoretical contributions in this work: 1. We notice that the gradients directly obtained from dual actually lack one critical constraint—the primal formulation of the OT problem is essentially always redundant and one constraint will be dropped during the primal-dual transform. The "real gradient" (calibrated gradient) is obtained by explicitly enforcing the constraint, bridging the gap in both theory and practice. 2. In practice, it is the entropy-penalized OT being solved and what is returned are the approximation solutions. We provide novel theoretical findings in Theorem 5 that in the context of evaluation and ranking of the datapoints, the differences between the value of the datapoints can be precisely recovered from the approximated solutions such that the order of values for datapoints is actually preserved. We believe this offers new insights for a variety of applications that there may not necessarily be a tradeoff between efficient solution (scalability) and solution quality (accuracy). <hr style="border:1px dashed gray"> **Minor Comments MC. 1** "*Many notations are used before the definition. For example, ...on page 3*..." **Clarification Cl. 1** The authors are sorry for not being able to find and fix all the typos and inconsistencies and would like to apologize for causing all the confusion. The authors extend sincere gratitude to the reviewer for help identifying the issues. We greatly appreciate the reviewer for having the patience and devoting the efforts to help improve this work. We have fixed all the issues pointed out with additional clarifications added to the paper. We have also checked the contents and fixed other inconsistencies to the best of our ability. Yes. By representing labels as distributions over the feature space they define and regarding features of data points belonging to each label as its samples $y\rightarrow \alpha_y(X)\triangleq P(X|Y=y)$, we use a p-Wasserstein distance between labels $\mathbf{W}_p^p=(\alpha_y, \alpha_y')$. $\alpha_y$ and $\alpha_y'$ are empirical distributions over data points conditioning on labels $y$ and $y'$, respectively. --- **MC. 2** *"In Theorem 4, line 135, both the notation* $\Pi^*$ and $\pi^*$ a*re used; are they the same? Also,* $P_t^{f_t}=(x,f_t(x))_{x\sim\mu_t}$ *and * $P_v^{f_v}=(x,f_v(x))_{x\sim\mu_v}$.} *are used* *without any explanation. Are they denoting the true distribution of the training and validation data?* " **Cl. 2** $\Pi$ is the couplings of $(P_t^{f_t}, P_v^{f_v})$ and represents the solution space, while $\pi$ denotes a transport map as a solution in this space $\pi\in\Pi(P_t^{f_t},P_v^{f_v})$. $\pi^*$ is the optimal solution as the minimizer of the OT; $P_t^{f_t}=(x,f_t(x))_{x\sim\mu_t}$ and $P_v^{f_v}=(x,f_v(x))_{x\sim\mu_v}$ represent the underlying distributions for the training and validation datasets, respectively. --- **MC. 3** "*Line 173, equation (6). The quantity is defined by maximizing over f and g, so OT should not be a function of f and g anymore*." **Cl. 3** Yes, OT should not be a function of variables f and g. It is either defined as a function of the underlying distributions $(\mu_t, \mu_v)$ or as a function of the optimal solutions $(f^*, g^*)$. --- **MC. 4** "*More details about Figure 1 are needed. I cannot understand how different curves are generated based on the information provided in the main paper*." **Cl. 4** We apologize for a brief explanation in the paper due to page limitations. Figure 1 presents a relation between Wasserstein distance and a model performance. Each curve represents a certain dataset trained on a specific model to receive its performance. Since, each dataset is of different size and structure, the Wasserstein distance will be of different scale. Therefore, we normalized the distances to the same scale to present the relation between the Wasserstein distance and model performance, which shows that although different datasets and models, with increased distance, the performance decreases. <hr style="border:1px dashed gray"> # Reviewer czdG ### Comment C1. [novelty and contribution] "*Measuring distance between datasets... and computing the gradient... are not new... the novelty... from the connection between this distance and the performance..*." ### Explanation E1. #### (tl;dr: The main contribution of the paper is the original invention of a novel valuation approach of individual datapoints for a target task, without requiring information for downstream models or using a proxy model. The proposed method enjoys wide advantages over existing approaches, being magnitudes faster and applicable to a scale far beyond that of existing methods. Also, novel theoretical insights of optimal transport are provided associated with downstream applications.) Most existing methods for data-valuation (measure the value of individual datapoints on a target task) are model-based, an approach which generally involves massive model retraining to examine the performance of models trained on different subsets. For example, the celebrated Shapley-value-based techniques often require retraining the model on an exponential number of subsets of the given dataset. The significant pain of computational intractability makes many such methods demanding in computational power and especially poor in scalability. Also, the inevitable stochasticity introduced from model training causes high variances in performance, affecting the accuracy of valuation results as well. In this work, leveraging existing theories in Optimal Transport and transfer learning, as you kindly noted, we proposed a model-agnostic approach to measure the value of individual datapoints to a target task based on distributional distances. Specifically, we found it is possible to extract the gradients of OT distance w.r.t. the strength of existence for each individual datapoint directly from the solutions to the OT problem---completely free from additional computations. With Theorem 4 as well as empirical results connecting OT distance to the upper bounds of the validation performance for downstream models, the valuation can be interpreted as a prediction of how adding/removing the datapoint will change the validation performance. This leads to a valuation approach **exceptionally efficient**, magnitudes faster than other valuation methods, and **applicable to a scale far beyond that of existing methods** (as shown in the table below). | LAVA | KNN-SV | TracIn-Clean | TracIn-Self | | ----------- | ----------- | ------------ | ----------- | | 1 hr 54 min | 4 hr 21 min | 7 hr 50 min | 7 hr 51 min | Moreover, the valuation results are also of **good quality**---demonstrating its strong effectiveness in a variety of challenging tasks---all in a small fraction of computing time compared to its predecessors. Besides, this work offers **novel theoretical insights** on computational aspects more closely associated with downstream applications. For example, the entropy-penalized OT solutions are subject to the approximation error introduced by the penalty terms, however, in this work, we show that in the context of evaluation and ranking the datapoints, the differences between the value of the datapoints can be precisely recovered from the approximated solutions such that the order of values for datapoints is actually preserved. We believe this offers new insights for a variety of applications that there may not necessarily be a tradeoff between efficient solution (scalability) and solution quality (accuracy). <u>In summary, our work contributed to solving a question that puzzles the data valuation field for years, namely, how to estimate the value of data in a meaningful and efficient manner? Towards that end, we couple insights from statistical learning (to bound the validation performance with distributional distance) and optimization theory (to calculate gradients of OT through duals). While the invention of distributional distance dated back to more than half a century ago, it is the first that the idea of distribution distance is leveraged in the context of data valuation and more importantly, it significantly boosts the efficiency and utility of data valuation applications. In addition, our theoretical result that the dual solutions to entropy-regularized OT preserve the rank of dual solutions to original OT is novel and actually very surprising: in contrast to traditional wisdom that there is direct tradeoff between efficiency and accuracy, our result shows that efficiency of entropy-regularizer does not tradeoff with accuracy when only the rank of dual solutions are of interest, considering a data valuation context.</u> --- ### Comment C2. [requirement on validation data] "...*investigating the trade-off between the validation size and detection rate ... such a large amount of data... for validating is not realistic*...." ### E2. #### (tl;dr: More experiments are conducted in this regard. The proposed method LAVA performs well with a few thousand of clean samples. It is cheap to obtain high-quality samples for this scale. The reliance on data is much weaker than existing data-valuation methods.) This method is remarkably **data-efficient** for clean validation examples–in our experiments, a few thousand samples are generally well enough for the method to have a strong performance. As an example to illustrate this point, we could compare with the best performing baseline TracIn-Self on mislabeled data detection. Given 10,000 validation samples, TracIn-Self can achieve a detection rate of 80% at a data removal budget of 25000; by contrast, our method can achieve the same detection rate at the same budget using only 2,000 validation samples. **It is fair to say that the reliance on validation samples is much weaker than in existing data-valuation methods**. Also, **it is generally considered not difficult to find high-quality samples on this scale**–crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) can help prepare such samples economically and conveniently (e.g., **acquiring a high-quality validation set of 1000 samples only costs $12 on MTurk**). Also, the authors are appreciative of the reviewer for pointing out the need for additional experiments investigating the relationship between validation size and model performance and comparing with existing methods. **Additional experiments are conducted and the results and analysis are combined into the answer below the next comment**. --- ### C3. [performance under small validation size] "...*when you can accurately approximate the ground-truth distribution ...the performance of LAVA when the validation size is 200 or 500 samples ...just above random selection... thorough experiments on smaller validation sizes*." ### E3. #### (tl;dr: the referred experiment is to evaluate the lower bound for the scale of samples required and presented without sufficient elaborations. Additional experiments are conducted with comparison to appropriate baselines: the proposed method is data-efficient compared to existing approaches.) The authors agree that the results with very small validation datasets are close to random selection and would like to note that the intention of presenting the results is to give a comprehensive view of how the proposed method performs with different sizes for the validation dataset, and the near-random results on very small validation datasets suggest **a lower bound for the scale of samples required for the method to perform well**---the performance significantly improved when the validation samples are increased to 500 from 200. The authors acknowledge presenting potentially confusing experiments without sufficient elaborations and appreciate the reviewer for kindly pointing out the issues. We present results on detecting mislabeled data given a validation set with 500 samples on the CIFAR-10 and compare the detection rate with other baselines with the same validation size in Fig 10 (in the updated Appendix version). **The performance of our proposed LAVA method is still much better than the best baseline, under the limited validation data setting**. On the other hand, the best performing baseline at a validation size of 10,000 is now worse than random baseline at a validation size of 500. In comparison, the performance of our method is more robust to validation size reduction. <hr style="border:1px dashed gray"> ### Question Q1. [theoretical contribution of Theorem 4] "...*Theorem 4 seems to be the main theoretical contribution of this paper...truly incomprehensible... for the list of typos*." ### Answer A1. #### (tl;dr: We have fixed the typos that the reviewer kindly pointed out as well as other inconsistencies we found during double-checking; we also present here more insights into Theorem 4 and its proof.) The authors are sorry for not being able to find and fix all the typos and inconsistencies and would like to apologize for causing all the confusion. The authors extend sincere gratitude to the reviewer for help identifying the issues. We greatly appreciate the reviewer for having the patience and devoting the efforts to help improve this work. We have fixed all the issues pointed out with additional clarifications added to the paper. We have also checked the contents and fixed other inconsistencies to the best of our ability. In a nutshell, Theorem 4 bounds the gap between training and validation loss by the Wasserstein distance grounded on the specific hybrid cost: $$d_{\mathcal{Z}}\left((x, y),\left(x^{\prime}, y^{\prime}\right)\right) \triangleq\left(d_{\mathcal{X}}\left(x, x^{\prime}\right)^{p}+W_{p}^{p}\left(\alpha_{y}, \alpha_{y^{\prime}}\right)\right)^{\frac{1}{p}}.$$ The Wasserstein distance with this particular hybrid cost has shown great empirical performance in detecting abnormal features and labels in our experiments. In addition, this Wassestein distance has also been shown in past work [1] to characterize the domain and task difference. However, all of these remain at an empirical level. This paper provides the first theoretical justification for the Wasserstein distance with hybrid cost and also characterizes the upper bound of the validation performance. Our proof might be of independent interest to a broader community involving meta learning and domain adaptation as it can be simply adapted to provide justification for why this Wasserstein distance is a good proxy for transfer learning and meta learning performance. **The novelty of the proof lies in a series of mathematical constructions to make the Wasserstein distance grounded on the hybrid cost appear on the upper bound of validation performance, thus providing this Wasserstein distance a formal justification.** [1]. Alvarez-Melis, David, and Nicolo Fusi. "Geometric dataset distances via optimal transport." Advances in Neural Information Processing Systems 33 (2020): 21428-21439. --- ### Q2. [selection of reference points in Theorem 5] "*Does the ground-truth difference... depend on the choice of the reference datapoint*..." ### A2. #### (tl;dr: The recovery of precise differences between OT gradients does not depend on the choice of reference points.) The recovery of precise differences between OT gradients does not depend on the choice of reference points. Selecting any pair of reference points should lead to the same recovered results. As the differences between OT gradients can be precisely recovered, the ranking of datapoint values is well preserved and not affected by the approximated solution of the OT problem. --- ### Q3. [Interpretation of valuation results] "*For Shapley-based value... values as the contribution of data points to the performance... interpretation of LAVA’s values*?" ### A3. #### (tl;dr: the gradients predict how adding/removing the datapoint will change the validation performance) The valuation is given by the calibrated gradients of the OT distance between the training and validation datasets w.r.t. to the probability mass of the datapoints. The probability mass represents the strength of existence of the datapoint in the dataset---removing the datapoint will reduce its probability mass to zero while duplicating the datapoint will double its probability mass. With Theorem 4 as well as empirical results connecting OT distance to the upper bounds of the validation performance for downstream models, the valuation can be interpreted as a prediction of how adding/removing the datapoint will change the validation performance. The gradients may be positive or negative, representing different contributions (negative and positive) of the datapoint in the training dataset to the target task (validation dataset). For example, adding datapoints with positive gradients to the training dataset suggests this will increase the OT distance to the target dataset, lowering the prediction of performance on the target dataset for downstream models trained on this dataset. <hr style="border:1px dashed gray"> # Reviewer yYLk We express our gratitude to the reviewer for taking the time and effort reviewing the work and also for the helpful comments and advice. ### Q1. [computational complexity] "...*concern is the computational cost...the computation of the Wasserstein distance may scale cubically... the sinkhorn approach may probably be much faster*?" ### A1. #### (tl;dr: The Sinkhorn algorithm solves for approximate solution magnitudes faster and achieves a near-linear time complexity in practice.) The original formulation for computing the Wasserstein distance resulted in a linear programming problem, where the best theoretical bounds on computational complexity scales cubically with the number of samples [1]. Indeed, when solving this problem precisely with state-of-the-art commercial solvers such as Gurobi, we encountered prominent computation issues with just more than 10,000 samples. Thus, in this work, we solve for the entropy-penalized formulation of the OT problem which gives an approximate solution to the original problem. The entropy-penalized OT problem can be solved iteratively using the Sinkhorn's algorithm with a time complexity of n2 log(n)/ε2 to reach ε-accuracy [2][3]. When accepting a not excessively small approximation error ε, this approach solves the OT problem magnitudes faster. We presented the analysis with Theorem 5 that the precise relative differences and ranking for the gradients information, which we rely on to perform evaluation of datapoints, are actually preserved despite the existence of approximation error. Moreover, the algorithm generally performs much better than the theoretical result suggests. In state-of-the-art implementations, Sinkhorn's algorithm-based solution method may achieve a near-linear time complexity in practice [4]. In this work, we built our implementations on the package GeomLoss [5], which allows accelerated computing on GPUs for large-scale problems with a near-linear memory overhead. [1]. Anstreicher, K. M. (1999). Linear programming in O ([n3/ln n] L) operations. SIAM Journal on Optimization, 9(4), 803-812. [2]. Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26. [3]. Chizat, L., Roussillon, P., Léger, F., Vialard, F. X., & Peyré, G. (2020). Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33, 2257-2269. [4]. M. Scetbon and M. Cuturi, Linear time Sinkhorn divergences using positive features, Advances in Neural Information Processing Systems, (2020). [5]. Feydy, J., Séjourné, T., Vialard, F. X., Amari, S. I., Trouvé, A., & Peyré, G. (2019, April). Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 2681-2690). PMLR. --- ### Q2. [scale of applications] "...*more comments/experiments on how the approach compares to others in terms of the number of samples*." ### A2. #### (tl;dr: 2,000 samples is the scale existing methods can solve in a reasonable time. We successfully implemented our method on full CIFAR-10 (50,000 samples) and ImageNet (100,000 samples). For the experiments, we report runtime comparisons only for 2,000 test samples as this is the scale existing methods can solve in a not excessively long time (within a day). It showcases the advantageous computing efficiency that the proposed approach enjoys over other methods. However, the authors appreciate the reviewer for pointing out the logical problem that the linear programming-based approach may become computationally intractable as the scale grows. We note it as a crucial issue that requires better elaborations. As mentioned under the question above, Sinkhorn's algorithm solves entropy-penalized OT problems much more efficiently and magnitudes faster than the LP-based solution methods. With state-of-the-art implementations and support of GPU accelerations for large-scale problems, a near-linear time and memory overhead may be achieved in practice. We report additional experiment results on how the time complexity of our approach scales with the number of samples. We successfully implemented our method on full CIFAR-10 (50,000 samples as provided in the paper) and ImageNet (100,000 samples). We demonstrate the near-linear time complexity of our method on the CIFAR10 dataset in Fig. 11 (in the updated Appendix). Moreover, we validate the computation time efficiency of our method by comparing with existing valuation methods on the ImageNet-100 dataset and present the results below. It shows that our method is significantly faster the existing valuation methods, which demonstrates practicality of our method. | LAVA | KNN-SV | TracIn-Clean | TracIn-Self | | ----------- | ----------- | ------------ | ----------- | | 1 hr 54 min | 4 hr 21 min | 7 hr 50 min | 7 hr 51 min | <hr style="border:1px dashed gray"> # Summary of Changes We sincerely appreciate all reviewers' insightful comments, questions, and suggestions. We are pleased that our work was recognized as a novel and effective data valuation method (yYLK) which tackles an important (4TXZ), interesting (czdG) problem, and contributes to the accessible and applicable research for the community with practical implementations (RGDd). To further improve our work, we have updated following sections and addressed key problems in the new version of our paper: --- * Revised and fixed the typos in Theorem 4. * Added explanation and insights for the importance of theoretical contribution that Theorem 4 provides in Appendix A.2. * Provided interpretation of the data values produced by LAVA in Appendix A.4. * Added insights towards data-based valuation vs model-based valuation in Appendix A.5. * Added various practical experiments in Appendix (B.2-B.4) (such as rebalancing unbalanced datasets or reducing training datasets). * Explained and demonstrated scalability and computation complexity of the proposed method (B.5). --- We have revised our paper to include all of the additional comments and experiments suggested by reviewers. Please notify us if you have any further suggestions. Cordially, Authors of Paper2217

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully