# ICML KDS 2025 Rebuttal
## Follow up to R3
We thank the reviewer for adjusting the score. We observe that correlations across models are generally high—for instance, (bloomz, phi3-small) = 0.99, (phi3-small, qwen2.5-14b) = 0.98, and (llama3.1-8b, qwen2.5-7b) = 0.86. We will also ensure that the final version includes a discussion on rephrasing-based evasion.
Thanks again for your support!
<!--
| | bloomz | gemma2 | llama3.1-8b | mistral | phi3-small | qwen2.5-14b | qwen2.5-7b |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| bloomz | 1.00 | 0.87 | 0.69 | 0.83 | 0.99 | 1.00 | 0.72 |
| gemma2 | | 1.00 | 0.59 | 0.81 | 0.87 | 0.88 | 0.57 |
| llama3.1-8b | | | 1.00 | 0.69 | 0.75 | 0.69 | 0.86 |
| mistral | | | | 1.00 | 0.79 | 0.86 | 0.65 |
| phi3-small | | | | | 1.00 | 0.98 | 0.71 |
| qwen2.5-14b | | | | | | 1.00 | 0.71 |
| qwen2.5-7b | | | | | | | 1.00 | -->
## Follow up
Thank you for the thorough read of our rebuttal, and probing with insightful questions. We address the remainder ones below.
_A3. Reason for negative correlation_
We fully agree with the reviewer's view. Accordingly, we have investigated this in more depth. As a case study, we examined the Zlib baseline on the HackerNews benchmark, with a correlation of -0.956. We hypothesize that these negative correlations may stem from the following factors:
- Many baseline methods operate at the instance level and then aggregate scores across a dataset. We observed that the overall score distribution appears to be highly similar between small vs. large contamination rate $\lambda$, making the aggregated score relatively insensitive to gradual contamination. In such cases, even a few outliers can disproportionally influence the overall score and introduce non-monotonic behavior.
- Upon closer examination, we found that unseen examples can contain a large number of code snippets with predictable formatting, repetitive syntax and tokens, which makes them highly compressible. As a result, even at a low contamination rate $\lambda=0.1$, the Zlib score can be slightly higher than $\lambda=0.9$ (compounded with the outlier effect).
We believe our results can inspire follow-up works to carefully re-examine widely used contamination detection methods, which may be worth an extensive study on its own. We thank the reviewer again for the comment. We will make this discussion more prominent in the revised version.
_A4. Score and contamination rate_
We report below the scores across different benchmarks and different contamination rates. While a perfect correspondence between score and contamination rate remains difficult to establish, we observe that the overall score trends are consistently monotonic across all three benchmarks.
| **Contamination Rate** | **WikiMIA** | **Arxivtection** | **BookMIA** |
|:---:|:---:|:---:|:---:|
| 0.10 | 0.110 | 0.172 | 0.158 |
| 0.20 | 0.228 | 0.300 | 0.302 |
| 0.30 | 0.382 | 0.437 | 0.425 |
| 0.40 | 0.441 | 0.496 | 0.511 |
| 0.50 | 0.487 | 0.517 | 0.619 |
| 0.60 | 0.597 | 0.638 | 0.722 |
| 0.70 | 0.722 | 0.768 | 0.763 |
| 0.80 | 0.809 | 0.863 | 0.802 |
| 0.90 | 0.920 | 0.918 | 0.832 |
_A5. Correlation between benchmark difficulty and KDS_
We should have used the wording "relatively low" in our previous response. As suggested, we also computed the correlation for more models, and observed consistent findings (e.g. Llama3.2-1b and Qwen2.5-1.5b having weak correlation). We couldn't afford to run these many inferences (number of benchmarks $\times$ models) in such a short time window, given the limited compute we have to serialize the inference runs.
We would like to clarify that difficulty vs. contamination are distinct notions. A benchmark’s difficulty reflects how challenging its questions or text are for human (or model) reasoning. In contrast, contamination reflects whether the exact or near-verbatim content of that benchmark was present in the model’s pretraining corpus.
> "_Using a similar argument as the authors, one could argue that training on more difficult samples the model is less used to, would cause more deviation on the sample._"
This does not necessarily hold. **A difficult sample may still cause minimal deviation if the model has already seen similar content during pretraining**. Sometimes, complex or niche material—such as advanced math problems or scientific text—can appear on public educational sites, making it possible for a model to memorize them during training. In such cases, despite the sample’s difficulty, the model is already well-prepared to handle it, and fine-tuning will make only minor adjustments.
Conversely, even a simple sample can cause a large deviation if it is truly novel to the model—e.g., a common-sense sentence phrased in an unfamiliar way or from an novel domain. What matters is not whether the sample is inherently hard, but whether it introduces a new learning signal. Our method captures this signal by measuring representation shift induced by fine-tuning, which we argue is a more direct proxy for novelty (and hence contamination) than for difficulty.
_A7. Similarity plot_
We now understand your concern. Yes -- detection would become easier when one distribution becomes more concentrated. However, **we would like to clarify that Figure 2 was constructed primarily for clear illustration purposes**. We intentionally chose a non-i.i.d. configuration to make the effect of fine-tuning on unseen examples more visually prominent, particularly for demonstrating how our method captures representation shifts for those samples.
We would like to emphasize that **all of our quantitative results and benchmark evaluations in the main paper are based on the standard i.i.d. setup**, where seen and unseen samples are drawn from the same distribution. This ensures that contamination detection is evaluated under more realistic conditions.
# Reviewer 5cjP (4896 chars)
We sincerely thank the reviewer for the thorough and constructive feedback. Below, we address the concerns in detail.
---
_A1. Baseline comparison in Table 5 (Pile dataset)_
We provide the comparison below. KDS achieves the highest average correlation.
| Spearman Corr. | Wikipedia | PhilPapers | Enron | HackeerNews | Pile_CC | StackExchange | **Average** |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Zlib | 0.861 | **1.000** | **1.000** | -0.956 | -0.782 | 0.990 | 0.352 |
| Zlib + FSD | **1.000** | 0.991 | 0.999 | 0.323 | 0.894 | 0.999 | 0.868 |
| Perplexity | -0.886 | 0.999 | 0.999 | -0.999 | -0.251 | 0.999 | 0.144 |
| Perplexity + FSD | **1.000** | 0.990 | 0.999 | 0.118 | **0.908** | **1.000** | 0.836 |
| Min-K% | -0.645 | 0.996 | **1.000** | -0.955 | 0.690 | 0.999 | 0.348 |
| Min-K% + FSD | 0.997 | 0.952 | 0.997 | 0.421 | **0.908** | **1.000** | 0.879 |
| Min-K%++ | -0.482 | 0.960 | -0.842 | 0.561 | 0.514 | 0.697 | 0.235 |
| Min-K%++ + FSD | -0.536 | 0.994 | -0.770 | 0.705 | -0.358 | 0.210 | 0.041 |
| KDS (Ours) | 0.891 | 0.982 | **1.000** | **0.897** | 0.895 | **1.000** | **0.944** |
_A2. Significance over Perplexity_
In the A1 table above, **KDS outperforms Perplexity on the Pile dataset, with an average correlation of 0.944 compared to 0.144**.
_A3. Negative correlations_
We resonate with the reviewer’s concern. To address the potential causes, **we reviewed our implementation and found no errors**. Since our code directly builds on established prior work, an implementation error is unlikely (code will be released).
---
_A4. Main metrics and Practical utility_
(1) Main metrics
We agree that absolute contamination estimation and exact correspondence would be valuable. However, a **necessary step toward this goal** is to first validate the correctness of any scoring functions. The use of Spearman/Pearson correlation in our controlled experiments allows us to rigorously assess whether KDS (and other approaches) produces consistent and monotonic metrics, necessary conditions for any meaningful scoring function. **This provides dev-purpose metric to ensure score behaves predictably and reliably in principle, which is essential for deploying it in practical settings**. To our knowledge, no prior work has investigated this practically important angle.
Concerningly, we discovered existing scores do not satisfy these necessary conditions. In contrast, KDS consistently satisfies these conditions under varying benchmarks (including the challenging ones), providing stronger foundations for future method design.
(2) Practical utility
The lack of access to pre-training corpora makes ground-truth estimation fundamentally challenging. Moreover, pre-training data is often non-overlapping across models, further complicating any cross-model normalization or comparison. _To our knowledge, there is no existing work that has effectively estbalished correspondence between the scoring function and dataset contamination_. **This reflects the intrinsic difficulty of the problem, rather than a unique shortcoming of our approach**.
That said, we would like to emphasize that **KDS is highly practical, which supports real-world applications such as safety auditing and benchmark selection**, where knowing which dataset is more contaminated is often more important. It enables dataset creators or auditors to prioritize and select the least contaminated benchmarks for evaluation purposes. We state the practical utility in **Lines 108-117**.
---
_A5. Correlation between benchmark difficulty and KDS_
Thank you for this insightful question. During rebuttal, we computed the correlation of {Llama3.1-8b, Mistral-7b, Qwen2.5-7b}'s performance on {GSM, MPP, MPL, TFQA} with KDS. The correlation values are 0.2, 0.4, 0.2, respectively, which are all very low. This indicates that **benchmark difficulty is not what determines the scale of KDS.**
---
_A6. Why do we need Requirement 2?_
While the two requirements could be combined, separating them into orthogonal objectives offers clearer guidance for future research by isolating distinct performance aspects.
---
_A7. Clarification on Figure 2 (left)_
It is true that unseen samples are similar to each other in Figure 2 (left), and we would like to clarify that this is the similarity _before_ fine-tuning. However, **our method relies on the changes in the similarity or distance before and after fine-tuning, which is much more significant for unseen examples**. As shown in Figure 2 (middle, without gating) and Figure 2 (right, with gating), unseen samples exhibit significantly larger shifts in embedding geometry, whereas seen samples remain relatively stable. This differential behavior is the core signal that KDS captures.
Ps. The performance of only using Figure 2 (left) is in Table 2 "w/o (2) Fine-tuning".
---
_A8. Is comparison with SRCT accurate?_
**We preserved the canonical ordering of samples in SRCT** by not shuffling the dataset.
---
# Reviewer 9vAw (4986 chars)
We thank the reviewer for the thoughtful and constructive feedback. Below, we address your concerns in detail.
---
_A1. Related work_
We thank the reviewer for pointing out the work by Dekoninck et al. [1], which appears to be highly relevant! It's our oversight at the time of submission and will make sure to discuss it in the updated version.
[1] Dekoninck et al "Constat: Performance-based contamination detection in large language models." NeurIPS 2024.
---
_A2. Open weight model_
We acknowledge that KDS currently requires open-weight models. This is a fair point, and we will clearly note this limitation in the paper. However, we'd like to emphasize that focusing on open-weight models is valuable for academic research because they provide transparency **essential for developing a deeper understanding of how contamination manifests inside LLMs**. In future works, adapter-based probing techniques can be explored to extend its principles to proprietary models.
---
_A3. Significance w.r.t. min-K%_
While Min-K% works well on easy benchmarks such as WikiMIA and BookMIA, its performance degrades significantly on more challenging datasets. For example, **in the three hardest PILE subsets in Table 5, Min-K% has significantly lower performance than KDS**, as shown below (numbers are Spearman correlation):
| PILE Dataset | Min-K% | KDS|
|-|-|-|
| Wikipedia | -0.645 | **0.891** |
| Hackernews | -0.955 | **0.897** |
| Pile-CC | 0.690 | **0.895** |
---
_A4. Practical utility_
We agree that absolute contamination estimation is valuable. However, **a necessary step toward this goal** is to ensure that any proposed contamination scoring function satisfies core correctness criteria—namely, monotonicity and consistency with ground-truth contamination levels in controlled settings. **These conditions ensure that the score behaves predictably and reliably in principle, which is essential before deploying it in practical settings**. To our knowledge, no prior work has investigated this practically important angle.
Concerningly, we discovered existing scores do not satisfy these necessary conditions. In contrast, KDS consistently satisfies these conditions under varying benchmarks (including the challenging ones), providing stronger foundations for future method design.
Lastly, we would like to emphasize that KDS is also highly practical, because it supports real-world applications such as safety auditing and benchmark selection, where knowing which dataset is more contaminated is often more important. **To further demonstrate the practical utility, we include an in-the-wild evaluation across 11 public benchmarks (Appendix E)**. For these reasons, we believe KDS offers a meaningful and practical step forward for the field.
---
_A5. Minimal contamination rate to be regarded as "significantly contaminated"_
The required rate to be considered "significantly contaminated" is **0.10 (p=0.014)** under significance level $\alpha=0.05$, and **0.15 (p=0.004)** under $\alpha=0.01$.
---
_A6. Histogram on the values of embedding similarity change_
Since plots cannot be updated or shown at this stage, we provide the histogram as a frequency table below. The table is retrieved from a contamination rate of 0.5.
| bins | Seen-Seen | Seen-Unseen | Unseen-Unseen |
|:---:|:---:|:---:|:---:|
| x < -0.10 | 0 | 202 | 22 |
| -0.10 <= x < 0.00 | 21529 | 44112 | 6030 |
| 0.00 <= x < 0.10 | 40953 | 80338 | 52270 |
| 0.10 <= x | 18 | 348 | 4178 |
| **total** | 62500 | 125000 | 62500 |
_A7. Embedding similarities should increase for previously unseen data. Why take the absolute value of change?_
We would like to clarify that the embedding distances increase (i.e., embedding similarities decreases) for previously unseen data after fine-tuning. This is also evidenced in the table above, where $\log \frac{\Phi(Z)}{\Phi(Z')} = ||Z'_i-Z'_j||^2_2 - ||Z_i-Z_j||^2_2$ is mostly positive for unseen-unseen pairs.
We take the absolute value of the change to ensure that we capture the magnitude of deviation, regardless of direction. This makes the score robust to mixed behavior and focuses on detecting how much the geometry shifts. We will clarify this in the final version.
---
_A8. Effect of Fine-tuning and Rephrased Data_
Following the suggestion, we fine-tuned "unseen" samples from WikiMIA and Arxivtection, observing KDS increases of 22% for WikiMIA, and 11% for Arxivtection. This shows our scoring approach's ability to effectively capture contamination introduced by fine-tuning.
Additionally, we experimented on WikiMIA by mixing unseen data with rephrased "seen" data at varying proportions, yielding a Spearman correlation of -0.791. This outcome is expected, as rephrasing likely blurs the distinction between seen and unseen samples.
---
_A9. Other suggestions_
Thank you for all the constructive suggestions! We will incorporate those comments, to (1) mention the use of LoRA in the main paper, (2) bold-face all the best performances in Table 1, and (3) add a histogram based on our response, A6.
---
# Reviewer CbNy (3933 chars)
We thank the reviewer for the thoughtful and constructive feedback. Below, we address the key concerns:
---
_A1. Clarification on problem setting_
We absolutely agree with your viewpoint that contamination in real LLMs occurs during pre-training! We'd like to clarify that **our central goal is indeed to quantify contamination originating from pre-training, not fine-tuning**. Our methodology and evaluation are designed around pre-training leakage.
In particular, we ask this question (Sec 2.1):
> Given an LLM $\mathcal{M}$ and a benchmark dataset $\mathcal{D}$ (e.g., WikiMIA), to what extent has this dataset been exposed during pre-training?
To address this, our key idea is the following:
> If the model $\mathcal{M}$ has already been pre-trained on $\mathcal{D}$, then if we further fine-tune $\mathcal{M}$ on $\mathcal{D}$, the model would have minimal changes in embeddings due to prior exposure during pre-training (vice versa).
Hence, we use a lightweight fine-tuning step as a probe to reveal how much of the evaluation dataset was likely seen during pre-training. In other words, **we are not using fine-tuning to inject contamination, but only as a mechanism to probe for the presence of whether $\mathcal{D}$ is used in pre-training**. This design choice aligns with recent work (e.g., Zhang et al. 2025, FSD) that also leverages fine-tuning to surface memorized content—but our approach uniquely leverages embedding-level structural changes, enabling a more holistic dataset-level assessment.
---
_A2. "the contamination is artificially introduced via fine-tuning, not pre-training"_
We would like to clarify that **our experimental setup simulates contamination w.r.t. pre-training, which is precisely the phenomenon our method aims to quantify**. Specifically, our controlled experiments leverage datasets such as WikiMIA, with two subsets: seen/unseen in the pre-training corpus of the LLM (e.g., Mistral-7B). By mixing these pre-training seen and unseen subsets at varying proportions, we simulate datasets with known and controllable contamination ratios relative to pre-training—not fine-tuning.
---
_A3: “Have the authors tested KDS on LLMs with partially known pre-training corpora?"_
Excellent suggestion! We further evaluated on **Pythia-6.9b**, which is known to be pre-trained on the PILE dataset. The spearman correlation coefficient is **0.999** on the Enron subset.
Furthermore, our evaluation includes WikiMIA, BookMIA, and ArxivTection benchmarks, which label samples that are likely to have been seen in general LLM pre-training as "seen". We explicitly simulate varying contamination ratios and show in **Table 1** that **KDS exhibits near-perfect monotonicity and consistency, outperforming all baselines across three datasets.**
---
_A4. Theory_
While our paper primarily focuses on providing a practical and robust empirical method for quantifying pre-training contamination, we agree that theoretical analysis is an important direction. Our core intuition—that fine-tuning shifts the embeddings of unseen pre-training data more significantly than seen data—is empirically grounded, as shown in Figure 2, and supported by strong monotonicity and consistency across all benchmarks (Tables 1, 5, 6). Moreover, the mathematical formulation in Eq. (3–5) provides a principled and interpretable construction of KDS, which integrates:
- A soft gating function that emphasizes initially similar pairs, and
- A distance shift measure capturing how much pairwise embedding geometry is altered after fine-tuning.
That said, we acknowledge that a full theoretical analysis of why kernel divergence correlates with contamination remains nontrivial. Such analysis would require carefully defined assumptions about the structure of the embedding space, the dynamics of fine-tuning, and the underlying data distribution. We consider this a promising direction and plan to explore these theoretical foundations in future work.
# Reviewer J5f8
Dear Reviewer J5f8,
We sincerely appreciate your positive feedback and the time you've dedicated to reviewing our manuscript. Your insights are invaluable to us. Please let us know if you have any further questions.