## Global Response
**Summary of Reviews and Response**: We thank all reviewers for their valuable feedback and insightful questions. In general, the reviewers have provided positive feedback about our work and appreciated the problem formulation and analysis (as mentioned in the Review highlights). While the feedback has been largely positive, we acknowledge the suggestions for an enhanced experimental design, incorporating state-of-the-art Detectors and Generators. We have now included various new experiments detailed in the following Rebuttal Table 1 to verify our claim. We include the detailed results with figures in the attached pdf submitted with the general response.
Additionally, we recognize the call for more intuitive explanations of our theorems and are committed to addressing these aspects comprehensively in the final version of our paper.
**Rebuttal Table 1:** We performed new experiments as suggested by the reviewers on a variety of state-of-the-art text generate-detector pairs. We remark that our claim of increasing detectability holds for all of the above new experiments as well.
| New Experiment | Text Generator | Text Detector | Datasets | Satisfy our Claim | Figure No. in Rebuttal pdf|
| -------------- | -------------------- | ------------------------- | ------------ | ----------------- |---------|
| 1 | GPT3.5 Turbo | OpenAI's Roberta (Large) | XSum, Squad | Yes |Figure 1(a)|
| 2 | GPT3.5 Turbo | OpenAI's Roberta (Base) | XSum, Squad | Yes |Figure 1(b)|
| 3 | Llama | OpenAI's Roberta (Large) | XSum, Squad | Yes |Figure 1\(c\)|
| 4 | GPT3.5 Turbo | ZeroGPT (SOTA Detector) | XSum, Squad | Yes |Figure 2(a)|
| 5 | Llama-2 (13B) | ZeroGPT (SOTA Detector) | XSum, Squad | Yes |Figure 2(b)|
| 6 | Llama-2 (70B)* | ZeroGPT (SOTA Detector) | XSum, Squad | Yes |Figure 2\(c\)|
*Llama-2 (70B) is the state-of-the-art open source text generator.
**Review highlights:**
Reviewer LTNg mentioned that "*results presented in this work are general and **interesting***" and "*They **address the important drawback** from the previous work*".
Reviewer 8upM appreciates that *"paper investigates an **important problem**"* and "*paper is **well-written**/presented".
Reviewer NSUG believes that the paper *"addresses an important and timely topic"* and "*the author(s) discovers the **hidden possibility** and provides a theoretical analysis*".
Reviewer hBT7 mentioned that "*I find the research **very interesting***" and believes that our work "*is **an important contribution** that needs to be **heard by a larger audience**, especially in a conference*".
The reviewer has also ***appreciated the organization*** of the paper "*The visualization (Figure 1) placed early on the paper also helps the reader to further contextualize the problem.*"
---
**Summary of core contributions:**
1. **Possibility of AI Text Detection**: We utilized a mathematically rigorous approach to answer the question of the possibility of AI-generated text detection. We conclude that there is always a ***hidden possibility*** of detecting the AI-generated text, and it improves with the text sequence length.
2. **Precise Sample Complexity Bound for AI Text Detection**: We derive the sample complexity bounds, a first-of-its-kind tailored for the detection of AI-generated text for both IID and non-IID scenarios.
3. **Comprehensive Empirical Evaluations**: We have conducted extensive empirical evaluations for real datasets Xsum, Squad, IMDb, and Fake News dataset with state-of-the-art generators (GPT-2, GPT3.5 Turbo, Llama, Llama-2 (13B), Llama-2 (70B)) and detectors (OpenAI’s Roberta (large), OpenAI’s Roberta (base), and ZeroGPT (SOTA Detector)).
**Remark**: In light of the recent pessimistic undetectability results (Sadasivan et al., 2023), which are based on a sequence length of 1, we offer a new direction toward detectability. We believe that rigorous characterization of AI-generated text detection is crucial in the current times, and our work provides a first step in that direction. We encourage the community to work more collaboratively on detection and not be hindered by the perceived impossibility of the task.
-----------------------------------------------------------------
## Response to Reviewer LTNg [Score 4, confidence 4]
We are thankful to the reviewer for dedicating their valuable time and effort towards evaluating our manuscript, which has allowed us to strengthen the manuscript. We have thoroughly responded to reviewer’s inquiries in the responses provided below.
> **Weakness 1:** The non-iid setup (sec.3.2) is not clear. Assumptions about the type of conditional dependency of samples do not look realistic.
**Response to Weakness 1:** Thanks for the question and we apologize for any confusion. In practice, ***the assumption about conditional dependency in our analysis is sensible***. We leverage the autocorrelation structure present in NLP texts, where the strength of association decreases over the sequence length, and there can be 'independent dependent-structures' in the passage.
For example, consider the passage
*"While some species are adapting to the changing climate, others are facing severe challenges. Polar bears in the Arctic, for example, are struggling to find sufficient food due to melting ice. On the other hand, migratory birds are altering their flight patterns to cope with shifting weather conditions."*
In the above passage, there are two distinct structures, one focusing on the challenges faced by polar bears and the other on migratory birds. These structures are independent of each other, but they both contribute to the overarching theme of the impact of climate change on wildlife.
***Connection to our Analysis:*** Hence, in our analysis we consider $L$ number of indepedent subsets, ($L$ can also be $1$, i.e., there are no such subsets) and $\rho$ as the depedency between the sequences. The worst that can happen is all the prior $t-1$ number of samples are related to the $t^{th}$ sample almost exactly and even consider that in our analysis. Naturally the decreasing depedency can be easily captured with some minor modifications.
With these assumptions, the most important aspect of the result is that we are able to derive the precise sample complexity bound (cf. Equation 17) which is a function of the strength of association between the sequences, given by $\rho$ and reduces to standard Hoefdding's inequality with $\rho = 0$ (ref to Equation 16, 17).
we derive the precise sample complexity bounds (in equation 17) in the context of detection which are quite general. As mentioned, for $L=1$ its the case where there are no subsets, naturally detection will be comparitively harder there as reflected in final bounds in Theorem 2 Equation 17.
> **Weakness 2:** The experimental setup causes questions. The authors consider texts generated by GPT-2 model (which is not the best text generator nowadays), and quite simple detectors like logreg, random forest, and two-layers neural networks, which are not SOTA in the area.
**Response to Weakness 2:** To address the above concern, we have performed detailed new experiments (please refer to the attached pdf in the global response) leveraging both the SOTA Detectors and SOTA Generators as ***shown in Figure 1 (a,b,c) and 2(a,b,c)*** summarized in the global rebuttal, and for all those cases our result holds true.
> **Weakness 4:** Overall the theoretical results have limited interest, because there is no practical prediction that can be derived from these results. It would be good to provide an experimental comparison of the derived upper bounds with the latest generators and the SOTA detectors.
**Response to Weakness 4:** We respectfully disagree with the reviewer that theoretical results have limited interested. We emphasize that ***our work is the first attempt to theoretically show the "hidden possibility" of AI-generated text detection*** and derive the precise sample complexity bounds for the detection problem, which is non-trivial. This theoretical results provide the insight that the detectability increases as the number of samples (tokens) increases and also derive the rate of increment.
Further, we highlight that the characterization of sample complexity results with Chernoff information is a critical insight and provide additional insights (detailed in Appendix A) for:
1. **Watermark Design**: We highlight how the insights from our analysis can be combined with the design of more efficient/robust watermarks. One can leverage Chernoff information and our sample-complexity to adaptively design watermarks: i.e., if required samples for detection is low, we watermark those sequences less and vice versa. This could make watermark design much robust against attacks. This is a part of our future research.
2. **Imposing Constraints on Length in Exams (prevent cheating)**: Our precise sample complexity bounds give a characterization of the no. of samples required for detection given the complexity of the scenario which can be realized through the Information based metric (Chernoff information). For example, given the context and the scenario the teacher can always ask the students to write the summaries with a particular length which make it easier to detect.
We have also added additional experiments in the attached pdf to empirically verify our claims.
**Remark:** In light of the recent pessimistic undetectability results (Sadasivan et al., 2023), which are based on a sequence length of 1, we offer a new direction towards detectability. We encourage the community to work more collaboratively on detection and not be hindered by the perceived impossibility of the task.
> **Weakness 5:** In experiments on artificial text detection, it's important to distinguish the black-box scenario from the situation when the classifier has access to the generator's weights or a large set of samples generated by the same model.
**Response to Weakness 5:** Thanks for the question. For all our experiments, we have prompted OpenAI's GPT models with prompts designed from datasets XSum, Squad etc as in [1,2, 3] and then used various detectors like OpenAI's Roberta, ZeroGPT and simple ML classifiers. However, in all of these cases ***we have no access to the weights/gradients of the generator*** (ChatGPT/LLama). Also, we only prompted the generator with a limited set of samples (300-500) as in [1,2,3] and tested our hypothesis. However, although it might be true that OpenAI's detector was trained on lots of samples from GPT-2, (although might not be with samples from GPT-2 prompted with prompts from Xsum, Squad etc). Even though thats the case, we vary the detectors and generators in our new experiments to nullify those cases.
> **Question 1:** Sec. 4.2 What does "zero-shot" mean in this section? If we consider the classifier trained by OpenAI engineers having access to the large amount of natural and generated data, in which sense it can be considered a "zero-shot"?
**Response to Question 1:** That's a good point. In our context, by zero-shot detection we mean that the model is not trained/fine-tuned on the dataset we are using for detection. For example, OpenAI's Roberta model has been trained on data generated using GPT-2 and prior models, but has not been specifically trained or fine-tuned on the summaries generated using prompts from Xsum, Squad etc. Although, that's a valid concern due to the black-box nature of the models and detectors, we are not absolutely sure on the texts used to train those models. Hence, in the new experiments (see Rebuttal Table 1 and attached pdf in global response) we also consider SOTA generators with old detectors and SOTA detectors to deal with such potential biases.
> **Question 2:** Sec. 3.4 in Eq. 16, if $s_i$ denote text samples, what is the meaning of $\sum_{s_k}$? How do we sum up text samples?
**Response to Question 2:** This is a great question! Eq. 16 provides a way to measure the dependence between random variables, which is later used to extend Chernoff bound to non-IID cases. Such measurement holds for any representation of $s_k$, such as one-hot encoding or embedding (distributed representation like text2vec). In SOTA LLMs, $s_k$ is represented by embeddings, and in this case the sum $\sum {s_k}$ means the "average meaning" of these text samples, such as "woman" + "royalty" may have similar meaning as "queen".
> **Limitations:** The paper does not consider the realistic situation of domain shift; The paper claims the theoretical possibility of Artificial texts detection but does not provide a recipe.
**Response to Limitations:** It is not very clear what is the exact domain shift being referred to here?
***In this work, we rigorously studied whether detection is possible or impossible.*** We obtain a precise sample complexity analysis of the detection problem and we are the first to mathematically characterize the hardness of detection with sequence length, Chernoff Information and sequence depedence (for Non-iid) which was very important considering the potential impact of detectors.
Hence, we believe that the insights from our analysis can serve as a guidance/recipe for any detector and watermarking designs as also described above in response to Weakness 4. We highlight the same on design of watermarks and detectors with our insights in Appendix A in the supplementary material. ***The interesting aspect of our theory is that it holds for any detector/classifier as also validated through the experiments.*** Although, we agree that a complete efficient design of detector with the exactness in theory and practice is yet to be developed, but this work is to provide a first step in that direction.
## Response to Reviewer 8upM [Score 5, confidence 3]
We express our gratitude to the reviewer for taking the time to review our manuscript and recognizing the importance of our contribution in characterizing the detectability of AI-generated text. We sincerely appreciate the feedback and provide detailed responses to all questions below.
> **Weakness 1:** One major weakness of this work is the base language model selected to conduct experiments. The 1.5B GPT-2 is largely outdated and the generation performance of this model lags the SOTA models (e.g. ChatGPT, GPT-4) by large margins. To ensure the validity and impact of this work, I strongly suggest the author(s) to provide results using SOTA models. [I am willing to increase my assessment score if this problem is properly addressed by the author(s).]
**Response to Weakness 1:** We thank the reviewer for highlighting this point. As suggested by the reviewer, ***we have performed an additional detailed experiments*** with the state of the art detector and state of art generators in the new experiments Figure 1(a,b,c) and Figure 2(a,b,c) [they are summarized in the attached pdf with global response]. We have also considered the case where the generator is SOTA, but the detector is not SOTA and only trained till GPT-2 outputs, how the performance varies. In all the scenarios, we observe that our result holds true i.e detectability improves with increasing sequence length.
> **Weakness 2:** [Minor] One minor weakness is the assumption of this work. It is not always feasible to obtain multiple samples from the same source. In many cases, we only have access to a single machine-generated sample. I would like to see the claim weakened a bit in the following version.
**Response to Weakness 2:** This is a good point. For the IID case, we agree that there might be cases when we don't have access to more than one samples, then that might create issues in the detection. We will revise our claim in the final version as you mentioned. But interestingly, there are many scenarios where we can access iid pairs such as Twitter or Chatbots (generally posts multiple samples continuously) which is one of the major concerns for spreading misinformation. So, even if we only have access to pairwise samples (just 2 iid samples), we show in Figure 2c that we can do detection with very high cofidence. For that, we construct a pairwise-type detector, which is trained with 2 samples rather than one for detecting machine and human generated texts. Such a pairwise type detectors extract information between the pairs and is shown to indeed improve detection significantly which we discuss in experiment Section 4.1 for Figure 2c. Additionally, we would also like to mention that, our non-IID result would allow us to work with even one sample as long as the text length is sufficient.
------------------------------------------
## Response to Reviewer NSUG [Score 5, confidence 2]
> **Weakness 1:** The Eq. 28 in your proof seems not correct or an additional condition is needed. Maybe a step-wise derivation using the Chernoff bound is needed.
**Response to Weakness 1:** Thanks for the question. Equation 28 comes from the results of Standard Chernoff Bound (refer to [1, 2]), which gives the advantage of deriving the bound for both the tails. But it is important to note that the first bound in Eq 28 is that at least $(p+\frac{\delta}{2})n$ samples **from $h(s)$** are in $A$ and the second bound is that atmost $(p+\frac{\delta}{2})n$ samples **from $m(s)$** are in $A$. So one has the condition on samples from $h(s)$ and the other from $m(s)$.
Now, if we have $n$ samples from $h(s)$, $pn$ will be in $A$ on average and if we have $n$ samples from $m(s)$, $(p +\delta)n$ will be in $A$ on average.
For upper-bounding the first equation of Equation 28, we can re-write $P(X \geq (p+\frac{\delta}{2})n)$ (samples of $h$ are in $A$) as $P(X \geq pn + \frac{\delta n}{2})$ and we know the mean is $pn$ and then susbtituting the values in right-tail bound of Chernoff, we get the first equation.
Similarly, for the second-equation, we re-write $P(X \leq (p+\frac{\delta}{2})n)$ (samples of $m$ are in $A$) as $P(X \leq ((p+ \delta) n- \frac{\delta n}{2})$ and we know the mean is $(p +\delta)n$. Now, susbtituting the values in right-tail bound of Chernoff will give the second equation of Equation 28.
We will add these details in the proof for better clarity as well.
> **Weakness 2:** I guess there exists some errors in Figure 2(b) since the AUROC should always lies in [0.5,1]
**Response to Weakness 2:** That's a good catch. We just checked that when the sequence length is too low which is 0-10% and AUC is 0.5, the package malfunctions and produces some bad values. Thanks for mentioning that, we have rectified it. Also, can be seen in new Figure 3 (General response pdf).
> **Question 1:** I wish I could only blame myself but I don't understand why the discussion and conclusion in 3.3 is not applicable in Non-IID case. Can you explain it a little bit more?
<!--
**Response to Question 1:** Thanks for the question. To prove the conclusion in 3.3 i.e. Theorem 1, we need Equation 28 to hold. However, the Chernoff bound to hold in Equation 28, it requires the sum of iid random variables from the distribution m(s) or h(s) which violates in the non-iid case, since in that case the samples are correlated. For example refer to the definition of Chernoff Bound in Theorem 4 [2] where the probability is defined over the sum of random variables.
Hence, we cannot directly use Equation 28 and need to leverage a different form as shown in Equation 36 in the Proof of Theorem 2. -->
**Response to Question 1:** This is a great question! To prove the conclusion in Section 3.3, i.e., Theorem 1, we need Equation 28 to hold. However, the Chernoff bound used to prove Equation 28 only holds for independent random variables and does not apply to the non-IID case. We will clarify our motivation here in the updated version of this paper.
> **Question 2:** What is the large deviation theory for? I can not see why Eq. 12 will lead to Eq. 13.
**Response to Question 2:** Large deviation theory is a general approach we mentioned, where the focus is on understanding the probabilities for extreme events as the size of the number of observations increases. We apologies for the confusion of going from Eq. (12) to Eq. (13). We had mentioned the approach in line 201 following Proposition 1.
To obtain the expression in Eq. (13), we need to start from the inequality presented in Eq. (10) given by
\begin{align}
\text{TPR}_{\gamma}^n \leq \min\{\text{FPR}_{\gamma}^n + \texttt{TV}(m^{\otimes n}, h^{\otimes n}) ,1\},\tag{9}
\end{align}
where TPR is true positive rate (defined in Eq. 10) and FPR is false positive rate (defined in Eq. (11)) in the paper. This relationship between TPR and FPR is known as ROC curver in the literature. And area under the ROC curve (AUROC) upper bound of (13) can be obtained by integrating the TRP upper bound in (9) with respect to FPR. Integrating both sides in (9), we can write
\begin{align}
\int_{0}^1\text{TPR}_{\gamma}^n \cdot d\text{FPR}_{\gamma}^n \leq \int_{0}^1\min\{\text{FPR}_{\gamma}^n + \texttt{TV}(m^{\otimes n}, h^{\otimes n}) ,1\}\cdot d\text{FPR}_{\gamma}^n.
\end{align}
Note that the left hand side in the above expression is nothing but AUROC and after calculating the right hand side and substituing the TV norm definition from (12) in the main paper into the right hand side, we obtain the final expression of (13) in the main paper. We will expand this discussion in the final revision.
## Response to Reviewer hBT7 [Score 6, confidence 4]
We appreciate the reviewer for their valuable time to assessing our manuscript and recognizing the novel contributions.We have addressed reviewer's queries in the responses provided below.
> **Weakness 1:** While I do find valuable merit with the mathematical derivations shown in the paper, I believe the experiments conducted in the later part are not diverse enough to support the claims, especially since the setup of choice seems to be a prompted base GPT-2 model (see concern on Questions section below). I would have expected the authors to survey a wide variety of models, not just GPT-2, but also task-specific customized/finetuned models such as BART for summarization, a SQUAD-trained GPT-2 model, or something similar. Any findings from this type of experiment would further test the validity of the “hidden possibility” claim and how it can/cannot extend to various types of models.
**Response to Weakness 1:** We agree with the reviewer and thanks for highlighting this point. To validate the same, ***we performed detailed additional experiments*** with the state of the art detector and state of art generators in the ***new experiments Figure 1(a,b,c) and Figure 2(a,b,c)*** (detailed in the attched pdf with the rebuttal in the global response). We have also considered the case where the generator is SOTA, but the detector is not SOTA and only trained till GPT-2 outputs, how the performance varies. In all the scenarios, we observe that our result holds true i.e detectability improves with increasing sequence length.
> **Weakness 2:** I also did not find proper justification as to why the author/s seem to simply use specific subsets from the datasets such as the Wikipedia from SQUAD. Are these the same datasets used in related works for testing effectiveness of AI-generated text detectors?
**Response to Weakness 2:** Yes, we have used Xsum and Squad datasets in the ***exact same way as used in*** all the recent liteatures [Sadasivan et. al, Mitchell et. al, Kirchembauer et. al etc] to be able to comment on the detectability in-comparison. However, as requested, we have also considered random samples from the Xsum and Squad dataset and perform the ablation with similar outcomes as reported in the new results (***Figure 3(a)-(b) in the general response pdf***). Additionally, we have also considered IMDb and FakeNews dataset to test the efficacy of our proposal (Appendix Section C) and the results support our hypothesis.
> **Weakness 3:** While I understood the objectives and the main idea of the paper at first read, I suggest the key message be properly organized to preserve the narrative structure of the paper. I don’t understand the purpose of the hanging quote in Section 1 as I already understood the message even before reaching that part.
**Response to Weakness 3:** We thank the reviewer for the feedback. Our main motive for the quote is to provide a quick high level message of the paper without going into the details of the introduction and contributions for the general audience. But we agree with you and will remove it from the final version.
> **Question 1:** Why treat OpenAI as the gold standard (as mentioned in the Abstract)? What’s the motativation behind this? I suggest more context in the form of discussion should be added to this as to not confuse the readers.
**Response to Question 1:** Our main motivation of quoting "OpenAI" in the abstract stems from its pivotal role as a leading entity in the development and dissemination of state-of-the-art language models like GPT-2, ChatGPT, and GPT-4. OpenAI's technical reports serve as one of the most comprehensive and authoritative sources of information on these models. Given the significant impact these models have had in recent years, and considering the limited access to proprietary details outside of OpenAI's official reports, it becomes imperative to use them as a benchmark. When our findings or theories resonate with the insights presented in these reports, it not only provides empirical validation to our results but also situates our work within a broader, recognized context. This approach ensures that our research is both grounded in established knowledge and contributes to the ongoing discourse in the field.
<!-- The main motivation of the quoting "OpenAI" in the abstract was to provide an additional sense of empirical validation of our results. Since, the OpenAI technical report is the primary and main source of information available to us about GPT-2, ChatGPT and GPT-4 as well. As these are some of the primary models that have been extremely impactful over these recent years, but most of the information is closed to OpenAI and only the reports were available. Consequently, when our generated results or theories align with the findings in the report, it serves as a valuable validation of our work. -->
> **Question 2:** It seems like we need to be clear about the setup of the testing experiments involving the different datasets and tasks and prompting the LLM (GPT-2) to generate continuations and using this for the detection. In my understanding of what the author/s have written, the setup used is zero-shot which the GPT-2 is not finetuned at all but is prompted to generated text. Is this correct? If so, a major concern would now be that the results from the experiments conducted in this study would be methodologically different if a finetuned GPT-2 model is used or any model customized for a specific task (ex. BART models for summarization).
**Response to Question 2:** Thanks for the question. Firstly, we want to highlight that we have followed the exact same experimental setup as in [1,2,3,4] to be able to comment on the possibilities under the current detection and impossibility claims. In the set-up, we prompt the LLMs with prompts generated from XSum and Squad and perform the detection on the generated text as in [1,2,3,4]. The Squad, Xsum consists of articles and passages from Wikipedia or Newsarticles on most of which the standarad LLMs are already trained on (either pre-training or finetuning). Thats also precisely the reason for superior performance of zero-shot performance of LLMs on such datasets [5,6]. Hence, we believe that the impact of distribution shift is not that critical in these scenarios. However, to diversify we have tried with additional SOTA generators and both old and SOTA detectors to validate our claim in the new experiments (see attached pdf with global response), and we found our claim holds true.
> **Question 3:** How is TV distance different from the MAUVE metric (Pillutla et al 2021)? Do you think that MAUVE will have the same relationship as TV distance since both essentially just quantifies the dissimilarities between humans and AI-generated texts?
**Response to Question 3:** This is a great question! One of the reasons or motivation for using TV distance stems from the mathematical analysis of Lecam's Lemma connecting it to the AU-ROC as can be seen in Proposition 1 Eq. (13) in the submitted paper. Due to this, one can characterize the detection performance mathematically with the TV distance.
However, TV distance is not always efficiently computable for high dimensional scenarios. Hence, as pointed out by the reviewer, the MAUVE metric which basically uses a modified KL/JS type divergence as in Equation 2 (Pillutla et al 2021) can be an alternate measure of divergence. Additionally, one can also show theortically by replacing TV with KL using Pinsker's inequality and obtain the equivalent upper-bound. We will definitely include the discussion about it in the updated draft.
> **Limitations:** The authors should be very clear with the limitations of the experiment procedure, especially with the type of language model used for generating texts for the detection part with traditional ML and Roberta models. This can be added as an additional section in the main paper of Appendix.
**Response to Limitations:** We thank the reviewer for pointing this out. We have now added additional set of experiments including diversity of the models, SOTA detectors like DetectGPT and distribution shift in the detectors and models. However, as the reviewer suggests we will will update the final version with limitations of the updated experiments.
================AFTER SUBMITTING THE REBUTTAL==========
> **Comment 1:** This is to acknowledge that I have fully read both my fellow reviewers' feedback as well as the author/s' response to my own assessment. For the rebuttal period, the authors have provided some additional information for my concerns listed in my review (see weaknesses section). With this, I would like to summarize some points that the authors are strongly recommended add to the final paper in case of acceptance. This will ensure that any claims made on the paper are properly supported with evidence as well as discussed thoroughly.
**Response to Comment 1:** We are extremely thankful to the reviewer for appreciating our efforts and increasing the scores. The feedback provided by the reviewer have been invaluable in improving the quality of our paper.
We will make sure to update the final version of our paper with responses to Weakness #1, Question #2, and Question #3.
-----------------------------------------------------------------
## Response to Reviewer RcT9 [Score 3, Confidence 4]
<!-- **General Response:** -->
We would like to express our gratitude to the reviewer for dedicating their valuable time and effort towards evaluating our manuscript. The detailed responses are provided below.
>**Weakness 1:** The paper's contribution is relatively small since the bound on the AUROC (first half of Section 3.1) was already presented in previous work by Sadasivan et al. The paper extends this bound to take into account the availability of a collection of IID text samples, rather than a single sample. However, this extension is not written very clearly, and I had trouble following the logic, since the actual definition is a "sample" is never articulated.
**Response to Weakness 1:** We respectfully disagree with the reviewer regarding the contribution. Our paper provides the first rigorous mathematical framework to characterize the hardness of the detection problem and derive precise sample-complexity analysis which we believe is extremely crucial to the current times.
To be precise, first, we would like to clarify ***its not a direct extension*** of the results from Sadasivan et al with IID text samples, rather it ***rectifies the misleading impression of the impossibility result in Sadasivan et al.***. We will like to highlight that the bound of AUROC mentioned in Sadasivan et al. is also not new --- it comes from the Lecam’s lemma, which we clearly mentioned in our paper. (line 158). Furthermore, we explicitly discussed in Appendix B.1 (line 587-598) the relationship of our analysis to the analysis in Sadasivan et al. as well. We not only derived the Total variation upper bounds but also derived a precise sample complexity analysis relating AU_ROC with the samples required and Chernoff Information.
***Impact:*** The sample complexity results not only provides a theoretical characterization of detectabily, but can also help design adaptive and better watermarks by utilizing trade-off b/w the Chernoff information and sample complexity. We highlight the same on the design of watermarks and detectors with our insights in Appendix A in the supplementary material. Additionally, it can also help in designing regulations such as imposing Constraints on Length in Exams for improved detectability etc.
<!-- >[name=Furong: first, address it is not an extension to availablity of a collection of IID text samples.]
>[name=Furong: then, address that it corrects the incorrect impression of the impossibility result.Use some of the texts written up for the response to the reviewer on 'why the theory is useful', i.e.,response to Weakness 4 to Reviewer LTNg] -->
>**Weakness 2:** Also, I am concerned some of the authors claims are over-stated. For example, in the abstract they write "we provide evidence that it should almost always be possible to detect aI-generated text unless the distributions of human and machine generated texts are exactly the same over the entire support." This sentence should have the added condition, "when there are sufficient number of samples." Furthermore, the authors claim "we can always gather more samples to spot AI generated text" (line 64) but this really isn't true. There are critical applications of AI-generated text detection where it is not straightforward to collect more samples. These include checking for cheating in student essays (where only a single essay may be available per student) and identifying fraudulent product reviews (where a bad actor might post from many different accounts).
**Response to Weakness 2:**
We admit that some claims should be made more precise by adding conditions as suggested by the reviewer and will make the modifications accordingly. However, immediately following the previously quoted line in our abstract, in the subsequent sentence, ***we discuss the necessity for additional samples***.
We agree that it might not always be possible to collect more samples, but our non-iid results show that even if the sequence length is high, detectability increases significantly, as can be seen from the Theorem 2 and empirical evidence in Figure 2,3,4 (Submitted paper) and Figure 1,2,3 (in the new pdf). In all the empirical experiments (other than Figure 2c in the submitted paper), we don't assume iid samples; rather, the improvement in detection is shown based on the sequence length which is reasonable to get in most cases.
Also, as discussed in our insights, catching students cheating using detectors is not of utmost importance, but flagging hateful/violent/offensive content from social bots is more critical. In many of those cases, we actually can even get a lot of iid samples.
> **Issue 1:** At the start of the Datasets section, it says four datasets were employed, but I only see results on two. Additionally, the description of what is in these datasets and why they were chosen for evaluation should be much longer. The motivating example in Section 3.1 is Twitter, but no Twitter datasets are used in evaluation. This omission should be justified or a different motivating example should be used in 3.1.
**Response to Issue 1:** We want to ***clarify that we indeed used four datasets, including Xsum, Squad, IMDb, and Fake News dataset*** (the last two analyzed in the Appendix). We consider the datasets and experiments in similar lines of the contemporary works [2,3,4] to be able to comment on the possibilities. We have also added new experiments by leveraging SOTA generators and SOTA detectors to validate our hypothesis. Please refer to the submitted pdf with the global response.
> **Issue 2:** Figure 2 appears in the text before it is described. It should be moved to the end of Section 4.3 to avoid reader confusion.
**Response to Issue 2:** Thanks! We will re-position it in the updated draft.
> **Issue 3:** The paper's use of the word "ngram" in Section 4.2 is nonstandard and could create confusion (it did for me). If I understand correctly, when the ngram level is set to 1, the detector is making a decision based off of one word; when it is set to 5, the detector is making a decision off of 5 words, and so on. My original interpretation was that you were taking all the ngrams in a document, and making a classification based on this bucket of ngrams. Also, 4.2(1) talks about choosing n such that the ngram size reaches the paragraph level, but Figure 2A also shows ngram up to n=6, which is far less than a paragraph.
**Response to Issue 3:** We respectfully disagree with the reviewer. The ***use of ngrams is extremely standard in the context of NLP and ML***, especially for TF-IDF-based representation learning, and is a standard in several pipelines (like nltk, sklearn etc.) and algorithms. However, for the reader unfamiliar with such concepts, we will definitely specify it with details in the updated draft.
Secondly, the reviewer is correct when we consider ngrams as 6 or 7. It doesn't mean paragraph level but rather can be sentence level. But the ***main purpose of using ngrams*** was to be able to ***compute the Total Variation in exactness*** without any approximation and show the exact empirical studies of our theoretical analysis. For example, in Figure 2a in the submitted version, it is clear that TV increases with length and here the TV is exact. Since we know when $\Omega$ is countable, we can estimate TV distance as $TV(P, Q) = \frac{1}{2} \sum_{\omega} |P(\omega) - Q(\omega)|$.
Hence, our intention was to bridge the gap b/w theory and practice in exactness which is missing from several works. Additionally, the main point to emphasize is that in all such cases even with ngram =6, the TV distance increases significantly from 50% to 95% with the ngrams. ***This clearly highlights the hidden possibility in detection ignored by existing works***.
> **Issue 4:** Figure 3 shows results that are pretty similar to Figure 1a in the paper "Automatic Detection of Generated Text is Easiest when Humans are Fooled." It might be worth citing this.
**Response to Issue 4:** Thanks! We will cite the work in the updated draft.
> **Issue 5:** While, I appreciate the authors including their code, it would have been very helpful if they could have also included examples CSVs, or a description of the CSV format, as well as a README with run instructions.
**Response to Issue 5:** We have added a link to the generated samples for reference.
> **Question 1:** The paper makes it very unclear what unit a sample is. Is a sample a single word? A single sentence? A single Tweet? Does it matter for the bounds described? Similarly, what unit is considered an independent subset (Section 3.4)?
**Response to Question 1:** Although we have defined $s$ as the random variable indicating the textual representation over which $h(s), m(s)$ is defined as also in [Sadasivan et al., Mitchell et al.]. Hence, the unit can be anything a word, a sentence, or a passage, depending on the domain and context.
However, our point was that, let's say, in a particular domain/context, we were considering the detection at the sentence level, and the TV distance between human and machine distribution was low at the sentence level, then the prior result says it is impossible to detect. Our result states that if we can collect multiple iid samples (sentences) or even a passage that consists of multiple sentences, we can still detect with very high confidence whether it has come from a machine or human, thereby highlighting the hidden possibility for detection. It doesn't affect our bounds on the specific choice of the unit to represent $s$. The sample-complexity result takes care of the same.
> **Question 2:** What is the motivation for section 3.4? Why is the IID case insufficient to represent the real world?
**Response to Question 2:** The primary motivation stems from the fact that it might not always be possible to collect more iid samples in some scenarios, for example, in the case of detecting cheating by students, etc. (as also mentioned by the reviewer in Weakness 2). We can have one answer per prompt. However, extending the results for the non-iid case helps, as we can ask the students for longer answers or put word limits. It is much easier to get longer samples/contexts, and in such cases, even a small increase improves the detectability, as can be seen in all our empirical demonstrations as well as the new experiments.
> **Question 3:** "this is a natural assumption in NLP where a large paragraph often consists of multiple topics, and sentences for each topic are dependent." This sentence feels weird to me. In NLP we often have documents which consist of multiple paragraphs, which are dependent on each other. Can you clarify what you mean by this assumption?
**Response to Question 3:** Here we wanted to emphasize the autocorrelation structure typically present in NLP texts, where the strength of association decreases over the sequence length, and there can be 'independent -structures' in the passage with the number of independent components given as $L$. But, as the reviewer mentioned that it can happen that there are no independent components and all the sequences are dependent on each other. Our theory already takes care of such a structure with $L=1$, but even under that it is natural to assume that the dependency decreases over the sequence, which is a pretty standard assumption in any autocorrelated structure.
> **Question 4:** In the experiments, how long on average are the sequences being detected?
**Response to Question 4:** The max-length of the sequences considered is 500-600 tokens, and we show that with a percentage increase in the length, the detectability improves significantly with reasonable detectors.
> **Limitations:** The paper's Conclusions and Future Works section addresses some limitations of the method, such as the need for a large number of samples when the human and AI distributions are very close. Other limitations and societal impacts the authors could consider mentioning are:
1. Some domains of human-written text might have distributions which differ more from the AI text than others.
2. The paper does not address the case where a document (or a corpus of documents) contains a mixture of human-written and AI-generated content. An example of this would be a Twitter account where half the posts are generated. (I agree this case is well beyond the scope of the paper, but it is probably worth mentioning.)
3. When more samples are available, the harm being caused by the AI-generated content is also greater, since there is a greater chance people will see and be fooled by the content.
**Response to Limitations:**
For point 1, we agree that certain domains that need particular contexts like medicine, specialized mathematics, physics, and business, need specialized information. Here, human/expert written texts are quite different from AI-generated and are easier for detectors. We will add the discussions in the updated draft.
For point 2, we agree this is a harder problem and is a part of our future research. But as suggested, we will specify in the discussions in the updated draft.
For point 3, when more samples are generated, it can be easily detected by the detector, and hence, we can flag that as AI-generated before it can fool users.
-------------------
<!-- ## Request for Urgent Assistance Regarding Replacing Reviewer RcT9 [Score 3, Confidence 4]
<!-- Dear ACs and SACs,
Thank you for your time and effort in organizing the reviews for our NeurIPS submission. We greatly appreciate your service and dedication.
We are writing to respectfully raise our concerns regarding the feedback provided by Reviewer RcT9. We are concerned that the reviewer seems to be biased towards an existing unpublished arxiv work by Sadasivan et al. **and critisizing our work in a disrespectful manner**. Therefore, we would like to bring this potential conflict of interest to your attention and kindly request the replacement of this reviewer.
Our concern stems from the fact that the questions raised by Reviewer RcT9 appear to be non-technical and more focused on explanatory matters rather than addressing the core technical aspects of our work. A detailed list of concerns is provided below. -->
<!-- For instance, ***in the primay reasons for rejection***, reviewer mentioned that
-->
<!-- * the ***"contribution is small"*** by citing that first half of Section 3.1 in our work was already presented in previous **unpublished** work by Sadasivan et al. We would like to remark that this claim is wrong as Section 3.1 is just notations and preliminaries. We believe it is important to define terms in a mathematical rigorous manner (which is missing from Sadasivan et al.) because this is a new topic for the better accessibility. We have also mentioned clearly that AUROC bound in a single sample setting is also present in Sadasivan et al. (line 169 in our paper). Further, we remark that the bound of AUROC mentioned in Sadasivan et al. is also not new, it comes from the Lecam's lemma, which was never mentioned in Sadasivan et al., but we clearly mentioned it in our paper. (line 158). Therefore, evaluating our work based on this observation is not fair. Furthermore, we explicitly discussed in Appendix B.1 (line 587-598) regarding the relationship of our analysis to the analysis in Sadasivan et al. as well.
-->
<!-- * The reviewer has also mentioned that our claim regarding the possibility of AI-generated text detection in abstract is over-stated, and we need to add the condition "when there are sufficient number of samples." This also seems to be wrongly interpreted because we have mentioned "***we need more samples to detect it***" in the immediate next line in the abstract. We even talk about the precise sample complexity in the abstract, which the reviewer has completely ignored. This shows the biased nature of the reviewer.
-->
<!-- * Also, in the end, the reviewer has mentioned that our claim "***we can always gather more samples to spot AI generated text***" is incorrect. This is again misinterpreted because line 64 is just a high-level message (not a claim) of this work. We also motivated it with a specific application in lines 60-64. Further, in the applications mentioned by the reviewer, even in single essay, if the essay is long enough, we will have enough samples to detect (considering each sentense as samples) and our non-IID result would apply. This we explicitly show in several experiments in the paper (cf. Figure 2, 3, and Figure 6 in Appendix). We agree that one can construct counterexamples to our argument in this work because this is the first result on possibility, and there lies a huge scope for further improvements, but it is unfair to just consider that as the reasons for rejection without any technical argument.
-->
<!-- * Also, another comment in the weakness section mentioned by the reviewer is, "*Figure 2 appears in the text before it is described. It should be moved to the end of Section 4.3 to avoid reader confusion*. **This is not a weakness rather should be a suggestion**.
-->
<!-- * In another comment, the reviewer mentioned that "*when more samples are available, the harm being caused by the AI-generated content is also greater, since there is a greater chance people will see and be fooled by the content*". We note that this comment is not very meaningful and the reviewer's motivation behind this comment is unclear. Because, if more samples are already available, it will be easily detectable and we can already warn the users of that issue.
-->
<!-- We would like to mention that all the other reviewers have not raised any of the above-mentioned superfluous issues and have provided insightful reviews for our work.
Hence in summary, after reading the review of reviewer RcT9 in detail, we feel that there is certain bias in the review. Most of the comments are quoting sentences/part of sentences from our draft and are not technical questions of our work which can help in improving the quality of our work (like the comments provided by the other reviewers). So for a fair review, we would like to request you to replace/remove the reviewer RcT9 for our manuscript. But to respect the time spent by the reviewer on our work, we would still be providing respones to the review before the rebuttal period ends. Thank you so much for your consideration.
Sincerely,
Authors Paper ID 6122
===================================================== --> -->