## Commitment Message
We would like to express our sincere appreciation for your dedicated efforts in facilitating the conference, as well as those of the action editor and the five reviewers.
Based on the insightful feedback from the action editor, we plan to revise the final version of our manuscript as follows:
1. We will expand the discussion on the partial contribution of the document to the summary, supplementing the current discussion in Appendix H.
2. We will clarify that the human evaluation mentioned in the meta-review represents the human-machine agreement rate. This result is already included in the manuscript (Lines 267–271).
3. We will provide a more detailed explanation of the implementation details, enhancing the current explanation in Appendix C.
These proposed additions and revisions are supplementary and can be seamlessly integrated into the camera-ready version without compromising the integrity of the current manuscript.
Lastly, regarding the “recommendation for venues” in the metareview, we fully appreciate the thorough consideration of the action editor and reviewers. However, we would like to emphasize that we have submitted our paper to the “Resources and Evaluation” area, underscoring our dedication to introducing a new public dataset, Multi-News+, which addresses the limitations of previous datasets rather than performing experiments on other tasks and datasets.
Once again, we extend our gratitude for your time and consideration in the conference process. We look forward to refining our work to meet the high standards of EMNLP.
## FcuV (S3 O3) -> 신규 리뷰어
Thank you for dedicating your time to a comprehensive review of our manuscript. We deeply appreciate your acknowledgment of our clear motivation and organization, the contribution that leverages LLMs to enhance the quality of existing datasets, and our efforts to establish reproducibility. We would like to address the reviewer’s concerns as follows:
### Response to Weakness 1 - Weak Baseline
> The experiments in the article are slightly weak. The main experiments only consider the BART and T5 models. The author should consider adding more baseline language model evaluations, such as Llama.
Thank you for your insightful feedback. While we acknowledge the importance of including LLaMA as a baseline model, we were constrained by the page limit. However, we would like to highlight that Appendix D includes additional experiments using LLMs, specifically Mistral-7B-Instruct-v0.2 and Llama-2-7b-chat-hf. In these experiments, we found that the inclusion of noisy examples in a two-shot summarization scheme leads to a decrease in the quality of the summaries generated by the LLM. This experiment underscores the importance of filtering noisy data for LLM-based summarization. We plan to reorganize the manuscript to incorporate this analysis into the main content upon acceptance, as an additional page is provided for accepted papers.
### Response to Weakness 2 - Methodological Novelty
> The design of the article's methodology is somewhat weak, although the motivation is clear. It seems that the article only uses LLMs and voting mechanisms, which is somewhat like a preliminary exploration work.
We understand the concern regarding the novelty of our methodology. While our method may not seem methodologically innovative, it is worth noting that after thorough investigation including a survey paper [[1]](https://arxiv.org/pdf/2402.13446), we found no existing studies that leverage majority voting for LLM-based data annotation. Our work is the first to demonstrate the effectiveness of this approach.
Additionally, according to [call for papers of ACL rolling review](https://aclrollingreview.org/cfp), a short paper is not required to describe a ‘completed’ work, unlike long papers. Our work aligns with this guideline by proposing a focused contribution—the effectiveness of majority voting in LLM-based data annotation through a case study—and the release of the Multi-News+ dataset.
```
[1] Tan et al., Large Language Models for Data Annotation: A Survey, ArXiv Preprint arXiv:2402.13446 (2024).
```
Once again, we appreciate your thorough review and feedback. We hope this response addresses your concerns.
## a69B (S3.5 O3.5) -> 신규 리뷰어
We are grateful for your thorough review and positive evaluation of our manuscript, especially your recognition of our primary motivation: improving the quality of existing datasets without human annotation costs. Below, we address the reviewer’s concerns and comments.
### Response to Weakness - Manual Analysis
> Instead of a case analysis in the appendix, can you manually analyze a small portion of the dataset, let's say only 30-50 instances, and compare the results with LLM, so we know the precision / recall of the LLM's judgments? I think the soundness score can be further improved, if this is added.
We extend our gratitude for this constructive feedback. In response to the feedback, we randomly selected 60 instances from the validation set, which consists of 153 documents. We defined our confusion matrix as follows:
- True Positive (TP): Documents that are relevant and correctly classified as relevant.
- False Positive (FP): Documents classified as relevant but are actually not relevant.
- True Negative (TN): Documents that are not relevant and correctly classified as not relevant.
- False Negative (FN): Documents that are relevant but incorrectly classified as not relevant.
After examination, we found that 127 documents were TP, while there were 24 TN documents and 2 FN documents. The annotation framework predicted 26 documents as irrelevant and noisy, which is approximately 17% of the total 153 documents. This aligns with the statistics from Table 3 in Appendix A, which suggests that 18.6% of documents are classified as noisy in the validation set.
Based on this result, the precision is 1.0 as there are no FP documents, and the recall is approximately 0.984. Additionally, we found that 17 out of the 24 TN documents could be considered noisy system messages such as "This will appear next to all of your comments this will not appear anywhere on newser," while the remaining 7 documents were irrelevant to the summary.
Moreover, we investigated the 2 FN cases. In one case, the summary included a part connected to the falsely classified document at the very end. In the other case, the falsely classified document provided context for the summary but was not directly connected to it.
It is noteworthy that while a single annotator sometimes provided incorrect classifications, and the majority voting process was helpful in correcting these errors. This indicates that our proposed method is beneficial for ensuring the quality of data annotation and dataset cleansing.
We are grateful for this constructive feedback and we hope that this analysis satisfies the reviewer.
### Response to Comment 1 - Usage of GPT-4.5
> The price for the GPT-4.5 turbo reduced a lot recently. Will stronger LLM lead to better performance?
We appreciate this insightful comment. Indeed, the recently released GPT-4.5 turbo (also known as GPT-4o) has seen a significant price reduction. To investigate its potential in data annotation, we tested GPT-4.5 turbo on the two FN cases from the analysis above, where GPT-3.5-based data annotation had failed. We found that GPT-4.5 turbo correctly identified both cases that GPT-3.5 turbo had missed.
However, despite its superior performance, GPT-4.5 turbo remains significantly more expensive than GPT-3.5 turbo. Thus, we propose two cost-effective strategies to leverage GPT-4.5 turbo’s strengths:
- **Weighted Majority Voting**: Here, GPT-4.5 turbo is used in combination with GPT-3.5 turbo annotators. Given its superior performance, GPT-4.5 turbo is assigned more weight in the annotation process. For instance, instead of using five GPT-3.5 turbo models, we could use three GPT-3.5 turbo models and one GPT-4.5 turbo model, where GPT-4.5 turbo has double the weight of a GPT-3.5 turbo annotator. This approach treats GPT-4.5 turbo as an expert annotator, where agreement between at least one GPT-3.5 annotator and GPT-4.5 turbo results in the deletion of the document.
- **Verification of Annotation Result**: In this approach, GPT-4.5 turbo does not participate in the initial annotation process. Instead, it verifies the results of GPT-3.5 turbo annotators by taking the set of documents, summaries, and annotation results as input, thus acting as an expert annotator supervising the outputs of standard annotators.
We are committed to exploring these methods in future work and will include this discussion in the revised manuscript. We appreciate your valuable feedback.
### Response to Comment 2 - Relevant Work
> Please consider citing the paper below as it is relevant. arXivEdits: Understanding the Human Revision Process in Scientific Writing; Chao Jiang, Wei Xu, Samuel Stevens; EMNLP 2022
We appreciate this suggestion. We have reviewed the paper and found it highly relevant to our work, particularly in its exploration of the revision process to enhance document quality. We will ensure to cite this work in the updated manuscript.
Your feedback and insights will significantly strengthen our study, and we once again extend our sincere gratitude.
## Yc5G (S3 O2.5) -> 신규 리뷰어
We appreciate your comprehensive review and are grateful for recognizing the relevance and importance of our research problem. We have carefully considered your comments and concerns and address them as follows:
### Response to Weakness 1 - Generalizability of Proposed Approach
> Without further experiments on more datasets and more models, the generalizability of this approach is unclear. The minuscule improvement in performance further corroborates this concern.
We understand the concern regarding the generalizability of our approach. However, it is important to note that our paper is a short paper, as defined by the [call for papers](https://aclrollingreview.org/cfp). This type of paper is intended to propose a small, focused contribution. Accordingly, we presented a case study to validate the effectiveness of dataset cleansing using LLM-based data annotation with majority voting.
Hence, in this short paper, we focused on enhancing the soundness of our manuscript, instead of expanding the approach to other datasets. Accordingly, we followed the guidance of the reviewer by including additional experiments on other dataset cleansing baselines, as detailed below.
### Response to Weakness 2 - Other Data Cleansing Baselines
> Lack of any other data cleansing baseline results. It is important to answer how effective other data cleansing methods are on your experimental setup.
We appreciate this feedback. In response, we applied the rule-based dataset cleansing method proposed in a previous study [[1]](https://aclanthology.org/2022.lrec-1.614.pdf), which we also analyzed in Appendix E. Using this method, we created a dataset named Multi-News Ablation and trained BART-large-CNN and T5-base models on it.
The results are shown below. They suggest that the application of this dataset cleansing method actually degrades downstream task performance, likely due to the different characteristics of multi-document summarization datasets compared to single-document summarization datasets. This supports our analysis in Appendix E. We will include these results in the updated manuscript. We once again appreciate your insightful feedback.
| **BART-large-CNN** | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** | **BERTScore** | **BARTScore** |
|:---------------------------------:|:-----------:|:-----------:|:-----------:|:-------------:|:-------------:|
| Multi-News (Original) | 48.64 | 18.86 | 24.11 | 0.6401 | -2.763 |
| Multi-News+ (Ours) | 49.17 | 19.04 | 24.36 | 0.6418 | -2.698 |
| Multi-News Ablation (TeSum-based) | 47.48 | 18.27 | 23.81 | 0.6362 | -2.767 |
| **T5-base** | **ROUGE-1** | **ROUGE-2** | **ROUGE-L** | **BERTScore** | **BARTScore** |
| Multi-News (Original) | 40.11 | 13.90 | 21.58 | 0.6003 | -2.407 |
| Multi-News+ (Ours) | 40.45 | 14.17 | 21.84 | 0.6027 | -2.362 |
| Multi-News Ablation (TeSum-based) | 39.30 | 13.65 | 21.42 | 0.5967 | -2.457 |
```
[1] Urlana et al., TeSum: Human-Generated Abstractive Summarization Corpus for Telugu, LREC 2022.
```
### Response to Weakness 3 - Methodological Insights
> The paper lacks methodological or algorithmic insight and is a simple majority vote on a prompting output.
We acknowledge the concern of the reviewer regarding the methodological insight of our manuscript. However, while our method may not appear methodologically innovative, it is the first study to validate the effectiveness of majority voting for LLM-based data annotation. We could not find any previous work utilizing this approach after a comprehensive investigation, including a survey paper in this field [[1]](https://arxiv.org/pdf/2402.13446).
Additionally, we believe simplicity in LLM-based data annotation is crucial. A complex method can increase the likelihood of errors during the annotation process, leading to higher costs and instability. For example, a previous study had to set a threshold for patience of error and exclude data exceeding this threshold [[2]](https://aclanthology.org/2024.findings-eacl.2.pdf). A straightforward approach reduces these risks, making it easier to understand and implement, especially for non-experts in deep learning.
Lastly, our contribution also includes the release of the enhanced dataset, Multi-News+, which is already available in the anonymized repository linked in our manuscript. We believe this provides a significant and compelling contribution for a short paper.
```
[1] Tan et al., Large Language Models for Data Annotation: A Survey, ArXiv Preprint arXiv:2402.13446 (2024).
[2] Choi et al., GPTs Are Multilingual Annotators for Sequence Generation Tasks, EACL 2024 Findings.
```
We once again extend our gratitude for your comprehensive review, which has helped us improve the strength of our study. We hope this response addresses your concerns.
## jLu7 (S4, O2.5) -> 기존 리뷰어
First of all, we would like to extend our sincere gratitude for the continued engagement of the reviewer. Your comments from the previous round significantly enhanced the quality of our study. We are pleased to address your current concerns as follows:
### Response to Weakness 1 - Contribution and Novelty
> I believe that the contribution of the paper remains insufficient. Utilizing LLMs as annotators with a self-consistency mechanism is not methodologically novel. Applying this approach to a single downstream task and dataset does not have enough impact.
We acknowledge the concern of the reviwer regarding the contribution and novelty of our work. However, while our method may not appear highly innovative, it is important to note that this is the first study to leverage majority voting for LLM-based data annotation. Our thorough literature review, including a survey paper in this field [[1]](https://arxiv.org/pdf/2402.13446), found no previous work utilizing this approach. Thus, our study makes a significant contribution by validating the effectiveness of majority voting in this context.
Moreover, we believe simplicity in LLM-based data annotation is beneficial. A complex method can lead to higher error rates and costs. For example, a previous study set a threshold for error patience and excluded data exceeding this threshold [[2]](https://aclanthology.org/2024.findings-eacl.2.pdf). A straightforward framework reduces these risks and makes the method accessible to non-experts in deep learning.
Furthermore, we would like to note that our contribution also includes the release of the enhanced dataset, Multi-News+, in the anonymized repository linked in our manuscript. This dataset is a valuable resource for future research. Consequently, we believe that our proposed method has a novelty in the field of LLM-based data annotation and our study provides insights for future study, as well as the release of the dataset to facilitate future work using this method.
```
[1] Tan et al., Large Language Models for Data Annotation: A Survey, ArXiv Preprint arXiv:2402.13446 (2024).
[2] Choi et al., GPTs Are Multilingual Annotators for Sequence Generation Tasks, EACL 2024 Findings.
```
### Response to Weakness 2 - Expanding to Various Tasks
> As stated in my original review: To increase the paper's impact, the authors should consider conducting a more comprehensive survey rather than focusing on a single use case study. A categorization of dataset quality issues across various NLP tasks and a generalizable LLM-based solution would be particularly valuable.
We fully acknowledge the opinion of the reviewer, suggesting to expand current study to other tasks and dataset. We are already exploring methods to improve the quality of other datasets, particularly those for multimodal tasks. Specifically, we are investigating methods to verify labels and re-annotate noisy data.
However, we believe these ideas differ significantly from our current study and warrant a separate manuscript. This paper is a short paper, which is intended to propose a small, focused contribution, according to the [call for papers](https://aclrollingreview.org/cfp). We have focused on enhancing the soundness of our current work and providing the Multi-News+ dataset as a valuable contribution.
We once again extend our sincere gratitude for your constructive feedback and continued engagement, which have been crucial in enhancing the quality of our study. We hope this response addresses your concerns.
## sugy (S3, O3) -> 기존 리뷰어
We are grateful for your continued engagement from the previous round of ARR. Your comprehensive review and insightful feedback was a significant assistance to enhance the quality of our paper. For instance, we could include additional experiment that leverages LLMs as a baseline and a qualitative analysis based on the guidance of the reviewer. We are delighted by the improved evaluation from the reviewer based on our effort. We are pleased to address your remaining concerns as follows:
### Response to Weakness - Partial Contribution
> How does the proposed approach handle cases where the documents contribute partially to the summary?
We are grateful for this feedback. Based on your previous round’s comment, we included a dedicated error analysis in Appendix H. This analysis revealed that errors primarily arise from documents only partially contributing to the final summary. Specifically, GPT-3.5 models struggle with irrelevant and relevant information within a single document.
To address these issues in future research, we plan to leverage superior models such as the recently released GPT-4o. We propose two potential strategies:
- **Weighted Majority Voting**: Here, GPT-4.5 turbo is used alongside GPT-3.5 turbo annotators. Given its superior performance, GPT-4.5 turbo is assigned more weight in the annotation process. For example, instead of using five GPT-3.5 turbo models, we could use three GPT-3.5 turbo models and one GPT-4.5 turbo model, where GPT-4.5 turbo has double the weight of a GPT-3.5 turbo annotator. This approach treats GPT-4.5 turbo as an expert annotator, where agreement between at least one GPT-3.5 annotator and GPT-4.5 turbo results in the deletion of the document.
- **Verification of Annotation Result**: In this approach, GPT-4.5 turbo does not participate in the initial annotation process. Instead, it verifies the results of GPT-3.5 turbo annotators by taking the set of documents, summaries, and annotation results as input, thus acting as an expert annotator supervising the outputs of standard annotators.
We believe these strategies could significantly reduce errors and improve annotation quality in future studies.
Once again, we extend our profound gratitude for your continued engagement and constructive feedback, which have been instrumental in enhancing the quality of our study. We hope this response addresses your concerns.