ACL 2024 Rebuttal - 1921 (MultiNews+)

# ACL 2024 Rebuttal - 1921 (MultiNews+) ## Reviewer bSjA (Soundness 3, Overall 2) Thank you for dedicating your time to a comprehensive review of our manuscript. We deeply appreciate your acknowledgment of the significance of the dataset cleansing to enhance the quality of the given dataset. Our intention to make Multi-News+ dataset publicly available stems from our commitment to advancing the field and serving the research community. Your feedback is invaluable to us, and we are eager to address and clarify any concerns you've raised to ensure the quality and clarity of our work. ### Response to Weakness 1: Limited Novelty > There is a lack of novelty and significance. In fact, dataset cleansing/filtering with language models is not a new idea. There is already a line of work on improving the factuality of summarization datasets through filtering [1, 2, 3]. Simply using GPT-3.5 to filter one dataset does not seem like enough contribution. We appreciate your feedback regarding the lack of comparison between our work and previous works. Our paper aims to introduce Large Language Models to enhance the quality of real-world datasets, distinguishing them from existing studies [1,2,3]. Previous researches that the reviewer mentioned applied data filtering to training datasets to improve the factual consistency of abstractive summarization models. However, we have focused on the significant noise and extensive cleaning required in datasets primarily obtained through web crawling [4]. To address this data cleaning challenge, we utilized large language models to reduce human effort, eliminating the need for external modules and expensive fine-tuning processes, unlike in previous studies [1, 2, 3]. Furthermore, instead of using multiple fine-tuned models and filtering data based on fixed thresholds through intersection [3], we implemented methods such as majority voting and chain-of-thought reasoning to effectively imitate human experts' processes. This approach not only provides rationale for decision-making and improves transparency, but also reduces the need for fine-tuning each model individually. Therefore, our paper emphasizes our attempt, particularly using large language models, to enhance real-world datasets, such as those obtained through web crawling. Furthermore, we believe that the core contribution of our paper lies in the release of the Multi-News+ dataset, which is an enhanced version of previous Multi-News dataset. As stated in the manuscript, we plan to publicly open the dataset for future research. We are committed to involve these discussion with respect to previous works in our updated manuscript. ``` [1] Kazuki Matsumaru, Sho Takase, and Naoaki Okazaki. 2020. Improving truthfulness of head- line generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 1335–1346, Online. Association for Computational Linguistics. [2] Feng Nan, Ramesh Nallapati, Zhiguo Wang, Ci- cero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. 2021. Entity-level factual consistency of abstractive text summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, Online. Association for Computational Linguistics. [3] Yanzhu Guo, Chloé Clavel, Moussa Kamal Eddine, and Michalis Vazirgiannis. 2022. Questioning the valid- ity of summarization datasets and improving their factual consistency. In Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 5716–5727, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [4] Kreutzer, Julia, et al. "Quality at a glance: An audit of web-crawled multilingual datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50-72. ``` ### Response to Weakness 2: Necessity of Extensive Experiments > Downstream task experiments are limited to fine-tuning T5 and BART. State-of-the-art instruction tuning approaches for decoder-only models were not tested. We acknowledge the concern raised by reviewer regarding the limited extent of the experiment. We are planning to conduct additional experiment using LLMs, such as Mistral and LLaMA. Even though we could not include the experimental result in this initial rebuttal due to the limited time, we will ensure to attach this experiment in the updated version. ### Response to Weakness 3: Evaluation on Annotation > Annotations by GPT-3.5 were not evaluated. We deeply appreciate the feedback by the reviewer concerning the evaluation of the annotations performed by GPT-3.5. Our initial intention about the paper was to showcase the possibility of dataset cleansing with proposed method, as the manuscript is a short paper. However, we acknowledge the importance of the evaluation of the annotation. In response to your invaluable feedback, we performed an initial evaluation on the 379 sets that are classified as not to contain relevant source articles, as mentioned Lines 213-214 and Appendix C. We examined these examples and found 93.93% of 379 sets were correctly classified as full of noisy documents. Moreover, we suggest an error analysis on several wrongly classified cases. Upon examining IDX 1890, it can be observed that sentences which could be obtained during crawling, such as "You always have the option to delete your tweet location history. Learn more," are included in the original text. Adn the frequency of such sentences, especially when they do not clearly specify proper nouns, can result in a lack of correlation with the summary. Additionally, when looking at IDX 18846, the summary describes content related to the victims of a suicide bombing, whereas the original text is written from the perspective of the suicide bomber. This indicates that GPT may not distinguish the difference in expression that occurs during the process of converting a narrative into a summary. We are committed to incorporate this evaluation and analysis into the updated version of the manuscript. Once again, we sincerely appreciate your priceless feedback. ### Response to Comment 1: Contribution of the Paper > While the topic of data quality enhancement is of interest to the community, the contribution of this paper is too trivial to warrant publication at a major venue. In its current form, it might be of interest for a workshop. To broaden the impact of the paper, the authors may consider conducting a more comprehensive survey instead of a single use case study. It would be interesting to see a categorization of dataset quality defaults across different NLP tasks and a generalizable LLM based solution. We acknowledge the viewpoint of the reviewer regarding the contribution of our paper. However, we wish the reviewer to consider that the manuscript is a short paper, which is considered to present *[small, focused contributions](https://aclrollingreview.org/cfp)* such as a *[proof of concept](https://aclrollingreview.org/reviewertutorial)*. In this paper, we presented an initial idea to cleanse and enhance the quality of real-world dataset, which aligns with the purpose of the short paper. Additionally, our proposed method exhibits various strenghts such as the demonstration of the rationale behind the annotation with the adaptation of chain-of-thought technique, which eases further investigation by human annotators. Lastly, we presented a cleansed version of existing dataset, Multi-News+. We will incorporate this discussion into our updated version of the paper, strenghtening the presentation of the contribution of our paper. ### Response to Comment 2: Construction Process of Multi-News > There should have been more detailed explanations on how the Multi-News was constructed (in addition to “automated crawling from the Internet Archive”), for the readers to understand why the noisy documents were included in the dataset. We respect this kind comment mentioning the necessity of more detailed explanation of the construction process of previous Multi-News dataset. We will ensure to update the manuscript accordingly. We appreciate your thorough review and your attention to detail, which helps us enhance the quality and clarity of our work. ## Reviewer sBB6 (Soundness 1.5 , Overall 1) We appreciate your insightful feedback and suggestions regarding our manuscript. We are grateful for the recognition of the value of our cleansed Multi-News+ with positive rating on 'Datasets' score. Below, we address the reviewer's comments and suggestions. ### Response to Weakness A: Real-world Case of Multi-News > The authors tried to filter out noise documents. Although noise documents are not relevant to the general topic, they might demonstrate a real-world case where sometimes not all automatically-extracted documents are relevant. In fact, the authors created a cleaner version that fits cases where the documents were extracted manually or with high precision. This is a worthy cause, but it should be discussed in the paper. We acknowledge the viewpoint of the reviewer regarding the possibility of real-world case. While we fully understand the concern raised by the reviewer, we believe such possibility can be dealt with the reciprocal usage of our Multi-News+ and previous Multi-News dataset. For instance, one could utilize previous Multi-News dataset when the trained model is expected to consistently deal with noisy documents for inference and there is no pre-defined strategies for filtering out these noisy document at inference time. Otherwise, for cases where the model is expected to only handle clean documents, it will be more beneficial to utilize our proposed Multi-News+ dataset for training the model. We are grateful for your feedback regarding this discussion and are planning to incorporate the discussion into the manuscript. ### Response to Weakness B: Easier Task of Multi-News+ > In practice, the authors filtered out not only documents that are not related to the topic of the set of documents, but also documents that are not related to the summary. This may cause filtering relevant documents that do not contain salient information. In fact, it makes the task easier while focusing on the salient information only. I would at least expect manual verification showing the rate of actual noise filtering. We acknowledge the reviewer's perspective regarding this issue. While we acknowledge the concern raised by the reviewer, we would like to carefully point out every experiment were conducted on the test set of Multi-News+, to ensure the fair comparison between the models. For instance, it would be also easier to handle clean test set of Multi-News+ dataset for the model trained with previous Multi-News dataset, which includes noisy documents. Additionally, we include an initial version of verification showcasing the rate of correct noise filtering in the next response. ### Response to Weakness C: Evaluation on the Quality of the Filtering > No manual annotation at all. Not to assess the quality of filtering, and not to assess the quality of the result summaries. We fully understand the concern raised by the reviewer. Our initial intension was to showcase the possibility of LLM-based dataset cleansing, and to present the cleansed version of the real-woirld datasets, different from previous studies on LLM-based data annotation. This intention was based on the purpose of the short paper according to the [CFP](https://aclrollingreview.org/cfp), and [reviewer guide](https://aclrollingreview.org/reviewertutorial), which suggests a *small, focused contribution* such as *proof of a concept* is suitable for a short paper. Furthermore, we were planning to publicly open the dataset, which facilitates the future investigation regarding the quality of the annotation performed in our study. However, thanks to the invaluable feedback of the reviewer, we now acknowledge the importance of human verification of the result of annotation. In response to the feedback raised by the reviewer, we conducted an examination on 379 documents that are classified to not have any relevant documents regarding the summary. As a result, we found 93.93% of the annotation is valid. We are planning to incorporate this evaluation result and thorough error analysis of failure caess. We once again extend our sincere gratitude on the reviewer's feedback regarding the importance of human verification of the result of the annotation. ### Response to Weakness D: Limited Novelty > The filtering method is not so novel. It is very similar to things that were done in the past. We acknowledge your concern regarding the novelty of our proposed method. In respose to your valuable feedback, we conducted a comparison between our method and previous studies. Our paper aims to introduce large language models to enhance the quality of real-world datasets, distinguishing them from existing studies [1, 2, 3]. Previous researches focused on data filtering to training datasets to improve the factual consistency of abstractive summarization models. However, we have focused on the significant noise and extensive cleaning required in datasets primarily obtained through web crawling [4]. To address this data cleaning challenge, we utilized large language models to reduce human effort, eliminating the need for external modules and expensive fine-tuning processes, unlike in previous studies [1, 2, 3]. Furthermore, instead of using multiple fine-tuned models and filtering data based on fixed thresholds through intersection [3], we implemented methods such as majority voting and chain-of-thought reasoning to effectively imitate human experts' processes. This approach not only provides rationale for decision-making and improves transparency, but also reduces the need for fine-tuning each model individually. Therefore, our paper emphasizes our attempt, particularly using large language models, to enhance real-world datasets, such as those obtained through web crawling. We appreciate your kind feedback regarding this issue. We plan to improve the presentation of the novelty of our method in the updated version of the paper. ``` [1] Kazuki Matsumaru, Sho Takase, and Naoaki Okazaki. 2020. Improving truthfulness of head- line generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 1335–1346, Online. Association for Computational Linguistics. [2] Feng Nan, Ramesh Nallapati, Zhiguo Wang, Ci- cero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. 2021. Entity-level factual consistency of abstractive text summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, Online. Association for Computational Linguistics. [3] Yanzhu Guo, Chloé Clavel, Moussa Kamal Eddine, and Michalis Vazirgiannis. 2022. Questioning the valid- ity of summarization datasets and improving their factual consistency. In Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 5716–5727, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [4] Kreutzer, Julia, et al. "Quality at a glance: An audit of web-crawled multilingual datasets." Transactions of the Association for Computational Linguistics 10 (2022): 50-72. ``` ### Response to Comment: Modification on Figure 2 > It is hard to distinguish the different colors in Figure 2 in the high "number of articles". We appreciate this feedback from the reviewer. We will ensure to update the organization of Figure 2. Once again, we extend our genuine appreciation for the feedbacks from the reviewer. We are committed to incorporate these feedbacks and discussions into the updated version of the manuscript. ## Reviewer n44F (Soundness 2.5 , Overall 2) Thank you for your review and thoughtful feedback on our manuscript. We have carefully considered your comments and concerns and would like to address them as follows: ### Response to Weakness 1: Lack of Analysis of Multi-News dataset > The study lacks a thorough analysis of the Multi-NEWS dataset. Metrics specific to automatic summarization reveal the characteristics of the Multi-NEWS dataset, shedding light on the types of noisy samples it contains. Previous works have proposed several filtration strategies to uphold the dataset's high quality. We acknowledge your concern regarding the necessity of thorough analysis of Multi-News dataset. Our initial intention about the paper was to showcase the possibility of dataset cleansing with proposed method, as the manuscript is a short paper. However, we acknowledge the importance of the thorough analysis of the pattern of the noise included in Multi-News dataset. We are planning to analyze the characteristics and pattern of the noise of Multi-News dataset, leveraging the method mentioned in studies the reviewer mentioned. We appreciate the suggestion of the reviewer, and we believe such thorough analysis will help readers to understand the problem of Multi-News dataset. ### Response to Weakness 2: The Usage of Summarization-specific Filtration Strategies > The initial step in enhancing the quality of any summarization dataset involves applying summarization-specific filtration strategies. From the remaining samples, identifying noisy ones becomes more feasible. This approach could potentially reduce data annotation costs to under $500. Regarding the previous feedback, we acknowledge the possibility of incorporating previous filtration strategies and our proposed method, further decreasing the required cost. We are committed to include the discussion on this future extension. We sincerely appreciate for this valuable feedback. ### Response to Weakness 3: Qualitative Analysis > Despite the performance enhancements on the Multi-NEWS dataset, the qualitative analysis of the models is lacking. We deeply appreciate the feedback by the reviewer concerning the evaluation of the annotations performed by GPT-3.5. In response to your invaluable feedback, we performed an initial evaluation on the 379 sets that are classified as not to contain relevant source articles, as mentioned in Lines 213-214 and Appendix C. We examined these examples and found 93.93% of 379 sets were correctly classified as full of noisy documents. We are committed to incorporate this evaluation and analysis for failure cases into the updated version of the manuscript. Once again, we sincerely appreciate your priceless feedback. ### Response to Weakness 4: Partial Contribution of the Documents to the Summary > How does the proposed approach handle cases where one document contributes only partially to the summary? We appreciate this insightful question. While we could not attach the result of the analysis regarding this question, we are planning to incorporate this analysis into the updated version of the manuscript. ### Response to Weakness 5: Necessity of LLM Baseline > It is suggested to include one or two baselines with Large Language Models (LLMs). We acknowledge the concern raised by reviewer regarding the limited extent of the experiment. We are planning to conduct additional experiment using LLMs, such as Mistral and LLaMA. Even though we could not include the experimental result in this initial rebuttal due to the limited time, we will ensure to attach this experiment in the updated version. We are grateful for your thoroguh feedback, which gives us opportunities to refine our work.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.