# COLING Rebuttal 3385 (KICK)
## Response to Reviewer 1
We highly value your insightful comments and constructive suggestions on our manuscript. We have addressed each point raised by the reviewer as follows:
### [1. The language of the proposed dataset]
We acknowledge the need for clarity regarding our dataset's language. KICK is exclusively in Korean, as stated in the manuscript. To highlight this more effectively, we will add parallel Korean examples in the appendix, which were not permissible in the initial submission. Furthermore, the figures depict the interaction between commentators and casters based on English-translated examples. This was a strategic choice to communicate the features of KICK to a broader audience within the confines of our space limitations. In the camera-ready version, we will explicitly state the dataset's language, including revised figures, to enhance comprehension of its linguistic aspects.
### [2. The definition of "belief state"]
Thank you for pointing out the need for a clearer definition of "belief state". In our paper, it conforms to the standard dialogue state tracking (DST) usage. A belief state in DST is an essential component that encapsulates a subject and its specific content, generally represented as (domain-slot-value) pairs. In our context, it's confined to (slot-value) pairs within a single domain. The model's role is to predict these belief states at each dialogue turn, with "joint goal accuracy(JGA)" measuring the cumulative accuracy and "turn goal accuracy(TGA)" indicating individual turn accuracy.
Even though various papers that proposed a dataset for DST use the term "belief state" without a distinct definition [1-3], we recognize that a more comprehensive definition could better assist readers less familiar with DST. Therefore, we will elaborate on this in the camera-ready version with relevant references [3-5] and support understanding through releasing the source code.
[1] Budzianowski et al., "MultiWOZ -- A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling", EMNLP 2018.
[2] Henderson et al., "The Second Dialog State Tracking Challenge", SIGDIAL 2014.
[3] Park et al., "KLUE: Korean Language Understanding Evaluation", NeurIPS 2021.
[4] Kim et al., "Mismatch between Multi-turn Dialogue and its Evaluation Metric in Dialogue State Tracking", ACL 2022.
[5] Dey et al., "Towards Fair Evaluation of Dialogue State Tracking by Flexible Incorporation of Turn-level Performances", ACL 2022.
### [3. The statement on Section 3.1.1]
We appreciate your attention to detail in Section 3.1.1. The statement regarding 'adding 45 minutes to the second half' was intended to emphasize that the 45 minutes represent the full duration of the first half, rather than a literal extension of the second half. This practice aligns with standard football statistical methods and enhances clarity between the halves. For instance, if the raw data indicates '11 minutes into the second half,' it is adjusted to '56 minutes (11 + 45)' for clarity and to prevent confusion between the first and second halves. In conclusion, we aimed to use expressions commonly employed in prior research studies and standard football statistical systems [1-3].
[1] https://www.fifa.com/fifaplus/en/match-centre
[2] https://www.fotmob.com/
[3] Taniguchi et al., "Generating Live Soccer-Match Commentary from Play Data", AAAI 2019.
### [4. Details on Section 3.1.2]
Your feedback on Section 3.1.2 is invaluable. We will explicitly elaborate the annotators' roles and tasks mentioned in item 1 of Section 3.1.2, detailing their responsibility to accurately transcribe ASR-generated text for timestamps corresponding to events in the conversation based on the guidelines. Specifically, the annotators are required to:
- Correct sentences transcribed from ASR to ensure accuracy.
- Ensure precise alignment with provided metadata for player names and home-away information.
- Assign timestamps corresponding to highlights within the match, ensuring they align with the game time indicated in the highlights video provided.
- Differentiate speakers between "caster" and "commentator".
We are committed to updating the manuscript to clarify this process. Additionally, the English and Korean version of these guidelines will be available in the code repository for better transparency. Furthermore, we will encompass examples of the annotation process in the appendix to offer concrete illustration of the annotation procedure.
### [5. Organization in Section 4 and 5]
We acknowledge the necessity for clarity in Sections 4 and 5. Section 4 outlines the tasks with the dataset, and Subsection 4.2 discusses the employed metrics. We introduce "accumulate belief state" also called JGA and "turn goal accuracy" (TGA) metrics to compare conversational characteristics of commentators and casters. The "relative goal index" (RGI) metric is also presented to measure the balance between turn-level and dialogue-level accuracy. A higher RGI value (closer to 1) indicates an emphasis on overall dialogue flow understanding, while a lower RGI value (closer to 0) suggests a focus on local information.
In light of your feedback, we will enhance the contextualization of Tables 1, 2, and 3 in the revised version for better comprehension. In Table 1, we will outline slot-value pairs representing match events, including the unique challenge of predicting non-categorical scores without predefined values. Numerical values of Table 2 will be enhanced for clarity, specifically highlighting the "avg. Tokens / Turn" feature. For Table 3, we will include explanations in the caption to clarify terms, particularly RGI.
### [6. Overall writing]
We recognize the importance of manuscript organization and writing quality. We will undertake a thorough revision to enhance clarity and integrate the reviewer's feedback effectively.
### [7. The comparison with other datasets]
We appreciate your emphasis on comparing our dataset with others. We will include a detailed comparison with MultiWOZ zero-shot DST, where the GPT-3.5-turbo achieved a state-of-the-art JGA score of 56.44 [1], in the revised version. While we acknowledge the value of direct comparisons, our dataset's unique focus on 'user-user' interactions in sports commentary sets it apart. These characteristics is distinguished from previous datasets, which was based on the dialogue between 'user-system'. We will add a section on this interaction in Table 2 to highlight these unique properties.
[1] Heck et al., "ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an Opportunity?", ACL 2023.
We hope these planned revisions align with your expectations and address the points you raised. We admire your detailed review and your attention to detail, which helps us enhance the quality and clarity of our work. Thank you once again for your insightful comments.
## Response to Reviewer 2
Thank you for your review and thoughtful feedback on our manuscript. We are glad that you recognized the main contributions of our paper: a new dataset that incorporates dialogues between person and person with different roles. Below, we address the reviewer's comment on the extent of experiments.
We acknowledge the need for experiments with finetuning-based classifiers. Inspired by the BERT-based TripPy [1], we performed experiments using RoBERTa. However, the performance was notably worse than prompt-based appraoches even in first half setting, described in the manuscript (JGA: -15.29%p, TGA: -20.63%p, SA: -51.02%p, RSA: -57.05%p compared to GPT-3.5-turbo).
We examined the result and revealed several possible rationales for this outcome:
1. "KICK" stands out with an average of 40.4 tokens per turn, handling longer sentences per turn compared to other datasets, as stated in the manuscript. As a result, traditional methodologies that rely on inputting the entire conversation history may not perform optimally for this dataset.
2. We observed difficulty in distinguishing between "home" and "away" values, suggesting the necessity for additional preprocessing to differentiate them within the same slot.
3. Unlike previous datasets with a straightforward 'user-system' order, "KICK" presented a dynamic order of caster and commentator, making it challenging for the model to explicitly distinguish between their roles.
Additionally, we have conducted additional experiments using the GPT4-1106-preview model. In the using both caster and commentator utterances scenario, we observed overall strong performance metrics (JGA: +1.59%p, TGA: +12.64%p, SA: +7.40%p, RSA: +4.80%p compared to GPT-3.5-turbo, RGI: 0.6074), indicating better comprehension of the overall dialogue flow compared to GPT-3.5. The RGI also indicates a tendency to prioritize turn-level understanding over GPT-3.5 (RGI: 0.3517), a trend similarly observed when evaluating performance solely based on TGA. Moreover, we revealed a tendency where JGA was higher when using only commentator utterance and TGA was higher only caster utterance, consistent with previous experiments in the manuscript. We will attach the detailed experimental results in the camera-ready version.
Furthermore, human evaluation was conducted to complement the quantitative analysis. To ensure the fairness of the experiment, it was conducted with five non-experts unrelated to the football field. The results indicated a significant improvement over the baseline in first half setting (JGA: +19.42%p, TGA: +35.49%p, SA: +3.46%p, RSA: +12.64%p compared to GPT-3.5-turbo, RGI: 0.8155), suggesting that while large language model performs well across various tasks, there are still limitations in surpassing human performance on the KICK dataset. We assume that there are two primary reasons for this phenomenon. Firstly, large language model currently lacks explicit coreference capabilities [2-3], such as memorization, making it limited in remembering the history of long conversations like those in KICK. Secondly, given that KICK is in Korean, the model performance could be affected by the multilingual ability of the model [4].
[1] Heck et al., "TripPy: A Triple Copy Strategy for Value Independent Neural Dialog State Tracking", SIGDIAL 2020.
[2] Heck et al., "ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an Opportunity?", ACL 2023.
[3] Mullick et al., "Better Handling Coreference Resolution in Aspect Level Sentiment Classification by Fine-Tuning Language Models", EMNLP 2023 Workshop.
[4] OpenAI, "GPT-4 Technical Report", arXiv Preprint 2023.
Your constructive feedback on the need of additional experiment has significantly contributed to the refinement of our work. Thank you once again for your insightful comments.
## General Response
We extend our sincere gratitude to the reviewers for their thoughtful feedback and insightful suggestions on our manuscript. We appreciate the positive recognition of our strengths of our dataset.
We have diligently addressed the major concerns raised by the reviewers:
- **Clarification of the dataset's language (R1)**: To ensure clarity, we will explicitly state the KICK's language in the camera-ready version, along with incorporating parallel Korean examples in the appendix.
- **Elaboration on "belief state" (R1)**: To enhance understanding, we will provide a detailed definition in the revised version, supported by relevant references and accompanying source code release.
- **Data preprocessing procedure (R1)**: We will clarify the data preprocessing steps in Section 3.1.1 to address concerns raised and ensure clear delineation of each procedure, such as adjusting time notation between first and second half.
- **Annotation guidelines (R1)**: We will enhance Section 3.1.2 by providing clear explanations of the annotators' tasks, including transcribing ASR-generated text and timestamps, along with guidelines and examples in the appendix.
- **Clarification on metrics and tables (R1)**: We aim to clarify the significance of each metric in Sections 4 and 5, ensuring their contributions to evaluating dataset quality are clearly outlined. Furthermore, detailed explanations will be included in the captions of Tables 1, 2, and 3 to facilitate the interpretation of numerical values.
- **Organization of the writing issues (R1)**: We are committed to revising the writing of the paper to align with the reviewers' feedback.
- **Comparison with other datasets (R1)**: We will provide a detailed comparison with zero-shot DST on MultiWOZ, showcasing the GPT-3.5-turbo's state-of-the-art JGA score. Additionally, we will clarify the 'user-user' interactions in KICK, explaining the challenges of direct comparisons due to the distinctive nature of our dataset.
- **Necessity of Additional Experiments (R2)**: We have conducted additional experiments on RoBERTa, GPT-4, and human evaluation and report their performance.
We sincerely thank the reviewers for their constructive feedback, which has played a crucial role in enhancing the quality and clarity of our work.
## Response to Chairs
Dear Chairs,
We are grateful for the detailed reviews from the reviewers. However, while we appreciate every reviewers, we wish to address some concerns regarding the evaluations from Reviewer 1.
Regarding the "belief state" concept, it is a standard term in dialogue state tracking(DST), often not elaborated upon in similar papers. The notation of adding 45 minutes to mark the second half in Section 3.1.1 is a standard practice in football commentary for clarity. The metric joint goal accuracy(JGA), mentioned in Section 4.2, is a well-established metric in DST.
We believe that the reviews from Reviewer 1 may not fully reflect the nuances of the sports domain or provide a comprehensive evaluation of the DST aspects. As our paper have received only two reviews, we hope these clarifications aid in a more accurate understanding of the concepts discussed.
Thank you for your thoughtful consideration and invaluable service to the community.
Sincerely,
Paper3385 Authors