# Natural Questions: A Benchmark for Question Answering Research ###### tags: `筆記`, `study notes`, `NLP`, `TACL 2019` > https://aclanthology.org/Q19-1026/ ## Abstract - Motivation: Addressing the lack of large, real-world datasets for open-domain question answering (QA) to advance natural language understanding (NLU) and machine learning methods. - This paper introduces the **Natural Questions** corpus, a large dataset for QA research, pairing real user queries from Google search with high-quality annotations of answers found in Wikipedia pages. It aims to provide a robust benchmark for developing and evaluating QA systems. ## Introduction - Recent advances in machine learning, especially neural methods, have significantly improved performance in tasks like machine translation, speech, and image recognition. These successes are attributed to both methodological advancements and the availability of large training datasets. - Open-domain QA is highlighted as a benchmark task in NLU, with the potential to drive methodological innovations. Despite the existence of QA datasets, the challenge lies in the annotation and evaluation process, which the Natural Questions dataset aims to address by providing a large-scale, high-quality dataset for end-to-end training and evaluation. ## Methodology - The Natural Questions corpus - Questions in this dataset are real, anonymized, aggregated queries issued to Google search, ensuring their naturalness and diversity. The corpus includes over 300,000 examples for training, with additional sets for development and testing. The annotation process involves presenting annotators with a question and a relevant Wikipedia page, from which they extract or identify long (paragraph) and short (entity or entities) answers or mark as null if no answer is present. ## Experiments - Datasets: The dataset comprises 307,373 training examples, 7,830 development examples, and 7,842 test examples, all annotated with the goal of reflecting real-world question-answering scenarios. - Metrics: Introduces robust evaluation metrics considering the variability of acceptable answers and demonstrates high human upper bounds on these metrics, establishing a challenging benchmark for QA systems. ## Takeaways - 此研究提出的Natural Questions資料集,為開放領域問答(NLU)提供了一個大規模、高質量的基準測試集,能有效推動自然語言理解和機器學習方法的發展。 - 該資料集透過真實的Google搜尋查詢作為問題,結合維基百科頁面作為答案來源,通過嚴謹的註釋過程,提供了一個反映真實世界問答場景的高質量資料集。 - 研究還引入了考慮可接受答案變異性的堅固評估指標,展示了高人類上限,為QA系統設立了挑戰性的基準。 > STATEMENT: The contents shared herein are quoted verbatim from the original author and are intended solely for personal note-taking and reference purposes following a thorough reading. Any interpretation or annotation provided is strictly personal and does not claim to reflect the author's intended meaning or context.