JaQuAD: Japanese Question Answering Dataset

# JaQuAD: Japanese Question Answering Dataset ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset_structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction Japanese Question Answering Dataset (JaQuAD), released in 2022, is a human-annotated dataset created for Japanese Machine Reading Comprehension. JaQuAD is developed to provide a SQuAD-like QA dataset in Japanese. JaQuAD contains 39,696 question-answer pairs. Questions and answers are manually curated by human annotators. Contexts are collected from Japanese Wikipedia articles. Fine-tuning [BERT-Japanese](https://huggingface.co/tohoku-nlp/bert-base-japanese) on JaQuAD achieves 78.92% for an F1 score and 63.38% for an exact match. ## Dataset Structure ### Data Instances * **Size of dataset files**: 24.6 MB * **Size of the generated dataset**: 48.6 MB * **Total amount of disk used**: 73.2 MB An example of 'validation': ``` { "id": "de-001-00-000", "title": "イタセンパラ", "context": "イタセンパラ(板鮮腹、Acheilognathuslongipinnis)は、コイ科のタナゴ亜科タナゴ属に分類される淡水>魚の一種。\n別名はビワタナゴ(琵琶鱮、琵琶鰱)。", "question": "ビワタナゴの正式名称は何?", "question_type": "Multiple sentence reasoning", "answers": { "text": "イタセンパラ", "answer_start": 0, "answer_type": "Object", }, }, ``` ### Data Fields * `id`: a string feature. * `title`: a string feature. * `context`: a string feature. * `question`: a string feature. * `question_type`: a string feature. * `answers`: a dictionary feature containing: * `text`: a string feature. * `answer_start`: a int32 feature. * `answer_type`: a string feature. ### Data Splits JaQuAD consists of three sets, train, validation, and test. They were created from disjoint sets of Wikipedia articles. The test set is not publicly released yet. The following table shows statistics for each set. | Set | Number of Articles | Number of Contexts | Number of Questions | |:---------- |:------------------:|:------------------:|:-------------------:| | Train | 691 | 9713 | 31748 | | Validation | 101 | 1431 | 3939 | | Test | 109 | 1479 | 4009 | ## Reference We would like to acknowledge ByungHoon So et al. for creating and maintaining the JaQuAD dataset as a valuable resource for the computer vision and machine learning research community. For more information about the JaQuAD dataset and its creator, please visit [the JaQuAD website](https://github.com/SkelterLabsInc/JaQuAD). ## License The dataset has been released under the Creative Commons Attribution-ShareAlike 3.0 International License. ## Citation ``` @misc{so2022jaquad, title={{JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension}}, author={ByungHoon So and Kyuhong Byun and Kyungwon Kang and Seongjin Cho}, year={2022}, eprint={2202.01764}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```