CommonGen - HackMD

# CommonGen ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset-structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts; the task is to generate a coherent sentence describing an everyday scenario using these concepts. ![](https://s3.w3s.aioz.network/w3ai-platform-v2/uploads/samples/54deb202-59ab-491b-b167-e0d95f9c4eb7/2024/07/12/1720774926-aaRmW5YyYH95dSghngnmwB.png?AWSAccessKeyId=FT7EO3IGQNMIILHXIDZRVTJHWE&Signature=0ipUGsFgKn36ZhK%2FWs1jtq48Ks0%3D&Expires=2351494926) CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total. ## Dataset Structure ### Data Instances * **Size of downloaded dataset files**: 1.85 MB * **Size of the generated dataset**: 7.21 MB * **Total amount of disk used**: 9.06 MB An example of 'train' looks as follows. ``` { "concept_set_idx": 0, "concepts": ["ski", "mountain", "skier"], "target": "Three skiers are skiing on a snowy mountain." } ``` ### Data Fields The data fields are the same among all splits. * `concept_set_idx`: a int32 feature. * `concepts`: a list of string features. * `target`: a string feature. ### Data Splits | name | train | validation | test | |:-------:|:-----:|:----------:|:----:| | default | 67389 | 4018 | 1497 | ## Reference We would like to acknowledge Lin, Bill Yuchen et al. for creating and maintaining the CommonGen dataset as a valuable resource for the computer vision and machine learning research community. For more information about the CommonGen dataset and its creator, please visit [the CommonGen website](https://inklab.usc.edu/CommonGen/index.html). ## License The dataset has been released under the MIT License. ## Citation ``` @inproceedings{lin-etal-2020-commongen, title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning", author = "Lin, Bill Yuchen and Zhou, Wangchunshu and Shen, Ming and Zhou, Pei and Bhagavatula, Chandra and Choi, Yejin and Ren, Xiang", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165", doi = "10.18653/v1/2020.findings-emnlp.165", pages = "1823--1840" } ```