# AfriBERTa's Corpus ## Summary * [Introduction](#introduction) * [Dataset Structure](#dataset_structure) * [Reference](#reference) * [License](#license) * [Citation](#citation) ## Introduction This is the corpus on which AfriBERTa was trained on. The dataset is mostly from the BBC news website, but some languages also have data from Common Crawl. * Homepage: [Afriberta](https://github.com/keleog/afriberta) * Models: * [Afriberta Small](https://huggingface.co/castorini/afriberta_small) * [Afriberta Base](https://huggingface.co/castorini/afriberta_base) * [Afriberta Large](https://huggingface.co/castorini/afriberta_large) * Paper: [Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages](https://aclanthology.org/2021.mrl-1.11/) * Point of Contact: kelechi.ogueji@uwaterloo.ca ## Dataset Structure ### Data Instances Each data point is a line of text. An example from the igbo dataset: ``` { "id": "6", "text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa." } ``` ### Data Fields The data fields are: * `id`: id of the example * `text`: content as a string ## Reference We would like to acknowledge Ogueji, Kelechi et al. for creating and maintaining the AfriBERTa's Corpus dataset as a valuable resource for the computer vision and machine learning research community. For more information about the AfriBERTa's Corpus dataset and its creator, please visit [the AfriBERTa's Corpus website](https://github.com/keleog/afriberta). ## License The dataset has been released under the Apache License 2.0. ## Citation ``` @inproceedings{ogueji-etal-2021-small, title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", author = "Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy", booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", } ```