# AfriBERTa's Corpus
## Summary
* [Introduction](#introduction)
* [Dataset Structure](#dataset_structure)
* [Reference](#reference)
* [License](#license)
* [Citation](#citation)
## Introduction
This is the corpus on which AfriBERTa was trained on. The dataset is mostly from the BBC news website, but some languages also have data from Common Crawl.
* Homepage: [Afriberta](https://github.com/keleog/afriberta)
* Models:
* [Afriberta Small](https://huggingface.co/castorini/afriberta_small)
* [Afriberta Base](https://huggingface.co/castorini/afriberta_base)
* [Afriberta Large](https://huggingface.co/castorini/afriberta_large)
* Paper: [Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages](https://aclanthology.org/2021.mrl-1.11/)
* Point of Contact: kelechi.ogueji@uwaterloo.ca
## Dataset Structure
### Data Instances
Each data point is a line of text. An example from the igbo dataset:
```
{
"id": "6",
"text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa."
}
```
### Data Fields
The data fields are:
* `id`: id of the example
* `text`: content as a string
## Reference
We would like to acknowledge Ogueji, Kelechi et al. for creating and maintaining the AfriBERTa's Corpus dataset as a valuable resource for the computer vision and machine learning research community. For more information about the AfriBERTa's Corpus dataset and its creator, please visit [the AfriBERTa's Corpus website](https://github.com/keleog/afriberta).
## License
The dataset has been released under the Apache License 2.0.
## Citation
```
@inproceedings{ogueji-etal-2021-small,
title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
author = "Ogueji, Kelechi and
Zhu, Yuxin and
Lin, Jimmy",
booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.mrl-1.11",
pages = "116--126",
}
```