owned this note
owned this note
Published
Linked with GitHub
# PG-19
## Summary
* [Introduction](#introduction)
* [Dataset Structure](#dataset_structure)
* [Reference](#reference)
* [License](#license)
* [Citation](#citation)
## Introduction
This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates.
PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark. Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date).
Unlike prior benchmarks, they do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text. To compare models they propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table. One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA or NarrativeQA. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.
## Dataset Structure
### Data Instances
**default**
* **Size of downloaded dataset files**: 11.74 GB
* **Size of the generated dataset**: 11.51 GB
* **Total amount of disk used**: 23.25 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"publication_date": 1907,
"short_book_title": "La Fiammetta by Giovanni Boccaccio",
"text": "\"\\n\\n\\n\\nProduced by Ted Garvin, Dave Morgan and PG Distributed Proofreaders\\n\\n\\n\\n\\nLA FIAMMETTA\\n\\nBY\\n\\nGIOVANNI BOCCACCIO\\n...",
"url": "http://www.gutenberg.org/ebooks/10006"
}
```
### Data Fields
The data fields are the same among all splits.
**default**
* `short_book_title`: a string feature.
* `publication_date`: a int32 feature.
* `url`: a string feature.
* `text`: a string feature.
### Data Splits
| name | train | validation | test |
|:------- |:-----:|:----------:|:----:|
| default | 28602 | 50 | 100 |
## Reference
We would like to acknowledge Rae, Jack W et al. for creating and maintaining the PG-19 dataset as a valuable resource for the computer vision and machine learning research community. For more information about the PG-19 dataset and its creator, please visit [the PG-19 website](https://github.com/google-deepmind/pg19).
## License
The dataset has been released under the Apache License 2.0.
## Citation
```
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
```