---
tags: hathi
---
# Hathidata overview
## Hathidata in numbers
Total number of documents: 108 395
Estimate of distinct titles: 51 196
Estimate of distinct authors: 19 481
*The real numbers are probably lower, since my cleaning method did not capture all duplicates. Sometimes the same text might also have different titles.*
### Timely distribution
Decades (last zero of each decade missing, but don't let it bother)

- earliest year: 1701
- latest year: 2011
### Authors

**Authors with the most text in the corpus**
(names cleaned and lowercased)
dickens charles: 1107
scott walter sir: 981
balzac honore de: 813
thackeray william makepeace: 769
defoe daniel: 704
cooper james fenimore: 694
lytton edward bulwer lytton: 666
trollope anthony: 616
stevenson robert louis: 574
dumas alexandre: 546
## The English subset
2730 titles, manually checked, evenly distributed over time:


*In our data: 1678. More recent files missing.*

#### Preliminary Hurst exploration

*Quite stable over decades!*
## Nobel set
Different authors: 27 (out of 118)

- the year 1800 is a mistake, will be corrected manually
An overrepresentation of Kipling:

## Documentation
About the collection: https://wiki.htrc.illinois.edu/display/COM/About+the+Collection
Tutorials: https://wiki.htrc.illinois.edu/display/COM/All+HTRC+Tutorials
Validating the Nobel workset?
https://analytics.hathitrust.org/validateworkset
HathiTrust on GitHub: https://github.com/htrc
https://github.com/tedunderwood/noveltmmeta/tree/master/metadata#4--the-manually-checked-title-subset