--- tags: hathi --- # Hathidata overview ## Hathidata in numbers Total number of documents: 108 395 Estimate of distinct titles: 51 196 Estimate of distinct authors: 19 481 *The real numbers are probably lower, since my cleaning method did not capture all duplicates. Sometimes the same text might also have different titles.* ### Timely distribution Decades (last zero of each decade missing, but don't let it bother) ![](https://i.imgur.com/fftKrHi.png) - earliest year: 1701 - latest year: 2011 ### Authors ![](https://i.imgur.com/hbThcWM.png) **Authors with the most text in the corpus** (names cleaned and lowercased) dickens charles: 1107 scott walter sir: 981 balzac honore de: 813 thackeray william makepeace: 769 defoe daniel: 704 cooper james fenimore: 694 lytton edward bulwer lytton: 666 trollope anthony: 616 stevenson robert louis: 574 dumas alexandre: 546 ## The English subset 2730 titles, manually checked, evenly distributed over time: ![](https://i.imgur.com/Jb35lqw.png) ![](https://i.imgur.com/ZsClQzN.png) *In our data: 1678. More recent files missing.* ![](https://i.imgur.com/xGjW599.png) #### Preliminary Hurst exploration ![](https://i.imgur.com/Gx38kck.png) *Quite stable over decades!* ## Nobel set Different authors: 27 (out of 118) ![](https://i.imgur.com/UnDZZfD.png) - the year 1800 is a mistake, will be corrected manually An overrepresentation of Kipling: ![](https://i.imgur.com/2ViGiFF.png) ## Documentation About the collection: https://wiki.htrc.illinois.edu/display/COM/About+the+Collection Tutorials: https://wiki.htrc.illinois.edu/display/COM/All+HTRC+Tutorials Validating the Nobel workset? https://analytics.hathitrust.org/validateworkset HathiTrust on GitHub: https://github.com/htrc https://github.com/tedunderwood/noveltmmeta/tree/master/metadata#4--the-manually-checked-title-subset