---
tags: useful!
---
# Datasets on culture / varia
Quite messy atm, will remove some for sure that turn out to be less useful. E.g. has 5 sources on goodreads, none ideal. Note, the document is editable for all, so it's possible to add stuff here if you want. Will be cool to learn of new resources and collectively gather a better overview.
## Deep history
- [Pantheon](https://pantheon.world/about/vision) - 70k biographies from Wikipedia.
- Notable people 2.29M people from Wikipedia [data](https://data.sciencespo.fr/dataset.xhtml?persistentId=doi:10.21410/7E4/RDAG3O)
- Global Technology Diffusion [TidyTuesday version](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-07-19), building on Cross-Historical Adoption of Technology (CHAT) [database](https://data.nber.org/data-appendix/w15319/), 100+ technologies, 150+ countries 1800-2020.
- D-Place - geography, language, culture, and environment of over 1400 human societies [data](https://d-place.org/), [paper](https://pure.mpg.de/rest/items/item_3258633_1/component/file_3262060/content)
- Seshat data - [download](http://seshatdatabank.info/datasets/)
- 29k natural or manmade disasters 1960-2018, [data](https://www.emdat.be/explanatory-notes), [paper](https://www.nature.com/articles/s41597-021-00846-6)
## Text search
- The Simpsons subtitles. [Subtitles example](https://frinkiac.com/caption/S09E06/428043), [Overview](https://langui.sh/2016/02/02/frinkiac-the-simpsons-screenshot-search-engine/). [Wikidata overview](https://linkedwiki.com/query/wikidata_The_Simpsons_television_series_episodes_list_by_season), [dataset](https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset)
- [TED talk corpus](https://github.com/kinnaird-laudun/data)
- Drama corpus [DraCor](https://dracor.org/), Artyom Shelya has a structured version
- Film: Movie conversations, 9k characters in 617 movies. [Movie dialog corpus](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)
- Film: Hollywood film dialog by gender and age. [Intro Cultural Analytics dataset](https://melaniewalsh.github.io/Intro-Cultural-Analytics/00-Datasets/00-Datasets.html#hollywood-film-dialogue-by-character-gender-and-age), [Original version](https://github.com/matthewfdaniels/scripts/)
- Google Books Ngrams [search/download](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
## Media
- Art: Tate art collection and artists [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-01-12/readme.md)
- Celebrities 2015-2017, Wikipedia popularity [data](https://github.com/the-pudding/wiki-billboard-data#historical-data)
- Books: Amazon most popular books 2017, [data](https://github.com/luminati-io/Amazon-popular-books-dataset)
- Books: Goodreads [UCSD Goodreads metadata on books, scraped 2017](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home), [extended 10k books from goodreads archive](https://github.com/malcolmosh/goodbooks-10k-extended/blob/master/README.md), [original 6M ratings for 10k books](https://github.com/zygmuntz/goodbooks-10k), [52k books on goodreads metadata 2020](https://zenodo.org/record/4265096), [code](https://github.com/scostap/goodreads_bbe_dataset), [list of all books on goodreads from 2019](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks)
- Books: Book summaries from wikipedia [data](https://www.kaggle.com/datasets/ymaricar/cmu-book-summary-dataset)
- Books: Book ratings - 433,000 numeric ratings of 186,000 books by 78,000 users on the book-tracking website BookCrossing [data](http://www2.informatik.uni-freiburg.de/~cziegler/BX/)
- Film: 26M+ movie ratings @ [MovieLens](https://movielens.org/), [Kaggle dataset] 45k films + 26M ratings (https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset).
- Film: Netflix show ratings [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md)
- Film: Bechdel test [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-03-09/readme.md)
- Games: Board Games database. [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-12), [TidyTuesday dataset v2](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-01-25/readme.md)
- Music: 1 million Bandcamp sales from few weeks late 2020 [data](https://components.one/datasets/bandcamp-sales)
- Music: Discographies - Discogs [data](https://data.discogs.com/)
- Music: Spotify track API [R package](https://www.rcharlie.com/spotifyr/), [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md)
- Music: Audio features of songs [Million Song dataset](http://millionsongdataset.com/)
- Music: Hip Hop vocabulary [Pudding](https://pudding.cool/projects/vocabulary/index.html),
### Media history
- Art: Museum of Modern Art New York exhibitions 1929-1989 [data](https://github.com/MuseumofModernArt/exhibitions)
- Books: New York Times bestseller list [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-05-10/readme.md), 1931-2020, top 18 books per week (?). [Description](https://data.post45.org/wp-content/uploads/2022/01/NYT-Data-Description.pdf), [Summary viz](https://towardsdatascience.com/finding-trends-in-ny-times-best-sellers-55cdd891c8aa).
- Film: IMDb [datasets](https://www.imdb.com/interfaces/), [download](https://datasets.imdbws.com/)
- Film: TV drama ratings 1990-2008 [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-01-08)
- Food: New York Public Library menu dataset [CA intro dataset](https://melaniewalsh.github.io/Intro-Cultural-Analytics/00-Datasets/00-Datasets.html#the-new-york-public-librarys-menu-dataset), [original version](http://menus.nypl.org/data)
- Food: Historic American Cookbook dataset [Feeding America](http://www.lib.msu.edu/feedingamericadata/)
- Food: Lindenfors et al: simple overview of food ingredients from a few cookbooks 1200-2000 [Dryad dataset](https://datadryad.org/stash/dataset/doi:10.5061/dryad.rq43r), [paper1](http://www.lindenfors.se/publications/Lindenfors_et_al_2015.pdf) [paper2](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0122092)
- Music: Top 100 Billboard per week [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-09-14/readme.md), position + audio features.
- Music: Billboards top 100 per year, with lyrics [data](https://github.com/walkerkq/musiclyrics)
- Music: The New York Philharmonic’s performance history dataset 1842-2020 [data](https://github.com/nyphilarchive/PerformanceHistory)
- Theatre: Broadway Weekly $$$ 1985-2020 [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-04-28/readme.md)
- Theatre: Operabase - 600k+ performances around the world since 1996, [database](https://www.operabase.com/en), [one version](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8LUFN8)
- Theatre: 140 years of London theatre. - 50k performances in 17c [data](https://londonstagedatabase.usu.edu/data.php)
- Varia: Nobel Prize history [API](https://www.nobelprize.org/about/developer-zone-2/), [CA intro course dataset](https://melaniewalsh.github.io/Intro-Cultural-Analytics/00-Datasets/00-Datasets.html#nobel-prize-winners), [Wikidata query](https://linkedwiki.com/query/List_of_Nobel_Prize)
## Old things
- The most classical DH dataset The Index Thomisticus. [data](http://www.corpusthomisticum.org/it/index.age)
- Encyclopedia Britannica 1768-1860. [data](https://data.nls.uk/data/digitised-collections/encyclopaedia-britannica/)
- 19c British fiction, 19k titles, 4k authors [data](https://www.victorianresearch.org/atcl/search.php)
- 18c texts, [ECCO-TCP](https://textcreationpartnership.org/tcp-texts/ecco-tcp-eighteenth-century-collections-online/)
- US presidents inaugural speeches, all 57, [download](http://oldsite.english.ucsb.edu/faculty/ayliu/unlocked/presidents/inaugural_speeches.zip)
- Magazine of Early American datasets [list](https://repository.upenn.edu/mead/)
## Varia
- Crossword puzzle clues [TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-04-19/readme.md)
- Oldest movies playable on wikidata [query](https://linkedwiki.com/query/The_oldest_movies_in_Wikidata)
## Estonian stuff
- Estonian fiction collection 1[ngrams](https://datadoi.ee/handle/33/41), [fulltexts](https://datadoi.ee/handle/33/76), [stopwords](https://datadoi.ee/handle/33/78), collection 2 [data](https://github.com/peeter-t2/tidy_ilukirj)
- Music: SkyPlus Yearly Top 40 Songs 1995-2020 + Lyrics [data](https://github.com/peeter-t2/EestiTop40_laulus6nad)
## Bibliographies
- British National Library [data](https://www.bl.uk/collection-metadata/downloads#lodbnb), [v2](https://old.datahub.io/dataset/jiscopenbib-bl_bnb-1)
- French National Library [data](https://data.bnf.fr/semanticweb)
- Internet Archive Open Library [data](https://openlibrary.org/developers/dumps)
- Finnish National Bibliography [data](https://github.com/COMHIS/fennica)
- Swedish National Bibliography [data](https://github.com/comhis/kungliga)
## Newspapers
- Chronicling America [API](https://chroniclingamerica.loc.gov/about/api/)
## General data sources
- DBpedia [API](https://www.dbpedia.org/), [data](https://databus.dbpedia.org/dbpedia/)
- Long Abstracts from Wikipedia [DBPedia](https://databus.dbpedia.org/dbpedia/text/long-abstracts)
- Short Abstracts from Wikipeida [DBPedia](https://databus.dbpedia.org/dbpedia/text/short-abstracts)
- Wikipedia Article texts [DBPedia](https://databus.dbpedia.org/dbpedia/text/nif-context)
- Wikipedia Article Structure [DBPedia](https://databus.dbpedia.org/dbpedia/text/nif-page-structure)
- Extracted facts from Wikipedia infoboxes [DBPedia](https://databus.dbpedia.org/dbpedia/generic/infobox-properties/)
- Wikipedia categories on articles [DBPedia](https://databus.dbpedia.org/dbpedia/generic/categories/)