README for Chicago Corpus measures excel-sheet

About the Chicago spread-sheet: readme, meta-data, features For an overview on each feature (readme), please see the bottom of this note ## Quality standards For now, the corpus has been annotated for the following quality-standards: ![](https://i.imgur.com/FJxH5lG.png) Some of these prizes have very few matches in Chicago (such as the British Fantasy Awards, 3 matches), so it only makes sense to use some of these in conjunction (see readme in the bottom of this page for exact numbers per category). ## Features The corpus has been annotated for the following textual/arc features: ![](https://i.imgur.com/JGqaCv1.png) # README for Chicago Corpus measures excel-sheet # also in ucloud: DAT/Chicago_corpus/Readme_Chicago_excel https://cloud.sdu.dk/app/files/properties/%2F47946%2FChicago_corpus%2FReadme_Chicago_excel.md # Textual measures # #### TITLE_LENGTH Number of letters in title. #### WORDCOUNT Global. Number of words in text. #### SENTENCE_LENGTH Global. Length of sentences in text measured in characters. #### MSTTR-100 Global. Mean Segmental Type-Token Ratio is a measure of lexical richness. It segments the text in segments of a given size (here 100 words, often taken as standard) and calculates the Type-Token Ratio for each segment - then takes the average of all segment ratios of the whole text. #### BZIP_NEW Global. Compressibility of the sentiment-arcs as calculated by dividing the original bitsize of the arcs with the compressed bitzsize (using bzip2 compression). We calculated the compression ratio (original bit-size/compressed bit-size) for the first 1500 sentence-sentiment-values of each text. _We use bz2 in python_ Settings: bz2.compress(text.encode(),compresslevel=9) #### BZIP_TXT Global. Compressibility of the text-files as calculated by dividing the original bitsize of the text with the compressed bitzsize (using bzip2 compression). We calculated the compression ratio (original bit-size/compressed bit-size) for the first 1500 sentences of each text. _We use bz2 in python_ Settings: bz2.compress(arc.encode(),compresslevel=9) * ## Syntactic features Range of spAcy tags extracted using the small spaCy model (en_core_web_sm) #### SPACY_ADJ Adjective frequency of each text (not normalized, e.g., by wordcount) #### SPACY_NOUN Noun frequency of each text (not normalized) #### SPACY_VERB Verb frequency of each text (not normalized) #### SPACY_PRON Pronoun frequency of each text (not normalized) #### SPACY_PUNCT Punctuation-mark frequency of each text (not normalized) #### SPACY_STOPS Stopword frequency of each text (not normalized) #### SPACY_SBJ Nominal subject frequency of each text (not normalized) #### SPACY_PASSIVE Passive auxiliary frequency of each text (not normalized) #### SPACY_AUX Auxiliary frequency of each text (not normalized) #### SPACY_RELATIVE Relative clause modifier frequency of each text (not normalized) #### SPACY_NEGATION Negation modifier frequency of each text (not normalized) * ## Readabilities #### READABILITY_FLESCH_EASE A measure of readability based on the average sentence length (ASL), and the average syllables per word (word length)(ASW), with a higher weight on the word length (Crossley et al., 2011). It should be noted that the weight on word lengths is higher in the Flesch Reading Ease score compared to the Flesch-Kincaid Grade Level. It returns a readability score between 0 and 100, where higher scores are better (Hartley, 2016). The formula is: Flesch Reading Ease =206.835 - (1.015 * sentence length) + (84.6 * word length) Why it was selected It’s one of the most common scores and has in several publications been argued to be the best measure compared to other readability scores (see Hartley, 2016) It does not return a US grade (compared to all other scores), which might be a bit difficult to interpret, but instead returns a score What to be aware of (also described in Hartley, 2016) The score might be outdated and has several issues, which also apply to other readability scores (Hartley, 2016): Many syllables does not mean that a word is more difficult to understand The meaning of words is not taken into account There are individual differences between readers #### READABILITY_FLESCH_GRADE A revised version of the Flesch Reading Ease score. Like the former, it is based on the average sentence length (ASL), and the number of syllables per word (ASW). It also weighs word length more than sentence length, but the weight is smaller compared to that in the Flesch Reading Ease Score. It returns a US grade level (Crossley et al., 2011). The formula is: Flesch Kincaid Grade Level =(0.39 * sentence length) + (11.8 * word length) -15.59 Why it was selected It’s also one of the most common and traditional scores to assess readability What to be aware of See Flesch Reading Ease above The score was initially developed for document for the US Navy, so it might be questioned how well it applies to literature #### READABILITY_SMOG A readability score introduced by McLaughlin. It measures readability based on the average sentence length and number of words with more than 3 syllables (number of polysyllables), and returns a US grade. However, it does this by defining all words with 3 or more syllables as polysyllables, rather than using word length as a continuous measure. It was developed as an easier (and more accurate) alternative to the Gunning Fog Index, and is based on the McCall-Crabbs Standard Test Lessons in Reading (Zhou et al., 2017). The formula is: SMOG Index = 1.0430 * number of polysyllables * 30number of sentences+ 3.1291 Why it was selected The main reason for selecting this measure was as a (better) alternative to the Gunning Fog Index, and as an alternative to the Flesch scores McCall-Crabbs Standard Test Lessons in Reading contain non-fiction but also fiction texts, which might be relevant for the texts we are looking at What to be aware of The SMOG Index is widely used for health documents, so it is unclear how accurate this score is when it is applied to literature The McCall-Crabbs Standard Test Lessons in Reading have been revised multiple times, which means that the formula itself might also be inaccurate (Zhou et al., 2017) #### READABILTY_ARI A readability score based on the average sentence length and number of characters per words (word length), and returns a US grade. However, the word length is not defined by the number of syllables, but by the number of characters in the word. It was developed to test readability of documents from the US Air Force, and was defined using 24 books and their associated grade levels (Zhou et al., 2017). The formula is: ARI = 4.71 characterswords + 0.5 wordssentences -21.43 Why it was selected It was mostly selected as it uses an alternative measure of word length, compared to the Flesch scores and the SMOG Index What to be aware of Since it was developed for rather technical documents it may be debated how well it applies to literature #### READABILTY_DALE_CHALL_NEW A 1995 revision of the Dale-Chall readability score. It is based on the average sentence length (ASL) and the percentage of "difficult words" (PDW) which were defined as words which do not appear on a list of words which 80 percent of fifth-graders would know, contained in the Dale-Chall word-list. See: https://countwordsworth.com/download/DaleChallEasyWordList.txt The Dale-Chall Readability Score also returns a US grade, but is different from all other scores, as it does not determine difficulty of words based on their length, but based on a predefined list. The raw score is adjusted, by adding 3.6365, if the number of difficult words (all words not on the list of familiar words) is above 5%. The formula to compute the raw score is as follows: New Dale Chall Readability Score = 0.1579 (difficult wordswords*100) + 0.0496 (wordssentences) Why it was selected This score was mainly selected as it addresses an issue of all other scores, namely that long words are not necessarily difficult to understand (e.g. interesting is a long word, but may be familiar to many and thus easy to read) What to be aware of The list of familiar words may not apply to all students and genres of text Since the list of familiar words is based on 5th grade students, this index may be most relevant in the given age group * ## Simple sentiment-arc measures #### MEAN_SENT Mean sentiment of all sentences in text #### STD_SENT Std. deviation of sentiment in text (sentence-based) #### END_SENT Mean sentiment of the last 10% of each text #### BEGINNING_SENT Mean sentiment of the first 10% of each text #### DIFFERENCE_ENDING_TO_MEAN Difference in mean sentiment between the main chunk of the text and the last 10% of the text #### ARC_SEGMENTS_MEANS List of sentiment means of each segment when splitting texts into 20 segments * ## Complex sentiment-arc measures #### HURST Linear. Hurst score of sentiment arcs. Sentiment arcs were exctracted with the Vader-lexicon. _We use ??_ #### APPENT Linear. Approximate Entropy of sentiment arcs calculated per 2 sentences. Sentiment arcs were exctracted with the Vader-lexicon. Approximate entropy is a technique used to quantify the amount of regularity and the unpredictability of fluctuations over time-series data. _We use Neurokit2 (https://neuropsychology.github.io/NeuroKit/functions/complexity.html#entropy)_ Settings : app_ent = nk.entropy_approximate(sentarc, dimension=2, tolerance='sd') # Quality measures # * ## Continuous These are all title-based (except for WIKI page rank) #### LIBRARIES “Libraries” corresponds to the number of library holdings as listed in WorldCat. Note from Hoyt Long: However, you should know that after ranking novels by number of holdings, we then proceeded to acquire whatever was available digitally. This means that some works that were ranked high did not make it into the corpus. #### RATING_COUNT Number of ratings for title on Goodreads. Scraped with the Goodreads scraper, see Readme-file for the scraper for details: https://cloud.sdu.dk/app/files/properties/%2F178949%2FReadme.txt #### AVG_RATING Average rating of title on Goodreads. Scraped with the Goodreads scraper, see Readme-file for thTe scraper for details: https://cloud.sdu.dk/app/files/properties/%2F178949%2FReadme.txt #### GOODREADS_PRODUCT Average rating x rating count of each book. #### AUDIBLE_AVG_RATING Average rating of title on Audible. From large audible dataset: https://github.com/elipickh/Audible_full_scraper 663 in Chicago #### AUDIBLE_RATING_COUNT Number of ratings for title on Audible. From large audible dataset: https://github.com/elipickh/Audible_full_scraper 663 in Chicago #### AUDIBLE_CATEGORY Category ("genre") assigned on Audible From large audible dataset: https://github.com/elipickh/Audible_full_scraper 663 in Chicago #### TRANSLATIONES Number of translations for title as listed in Index Translationum (https://www.unesco.org/xtrans/bsform.aspx), which lists translations in the period 1979-2019 5082 in Chicago > 0 #### AUTH_PageRank NB. Author-based An author's "PageRank Complete" at Wikipedia, based on data from World Literature group (Frank) who used wikipedia page-ranks: https://arxiv.org/pdf/1701.00991.pdf An author has a high PageRank if many other articles with a high PageRank link to it. 3558 in Chicago > 0 * ## Prestige/canon-lists ## #### GOODREADS_CLASSICS Author-based Authors mentioned on the Goodreads-classics-list are marked 1. https://www.goodreads.com/shelf/show/classics 62 in Chicago #### GOODREADS BEST 20TH CENTURY BOOKS Author-based Authors mentioned on the 20th century best books list are marked 1. https://www.goodreads.com/list/show/6.Best_Books_of_the_20th_Century 44 in Chicago #### OPENSYLLABUS Author-based Works that also appear in the top 1000 titles on the Opensyllabus list of English Literature are marked 1. https://opensyllabus.org/result/field?id=English+Literature 477 in Chicago #### NORTON_ENGLISH Author-based Authors mentioned in the 10th edition of the Norton Anthology of English Literature (British & American literature) are marked 1. 339 in Chicago #### NORTON_AMERICAN Author-based Authors mentioned in the 10th edition of the Norton Anthology of American Literature are marked 1. 62 in Chicago #### NORTON Author-based Norton english and Norton american combined 401 in Chicago #### PENGUIN_CLASSICS_SERIES_TITLEBASED Title-based Titles that have been published in the Penguin Classics series (https://www.penguin.com/penguin-classics-overview/) (1326 titles in total) 77 in Chicago #### PENGUIN_CLASSICS_SERIES_AUTHORBASED Author-based Authors that have been published in the Penguin Classics series (https://www.penguin.com/penguin-classics-overview/) (1326 titles in total) 335 in Chicago * ## Awards ## ## General fiction #### NOBEL Author-based Nobel-prize winners works are marked 1. 85 in Chicago #### PULITZER Longlisted works Title-based Works shortlisted (winners) for the Pulitzer Prize are marked W, and works that were longlisted (finalists) are marked F. 53 in Chicago #### NBA Longlisted works Title-based Works shortlisted (winners) for the National Book Award are marked W, and works that were longlisted (finalists) are marked F. 108 in Chicago * ## Scifi #### HUGO Longlisted works Title-based (1953-2022) Works shortlisted (winners) for the Hugo Awards are marked W, and works that were longlisted (finalists) are marked F. 96 in Chicago #### LOCUS_SCIFI Shortlisted works (Scifi) Title-based Locus award for best scifi novel 1980-2022 12 in Chicago #### NEBULA Longlisted works (Scifi) Title-based Nebula awards 1966-2022 92 in Chicago #### PHILIP_K_DICK_AWARD Longlisted works (Scifi) Title-based US Scifi award 1982-2022 4 in Chicago #### J_W_CAMPBELL_AWARD Longlisted works (Scifi) Title-based Scifi award 1973-2022 35 in Chicago #### PROMETHEUS_AWARD Longlisted works (Scifi) Title-based US "libertarian" scifi award 1979-2022 20 in Chicago * ## Fantasy #### LOCUS_FANTASY Shortlisted works Title-based 5 in Chicago #### BFA Shortlisted works Title-based British Fantasy Awards (aka. the August Derleth Fantasy Award) 1972-2022 3 in Chicago #### WORLD_FANTASY_AWARD Longlisted works Title-based Fantasy award 1975-2022 28 in Chicago #### MYTHOPOEIC_AWARDS Longlisted works (Fantasy) Title-based US fantasy award 1971-2022 5 in Chicago * ## Horror #### LOCUS_HORROR Shortlisted works Title-based Locus awards for horror fiction/dark fantasy (1989-2022) 5 in Chicago #### BRAM_STOKER_AWARD Longlisted works Title-based Award for dark & horror fiction (1987-2022) 14 in Chicago * ## Mystery #### EDGAR_AWARDS Shortlisted works (Mystery (Crime, etc.)) 10 in Chicago # ## Umbrella-categories #### SCIFI_AWARDS Title-based 163 in Chicago #### HORROR_AWARDS Title-based 19 in Chicago #### FANTASY_AWARDS Title-based 40 in Chicago #### ROMANTIC AWARDS Author-based 54 in Chicago # Metadata # #### PASC_GENRE Genre manually annotated by Pascale #### GENDER Gender assigned based on the authors first name (genderize) #### DECADE Just publication date by decade