Gold standard dataset

--- tags: reviews --- # Gold standard dataset ### Final content of gold_standard.csv: <class 'pandas.core.frame.DataFrame'> RangeIndex: 57369 entries, 0 to 57368 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title_id 57369 non-null int64 1 year_published_first 57369 non-null int64 2 date_published_first 57290 non-null object 3 publication_count 57369 non-null int64 4 publishing_house_ids 57369 non-null object 5 publishing_houses 57369 non-null object 6 id 57369 non-null int64 7 created_by 57369 non-null int64 8 date 57369 non-null object 9 headline 57369 non-null object 10 url 46053 non-null object 11 quote 46193 non-null object 12 grades_transformed_6 57369 non-null float64 13 rev_id 57369 non-null int64 14 rev_gender_updated 45787 non-null object 15 media_importance 57369 non-null int64 16 media_name 57369 non-null object 17 media_type_id 57369 non-null int64 18 media_type_name 57369 non-null object 19 title_title 57369 non-null object 20 title_keywords 51965 non-null object 21 title_language_primary 57369 non-null object 22 author_entity_ids_c 57369 non-null int64 23 first_name 57369 non-null object 24 last_name 57095 non-null object 25 birth_year 23997 non-null float64 26 demean_grades 57369 non-null float64 27 author_gender 57369 non-null object dtypes: float64(3), int64(9), object(16) ### Filters used Columns in fiction-reviews.csv: ``` Index(['id', 'created_by', 'media_id', 'date', 'headline', 'quote', 'grade', 'grade_normalized', 'url', 'referenced_title_ids', 'referenced_titles', 'reviewer_id', 'underheading', 'grading_max', 'grading_min', 'grading_precision', 'grading_symbol', 'media_importance', 'media_initials', 'media_name', 'media_type_id', 'media_url', 'media_type_name', 'clean_grades_scale6', 'clean_grading_max_scale6', 'grades_transformed_6', 'clean_grades_scale5', 'clean_grading_max_scale5', 'grades_transformed_5', 'rev_first_name', 'rev_last_name', 'rev_gender', 'rev_id', 'title_id', 'title_title', 'title_language_primary', 'title_language_translated_from', 'title_is_translated', 'title_category_id', 'title_keywords', 'category_id', 'category_name', 'title_ids', 'title_name', 'category_c', 'author_entity_ids_c', 'multiple_titles', 'review_title_id_c', 'multiauthors', rev_gender_updated], dtype='object') ``` ### rev_gender_updated column: * we have retrieved 16.309 gender of reviewers with help from Kirsten. * for the ficiton dataset it gives us 14.760 review genders. ### From the fiction-review.csv dataset... * filter away review on more than 1 book title and more and oen book author ==> removing 4.906 reviews out of 65.474 it total which is 7,49% * get the subset of columns: ['id','created_by', 'date', 'headline', 'url', 'quote','grades_transformed_6','rev_id','rev_gender_updated','media_importance','media_name','media_type_id','media_type_name','title_title','title_keywords','title_language_primary','title_id','author_entity_ids_c'] * filter away reviews without id on book author ==> removing 147 reviews. * filter away reviews without grade ==> removing 2957 reviews * filter away reviews withaout media name ==> removing 1 review. ### From books: * Get the follwoing information: ['year_published_first', year_published_first', 'date_published_first', 'publication_count', 'publishing_house_ids', 'publishing_houses', 'id'] from referenced_titles in reviews-graph.ndjson * Remove all reviews without a book id ==> removing 74 reviews in total * Column of book id is called ['title_id'] ### Merge: * Merge the two dataframes on 'title_id' * remove all reviews wihtout id * remove duplicated on 'id' and 'title_id' Then we end up with: a (57463, 22) dataframe #### Info: <class 'pandas.core.frame.DataFrame'> Int64Index: 57369 entries, 0 to 57462 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title_id 57369 non-null float64 1 year_published_first 57105 non-null float64 2 date_published_first 57290 non-null object 3 publication_count 57369 non-null float64 4 publishing_house_ids 57369 non-null object 5 publishing_houses 57369 non-null object 6 id 57369 non-null float64 7 created_by 57369 non-null float64 8 date 57369 non-null object 9 headline 57369 non-null object 10 quote 46193 non-null object 11 grades_transformed_6 57369 non-null float64 12 rev_id 34635 non-null float64 13 rev_gender_updated 45787 non-null object 14 media_importance 57369 non-null float64 15 media_name 57369 non-null object 16 media_type_id 57369 non-null float64 17 media_type_name 57369 non-null object 18 title_title 57369 non-null object 19 title_keywords 51965 non-null object 20 title_language_primary 57369 non-null object 21 author_entity_ids_c 57369 non-null int64 22 first_name 57369 non-null object 23 last_name 57095 non-null object 24 birth_year 23997 non-null float64 25 gender 57369 non-null object dtypes: float64(10), int64(1), object(15) memory usage: 11.8+ MB ### From author_gender.csv * make dataframe with ['author_entity_ids_c', 'first_name', 'last_name', 'birth_year', 'gender'] where author_entity_ids_c is id column renamed. * That gives us 12.822 unique author ids. * In the big dataframe change the author_entity_ids_c to be int (123) instead of string with square brackets ('[123]'). * merge dataframe with author gender togehter with the big (result) dataframe with an outer join --> author_entity_ids_c = author_entity_ids_c * drop all id which is nan. ==> author id for all 57.369 reviews. * rename the column 'gender' to 'author_gender' so its not mistaken for 'rev_gender' ### Further additions: * Some of the newspapers' website is listed as online media. This is now changed. * I have added a column for demean of the grades. ### Questions * If we need reviewer gender then we remove 11.587 reviews but we would still have 45.787 reviews. * If we need reviewer id, then we need to remove 22.734 reviews ==> leading to 34.635 reviews. * If we need both reviewer id and reviewer gender, then we will have 31.746 reviews. * Do we need publishing_house_ids or publishing_houses ? * example: * publishing_house_ids: [6125, 7979] * publishing_houses: [{'id': 6125, 'name_main': 'Forlaget Brændpunkt'}, {'id': 7979, 'name_main': 'Lindhardt og Ringhof', 'name_sub': 'SAGA Egmont'}] # Goldstandard.csv: (57369, 27) |Column name | Content | Gold standard Filter| Notes | | ---------- | ------- | ------------------- | ----- | | id | unique for each row| Should be there: filter out those without| | | created_by | an integer, having values [2, 880, 906, 960, 200, 197,5,424,544]|| | date | when the review was published| | | headline | review headline | | | url | url to review | | | quote | quote from the review (not the book) || | grades_transformed_6 | grade on the 6-point likert scale || | rev_id | Unique id for reviewer| |we only have 34706 reviewer ids| | rev_gender_updated| gender of reviewer, updated with help from Kirsten || | media_importance |values: 100, 200, 300, nan || | media_name | 304 unique names || | media_type_id |values: 1,2,3,4,5,6, ,9, nan (7 and 8 missing?)|| | media_type_name | name of the media type , unique types in the data: Blog, Onlinemedie, Ugeblad, Avis, Fagblad, Onlinemedie uden citat, Regianl avis, nan | | | | title_title | the title of the book| only reviews of one book ==> single title_title| | | title_keywords| list of all keywords associated with the title|| | title_language_primary ||| | title_id | unique book id |Should be there: filter out those without| | | year_published_first | || | date_published_first | | | | publisher: publishing_house_ids OR publishing_house_ids | Publisher house|| see question above| | publication_count | number of publications||| | author_entity_ids_c | Unique id for author| Should be there: filter out those without| | | first_name | first name of author | | | last_name | last name of author | | | | birth_year | birth_year of author || | author_gender | gender of author| | | | demean_grades | demean value for the grades || |