---
tags: reviews
---
# Gold standard dataset
### Final content of gold_standard.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57369 entries, 0 to 57368
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title_id 57369 non-null int64
1 year_published_first 57369 non-null int64
2 date_published_first 57290 non-null object
3 publication_count 57369 non-null int64
4 publishing_house_ids 57369 non-null object
5 publishing_houses 57369 non-null object
6 id 57369 non-null int64
7 created_by 57369 non-null int64
8 date 57369 non-null object
9 headline 57369 non-null object
10 url 46053 non-null object
11 quote 46193 non-null object
12 grades_transformed_6 57369 non-null float64
13 rev_id 57369 non-null int64
14 rev_gender_updated 45787 non-null object
15 media_importance 57369 non-null int64
16 media_name 57369 non-null object
17 media_type_id 57369 non-null int64
18 media_type_name 57369 non-null object
19 title_title 57369 non-null object
20 title_keywords 51965 non-null object
21 title_language_primary 57369 non-null object
22 author_entity_ids_c 57369 non-null int64
23 first_name 57369 non-null object
24 last_name 57095 non-null object
25 birth_year 23997 non-null float64
26 demean_grades 57369 non-null float64
27 author_gender 57369 non-null object
dtypes: float64(3), int64(9), object(16)
### Filters used
Columns in fiction-reviews.csv:
```
Index(['id', 'created_by', 'media_id', 'date', 'headline', 'quote', 'grade',
'grade_normalized', 'url', 'referenced_title_ids', 'referenced_titles',
'reviewer_id', 'underheading', 'grading_max', 'grading_min',
'grading_precision', 'grading_symbol', 'media_importance',
'media_initials', 'media_name', 'media_type_id', 'media_url',
'media_type_name', 'clean_grades_scale6', 'clean_grading_max_scale6',
'grades_transformed_6', 'clean_grades_scale5',
'clean_grading_max_scale5', 'grades_transformed_5', 'rev_first_name',
'rev_last_name', 'rev_gender', 'rev_id', 'title_id', 'title_title',
'title_language_primary', 'title_language_translated_from',
'title_is_translated', 'title_category_id', 'title_keywords',
'category_id', 'category_name', 'title_ids', 'title_name', 'category_c',
'author_entity_ids_c', 'multiple_titles', 'review_title_id_c',
'multiauthors', rev_gender_updated],
dtype='object')
```
### rev_gender_updated column:
* we have retrieved 16.309 gender of reviewers with help from Kirsten.
* for the ficiton dataset it gives us 14.760 review genders.
### From the fiction-review.csv dataset...
* filter away review on more than 1 book title and more and oen book author ==> removing 4.906 reviews out of 65.474 it total which is 7,49%
* get the subset of columns: ['id','created_by', 'date', 'headline', 'url', 'quote','grades_transformed_6','rev_id','rev_gender_updated','media_importance','media_name','media_type_id','media_type_name','title_title','title_keywords','title_language_primary','title_id','author_entity_ids_c']
* filter away reviews without id on book author ==> removing 147 reviews.
* filter away reviews without grade ==> removing 2957 reviews
* filter away reviews withaout media name ==> removing 1 review.
### From books:
* Get the follwoing information: ['year_published_first', year_published_first', 'date_published_first', 'publication_count', 'publishing_house_ids', 'publishing_houses', 'id'] from referenced_titles in reviews-graph.ndjson
* Remove all reviews without a book id ==> removing 74 reviews in total
* Column of book id is called ['title_id']
### Merge:
* Merge the two dataframes on 'title_id'
* remove all reviews wihtout id
* remove duplicated on 'id' and 'title_id'
Then we end up with: a (57463, 22) dataframe
#### Info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57369 entries, 0 to 57462
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title_id 57369 non-null float64
1 year_published_first 57105 non-null float64
2 date_published_first 57290 non-null object
3 publication_count 57369 non-null float64
4 publishing_house_ids 57369 non-null object
5 publishing_houses 57369 non-null object
6 id 57369 non-null float64
7 created_by 57369 non-null float64
8 date 57369 non-null object
9 headline 57369 non-null object
10 quote 46193 non-null object
11 grades_transformed_6 57369 non-null float64
12 rev_id 34635 non-null float64
13 rev_gender_updated 45787 non-null object
14 media_importance 57369 non-null float64
15 media_name 57369 non-null object
16 media_type_id 57369 non-null float64
17 media_type_name 57369 non-null object
18 title_title 57369 non-null object
19 title_keywords 51965 non-null object
20 title_language_primary 57369 non-null object
21 author_entity_ids_c 57369 non-null int64
22 first_name 57369 non-null object
23 last_name 57095 non-null object
24 birth_year 23997 non-null float64
25 gender 57369 non-null object
dtypes: float64(10), int64(1), object(15)
memory usage: 11.8+ MB
### From author_gender.csv
* make dataframe with ['author_entity_ids_c', 'first_name', 'last_name', 'birth_year', 'gender'] where author_entity_ids_c is id column renamed.
* That gives us 12.822 unique author ids.
* In the big dataframe change the author_entity_ids_c to be int (123) instead of string with square brackets ('[123]').
* merge dataframe with author gender togehter with the big (result) dataframe with an outer join --> author_entity_ids_c = author_entity_ids_c
* drop all id which is nan. ==> author id for all 57.369 reviews.
* rename the column 'gender' to 'author_gender' so its not mistaken for 'rev_gender'
### Further additions:
* Some of the newspapers' website is listed as online media. This is now changed.
* I have added a column for demean of the grades.
### Questions
* If we need reviewer gender then we remove 11.587 reviews but we would still have 45.787 reviews.
* If we need reviewer id, then we need to remove 22.734 reviews ==> leading to 34.635 reviews.
* If we need both reviewer id and reviewer gender, then we will have 31.746 reviews.
* Do we need publishing_house_ids or publishing_houses ?
* example:
* publishing_house_ids: [6125, 7979]
* publishing_houses: [{'id': 6125, 'name_main': 'Forlaget Brændpunkt'}, {'id': 7979, 'name_main': 'Lindhardt og Ringhof', 'name_sub': 'SAGA Egmont'}]
# Goldstandard.csv: (57369, 27)
|Column name | Content | Gold standard Filter| Notes |
| ---------- | ------- | ------------------- | ----- |
| id | unique for each row| Should be there: filter out those without| |
| created_by | an integer, having values [2, 880, 906, 960, 200, 197,5,424,544]||
| date | when the review was published| |
| headline | review headline | |
| url | url to review | |
| quote | quote from the review (not the book) ||
| grades_transformed_6 | grade on the 6-point likert scale ||
| rev_id | Unique id for reviewer| |we only have 34706 reviewer ids|
| rev_gender_updated| gender of reviewer, updated with help from Kirsten ||
| media_importance |values: 100, 200, 300, nan ||
| media_name | 304 unique names ||
| media_type_id |values: 1,2,3,4,5,6, ,9, nan (7 and 8 missing?)||
| media_type_name | name of the media type , unique types in the data: Blog, Onlinemedie, Ugeblad, Avis, Fagblad, Onlinemedie uden citat, Regianl avis, nan | | |
| title_title | the title of the book| only reviews of one book ==> single title_title| |
| title_keywords| list of all keywords associated with the title||
| title_language_primary |||
| title_id | unique book id |Should be there: filter out those without| |
| year_published_first | ||
| date_published_first | | |
| publisher: publishing_house_ids OR publishing_house_ids | Publisher house|| see question above|
| publication_count | number of publications|||
| author_entity_ids_c | Unique id for author| Should be there: filter out those without| |
| first_name | first name of author | |
| last_name | last name of author | | |
| birth_year | birth_year of author ||
| author_gender | gender of author| | |
| demean_grades | demean value for the grades || |