--- tags: reviews --- # review sprint ## descriptive stats ### Proportion of female/male reviewers and books With reviwer id duplicates: ![](https://i.imgur.com/lA7lGwT.png) Without book-id duplicates: (so apperently, there are not many books which are reviewed more than once in 'Online media') ![](https://i.imgur.com/Aq4AopH.png) ![](https://i.imgur.com/U7qWiDD.png) - note that the blogosphere is dominated by women, whereas newspapers are dominated by men (gender labeling: reviewerGender_authorGender) ### Author gender over years: ![](https://i.imgur.com/otpS5wf.png) - so books written by men are in general more reviewed - does this reflect a bias in publications? ### Nr of unique reviewers per year by gender: ![](https://i.imgur.com/b9Tsmct.png) - overall: female = 499, male =451, NA=291 - NB! rev_id = 0 is probably a mistake! (female 0 = 14041; NA 0 = 8693) ### Top reviewers: ![](https://i.imgur.com/x7dE0fh.png) - so, some reviewers are highly productive! - *can this be true?* - mean number of review per reviewer: 27.95 (SD=73.4) ### grade inflation over time ![](https://i.imgur.com/HqBhtCo.png) So we see an increase in the average grades. In newspapers and in blogs: ![](https://i.imgur.com/IQx2akX.png) The number of reviews in each media type differs over the years: ![](https://i.imgur.com/4cxrMxU.png) and with same y-axis: ![](https://i.imgur.com/9PcJmvV.png) How does average grade develop over the years in the different mediatypes? ![](https://i.imgur.com/lEJCMLJ.png) Here we see that grades given in newpapers are quite stable but with a small increase. From 2016-2021, the grades in blogs are also quite stable but is higher than in newspapers. And from the plots above, we can see that there is a significant increase in the number of blogs in that period. ### full-star reviews ![](https://i.imgur.com/d8SsJ0j.png) ![](https://i.imgur.com/QWMgxoj.png) Author-gender proportion in all reviews: **OBS** The numbers on the bars is the number of books in question (for both genders). ![](https://i.imgur.com/kAlD3UM.png) Author-gender proportion in full-star reviews: ![](https://i.imgur.com/i0139Oe.png) Author-gender proportion in multiple full-star reviews: ![](https://i.imgur.com/n7cNM8l.png) ## correlations ![](https://i.imgur.com/IQ7RqSA.png) - we can see that the most important correlations are media_type_name\*media_importance, rev_gender_updated\*media_importance, rev_gender_updated\*media_type_name, rev_year\*media_importance, rev_year\*media_type_name ## models - simple ### RQ1: gender-gender - Intercept: female reviewer and female author: 0.0554 - C(author_gender)[T.male]: female reviewer and male author: 0.0554 - 0.0512 = 0.0042 - C(rev_gender_updated)[T.male]: male reviewer and female author: 0.0554 - 0.0512 - 0.2702 = -0.2660 - C(author_gender)[T.male]:C(rev_gender_updated)[T.male]: male reviewer and male author: 0.0554 - 0.0512 - 0.2702 + 0.1445 = -0.1215 ![](https://i.imgur.com/HShGrjV.png) ### RQ2: gender-media_type * Intercept: female reviewer in newpaper: **-0.3090** * C(rev_gender_updated)[T.male]: male reviewer in newspaper: -0.3090 + 0.1572 = **-0.1518** * C(media_type_name)[T.Blog]: female reviewer in blogs: -0.3090 + 0.5282 = **0.2192** * C(media_type_name)[T.Fagblad]: female reviewer in fagblad: **not statistically significant** * C(media_type_name)[T.Onlinemedie]: female reviewer in online media: -0.3090 + 0.3193 = **0.0103** * C(media_type_name)[T.Onlinemedie uden citat]: female reviewer in online media without quote: -0.3090 + 0.5031 = **0.1941** * C(media_type_name)[T.Regional avis]: female reviewer in regional avis: -0.3090 + 0.1440 = **-0.1650** * C(media_type_name)[T.Ugeblad]: female reviewer in ugeblad: -0.3090 + 0.5051 = **0.1961** * C(rev_gender_updated)[T.male]:C(media_type_name)[T.Blog]: male reviewer in blog: -0.3090 + 0.1572 + 0.5282 - 0.3581 = **0.0183** * C(rev_gender_updated)[T.male]:C(media_type_name)[T.Fagblad]: male reviewer in fagblad: -0.3090 + 0.1572 - 0.1230 + 0.4866 = **0.2118** * C(rev_gender_updated)[T.male]:C(media_type_name)[T.Onlinemedie]: male reviwer in online media: -0.3090 + 0.1572 + 0.3193 - 0.3251 = **-0.1576** * C(rev_gender_updated)[T.male]:C(media_type_name)[T.Onlinemedie uden citat]: male reviwer in online media without quote: **not statistically significant** * C(rev_gender_updated)[T.male]:C(media_type_name)[T.Regional avis]: male reviewer in regional avis: -0.3090 + 0.1572 + 0.1440 - 0.2023 = **-0.2101** * C(rev_gender_updated)[T.male]:C(media_type_name)[T.Ugeblad]: male reviwer in ugeblad: **non existing** $\implies$ * female reviewers are grading the lowest in newspapers (-0.3090) and the highest in blogs (0.2192). * male reviewers are grading the lowest in regional newspapers (-0.2101) and the highest in fagblad (0.2118). * the lowest grades are (in average) given by female reviewers in newspapers (-0.3090). * the hightest grades are (in average) given by female reviewers in blogs (0.2192). ![](https://i.imgur.com/ad8qxSs.png) ### RQ3: genre (?), blog genre preference ### RQ4: media_importance*gender | | Estimate | Std.Error | t-value | p(>t) | | | ------------------- | --------- | --------- | ------- | ----------- | -------- | | (Intercept) | 0.175539 | 0.175539 | 23.673 | < 2e-16 *** | | | media_importance200 | -0.279354 | 0.021101 | -13.239 | < 2e-16 *** | | media_importance300 | -0.418904 | 0.013467 | -31.107 | < 2e-16 *** | | | media_importance200:rev_gender_updatedmale | 0.223404 | 0.036086 | 6.191 | 6.04e-10 *** | |media_importance300:rev_gender_updatedmale | 0.402571 | 0.027957 | 14.400 | < 2e-16 *** | As a plot: ![](https://i.imgur.com/Hd8L2KN.png) ### RQ5: blog positivity bias (media type) - Intercept: Avis: -0.2188 - C(media_type_name)[T.Blog]: -0.2188 + 0.4458 = 0.227 - C(media_type_name)[T.Fagblad]: -0.2188 + 0.2298 = 0.011 - C(media_type_name)[T.Onlinemedie]: -0.2188 + 0.2722 = 0.0534 - C(media_type_name)[T.Onlinemedie uden citat]: -0.2188 + 0.4427 = 0.2239 - C(media_type_name)[T.Regional avis]: Not statistically significant - C(media_type_name)[T.Ugeblad]: -0.2188 + 0.3812 = 0.1624 $\implies$ 'Newspapers' are giving the lowest grades, grading 0.2188 points below average. And 'Blogs' and 'Online media without quote' are giving the highest grades, with 0.22 above average. ## models - mixed effects gender-gender (1|media_type_name) res = smf.mixedlm(formula='demean_grades ~ C(rev_gender_updated)', data=df_gender, groups=df_gender["media_type_name"]).fit() Coef. Std.Err. z P>|z| [0.025 0.975] Intercept -0.204 0.071 -2.878 0.004 -0.342 -0.065 C(media_type_name)[T.Blog] 0.352 0.013 27.429 0.000 0.327 0.377 C(media_type_name)[T.Fagblad] 0.157 0.083 1.887 0.059 -0.006 0.320 C(media_type_name)[T.Onlinemedie] 0.098 0.015 6.390 0.000 0.068 0.128 C(media_type_name)[T.Onlinemedie uden citat] 0.331 0.031 10.765 0.000 0.271 0.392 C(media_type_name)[T.Regional avis] -0.043 0.024 -1.802 0.072 -0.090 0.004 C(media_type_name)[T.Ugeblad] 0.397 0.029 13.798 0.000 0.340 0.453 Group Var 0.059 gender-gender + rev_year + (1|media_type_name) res = smf.mixedlm(formula='demean_grades ~ C(rev_gender_updated) * C(year)', data=df_gender, groups=df_gender["media_type_name"]).fit() | | Coef. | Std.Err. | z | P>\|z\| | [0.025 | 0.975] | |-----------------------------------------------|-------------|----------|-------------|-------------|-------------|-------------| | Intercept | -0.443 | 0.082 | -5.396 | 0.000 | -0.603 | -0.282 | | C(rev_gender_updated)[T.male] | 0,20625 | 0.049 | 6.113 | 0.000 | 0,140277778 | 0,272222222 | | C(year)[T.2011] | 0,072916667 | 0.053 | 1.998 | 0.046 | 0.002 | 0,144444444 | | C(year)[T.2012] | 0,184027778 | 0.052 | 5.110 | 0.000 | 0,113888889 | 0,254861111 | | C(year)[T.2013] | 0,178472222 | 0.052 | 4.942 | 0.000 | 0,107638889 | 0,249305556 | | C(year)[T.2014] | 0,226388889 | 0.051 | 6.348 | 0.000 | 0,156944444 | 0,296527778 | | C(year)[T.2015] | 0,260416667 | 0.046 | 8.242 | 0.000 | 0,198611111 | 0,322222222 | | C(year)[T.2016] | 0,359722222 | 0.043 | 11.927 | 0.000 | 0,300694444 | 0,419444444 | | C(year)[T.2017] | 0,393055556 | 0.044 | 13.001 | 0.000 | 0,334027778 | 0,452083333 | | C(year)[T.2018] | 0,380555556 | 0.044 | 12.569 | 0.000 | 0,321527778 | 0,440277778 | | C(year)[T.2019] | 0,370138889 | 0.044 | 12.242 | 0.000 | 0,311111111 | 0,429861111 | | C(year)[T.2020] | 0,325694444 | 0.044 | 10.745 | 0.000 | 0,265972222 | 0,384722222 | | C(year)[T.2021] | 0,308333333 | 0.045 | 9.850 | 0.000 | 0,246527778 | 0,369444444 | | C(rev_gender_updated)[T.male]:C(year)[T.2011] | 0.024 | 0.066 | 0,248611111 | 0,5 | -0.106 | 0,10625 | | C(rev_gender_updated)[T.male]:C(year)[T.2012] | -0.020 | 0.066 | -0.301 | 0,529861111 | -0.150 | 0,076388889 | | C(rev_gender_updated)[T.male]:C(year)[T.2013] | -0.098 | 0.066 | -1.476 | 0,097222222 | -0.228 | 0.032 | | C(rev_gender_updated)[T.male]:C(year)[T.2014] | -0.104 | 0.066 | -1.584 | 0,078472222 | -0.233 | 0.025 | | C(rev_gender_updated)[T.male]:C(year)[T.2015] | -0.219 | 0.059 | -3.692 | 0.000 | -0.336 | -0.103 | | C(rev_gender_updated)[T.male]:C(year)[T.2016] | -0.347 | 0.057 | -6.073 | 0.000 | -0.458 | -0.235 | | C(rev_gender_updated)[T.male]:C(year)[T.2017] | -0.410 | 0.057 | -7.180 | 0.000 | -0.522 | -0.298 | | C(rev_gender_updated)[T.male]:C(year)[T.2018] | -0.383 | 0.057 | -6.708 | 0.000 | -0.496 | -0.271 | | C(rev_gender_updated)[T.male]:C(year)[T.2019] | -0.334 | 0.057 | -5.834 | 0.000 | -0.446 | -0.222 | | C(rev_gender_updated)[T.male]:C(year)[T.2020] | -0.279 | 0.058 | -4.787 | 0.000 | -0.393 | -0.165 | | C(rev_gender_updated)[T.male]:C(year)[T.2021] | -0.271 | 0.061 | -4.452 | 0.000 | -0.390 | -0.152 | | Group Var | 0.035 | 0.021 | | | | | media_type_name (1|rev_year): (I get a convergence warning in python statsmodels...) res = smf.mixedlm(formula='demean_grades ~ C(media_type_name)', data=df_gender, groups=df_gender["year"]).fit() Coef. Std.Err. z P>|z| [0.025 0.975] Intercept -0.204 0.071 -2.878 0.004 -0.342 -0.065 C(media_type_name)[T.Blog] 0.352 0.013 27.429 0.000 0.327 0.377 C(media_type_name)[T.Fagblad] 0.157 0.083 1.887 0.059 -0.006 0.320 C(media_type_name)[T.Onlinemedie] 0.098 0.015 6.390 0.000 0.068 0.128 C(media_type_name)[T.Onlinemedie uden citat] 0.331 0.031 10.765 0.000 0.271 0.392 C(media_type_name)[T.Regional avis] -0.043 0.024 -1.802 0.072 -0.090 0.004 C(media_type_name)[T.Ugeblad] 0.397 0.029 13.798 0.000 0.340 0.453 Group Var 0.059 To control for more reviews of a single book: something like: (1 | title_id) ## book clustering highbrow-middle brow-lowbrow ![](https://i.imgur.com/T8OWLSS.png)