Wojciech Hardy, Jakub Michańków
E-mail: wojciechhardy@uw.edu.pl
Follow -> https://hackmd.io/@WHardy/RR2024
Check if the results hold when the 'experiment' or process is exactly repeated
E.g. new researchers reach the same outcome (with same data)
Check if results hold in a different setting / with different methods / frameworks, etc.
The code should be
Rob, R. and Waldfogel, J. (2007). Piracy on the Silver Screen. Journal of Industrial Economics.
Ende et al. (2018). Global Online Piracy Study. Google Report.
Sample | Years | % unpaid | Coef. | |
---|---|---|---|---|
1 | 500 US students | 2002-05 | 5.2% | -0.76** |
2a | 372 Chinese students | 2006-08 | 74% | -0.14** |
2b | 3852 Chinese internet users | 2006-08 | 64% | 0.01 |
3 | 30k internet users; 6 countries | 2011-13 | 12% | -0.42*** |
4 | 35k internet users; 13 countries | 2015-17 | 10% | -0.22*** |
Study | Respondents | Content | Year of data | Coefficient |
---|---|---|---|---|
Rob and Waldfogel (2006) | US students | Hit CD albums | 1999-2003 | -0.16* |
Waldfogel (2009) | US students | TV series | 2005-06 or 06-07 | -0.31** |
Waldfogel (2010) | US students | popular songs | 2008-09 or 09-10 | -0.15**/-0.24** |
Hardy (2021) | Online readers | comic books | 2019 | -0.34***/-0.40*** |
See "Replication Crisis":
A 2016 study in the journal Science found that one-third of 18 experimental studies from two top-tier economics journals (AER and QJE) failed to successfully replicate.
Also: you can browse through some non-replicable (and sometimes retracted) marketing studies here, but note that it's not a random sample.
Publication bias or the file drawer problem:
p-hacking and questionable research practices:
Quick guide by XKCD and even a related R package.
Even if there were no biases… Let's recall statistical testing.
We need a null hypothesis, e.g. that a coefficient is equal to 0.
And an alternative, e.g. that it's larger than 0.
If the null is true, we'd expect our test statistic to be somewhere around the 0 point.
If it's very far away, we conclude that it's probably from a different distribution after all.
That's testing in a nutshell.
We typically choose a cutoff (e.g. 5%) for the test statistic and compare it with the p-value.
Two ways we can go wrong though:
And so by statistics alone, we err every now and then.
Take a look at the XKCD strip again and count the 'testing' frames. That's a case of a Type I error followed up with selection bias. Here's a study on that as well.
First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020 [at Harvard Dataverse]. (…) We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices.
Registered studies (mainly in medical journals, increasingly in other too)
Study marking (e.g. Biostatistics journal and its "C", "D", and "R")
Data / code sharing
More strict p-value requirements
Publishing replication studies
Meticulous code with comments, made publically available
Well-documented process, environment, etc.
We'll cover some ground rules later in the course
Here's a call to ease up with the interpretation of p-values (and it's not the first or last): Scientists rise up against statistical significance
The influence of hidden researcher decisions in applied microeconomics
(…) Two published causal empirical results are replicated by seven replicators each. We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No two replicators reported the same sample size. Statistical significance varied across replications, and for one of the studies the effect's sign varied as well.
Aggregates numerous existing studies and analyses them jointly.
Different goals:
Requires detailed and meticulous documentation of the (often arbitrary) procedures -> for reproducibility.
There are also metaanalyses of metaanalyses… (one last XKCD - for today). Here's somethin of this sort: https://www.bmj.com/content/344/bmj.d7762
Basic approaches:
find a common measure across studies (coefficient in same units, a t-statistic or p-value, a difference etc.)
regress it on study/data/author/period/etc. characteristics
look at distributions of statistics
search for irregularities or inconsistencies
Studied >135,000 p-values from >5,000 studies.
Found that "some authors choose the mode of reporting their results in an arbitrary way"
"Fig 2. Distribution of directly reported p values (restricted to .001–.15, equal weight of each paper)." from Krawczyk (2015)
"Fig 3. Distribution of re-calculated actual p values (restricted to .001–.15)." from Krawczyk (2015)
"Effect sizes reported in the biomedical literature" from Monsarrat and Vergnes, 2018
The project isn't aimed at testing your programming, econometric or statistical skills. The correctness of your analytic reasoning will not be graded in this course.
Instead, your projects should focus on:
No tolerance for plagiarism, including self-plagiarism (!)
Plagiarism is not only about providing a source, it is also about copying large parts from one project to another, especially without adequate modifications.
Project topics (and scope!) are agreed upon via e-mail. If you change them without approval, your project will not be graded (or will be graded accordingly with the initial scope that was agreed upon).
This course is not about your programming skills. It's about understanding the importance of reproducibility and the ways of achieving it.
You can use AI (e.g. LLMs) to support your writing and coding. If so, please state:
(Also cite other online sources, even if only of code snippets)
Work in teams of 2. If you can't find a team by the time of the deadline, you will be assigned to a team.
If you run into any problems regarding your collaboration, address them on the go. You may involve the course coordinator if so.
Do not wait until the deadline to notify the coordinator about not having been able to work with other team members.
Along the way, we'll be (sometimes) doing class assignments (i.e. during classes).
You can finish the project earlier and ask for an earlier grading as well.
Coursera - Reproducible Research course by Richard Peng
NBIS - Reproducible Research course
*you should try exploiting the advantages and functions of the 'new' language
**data update can include: collecting new survey data (not many), using newer datasets, etc.
*** you should provide a discussion on whether you managed to reproduce original results accurately or not and what might be the reasons. Also discuss whether your extension changed the conclusions and why that might be.