<style> .reveal { font-size: 28px; } </style> # Reproducible Research ## Introduction Wojciech Hardy, Łukasz Nawaro E-mail: [wojciechhardy@uw.edu.pl](mailto:wojciechhardy@uw.edu.pl) Follow -> https://hackmd.io/@WHardy/RR2023DS --- ## Technicalities - The course is taught at the faculty (i.e. is stationary), with Zoom kept as a backup plan - Course materials will be available via Google Drive / GitHub / on [lecturer's website](https://www.acep.uw.edu.pl/hardy/reproducible-research/) - Installing Git and registering to GitHub (or a similar service) will be necessary later on - Hit us up in case of trouble - If you fall behind (especially during first few classes) you might have problems with catching up --- ## A roadmap of sorts (key elements of the course) - What's the point anyway? - Version Control Systems (Git - used throughout the course) - Quarto and Markdown - Ground rules (clean code, documentation, etc.) - Tools (i.a. Kaggle, Cloud services, Linux shell) - AI use cases - Your projects --- ## The Three Rs [(Peng, 2011)](https://pubmed.ncbi.nlm.nih.gov/22144613/) ### Repetition Check if the results hold when the 'experiment' or process is exactly repeated ### Reproduction E.g. new researchers reach the same outcome (with same data) ### Replication Check if results hold in a different setting / with different methods / frameworks, etc. <!-- Talk about it. Repetition mostly e.g. in chemistry, to check the variability of measurements of an instrument. Reproduction in economics would involve the same data or collection methodology. As few changes as possible but usually done by different scientists Replication often to check if results can be replicated in a different setting --> <!-- Note: the borders between these terms are blurry. In one discipline, a change in the scientist might cause the study to be a replication (e.g. if subject react to the scientist). In some disciplines you'll collect new data with each repetition. In others, you keep using the same dataset. In economics, collecting new large surveys is not always possible, etc. --> --- ### The Five Rs? [(Benureau and Rougier, 2018)](https://www.frontiersin.org/articles/10.3389/fninf.2017.00069/full) The code should be - Re-runnable (executable) - Repeatable (producing the same result more than once) <!-- think of setting seeds for pseudo-random number generation --> - Reproducible (allowing an investigator to reobtain the published results) - Reusable (easy to use, understand and modify) - Replicable (acting as an available reference for any ambiguity in the algorithmic descriptions of the article). <!-- this is more code-oriented and might be a better fit for non-academic coding! --> --- ## Peng (2011) as a graph (note: it's a spectrum) <!-- .slide: style="font-size: 24px;" --> ![](https://i.imgur.com/PVcy2f6.png =600x) Source: [Camfield & Palmer-Jones (2013) - Three "Rs" of Econometrics: Repetition, Reproduction and Replication](https://www.tandfonline.com/doi/abs/10.1080/00220388.2013.807504) --- <!-- .slide: style="font-size: 24px;" --> ## Why replicate studies? 1) [Rob, R. and Waldfogel, J. (2007). Piracy on the Silver Screen. _Journal of Industrial Economics_.](https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-6451.2007.00316.x) 2) [Bai, J. and Waldfogel, J. (2012). Movie piracy and sales displacement in two samples of Chinese consumers. _Information Economics and Policy_.](https://www.sciencedirect.com/science/article/pii/S0167624512000388) 3) [Herz, B. and Kiljański, K. (2018). Movie piracy and displaced sales in Europe: Evidence from six countries. _Information Economics and Policy_.](https://www.sciencedirect.com/science/article/abs/pii/S016762451730152X) 4) [Ende et al. (2018). Global Online Piracy Study. _Google Report_.](https://www.ivir.nl/projects/global-online-piracy-study/) #### Relationship between piracy and sales across studies | | Sample | Years | % unpaid | Coef. | | --- | -------------------------------- | ------- |:--------:|:--------:| | 1 | 500 US students | 2002-05 | 5.2% | -0.76** | | 2a | 372 Chinese students | 2006-08 | 74% | -0.14** | | 2b | 3852 Chinese internet users | 2006-08 | 64% | 0.01 | | 3 | 30k internet users; 6 countries | 2011-13 | 12% | -0.42*** | | 4 | 35k internet users; 13 countries | 2015-17 | 10% | -0.22*** | <!-- In this example the methods used were the same, but contexts differed. Thus, outcomes may differ due to: different country (culture), different sample (internet users vs students), years (change in access, internet speed, etc.). So: replication is important even if original findings are correct. --> --- ## Even more similar studies | Study | Respondents | Content | Year of data | Coefficient | ------------------------ | --------------- | ------------- | ---------------- | ------------------- | | Rob and Waldfogel (2006) | US students | Hit CD albums | 1999-2003 | -0.16* | | Waldfogel (2009) | US students | TV series | 2005-06 or 06-07 | -0.31** | | Waldfogel (2010) | US students | popular songs | 2008-09 or 09-10 | -0.15**/-0.24** | | Hardy (2021) | Online readers | comic books | 2019 | -0.34***/-0.40*** | <!-- These studies also used different type of content and slight variations of the original methodology (still similar though) --> --- ## Why reproduce research / make research reproducible? - Transfer of knowledge - Replication - Work - Own workflow - Check for errors - etc. (all good reasons) <!-- transfer of knowledge - i.e. you learn when going through the steps of other researchers --> <!-- replication - you might need to retrace someone's steps to then apply the same method to a different context. This doesn't necessarily mean you want to invalidate the original thing, but that you might want to compare different settings --> <!-- work - you need other people in the workplace to be able to reuse your code and you hope that if you take over a piece of code from other people, it will be readable and documented --> <!-- you actually also need for yourself, when you're returning to a previously written piece of code --> <!-- and sometimes basically to uncover errors --> --- ## Why is research *not* reproduced? - Reproducing research regarded as 'unoriginal' - Time might be considered 'wasted' - A succesful reproduction does not contribute to a researcher's 'portfolio' (no returns to the work input) - Might create tensions with other researchers or journal editors - Invalidate priors to other conducted work - Researchers have little or no incentive to publish code and data to make research reproducible (fortunately it is changing in the right direction) see: [Camfield and Palmer Jones (2013).](https://www.tandfonline.com/doi/full/10.1080/00220388.2013.807504) --- ## However, reproduction **is** important See "[Replication Crisis](https://en.wikipedia.org/wiki/Replication_crisis)": > _A [2016 study in the journal Science](https://science.sciencemag.org/content/351/6280/1433) found that one-third of 18 experimental studies from two top-tier economics journals (AER and QJE) failed to successfully replicate._ Also: you can browse through some non-replicable (and sometimes retracted) [marketing studies here](https://openmkt.org/research/replications-of-marketing-studies/), but note that it's not a random sample. <!-- So basically there's this huge thing that started with psychology research, but as it turned out it actually goes much beyond. That lots of published research cannot be replicated and sometimes even reproduced. To check if they can be replicated, you have to reproduce at least parts (to make sure you're doing the same thing). In Psychology the problem is often small samples. Economic experiments seem to suffer from similar caveats. So what we're going to talk about now is: 1) WHY it's sometimes impossible to replicate findings. And 2) What can be done about in a systematic way. 3) What we can do to make our research easily reproducible https://science.sciencemag.org/content/351/6280/1433 --> --- ## So what's wrong with research? (1) Publication bias or the file drawer problem: - tendency to submit/publish statistically significant results, in-line with prior research - tendency to omit insignificant results and unconfirmed hypotheses - less intuitive results might take more time to publish (we don't see them yet) [As explained by XKCD](https://xkcd.com/882/). <!-- much more. E.g. it might be more difficult to get the funding to translate a non-English paper that's not in line with prior research --> --- ## So what's wrong with research? (2) p-hacking and questionable research practices: - fiddling with model specifications (models, final sample, choice of control variables) - leaving out contradicting results - writing down the hypotheses after the results are in (HARKing) - creative interpretation of p-values [Quick guide by XKCD](https://xkcd.com/1478/) and even a related [R package](https://gist.github.com/rasmusab/8956852). <!-- the whole system actually encourages this. You are rewarded for publishing, punished if you're not. You need funding and you get funding if you publish. You also go to conferences if you have interesting results. In numerous disciplines or regions you can lose your job if you don't publish. And it's difficult to just leave something you've worked on for a long time --> <!-- another comment is this: in machine-learning you don't really care that much. You can just run anything against everything else and continuously look for the better fit. You're only interested in the final predictive power, not in what it results from --> --- ## So what's wrong with research? (3) <!-- .slide: style="font-size: 24px;" --> Even if there were no biases... Let's recall statistical testing. We need a null hypothesis, e.g. that a coefficient is equal to 0. And an alternative, e.g. that it's larger than 0. If the null is true, we'd expect our test statistic to be somewhere around the 0 point. If it's very far away, we conclude that it's probably from a different distribution after all. ![](https://i.imgur.com/XYWODfW.jpg) --- ## What's wrong with research? (3) <!-- .slide: style="font-size: 24px;" --> That's testing in a nutshell. We typically choose a cutoff (e.g. 5%) for the test and compare it with the p-value. Two ways we can go wrong though: - Type I error: reject null although it's true (rare things can happen) - Type II error: not reject null while it's false And so by statistics alone, we err every now and then. Take a look at the [XKCD strip again](https://xkcd.com/882/) and count the 'testing' frames. That's a case of a Type I error followed up with selection bias. Here's a [study on that as well](https://pubmed.ncbi.nlm.nih.gov/16895820/). <!-- describe the process of testing. You start with a null and alternative hypotheses. Important bit: you can reject the null, but you can't confirm it. You calculate your test statistic. It typically measures how consistent a relationship is between some two variables. Question is: how possible would it be for this statistic to take that value if null hypothesis was true. If it's extremely unlikely, then you rather reject the null in favour of the alternative --> <!-- p-value tells you about that chance for your values to take place --> <!-- now consider this: you can make two kinds of mistakes: type I error is the rejection of a true null hypothesis type II error is the non-rejection of a false null hypothesis First basically depends on the taken significance level (e.g. 5%) Second also depends on the sample size. Now consider a set of studies. By statistics alone, at least some of them are actually erroneus. E.g. with a p-value of 5% you have a 1 in 20 risk of making a mistake when rejecting the null. --> --- ## Also other stuff, like ### Bad coding practices > First, we retrieve and analyze more than 2000 replication datasets with over 9000 unique R files published from 2010 to 2020 [at Harvard Dataverse]. (...) We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied, showing that many errors can be prevented with good coding practices. [Trisovic et al. (2022)](https://www.nature.com/articles/s41597-022-01143-6) --- ## Solutions - systemic Registered studies (mainly in medical journals) Study marking (e.g. [Biostatistics journal and its "C", "D", and "R"](https://academic.oup.com/biostatistics/pages/General_Instructions)) Data / code sharing More strict p-value requirements Publishing replication studies <!-- even if they present no new findings --> --- ## Solutions - as authors Meticulous code with comments, made publically available Well-documented process, environment, etc. We'll cover some ground rules later in the course Here's a call to ease up with the interpretation of p-values (and it's not the first or last): [Scientists rise up against statistical significance](https://www.nature.com/articles/d41586-019-00857-9) --- ## (why proper descriptions are so important...) [The influence of hidden researcher decisions in applied microeconomics](https://onlinelibrary.wiley.com/doi/abs/10.1111/ecin.12992) > _(...) Two published causal empirical results are replicated by seven replicators each. We find large differences in data preparation and analysis decisions, many of which would not likely be reported in a publication. No two replicators reported the same sample size. Statistical significance varied across replications, and for one of the studies the effect's sign varied as well._ --- ## Solutions - meta-research or a study of studies <!-- .slide: style="font-size: 28px;" --> - Aggregates numerous existing studies and analyses them jointly. - Different goals: - find the true effect size, - verify biases in the literature, - reduce Standard Errors, - identify new relationships. <!-- talk about other biases or study characteristics. E.g. some of the biases might be specific to something to the authors' characteristics or chosen method or country of study, etc. Stuff that can't be controlled in a single study --> - Requires detailed and meticulous documentation of the (often arbitrary) procedures -> for reproducibility. - There are also metaanalyses of metaanalyses... [(one last XKCD - for today)](https://xkcd.com/1447/). Here's somethin of this sort: https://www.bmj.com/content/344/bmj.d7762 --- ## Meta-research Basic approaches: - find a common measure across studies (coefficient in same units, a t-statistic or p-value, a difference etc.) - regress it on study/data/author/period/etc. characteristics - look at distributions of statistics - search for irregularities or inconsistencies - [funnel plots](https://en.wikipedia.org/wiki/Funnel_plot#/media/File:Funnelplot.png) --- <!-- .slide: style="font-size: 24px;" --> ### Funnel plot - example * Author creates a meta-analysis of studies on the impact of ICT on economic performance. * The below figure shows the size of the estimates (on the x-axis) against its precision (inverted standard error) on the y-axis. * We can observe that the estimates are more crowded on the positive part of the x-axis. * Graphical tests provide the easiest method for the detection of publication bias (Polák, 2017). ![](https://i.imgur.com/CuJtULv.jpg) [*Source*: Polák, P. (2017), The productivity paradox: A meta-analysis. *Information Economics and Policy*, vol.38.](https://www.sciencedirect.com/science/article/pii/S0167624516301524#fig0002) --- ## Meta-research - example <!-- .slide: style="font-size: 24px;" --> [Krawczyk, M. (2015) - The Search for Significance: A Few Peculiarities in the Distribution of P Values in Experimental Psychology Literature. *PLOS ONE 10*(6).](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127872) Studied >135,000 p-values from >5,000 studies. Found that "some authors choose the mode of reporting their results in an arbitrary way" - "they frequently report p values “just above” significance thresholds directly, whereas other values are reported by means of inequalities (e.g. “p<.1”)" - "they round the p values down more eagerly than up" - "[they] appear to choose between the significance thresholds and between one- and two-sided tests only after seeing the data" - "about 9.2% of reported p values are inconsistent with their underlying statistics (e.g. F or t)" - "it appears that there are “too many” “just significant” values." --- <!-- .slide: style="font-size: 24px;" --> "Fig 2. Distribution of directly reported p values (restricted to .001–.15, equal weight of each paper)." From Krawczyk (2015) ![](https://i.imgur.com/z37vNWU.png =600x) --- <!-- .slide: style="font-size: 24px;" --> "Fig 3. Distribution of re-calculated actual p values (restricted to .001–.15)." From Krawczyk (2015) ![](https://i.imgur.com/yPOeMGN.png =600x) --- <!-- .slide: style="font-size: 24px;" --> ## Passing the course: Final project (50 pts.) - In teams of **3**, with teamwork graded - The topic choice has to be confirmed with the coordinator - Topics at the end of this presentation and in course materials - Must-haves for all projects: 1. project stored in a repository from the start (viewable *history* of *well-described* contributions from *all* team members who *collaborate*) (20 pp.) 2. code and results with appropriate documentation (e.g. Markdown) appropriate for *full* reproducibility (including software version, etc.) (20 pp.) 3. code in a clean and easily readable format (10 pp.) <!-- this is what we grade --> --- ## The core idea of the project The project isn't aimed at testing your programming, econometric or statistical skills. The correctness of your analytic reasoning **will not** be graded in this course. Instead, your projects should focus on: 1) the process of reproduction (did it succeed? what where the challenges? what was missing in the original source? what didn't work? what could have been done in the original work instead?) 2) making your own process reproducible 3) documentation and good coding practices 4) collaboration and version control via Git --- ## More about the rules: plagiarism No tolerance for plagiarism, **including self-plagiarism (!)** Plagiarism is not only about providing a source, it is also about copying *large* parts from one project to another, especially without adequate modifications. Project topics (and scope!) are agreed upon via e-mail. If you change them without approval, your project will not be graded (or will be graded accordingly with the initial scope that was agreed upon). --- ## More about the rules: AI policy This course is not about your programming skills. It's about understanding the importance of reproducibility and the ways of achieving it. You can use AI to support your writing and coding. If so, please state: - the scope of the support, - the AI model and version, - the (exact) prompts used. (Also cite other online sources, even if only of code snippets) --- ## More about the rules: examples of improper behaviour *Example 1*: the team reproduced a study based on a Kaggle notebook. They provided a source at the beginning of the documentation, but the only thing they changed was the color of lines and dots in charts. *Example 2*: the team reproduced most of the study alone, but one function is copied verbatim from a notebook published by a GitHub user without providing a source. *Example 3*: the team used ChatGPT to do a literature review without stating the prompts. Behavior in **all** of the examples above will not be accepted and will result in lower grading (and in some cases failing of the course). --- ## More about the rules: teamwork Work in teams of **3**. If you can't find a team by the time of the deadline, you will be assigned to a team. If you run into any problems regarding your collaboration, address them on the go. You may involve the course coordinator if so. **Do not wait** until the deadline to notify the coordinator about not having been able to work with other team members. --- ## Passing the course: activity (50 pts.) Along the way, we'll be (sometimes) doing class activities (i.e. during classes). - You'll be asked to upload them to an online repository at the end of the class (as a prerequisite for having it counted). - Not every week - It's not about finishing everything, but about checking activity during the classes - Each activity will be counted as done (1) or not done (0). At the end, the total will be rescaled for a maximum of 50 points towards the final grade. - i.e. if you do 4 out of 7 activities, you'll get 4*(50/7) = 28,6 , i.e. 29 pts. towards the final grade. --- ## Deadlines - March 31, 2023 -> send information about teams (send an e-mail, including names of the three members and indicate who is the leader) - April 16, 2023 -> send a link to the team's GitHub repository (or invite via GitHub) - June 18, 2023 -> no further changes in the project repository allowed You can finish the project earlier and ask for an earlier grading as well. --- # Some references / interesting sources / reminders ---- ## Statistical testing and hypotheses [Wikipedia: Statistical hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) [Wikipedia: Type I and Type II errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors) ---- <!-- .slide: style="font-size: 32px;" --> ## Reproducible research ### and what's wrong with research [Coursera - Reproducible Research course by Richard Peng](https://www.coursera.org/learn/reproducible-research) [NBIS - Reproducible Research course](https://nbis-reproducible-research.readthedocs.io/en/latest/) [Wikipedia: Replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) [Wikipedia: Misuse of p-values](https://en.wikipedia.org/wiki/Misuse_of_p-values) [Wikipedia: Data dredging](https://en.wikipedia.org/wiki/Data_dredging) [Wikipedia: Researcher degrees of freedom](https://en.wikipedia.org/wiki/Researcher_degrees_of_freedom) [Wikipedia: HARKing](https://en.wikipedia.org/wiki/HARKing) [Wikipedia: Publication bias](https://en.wikipedia.org/wiki/Publication_bias) [Wikipedia: Reporting bias](https://en.wikipedia.org/wiki/Reporting_bias) ---- ## Meta-analysis [Ioannidis, J.P.A. (2018). Meta-research: Why research on research matters. PLoS Biology 16(3)](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2005468) [Havránek et al. (2020) Reporting Guidelines for Meta-Analysis in Economics. Journal of Economic Surveys 34(3), pp. 469-475.](https://onlinelibrary.wiley.com/doi/full/10.1111/joes.12363) [Wikipedia: Metascience](https://en.wikipedia.org/wiki/Metascience) ---- ## and the links provided earlier in the presentation --- # Thank you! --- # Topic Examples --- <!-- .slide: style="font-size: 24px;" --> 1) Take an econometric / statistical analysis (e.g. from your bachelor thesis or Kaggle): - Translate the codes to a different programming language (e.g. R to Python or Stata to R, etc.)\* - Reproduce the results. - Pick a way to improve the study or update the data\*\* or perform a robustness check and do it. - Discuss the findings, potential problems, inconsistencies and conclusions. \*you should try exploiting the advantages and functions of the 'new' language \*\*data update can include: collecting new survey data (not many), using newer datasets, etc. --- 2) Take a published (in a peer-reviewed journal) research paper with no code attached. Reproduce its (main) findings. Report on the problems along the way. --- 3) Take a simple meta-analysis study (examples below). Then do the following: - Reproduce the obtained results using the reported sample of studies. - Add 2-5 newer studies, preferably using the selection process reported in the original study. - Replicate the results with the extended sample. - Describe your findings and discuss them. Note: check for data availability. (one of this is not that easy) Examples: Card and Krueger 1995 - Time-Series Minimum-Wage Studies Gorg and Strobl 2001 - Multinational Companies and Productivity Spillovers Glass, G. V., & Smith, M. L. (1979). Meta-Analysis of Research on Class Size and Achievement. Educational Evaluation and Policy Analysis, 1(1), 2–16. ---
{"metaMigratedAt":"2023-06-17T20:01:04.591Z","metaMigratedFrom":"YAML","title":"Reproducible Research 1 - why RR?","breaks":true,"description":"View the slide with \"Slide Mode\".","contributors":"[{\"id\":\"1c10bb23-6c4c-4c1b-8586-5f8d56305139\",\"add\":54937,\"del\":31201},{\"id\":\"d3dd1f20-d416-420a-903a-4f2cafc555ef\",\"add\":1163,\"del\":124}]"}
    606 views