--- # System prepended metadata title: (2025) Good practices for reproducible data science --- --- title: (2025) Good practices for reproducible data science author: - Miguel Ponce de León - Patricio Reyes date: 2025-02-24 slideOptions: transition: slide abstract: > 'Today, computers have become essential for data management. One of the main reasons for this is that all scientific disciplines share the common need to deal with big volumes of data: to organise, pre-process, analyze and finally, convert a collection of raw datasets into pieces of knowledge or data-supported actions. Due to the central role data plays in research and industrial applications, it is also critical to follow guidelines and use standard analysis workflows that generate **reproducible results**. Although the concept of reproducibility itself is at the heart of data analysis and scientific research (think of a researcher carrying out an experiment in a chemistry lab), the majority of data practitioners, students and researchers have no formal training in reproducible scientific computing. In most cases, data scientists and researchers acquire their technical skills by doing and, they also learn the lesson about good practices the hard way, i.e. by making a lot of mistakes and spending a lot of infernal hours trying to find out why things no longer works, or why the surprising result we found yesterday cannot be replicated when we want to show it to a colleague (or an advisor!). In this short tutorial, we will present a collection of good practices for reproducible data science, from our own experiences as well as from the experiences shared by colleagues. These practices can (or should) be adopted by any data scientist or researcher, regardless of their current level of computational skills.' --- ## Relevant Topics (1) * How to organize a data science project * Project Templating * Tootls: Cookycutter / Lazzy Bones * Software carpentry * Version control (Git) * Random numbers generation (seeds) * Notebooks vs scripts ---- ## Relevant Topics (2) * Collaborative working * Fancy tools or the old good bash commandline * The power of bash filters and AWK --- ### Reproducible Research: General (1) * Overview * Open Research * Version Control * Licensing * Research Data Management * **Reproducible Environments** * Document!! ---- ### Reproducible Research: Coding (2) * Code quality * Code Testing * Code Reviewing Process * Continuous Integration and Development (CI/CD) * Reusable Code * BinderHub (allows users to share reproducible interactive computing environments) ---- ### Reproducible Research: Research (3) * Project Design * Case Studies * Risk Assessment * Communication * Collaboration * Ethical Research --- # Good practices for reproducible data science * Hands-on material: https://gitlab.bsc.es/patc/gprds --- ## The XXI could be considered the century of complexity **Complex problemas are found in many different domain** * Climate change * Molecular biology * Public Health * Urban/Social Sience * ... Data is a fundamental in all of them! Reproducibility is also critical! --- ## Why are we talking about reproducible research in the first place? Isn't reproducibility the cornerstone of scientific knowledge? What are we talking about? > Replication is one of the central issues in any empirical science. To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception. > A replication experiment to demonstrate that the same findings can be obtained in any other place by any other researcher is conceived as an operationalization of objectivity. > It is the proof that the experiment reflects knowledge that can be separated from the specific circumstances (such as time, place, or persons) under which it was gained.

Source: https://en.wikipedia.org/wiki/Replication_crisis)

--- ## Have you hear about the reproducibility crisis? --- ![image](https://hackmd.io/_uploads/SyE7sZR5a.png)

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124

**"... There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims..."** Ooops.. Houston we have a (serious) problem.. --- ### 1500 scientists lift the lid on reproducibility.

Reference: Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a

---- ### A crisis in numbers ![image](https://hackmd.io/_uploads/HkIa8taqT.png)

Reference: Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a

---- ### A crisis in numbers ![image](https://hackmd.io/_uploads/By64wYpqT.png)

Reference: Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a

---- ### A crisis in numbers ![image](https://hackmd.io/_uploads/BkGtPY6q6.png)

Reference: Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a

--- Is science really facing a reproducibility crisis? (Not everybody agrees...) ![image](https://hackmd.io/_uploads/rJ6G42RhJl.png)

Reference: Fanelli, D. (2018). Is science really facing a reproducibility crisis, and do we need it to?. PNAS, 115(11), 2628-2631.https://doi.org/10.1073/pnas.1708272114

> Scientific misconduct and questionable research practices (QRP) occur at frequencies that, while nonnegligible, are relatively small and therefore unlikely to have a major impact on the literature. ---- According to Fanelli: * Scientific misconduct and questionable research practices (QRP) occur at frequencies that, while nonnegligible, are relatively small and therefore unlikely to have a major impact on the literature. * Contemporary science could be more accurately portrayed as facing “new opportunities and challenges” or even a “revolution” --- But still, I've often found really hard to reproduce other researchers work. Specially while implementing mathematical models or reproducing machine learning projects. So where is the problem? ---- ## Behavioural components of the reproducibility crisis ### The four horsemen of the reproducibility apocalypse * HARKing (hypothesizing after the results are known) * publication bias * low statistical power * p-hacking ---- ## Reproducibility in model sharing * Sharing the equations is a must, but shareing the code is also critical! * Used community sepcific model repositories (e.g. BioModels) or genral ones (e.g. Zenodo) * For those using Agent-Based models, check he ODD protocols: * https://github.com/OpenDataDynamics/ODD-protocols ---- ## Reproducibility in Machien learning: Leakage In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.[1] ---- #### No raw data, no science: another possible source of the reproducibility crisis ![image](https://hackmd.io/_uploads/SJkEuFpqT.png)

Reference: Miyakawa, T. No raw data, no science: another possible source of the reproducibility crisis. Mol Brain 13, 24 (2020). https://doi.org/10.1186/s13041-020-0552-2

--- *"... Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process ..."* --- ## The FAIR Principles #### Findability, Accessibility, Interoperability, and Reuse of digital assets. https://www.go-fair.org/fair-principles/ --- ![image](https://hackmd.io/_uploads/rkh43F1iT.png)

Reference: Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

--- Terms and Abbreviations

* **DOI**: Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself. * **Interoperability**: the ability of data or tools from non-cooperating resources to integrate or work together with minimal effort. * **FAIR**: Findable, Accessible, Interoperable, Reusable. * **provenance**: refers to the lineage of data, and processes that act on data and agents that are responsible for those processes. * **Data lineage**: includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process.

--- ![image](https://hackmd.io/_uploads/rJvPgF6qT.png)

Reference: Heisse et al (2023) Ten simple rules for implementing open and reproducible research practices after attending a training course

--- ### Phases for implementing practices after a robust research course. ![image](https://hackmd.io/_uploads/Bkxdbz09p.png) --- ## Provenance is important! #### Specialy in large and complex workflows where raw data is manipulated (*)

Good practices:

Store the original data and the source from it was retrived (pointer, reference, url, DOI, etc).
Persist the code used to process
Make your work reproducible using workflow management system such as snake-make, nextflow, or bash sctipts
Published your coda and data in **Zenodo** or similar public repositories

(*) Filtering, aggregation, imputation, etc.

--- ## What is Zenodo?

Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artefacts. For each submission, a persistent digital object identifier (DOI) is minted, which makes the stored items easily citeable

--- ## Why use Zenodo?

Safe — your research is stored safely for the future in CERN’s Data Centre for as long as CERN exists.
Trusted — built and operated by CERN and OpenAIRE to ensure that everyone can join in Open Science.
Citeable — every upload is assigned a Digital Object Identifier (DOI), to make them citable and trackable.
No waiting time — Uploads are made available online as soon as you hit publish, and your DOI is registered within seconds.
Open or closed — Share e.g. anonymized clinical trial data with only medical professionals via our restricted access mode.
Versioning — Easily update your dataset with our versioning feature.
GitHub integration — Easily preserve your GitHub repository in Zenodo.
Usage statistics — All uploads display standards compliant usage statistics

--- ## Let's pick and example ![image](https://hackmd.io/_uploads/S1PdVURcp.png) - https://www.nature.com/articles/s41597-021-01093-5 --- Let's have a look https://zenodo.org/communities/flow-maps https://github.com/bsc-flowmaps --- Hands On: link your data to zenodo We will: 1. Create a git repository 2. Do some changes and push them 3. Create a release 4. Connect our release to Zenodo --- ### Reproducible Research Version Control 1. Our today task: https://datacarpentry.org/rr-version-control/02-git-in-github/index.html 2. For novice in git: https://swcarpentry.github.io/git-novice/ 3. Creating Zenodo Entry from github repository: https://zenodo.org/account/settings/github/ ----

Clients generally authenticate either using passwords (less secure and not recommended) or SSH keys, which are very secure. Password logins are encrypted and are easy to understand for new users. Two ways to access:

- PASSWORD - RSA Keys (Recommended) ![image](https://hackmd.io/_uploads/HkPSL805T.png) ---- ### Generating and Working with SSH Keys

SSH keys: matching set of cryptographic keys which can be used for authentication. Each set contains a public and a private key.

The public key can be shared freely without concern
The private key must be vigilantly guarded and never exposed to anyone.

To authenticate using SSH keys, a user must have:

an SSH key pair on their local computer.
On the remote server, the public key must be copied to a file within the user’s home directory at ~/.ssh/authorized_keys

---- ``` Host bsc User username HostName mn1.bsc.es IdentityFile ~/.ssh/id_rsa Host bsc0 User username HostName mn0.bsc.es IdentityFile ~/.ssh/id_rsa Host bscdt User username HostName dt01.bsc.es IdentityFile ~/.ssh/id_rsa ``` --- ## Good enough practices in scientific computing

Data management: saving both raw and intermediate forms, documenting all steps, creating tidy data amenable to analysis.
Software: writing, organizing, and sharing scripts and programs used in an analysis.
Collaboration: making it easy for existing and new collaborators to understand and contribute to a project.
Project organization: organizing the digital artifacts of a project to ease discovery and understanding.
Tracking changes: recording how various components of your project change over time.
Manuscripts: writing manuscripts in a way that leaves an audit trail and minimizes manual merging of conflicts.

> Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal. Plos Comp Biol (2017) --- ## Project layout

A good starting point would to have an organized project

Use some a tool for project template: Check cookiecutter

--- ## Reproducibility and beyond ![image](https://hackmd.io/_uploads/BkK7CHR9p.png)

Fig. 1 The Turing Way project illustration by Scriberia. DOI:10.5281/zenodo.3332807

--- Reproducibility is like brushing your teeth. Once you learn it, it becomes a habit :smile: --- ### References

Baker,M. (2016) 1,500 scientists lift the lid on reproducibility. Nature, 533, 452–454.
Heise,V. et al. (2023) Ten simple rules for implementing open and reproducible research practices after attending a training course. PLOS Computational Biology, 19, e1010750.
Ioannidis,J.P.A. (2005) Why Most Published Research Findings Are False. PLOS Medicine, 2, e124.
Miyakawa,T. (2020) No raw data, no science: another possible source of the reproducibility crisis. Molecular Brain, 13, 24.
Taylor,S.J.E. et al. (2018) CRISIS, WHAT CRISIS – DOES REPRODUCIBILITY IN MODELING & SIMULATION REALLY MATTER? 2018 (WSC).
Wilkinson,M.D. et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data

### Keep learning :pray: --- ## Do we still have time? ---- ## Let's go to the sandbox and learn some tricks --- ### Resources

https://the-turing-way.netlify.app/
https://software-carpentry.org/lessons/index.html
https://datacarpentry.org/
https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html