mponce
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: (2025) Good practices for reproducible data science author: - Miguel Ponce de León - Patricio Reyes date: 2025-02-24 slideOptions: transition: slide abstract: > 'Today, computers have become essential for data management. One of the main reasons for this is that all scientific disciplines share the common need to deal with big volumes of data: to organise, pre-process, analyze and finally, convert a collection of raw datasets into pieces of knowledge or data-supported actions. Due to the central role data plays in research and industrial applications, it is also critical to follow guidelines and use standard analysis workflows that generate **reproducible results**. Although the concept of reproducibility itself is at the heart of data analysis and scientific research (think of a researcher carrying out an experiment in a chemistry lab), the majority of data practitioners, students and researchers have no formal training in reproducible scientific computing. In most cases, data scientists and researchers acquire their technical skills by doing and, they also learn the lesson about good practices the hard way, i.e. by making a lot of mistakes and spending a lot of infernal hours trying to find out why things no longer works, or why the surprising result we found yesterday cannot be replicated when we want to show it to a colleague (or an advisor!). In this short tutorial, we will present a collection of good practices for reproducible data science, from our own experiences as well as from the experiences shared by colleagues. These practices can (or should) be adopted by any data scientist or researcher, regardless of their current level of computational skills.' --- ## Relevant Topics (1) * How to organize a data science project * Project Templating * Tootls: Cookycutter / Lazzy Bones * Software carpentry * Version control (Git) * Random numbers generation (seeds) * Notebooks vs scripts ---- ## Relevant Topics (2) * Collaborative working * Fancy tools or the old good bash commandline * The power of bash filters and AWK --- ### Reproducible Research: General (1) * Overview * Open Research * Version Control * Licensing * Research Data Management * **Reproducible Environments** * Document!! ---- ### Reproducible Research: Coding (2) * Code quality * Code Testing * Code Reviewing Process * Continuous Integration and Development (CI/CD) * Reusable Code * BinderHub (allows users to share reproducible interactive computing environments) ---- ### Reproducible Research: Research (3) * Project Design * Case Studies * Risk Assessment * Communication * Collaboration * Ethical Research --- # Good practices for reproducible data science * Hands-on material: https://gitlab.bsc.es/patc/gprds --- ## The XXI could be considered the century of complexity **Complex problemas are found in many different domain** * Climate change * Molecular biology * Public Health * Urban/Social Sience * ... Data is a fundamental in all of them! Reproducibility is also critical! --- ## Why are we talking about reproducible research in the first place? Isn't reproducibility the cornerstone of scientific knowledge? What are we talking about? > Replication is one of the central issues in any empirical science. To confirm results or hypotheses by a repetition procedure is at the basis of any scientific conception. > A replication experiment to demonstrate that the same findings can be obtained in any other place by any other researcher is conceived as an operationalization of objectivity. > It is the proof that the experiment reflects knowledge that can be separated from the specific circumstances (such as time, place, or persons) under which it was gained. <div style="text-align: left; font-size: 20px"> Source: <a href="https://en.wikipedia.org/wiki/Replication_crisis">https://en.wikipedia.org/wiki/Replication_crisis)</a></div> --- ## Have you hear about the reproducibility crisis? --- ![image](https://hackmd.io/_uploads/SyE7sZR5a.png) <div style="text-align: left; font-size: 18px"> Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124 </div> <br> **"... There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims..."** Ooops.. Houston we have a (serious) problem.. --- ### 1500 scientists lift the lid on reproducibility. <img src="https://hackmd.io/_uploads/rJg1XFT5T.png" width="500"/> <br> <p style="font-size: 24px"><b>Reference:</b> Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a</p> ---- ### A crisis in numbers ![image](https://hackmd.io/_uploads/HkIa8taqT.png) <p style="text-align: left; font-size: 26px"><b>Reference:</b> Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a</p> ---- ### A crisis in numbers ![image](https://hackmd.io/_uploads/By64wYpqT.png) <p style="text-align: left; font-size: 26px"><b>Reference:</b> Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a</p> ---- ### A crisis in numbers ![image](https://hackmd.io/_uploads/BkGtPY6q6.png) <p style="text-align: left; font-size: 26px"><b>Reference:</b> Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a</p> --- Is science really facing a reproducibility crisis? (Not everybody agrees...) ![image](https://hackmd.io/_uploads/rJ6G42RhJl.png) <p style="font-size: 24px"><b>Reference:</b> Fanelli, D. (2018). Is science really facing a reproducibility crisis, and do we need it to?. PNAS, 115(11), 2628-2631.https://doi.org/10.1073/pnas.1708272114</p> > Scientific misconduct and questionable research practices (QRP) occur at frequencies that, while nonnegligible, are relatively small and therefore unlikely to have a major impact on the literature. ---- According to Fanelli: * Scientific misconduct and questionable research practices (QRP) occur at frequencies that, while nonnegligible, are relatively small and therefore unlikely to have a major impact on the literature. * Contemporary science could be more accurately portrayed as facing “new opportunities and challenges” or even a “revolution” --- But still, I've often found really hard to reproduce other researchers work. Specially while implementing mathematical models or reproducing machine learning projects. So where is the problem? ---- ## Behavioural components of the reproducibility crisis ### The four horsemen of the reproducibility apocalypse * HARKing (hypothesizing after the results are known) * publication bias * low statistical power * p-hacking ---- ## Reproducibility in model sharing * Sharing the equations is a must, but shareing the code is also critical! * Used community sepcific model repositories (e.g. BioModels) or genral ones (e.g. Zenodo) * For those using Agent-Based models, check he ODD protocols: * https://github.com/OpenDataDynamics/ODD-protocols ---- ## Reproducibility in Machien learning: Leakage In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.[1] ---- #### No raw data, no science: another possible source of the reproducibility crisis ![image](https://hackmd.io/_uploads/SJkEuFpqT.png) <p style="text-align: left; font-size: 16px"><b>Reference:</b> Miyakawa, T. No raw data, no science: another possible source of the reproducibility crisis. Mol Brain 13, 24 (2020). https://doi.org/10.1186/s13041-020-0552-2</p> --- *"... Good data management is not a goal in itself, but rather is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process ..."* --- ## The FAIR Principles #### Findability, Accessibility, Interoperability, and Reuse of digital assets. https://www.go-fair.org/fair-principles/ --- ![image](https://hackmd.io/_uploads/rkh43F1iT.png) <p style="text-align: left; font-size: 26px"><b>Reference:</b> Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18</p> --- Terms and Abbreviations <div style="text-align: left; font-size: 26px"> * **DOI**: Digital Object Identifier; a code used to permanently and stably identify (usually digital) objects. DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself. * **Interoperability**: the ability of data or tools from non-cooperating resources to integrate or work together with minimal effort. * **FAIR**: Findable, Accessible, Interoperable, Reusable. * **provenance**: refers to the lineage of data, and processes that act on data and agents that are responsible for those processes. * **Data lineage**: includes the data origin, what happens to it, and where it moves over time. Data lineage provides visibility and simplifies tracing errors back to the root cause in a data analytics process. </div> <p style="text-align: left; font-size: 22px"><b>Reference:</b> Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18</p> --- ![image](https://hackmd.io/_uploads/rJvPgF6qT.png) <p style="text-align: left; font-size: 26px"><b>Reference:</b> Heisse et al (2023) Ten simple rules for implementing open and reproducible research practices after attending a training course</p> --- ### Phases for implementing practices after a robust research course. ![image](https://hackmd.io/_uploads/Bkxdbz09p.png) --- ## Provenance is important! #### Specialy in large and complex workflows where raw data is manipulated (*) <p style="text-align: left; font-size: 38px"><b>Good practices:</b></p> <div style="text-align: left; font-size: 36px"> <ol> <li>Store the original data and the source from it was retrived (pointer, reference, url, DOI, etc).</li> <li>Persist the code used to process</li> <li>Make your work reproducible using workflow management system such as snake-make, nextflow, or bash sctipts</li> <li>Published your coda and data in **Zenodo** or similar public repositories</li> </ol> </div> <br><br> <div style="text-align: left; font-size: 30px"> (*) Filtering, aggregation, imputation, etc.</div> --- ## What is Zenodo? <div style="text-align: left; font-size: 32px"> Zenodo is a <span style="color: red; font-weight: bold">general-purpose open repository</span> developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit research papers, data sets, research software, reports, and any other research related digital artefacts. For each submission, a <span style="color: red; font-weight: bold">persistent digital object identifier (DOI)</span> is minted, which makes the <span style="color: red; font-weight: bold">stored items easily citeable</span> </div> --- ## Why use Zenodo? <div style="text-align: left; font-size: 24px"> <ul> <li> <strong>Safe</strong> — your research is stored safely for the future in CERN’s Data Centre for as long as CERN exists. </li> <li> <strong>Trusted</strong> — built and operated by CERN and OpenAIRE to ensure that everyone can join in Open Science. </li> <li> <strong>Citeable</strong> — every upload is assigned a Digital Object Identifier (DOI), to make them citable and trackable. </li> <li> <strong>No waiting time</strong> — Uploads are made available online as soon as you hit publish, and your DOI is registered within seconds. </li> <li> <strong>Open or closed</strong> — Share e.g. anonymized clinical trial data with only medical professionals via our restricted access mode. </li> <li> <strong>Versioning</strong> — Easily update your dataset with our versioning feature. </li> <li> <strong>GitHub integration</strong> — Easily preserve your GitHub repository in Zenodo. </li> <li> <strong>Usage statistics</strong> — All uploads display standards compliant usage statistics </li> </ul> </div> --- ## Let's pick and example ![image](https://hackmd.io/_uploads/S1PdVURcp.png) - https://www.nature.com/articles/s41597-021-01093-5 --- Let's have a look https://zenodo.org/communities/flow-maps https://github.com/bsc-flowmaps --- Hands On: link your data to zenodo We will: 1. Create a git repository 2. Do some changes and push them 3. Create a release 4. Connect our release to Zenodo --- ### Reproducible Research Version Control 1. Our today task: https://datacarpentry.org/rr-version-control/02-git-in-github/index.html 2. For novice in git: https://swcarpentry.github.io/git-novice/ 3. Creating Zenodo Entry from github repository: https://zenodo.org/account/settings/github/ ---- <div style="text-align: left; font-size: 24px"> Clients generally authenticate either using passwords (less secure and not recommended) or SSH keys, which are very secure. Password logins are encrypted and are easy to understand for new users. Two ways to access:</div> - PASSWORD - RSA Keys (Recommended) ![image](https://hackmd.io/_uploads/HkPSL805T.png) ---- ### Generating and Working with SSH Keys <div style="text-align: left; font-size: 28px"> <b>SSH keys</b>: matching set of cryptographic keys which can be used for authentication. Each set contains a public and a private key. <ol> <li>The public key can be shared freely without concern</li> <li>The private key must be vigilantly guarded and never exposed to anyone.</li> </ol> To authenticate using SSH keys, a user must have: <ol> <li>an SSH key pair on their local computer.</li> <li>On the remote server, the public key must be copied to a file within the user’s home directory at ~/.ssh/authorized_keys</li> </ol> </div> ---- ``` Host bsc User username HostName mn1.bsc.es IdentityFile ~/.ssh/id_rsa Host bsc0 User username HostName mn0.bsc.es IdentityFile ~/.ssh/id_rsa Host bscdt User username HostName dt01.bsc.es IdentityFile ~/.ssh/id_rsa ``` --- ## Good enough practices in scientific computing <div style="text-align: left; font-size: 24px"> <ol> <li>Data management: saving both raw and intermediate forms, documenting all steps, creating tidy data amenable to analysis.</li> <li>Software: writing, organizing, and sharing scripts and programs used in an analysis.</li> <li>Collaboration: making it easy for existing and new collaborators to understand and contribute to a project.</li> <li>Project organization: organizing the digital artifacts of a project to ease discovery and understanding.</li> <li>Tracking changes: recording how various components of your project change over time.</li> <li>Manuscripts: writing manuscripts in a way that leaves an audit trail and minimizes manual merging of conflicts.</li> </ol> </div> > Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal. Plos Comp Biol (2017) --- ## Project layout <div style="text-align: left; font-size: 32px"> <p> A good starting point would to have an organized project <img src=https://hackmd.io/_uploads/Hkte_IA9p.png></img> </p> <p> Use some a tool for project template: <a href="https://cookiecutter.readthedocs.io/en/stable/">Check cookiecutter</a> </p> </div> --- ## Reproducibility and beyond ![image](https://hackmd.io/_uploads/BkK7CHR9p.png) <div style="text-align: left; font-size: 26px">Fig. 1 The Turing Way project illustration by Scriberia. DOI:10.5281/zenodo.3332807</div> --- Reproducibility is like brushing your teeth. Once you learn it, it becomes a habit :smile: --- ### References <div style="text-align: left; font-size: 26px"> <ul> <li>Baker,M. (2016) 1,500 scientists lift the lid on reproducibility. Nature, 533, 452–454.</li> <li>Heise,V. et al. (2023) Ten simple rules for implementing open and reproducible research practices after attending a training course. PLOS Computational Biology, 19, e1010750.</li> <li>Ioannidis,J.P.A. (2005) Why Most Published Research Findings Are False. PLOS Medicine, 2, e124.</li> <li>Miyakawa,T. (2020) No raw data, no science: another possible source of the reproducibility crisis. Molecular Brain, 13, 24.</li> <li>Taylor,S.J.E. et al. (2018) CRISIS, WHAT CRISIS – DOES REPRODUCIBILITY IN MODELING & SIMULATION REALLY MATTER? 2018 (WSC).</li> <li>Wilkinson,M.D. et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data</li> </ul> </div> ### Keep learning :pray: --- ## Do we still have time? ---- ## Let's go to the sandbox and learn some tricks --- ### Resources <div style="text-align: left; font-size: 28px"> <ul> <li>https://the-turing-way.netlify.app/</li> <li>https://software-carpentry.org/lessons/index.html</li> <li>https://datacarpentry.org/</li> <li>https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html</li> </ul> </div> <br>

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully