Reproducible workflows in R
(A new draft started on 4.5.2023 after the old one in Hedgedoc disappeared.)
Things that can affect reproducibility of R workflows:
- data management
- R scripts
- data storage and accessibility
- R version
- R package versions
- tip: define R version when loading the r-env moduleon Puhti: r-env/432 etc.
- operating system, its version and other underlying parts
Topics to cover:
- R Markdown and Quarto
- projects in R
- don't save workspace (save objects and scripts instead)
- keep original data - modified data separate should be a separate copy
- R script reproducibility
- commenting
- file paths
- general readability
- functions for repeating sections of code
- set.seed()
- aim for scripts that can produce the output again at any time (instead of relying on storage of output)
- version control
- renv
- the ultimate tool for reprodubility
- BUT be careful when using on Puhti
- containers
- package versions
- packages on Puhti tied to a specific date
- sessionInfo()
- Posit Public Package manager snapshots
Structure draft
- R versions on Puhti
- R package versions on Puhti
- minimum information to record for reproducibility
- light-weight tools and tips for R reproducibility
- stand-alone script principle
- version control
- general script tips
- heavy-weight tools for R reproducibility
Useful links