enrico.glerean@aalto.fi
. Looking for contributors/co-authors. All text under CC-BY 4.0. . Video presentation about this document.reusable
/riːˈjuːzəb(ə)l/
adjectiveable to be used again or more than once.
"reusable shopping bags"
Science that cannot be verified, is not science. How about code?
Fragile code that cannot be re-run, is not going to help the advancing of science.
The reproducibility matrix, from The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. http://doi.org/10.5281/zenodo.3233986
Inspired by the four categories of the reproducibility matrix, I try to build a taxonomy for reusable code. Reusable code can mean different things
Reusable code becomes sustainabile and it greatly contributes to the community of its users and beyond.
At this stage, you might not even need to be able to open the code and read it, you just want to make sure you can re-run all the needed steps and obtain the same results you had.
You can't use a car if you don't remember where you parked it. And surely relying on your human memory in which of your computer folder you stored the latest version of your code, is not going to help you in 18 months when you need to go back to it and reproduce your results.
Furthermore, others will also want to find your code: you might win the lottery tomorrow, disappear and move to a desert island, but your colleagues would still like to reproduce your results.
Your code must be stored publicly and shared with collaborators. It has an unique persistent identifier, so that everyone can find it and access it. (Ever heard of FAIR principles?)
Solutions and references: github.com, gitlab.com, zenodo.org, CodeRefinery lesson - Version Control with Git
Code is dynamic. A piece of software will change in time. Ideally the code has a version number and a date. Ideally the temporal evolution of the code is documented with version control (e.g. with git), i.e. temporal snapshots of the state of the code. This allows you to retrieve a specific version from the past.
Solutions and references: github.com, gitlab.com, zenodo.org, CodeRefinery lesson - Version Control with Git
The environment is a fragile snapshot in time which silently accompanies the code. It can include the human who operated the software, the steps the human did to prepare the data, the hardware, the OS, the libraries, external packages/toolboxes/dependencies. All this can be carefully documented for another human to re-do all the same exact steps.
Solutions and references: README file, list of dependencies and 3rd party libraries, CodeRefinery lesson on recording dependencies
In academia, often the human who set up the environment is also the one who wrote the code and the one who knows the exact order of steps needed to be able to re-run the code and reproduce the results. This could surely be carefully documented for another human to re-do it … but humans tend to be unreliable and bad at documenting (see next point).
Solutions and references: Readme file, CodeRefinery lesson on Reproducible Research
You do not want to depend on humans. They tend to make errors even if they do not have bad intentions. So you want your environment to be scripted and be re-created when needed and you want your sequence of operations to be run by a pipeline script that glues together all the sequence of steps.
Solutions and references: docker and singularity, pipeline scripts, DAG script (e.g. snakemake)
You do not want to have details specific of your data or analysis parameters hardcoded into the code. If something can become a reusable function, separate it from the hardcoded parameters and turn it into something (re)usable on its own. Make the modules pure: given the same input, a pure function always returns the same value.
Solutions and references: CodeRefinery Modular Code Development lesson
You might not know yet how your code will be re-used in the future, but you can prevent how it should not be used if you can test which parameters are allowed.
Solutions and references: CodeRefinery lesson on Automated testing
Separate even more the specifics of your project with the bits that can be reused in other of your projects or by other people.
Solutions and references: Packaging software, Software packaging in Python
Make sure you attach a license to your code and specify how you want to be cited when people re-use it.
Solutions and references: CodeRefinery lesson on Social Coding,
4 Simple recommendations for Open Source Software
Nowadays, we rarely need to write code optimised for machines. It pays more to write code optimised for other humans so they can read it (including your future self). A cryptic oneliner with obscure variable names is not any faster or more efficient than splitting the one liner into multiple steps with readable variable names that make sense. Furthermore, using coding conventions will help other readers.
Solutions and references: CodeRefinery lesson on Modular Code Development, Python PEP 8, Clang
Write comments before writing the actual code. Imagine that somebody could just read the comments and skip all the code bits between comments and get a full picture of what is going on as if they read the whole code.
Solutions and references: Writing comments before writing code
Document the signature of your functions, what are the inputs, the outputs, and add useful examples. Get inspired by the documentation of large OSS projects.
Solutions and references: CodeRefinery lesson on documentation