Try   HackMD

How to make your code reusable

Work in progress by enrico.glerean@aalto.fi. Looking for contributors/co-authors. All text under CC-BY 4.0.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
. Video presentation about this document.

reusable

/riːˈjuːzəb(ə)l/
adjective

able to be used again or more than once.

"reusable shopping bags"

Code reusability is a requirement for reproducibile science

Science that cannot be verified, is not science. How about code?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

source

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

source

Fragile code that cannot be re-run, is not going to help the advancing of science.

Reproducible science according to the Guide for Reproducible Research

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The reproducibility matrix, from The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. http://doi.org/10.5281/zenodo.3233986

What does 'reusable code' mean?

Inspired by the four categories of the reproducibility matrix, I try to build a taxonomy for reusable code. Reusable code can mean different things

  1. Reproducible code
    You (or somebody else) can re-use the code to reproduce the same exact thing you did
  2. Replicable & Robust code
    You (or others) can re-use the code to do the thing you did with different data/different parameters
  3. Generalisable code
    Others (or your future self) can build on your code to extend it and improve it

Reusable code becomes sustainabile and it greatly contributes to the community of its users and beyond.

What are the steps to make your code reusable?

TL;DR

  1. REPRODUCIBILE CODE - Make sure you (or somebody else) can re-use your code to do the same exact thing you did
    1.1 Make sure you can find it (in space)
    1.2 Make sure you can find it (in time)
    1.3. Make sure you can recreate the environment where it lived at a specific time
    1.4. Make sure you can execute the same sequence of operations
    1.5. Make sure your environment and sequence of operations is robust and no human is needed to replicate what was done
  2. REPLICABLE & ROBUST CODE - Make sure you (or others) can re-use the code to do the thing you did, but with different data/different parameters
    2.1. Remove hardcoded bits and make the code modular
    2.2. Test that the modules you made can take different types of input data or parameters
    2.3 Turn the modules into a package/toolbox
  3. GENERALISABLE CODE - Make sure others can build on your code to extend it and improve it
    3.1. License your code and get citations
    3.2. Make sure your code is readable by humans
    3.3. Make sure comments are present
    3.4. Write useful documentation

1. REPRODUCIBLE CODE - Make sure you (or somebody else) can re-use your code to do the same exact thing you did

At this stage, you might not even need to be able to open the code and read it, you just want to make sure you can re-run all the needed steps and obtain the same results you had.

1.1 Make sure you can find it (in space)

You can't use a car if you don't remember where you parked it. And surely relying on your human memory in which of your computer folder you stored the latest version of your code, is not going to help you in 18 months when you need to go back to it and reproduce your results.

Furthermore, others will also want to find your code: you might win the lottery tomorrow, disappear and move to a desert island, but your colleagues would still like to reproduce your results.

Your code must be stored publicly and shared with collaborators. It has an unique persistent identifier, so that everyone can find it and access it. (Ever heard of FAIR principles?)

Solutions and references: github.com, gitlab.com, zenodo.org, CodeRefinery lesson - Version Control with Git

1.2. Make sure you can find it (in time)

Code is dynamic. A piece of software will change in time. Ideally the code has a version number and a date. Ideally the temporal evolution of the code is documented with version control (e.g. with git), i.e. temporal snapshots of the state of the code. This allows you to retrieve a specific version from the past.

Solutions and references: github.com, gitlab.com, zenodo.org, CodeRefinery lesson - Version Control with Git

1.3. Make sure you can recreate the environment where it lived at a specific time

The environment is a fragile snapshot in time which silently accompanies the code. It can include the human who operated the software, the steps the human did to prepare the data, the hardware, the OS, the libraries, external packages/toolboxes/dependencies. All this can be carefully documented for another human to re-do all the same exact steps.

Solutions and references: README file, list of dependencies and 3rd party libraries, CodeRefinery lesson on recording dependencies

1.4. Make sure you can execute the same sequence of operations

In academia, often the human who set up the environment is also the one who wrote the code and the one who knows the exact order of steps needed to be able to re-run the code and reproduce the results. This could surely be carefully documented for another human to re-do it but humans tend to be unreliable and bad at documenting (see next point).

Solutions and references: Readme file, CodeRefinery lesson on Reproducible Research

1.5. Make sure your environment and sequence of operations is robust and no human is needed

You do not want to depend on humans. They tend to make errors even if they do not have bad intentions. So you want your environment to be scripted and be re-created when needed and you want your sequence of operations to be run by a pipeline script that glues together all the sequence of steps.

Solutions and references: docker and singularity, pipeline scripts, DAG script (e.g. snakemake)

2. REPLICABLE & ROBUST CODE - Make sure you (or others) can re-use it to do the thing you did, but with different data/different parameters

2.1. Remove hardcoded bits and make the code modular

You do not want to have details specific of your data or analysis parameters hardcoded into the code. If something can become a reusable function, separate it from the hardcoded parameters and turn it into something (re)usable on its own. Make the modules pure: given the same input, a pure function always returns the same value.

Solutions and references: CodeRefinery Modular Code Development lesson

2.2. Test that the modules you made can take different types of input data or parameters

You might not know yet how your code will be re-used in the future, but you can prevent how it should not be used if you can test which parameters are allowed.

Solutions and references: CodeRefinery lesson on Automated testing

2.3. Turn the modules into a package/toolbox

Separate even more the specifics of your project with the bits that can be reused in other of your projects or by other people.

Solutions and references: Packaging software, Software packaging in Python

3. GENERALISABLE CODE - Make sure others can build on your code to extend it and improve it

3.1. License your code and get citations

Make sure you attach a license to your code and specify how you want to be cited when people re-use it.

Solutions and references: CodeRefinery lesson on Social Coding,
4 Simple recommendations for Open Source Software

3.2. Make sure your code is readable by humans

Nowadays, we rarely need to write code optimised for machines. It pays more to write code optimised for other humans so they can read it (including your future self). A cryptic oneliner with obscure variable names is not any faster or more efficient than splitting the one liner into multiple steps with readable variable names that make sense. Furthermore, using coding conventions will help other readers.

Solutions and references: CodeRefinery lesson on Modular Code Development, Python PEP 8, Clang

3.3. Make sure comments are present

Write comments before writing the actual code. Imagine that somebody could just read the comments and skip all the code bits between comments and get a full picture of what is going on as if they read the whole code.

Solutions and references: Writing comments before writing code

3.4. Write useful documentation

Document the signature of your functions, what are the inputs, the outputs, and add useful examples. Get inspired by the documentation of large OSS projects.

Solutions and references: CodeRefinery lesson on documentation


Useful resources