FOR FUTURE LEARNERS & INSTRUCTORS ALIKE: FEEL FREE TO REUSE! Bear in mind this event has just passed, and I (Nadine) will follow up on this doc with some post-processing.
This is a tutorial on best practices in research software engineering using Python as an example programming language:
It is a modified 90-minute mini-version of the Intermediate Research Software Development course from the Carpentries Incubator, so for anyone wanting to delve into more detail or get more practice… looking into that course is probably one of the best options.
Here, you'll get
Some content/sections in this tutorial are meant for you to (optionally) go through before or after the tutorial.
We'll use GitHub CodeSpaces – a cloud-powered development environment that one can configure to one’s liking.
Disclaimer: rather than this being a tutorial about how to do collaborative research software engineering with a particular Python lens, we use Python as a vehicle to convey fairly general RSE principles.
For time reasons, we can only get a little bit into tutorial's topics - each of which are little (or big) worlds on their own.
If this tutorial incentivises you to delve more deeply into anything related to research software engineering, this tutorial will have been a full success!
0. Welcome (5 min)
0.1 Let's introduce ourselves
0.2 Recap & motivation: why collaboration and best research software engineering practices in the first place?
0.3 Difference between "mere" coding and research software engineering
0.4 What you’ll learn
0.5 Whom this tutorial is for
0.6 What you need to have done before the event (for participants to read before the event)
1. Let's start! Introduction into the project & setting up the environment (15 min)
1.1 The project
1.2 GitHub CodeSpaces
1.3 Integrated Development Environments
1.4 Git and GitHub
1.5 Creating virtual environments
2. Ensuring correctness of software at scale (20 min)
2.1 Unit tests
2.2 Scaling up unit tests
2.3 Debugging code & code coverage
2.4 Continuous integration
3. Software design (20 min)
3.1 Programming paradigms
3.2 Object-oriented programming
3.3 Functional programming
4. Writing software with - and for - others: workflows on GitHub, APIs, and code packages (20 min)
4.1 GitHub pull requests
4.2 How users can use the program you write: application programming interfaces
4.3 Producing a code package
4.4 Personal experience with research software academic and professional
6. Wrap-up (5 min)
7. Further resources
8. License
9. Original course
10. Funding & Acknowledgements
Color coding in this file:
Feel free to write down some or all the following suggestions: name / institution or affiliation / connection details, e. g., mail, Twitter or Mastodon, if you'd like to share those / why did you come to this tutorial, or what is your motivation?
Please write down your answers here:
The terms programming (or even coding) and software engineering are often used interchangeably, but those terms don't mean the same thing. Programmers or coders tend to focus on one part of software development which is implementation. Also, in the context of academic research, they often write software just for themselves and are the sole stakeholders.
Someone who is engineering software takes a broader view on code which also considers:
Bearing the difference between coding and software engineering in mind, how much do scientists actually need to do either of them? Should they rather code or write software, or do both (and if both, when do they do what)? This is a hard question and will very much depend on a given research project. In Scientific coding and software engineering: what's the difference?, it is argued that "scientists want to explore, engineers want to build". Both too little or too much of an engineering component in writing code can be a hindrance in the research process:
To boil down the challenge in other words, when you start out writing code for your research, you need to ask yourself:
How much do you want to generalize and consider factors in the software lifecycle upfront in order to spare work at a later time-point vs. stay specific and write single-use code to not end up doing (potentially) large amounts of unnecessary work, if you (unexpectedly) abandon paths taken in your research?
While this is a question every coder/software engineer needs to ask themselves, it's a particularly important one for researchers.
It may not be easy to find a sweet spot, but, as a heuristic, you may err on the side of incorporating software engineering into your coding, as soon as
More often than not, one or both points will apply fairly quickly.
This tutorial equips you with a solid foundation for working on software development in a team, using practices that help you write code of higher quality, and that make it easier to develop and sustain code in the future – both by yourself and others. The topics covered concern core, intermediate skills covering important aspects of the software development life-cycle that will be of most use to anyone working collaboratively on code.
At the start, we’ll address
Regarding testing software, you’ll learn how to
Regarding software design, you’ll particularly learn about
With respect to working on software with - and for - others, you’ll hear about
Some of you will likely have written much more complex code than the one you’ll encounter in this tutorial, yet we call the skills taught “intermediate”, because for code development in teams, you need more than just the right tools and languages – you need a strategy (best practices) for how you’ll use these tools as a team, or at least for potential re-use by people outside your team (that may very well consist only of you). Thus, it’s less about the complexity of the code as such within a self-contained environment, and more about the complexity that arises due to other people either working on it, too, or re-using it for their purposes.
The best way to check whether this tutorial is for you is to browse its contents in this HackMD main document.
This tutorial is targeted to anyone who
It is suitable for all career levels – from students to (very) senior researchers for whom writing code is part of their job, and who either are eager to up-skill and learn things anew, or would like to have a proper refresh and/or new perspectives on research software development.
If you’re keen on learning how to restructure existing code such that it is more robust, reusable and maintainable, automate the process of testing and verifying software correctness, and collaboratively work with others in a way that mimics a typical software development process within a team, then we’re looking forward to you!
The only thing you need to before the event is to create an account on GitHub, if you haven't done so already.
In this tutorial, we will use the Patient Inflammation Study Project which has been set up for educational purposes by the course creators, and is stored on GitHub. The project's purpose is to study the effect of a treatment for arthritis by analysing the inflammation levels in patients who have been given this treatment.
The data:
data
folder of the repository represents inflammation measurements from one separate clinical trial of the drug,The project as seen on the repository is not finished and contains some errors. We will work incrementally on the existing code to fix those and add features during the tutorial.
Goal: Write an application for the command line interface (CLI) to easily retrieve patient inflammation data, and display simple statistics such as the daily mean or maximum value (using visualization).
The code:
inflammation-analysis.py
which provides the main entry point in the application - this is the script we'll eventually run in the CLI, and for which we need to provide inputs (such as which data files to use),inflammation
which contains collections of functions in views.py
and models.py
,data
and tests
which contains tests for our functions in inflammation
,README
file (describing the project, its usage, installation, authors and how to contribute).We will use GitHub CodeSpaces. A codespace is a cloud-powered development environment that you can configure to your liking. It can be accessed from:
GitHub CodeSpaces' superpower is that you can code from any device and get a standardized environment as long as you have internet. This is perfect for our purposes - and maybe for some of yours in the future, too! - as we'll avoid the hassle to install programs on your machines and copy/clone GitHub repositories remotely/locally before you will be able to code. This spares us unexpected problems that would very likely occur when setting up the environment we need.
Python in particular can be a mess when it comes to dependencies between different components… see this XKCD webcomic for an insightful illustration:
Creative Commons Attribution-NonCommercial 2.5 License
! FOLLOW ALONG IN YOUR CODESPACE ! Let's instantiate a GitHub codespace:
Use this template
, then choose Open in a codespace
. Your codespace should load and be ready in a few seconds. Let's inspect what we see…In the cloud, you'll see that we use VSCode as an Integrated Development Environment (IDE), however, you could also use a codespace via your locally installed VSCode program by adding a GitHub Codespaces extension to it.
Explorer
: browse, open, and manage all of the files and folders in your project,Run and Debug
: see all information related to running and debugging,Extensions
: add languages, debuggers, and tools to your installation,Source control
(or version control): to track and manage changes to code.Terminal
: an interface in which you can type and execute text based commands - here, we'll use Bash which is both a Unix shell and command language,
! FOLLOW ALONG IN YOUR CODESPACE ! Let's install the Python and Jupyter extension for VSCode created by Microsoft by clicking the extensions icon at the bottom of the sidebar within the IDE, and searching for Python (selecting the IntelliSense extension), Jupyter, and clicking to install.
VSCode supports version control using Git version control: at the lower left corner, we can see which branch - something like a "container" storing a particular version of our code - in our version control system we're currently in (normally, if you didn't change into another branch, it's the one called main
).
git add filename
,git add filename
command to update it in the staging area,git commit -m "some message indicating what you commit/had changed"
command. Each commit is a new, permanent "snapshot" (checkpoint, or record) of your project in time which you can share and get back to.
git status
allows you to check the current status of your working directory and local repository, e.g., whether there are files which have been changed in the working directory, but not staged for commit (another command you'd probably use very often).git push origin branch-name
, and, if collaborating with other people, pulling their changes using git pull
or git fetch
to keep your local repository in sync with others.
git fetch
and git pull
is that the latter copies changes from a remote repository directly into your working directory, while git fetch
copies changes only into your local Git repo.
Git workflow from PNGWing.
main
) which is the version of the code that is fully tested, stable and reliable,develop
or dev
by convention) that we use for work-in-progress code. Feature branches get first merged into develop
after having been thoroughly tested. Once develop
had been tested with the new features, it will get merged into main
.git branch develop
creates a new branch called develop
,git merge branch-name
allows you to merge branch-name
with the one you're currently in,git branch
tells you which branch you're currently in (something you'd check probably very frequently) as well as gives you a list of which branches exist (the one you're in is denoted by a star symbol),git checkout branch-name
allows you to switch from your current branch into branch-name
.
Git feature branches, adapted by original course creators from Git Tutorial by sillevl (Creative Commons Attribution 4.0 International License)
venv
In inflammation/models.py
, we see that we import two external libraries:
from matplotlib import pyplot as plt
import numpy as np
venv
):
venv
,virtualenv
,pipenv
,conda
,poetry
.! FOLLOW ALONG IN YOUR CODESPACE ! Let's create our virtual environment by creating a new folder called venv
, and instantiating a virtual environment equally called venv
in the terminal:
$ python3 -m venv venv # creating a new folder called "venv",
# and instantiating a virtual environment
# equally called "venv"
$ source venv/bin/activate # activate virtual environment
(venv) $ which python3 # check whether Python from venv is used
Output:
/workspaces/python-intermediate-inflammation/venv/bin/python3
(venv) $ deactivate # deactivate virtual environment
Our code depends on two external packages (numpy
, matplotlib
). We need to install those into the virtual environment to be able to run the code using a package manager tool such as pip
:
(venv) $ pip3 install numpy matplotlib
When you are collaborating on a project with a team, you will want to make it easy for them to replicate equivalent virtual environments on their machines. With pip, virtual environments can be exported, saved and shared with others by creating a file called, e.g., requirements.txt
(you can name as you like, but it's practice to label this file as "requirements") in your current directory, and producing a list of packages that have been installed in the virtual environment:
(venv) $ pip3 freeze > requirements.txt # produce list of packages
(venv) $ pip3 list # view packages installed
If someone else is trying to use your library within their own virtual environment, instead of manually installing every dependency, they can just use the command below to install everything specified in the requirements.txt
file.
(venv) $ pip3 install -r requirements.txt # install packages from
# requirements file
Let's check the status of our repository using git status
. We get the following output:
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
requirements.txt
venv/
nothing added to commit but untracked files present (use "git add" to track)
While you do not want to commit the newly created directory venv
and share it with others as it's is specific to your machine and setup only (containing local paths to libraries on your system specifically), you will want to share requirements.txt
with your team as this file can be used to replicate the virtual environment on your collaborators’ systems.
To tell Git to ignore and not track certain files and directories, you need to specify them in the .gitignore
text file in the project root. You can also ignore multiple files at once that match a pattern (e.g. “*.jpg” will ignore all jpeg files in the current directory). Let's add the necessary lines into the .gitignore
file:
# Virtual environments
venv/
.venv/
Let's make a first commit to our local repository:
$ git add .gitignore requirements.txt
$ git commit -m "Initial commit of requirements.txt. Ignoring virtual env. folder."
Why is testing good?
We'll get into:
Let's create a new branch called test-suite
where we'll write our tests. It is good practice to write tests at the same time when we write some new code on a feature branch. But since the code already exists, we’re creating a feature branch just for writing tests this time. Generally, it is encouraged to use branches for even small bits of new work.
! FOLLOW ALONG IN YOUR CODESPACE ! Let's generate a new feature branch:
$ git checkout develop
$ git branch test-suite
$ git checkout test-suite
Now let's look at the daily_mean()
function in inflammation/models.py
. It calculates the daily mean of inflammation values across all patients. Let's first think about how we could manually test this function.
One way to test whether this function does the right thing is to think about which output we'd expect given a certain input. We can test this manually by creating an input and output variable, and use, e.g., npt.assert_array_equal()
to check whether the outcome of daily_mean()
given the input variable matches the output variable.
daily_mean()
, we need to import it. To import it, we need to instantiate a directory for the codespace.import numpy as np
, and choosing Run in Interactive Window
, and Run Selection/Line in Interactive Window
.
import os
# get the current working directory
os.getcwd()
import numpy.testing as npt
from inflammation.models import daily_mean
test_input = np.array([[1, 2], [3, 4], [5, 6]])
test_result = np.array([3, 4])
npt.assert_array_equal(daily_mean(test_input), test_result)
We can think about multiple pairs of expected output given a certain input:
test_input = np.array([[2, 0], [4, 0]])
test_result = np.array([2, 0])
npt.assert_array_equal(daily_mean(test_input), test_result)
test_input = np.array([[0, 0], [0, 0], [0, 0]])
test_result = np.array([0, 0])
npt.assert_array_equal(daily_mean(test_input), test_result)
However, we get a mismatch between input and output for the first test:
...
AssertionError:
Arrays are not equal
Mismatched elements: 1 / 2 (50%)
Max absolute difference: 1.
Max relative difference: 0.5
x: array([3., 0.])
y: array([2, 0])
The reason here is that one of our specified outputs is wrong - which reminds us that tests themselves can be written in a wrong way, so it's good to keep them as simple as possible so as to minimize errors.
We could put these tests in a separate script to automate running them. However, a Python script stops at the first failed assertion, so if we get one no matter why, all subsequent tests wouldn't be run at all –> this calls for a testing framework such as Pytest
where we
Let's look at tests/test_models.py
where we see one test function called test_daily_mean_zeros()
:
def test_daily_mean_zeros():
"""Test that mean function works for an array of zeros."""
from inflammation.models import daily_mean
test_input = np.array([[0, 0],
[0, 0],
[0, 0]])
test_result = np.array([0, 0])
# Need to use NumPy testing functions to compare arrays
npt.assert_array_equal(daily_mean(test_input), test_result
Generally, each test function requires
test_input
NumPy
array,daily_mean()
function so we can use it (we only import the necessary library function we want to test within each test function),daily_mean()
with our test_input array and using np.assert_array_equal()
to test its validity,PyTest
, the letters ‘test_’ at the beginning of the function name.! FOLLOW ALONG IN YOUR CODESPACE ! Let's install PyTest
:
$ pip3 install pytest
We can run PyTest
in the CLI…
$ python -m pytest tests/test_models.py
… and get the following output:
======================================================== test session starts =========================================================
platform linux -- Python 3.10.8, pytest-7.3.2, pluggy-1.2.0
rootdir: /workspaces/python-intermediate-inflammation
collected 2 items
tests/test_models.py .. [100%]
========================================================= 2 passed in 1.06s ==========================================================
We can also test single testing functions in our test_models.py
file. To do that, we need to configure our testing set up by clicking on the testing icon, choosing Configure Python Test
, then Pytest
, and then the folder the tests are in.
! TASK 1 ! Write a new test case that tests the daily_max()
function, adding it to test/test_models.py
. Also regenerate your requirements.txt
file, commit your changes, and merge the test-suite
branch with the develop
branch. (5-10 min)
daily_mean()
, defining input and expected output variables followed by the equality assertion.python -m pytest tests/test_models.py
, and have a look at your new tests pass.Solutions will be shown in this hackmd.io file. (Don't look into it before you haven't given it a try yourself.)
We had used two different testing functions to distinguish between integer and string inputs. Writing a separate test functions to test the same function for different cases is quite inefficient - that's where test parameterisation comes in handy.
Instead of writing a separate function for each different test, we can parameterise the tests with multiple test inputs, e.g., in tests/test_models.py
, we can rewrite the test_daily_mean_zeros()
from above and test_daily_mean_integers()
from the solutions hackmd.io file into a single test function:
@pytest.mark.parametrize(
"test, expected",
[
([ [0, 0], [0, 0], [0, 0] ], [0, 0]),
([ [1, 2], [3, 4], [5, 6] ], [3, 4]),
])
def test_daily_mean(test, expected):
"""Test mean function works for array of zeroes and positive integers."""
from inflammation.models import daily_mean
npt.assert_array_equal(daily_mean(np.array(test)), np.array(expected))
test
for inputs, and expected
for outputs -, as well as the inputs and outputs themselves that correspond to these names. Each row within the square brackets following the "test, expected"
arguments corresponds to one test case. Let's look at the first row:
[ [0, 0], [0, 0], [0, 0] ]
would be the input, corresponding to the input name test
,[0, 0]
would be the output, corresponding to the output name expected
,parameterize()
function is a Python decorator: A Python decorator is a function that takes as an input a function, adds some functionality to it, and then returns it (more about this in the section on functional programming).
parameterize()
is a decorator in that it takes as an input the respective testing function, adds functionality to it by specifying multiple input and expected output test cases, and calling the function over each of these inputs automatically when this test is called.! TASK 2 ! Rewrite your test functions for daily_max()
using test parameterisation. (5-10 min)
test-suite
branch.python -m pytest tests/test_models.py
, and have a look at your new tests pass.test-suite
branch with the develop
branch.Solutions will be shown in this hackmd.io file.
We can find problems in our code conveniently in VScode using breakpoints (points at which we want code execution to stop) and our testing functions.
! FOLLOW ALONG IN YOUR CODESPACE !
daily_max()
, and set a breakpoint somewhere within that function by left-clicking the space to the left of the line numbers,test_daily_max()
, then choose Debug Test
.DEBUG CONSOLE
to check whether our function does what it's supposed to do, e.g., run np.max(data, axis=0)
and see whether it's giving the expected output (i.e., array([0, 0])
).While Pytest is an indispensable tool to speed up testing, it can't help us decide what to test and how many tests to run.
As a heuristic, we should try to come up with tests that test
This ensures a high degree of code coverage. A Python package called pytest-cov
that is used by Pytest gives you exactly this - the degree to which you've covered your code w. r. t. tests.
! FOLLOW ALONG IN YOUR CODESPACE ! Let's install pytest-cov
and assess code coverage:
$ pip3 install pytest-cov
$ python -m pytest --cov=inflammation.models tests/test_models.py
--cov
is an additional named argument to specify the code that is to be analysed for test coverage.Output:
================================================= test session starts =================================================
platform linux -- Python 3.10.8, pytest-7.3.2, pluggy-1.2.0
rootdir: /workspaces/python-intermediate-inflammation
plugins: cov-4.1.0
collected 7 items
tests/test_models.py ....... [100%]
---------- coverage: platform linux, python 3.10.8-final-0 -----------
Name Stmts Miss Cover
--------------------------------------------
inflammation/models.py 9 2 78%
--------------------------------------------
TOTAL 9 2 78%
================================================== 7 passed in 1.25s ==================================================
inflammation.models
are tested. To see which ones have not yet been tested, we can use the following line in the terminal:python -m pytest --cov=inflammation.models --cov-report term-missing tests/test_models.py
Output:
================================================= test session starts =================================================
platform linux -- Python 3.10.8, pytest-7.3.2, pluggy-1.2.0
rootdir: /workspaces/python-intermediate-inflammation
plugins: cov-4.1.0
collected 7 items
tests/test_models.py ....... [100%]
---------- coverage: platform linux, python 3.10.8-final-0 -----------
Name Stmts Miss Cover Missing
------------------------------------------------------
inflammation/models.py 9 2 78% 18, 32
------------------------------------------------------
TOTAL 9 2 78%
================================================== 7 passed in 0.29s ==================================================
$ pip3 freeze > requirements.txt
$ git status
$ git add ./
$ git commit -m "Add coverage support"
$ git checkout develop
$ git merge test-suite
What is Test Driven Development?
In test-driven development, we first write the tests, and then the code, i.e., the thinking process would go from
This way, the set of tests act like a specification of what the code does. The main advantages are:
If we're collaborating on a software project with multiple people who push a lot of changes to one of the major repositories, we'd need to constantly pull down their changes to our local machines, and do our tests with the newly pulled down code - this would result in a lot of back and forth, slowing us down quite a bit. That's where Continuous integration (CI) comes in handy:
There are many CI infrastructures and services. We’ll be looking at GitHub Actions - which, unsurprisingly, is available as part of GitHub.
YAML
(a recursive acronym which stands for “YAML Ain’t Markup Language”) is a text format used by GitHub Action workflow files. YAML
files use
name: Kilimanjaro
height_metres: 5892
first_scaled_by: Hans Meyer
first_scaled_by:
- Hans Meyer
- Ludwig Purtscheller
height:
value: 5892
unit: metres
measured:
year: 2008
by: Kilimanjaro 2008 Precise Height Measurement Expedition
! FOLLOW ALONG IN YOUR CODESPACE ! Let's set up CI using GitHub Actions: with a GitHub repository, there’s a way we can set up CI to run our tests automatically when we commit changes. To do this, we need to add a new file in a particular directory of our repository (make sure you're on the test-suite
branch).
Let's create a new directory .github/workflows
which is used specifically for GitHub Actions, as well as a new file called main.yml
:
$ mkdir -p .github/workflows
$ vim main.yml
In the main.yml
, we'll write the following:
name: CI
# We can specify which Github events will trigger a CI build
on: push
jobs:
build:
# we can also specify the OS to run tests on
runs-on: ubuntu-latest
# a job is a seq of steps
steps:
# Next we need to checkout out repository, and set up Python
# A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: "3.9"
- name: Install Python dependencies
run: |
python3 -m pip install --upgrade pip
pip3 install -r requirements.txt
- name: Test with PyTest
run: |
python -m pytest --cov=inflammation.models tests/test_models.py
name: CI
: name of our workflowon: push
: indication that we want this workflow to run when we push commits to our repository.jobs: build:
the workflow itself is made of a single job named build
, but could contain any number of jobs after this one, each of which would run in parallel.runs-on: ubuntu-latest
: statement about which operating systems we want to use, in this case just Ubuntu.steps:
the steps that our job will undertake in turn to 1) set up the job’s environment (think of it as a freshly installed machine, albeit virtual, with very little installed on it) and 2) run our tests. Each step has a name (which you can choose to your liking) and a way to be executed (as specified by uses
/run
).
name: Checkout repository for the job
: use a GitHub Action called checkout
name: Set up Python 3.9
: here, we use the setup-python
Action, indicating that we want Python version 3.9.name: Install Python dependencies
: install latest version of pip
, dependencies, and our inflammation
package: In order to locally install our inflammation package, it’s good practice to upgrade the version of pip that is present first, then we use pip to install our package dependencies.name: Test with PyTest
: finally, we let PyTest run our tests in tests/test_models.py
, including code coverage.To address whether our code works on different target user platforms (e.g., Ubuntu, Mac OS, or Windows), with different Python installations (e. g., 3.8, 3.9 or 3.10), we can use a feature called build matrices. Doing our tests across all these platforms and program versions would take a lot of time - that's where a build matrix comes in handy.
! FOLLOW ALONG IN YOUR CODESPACE !
main.yml
, we can specify environments (such as operating systems) and parameters (such as Python versions), and new jobs will be created that run our tests for each permutation of these.strategy
as a matrix
of operating systems and Python versions within build
. We then use matrix.os
and matrix.python-version
to reference these configuration possibilitiesjob
name: CI
# We can specify which Github events will trigger a CI build
on: push
# now define a single job 'build' (but could define more)
jobs:
build:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10"]
runs-on: ${{ matrix.os }}
# a job is a seq of steps
steps:
# Next we need to checkout out repository, and set up Python
# A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
Let's add the new folder/file to the local repository and merge with develop
:
$ git add .github
$ git commit -m "Add GitHub Actions configuration & build matrix for os and Python version
$ git checkout develop
$ git merge test-suite
Whenever you push your changes to a remote repository, GitHub will run CI as specified by the main.yml
. You can check its status on the website of your remote repository under Actions
. For each push, you'd get a report about which of the steps have been successfully/unsuccessfully taken.
You may also look into these resources on unit testing, scaling it up, and continuous integration:
Different things can be meant by the term "software design":
Design patterns are typical solutions to commonly occurring problems in software design (from any of the domains/levels mentioned above). From Refactoring Guru:
Programming paradigms such as object-oriented or functional programming (we'll get to those in a minute!) are not so straightforward to allocate w. r. t. the different facets of software design mentioned above: a programming paradigm represents an own way of thinking about and structuring code, with pros and cons when used to solve particular types of problems.
Technical debt: If we don't follow best practices around code, including addressing design questions, we may build up too much technical debt - the cost of refactoring code due to having chosen a quick-and-dirty solution instead of having used a better approach that would have initially taken longer.
There are two major families that we can group the common programming paradigms into: Imperative and Declarative.
We will look into two major paradigms from the imperative and declarative families that may be useful to you - functional programming and object-oriented programming.
In object-oriented programming, objects encapsulate data in the form of attributes and code in the form of methods that manipulate the objects’ attributes and define how objects can behave (in interaction with each other).
A class is a template for a structure and a set of permissible behaviors that we want our data to comply to, thus, each time we create some data using a class, we can be certain that it has the same structure.
If you know about Python lists and dictionaries, you may recognize that they behave similarly to how we may define a class ourselves:
Encapsulating data
Let's have a look at a simple class:
class Patient:
def __init__(self, name):
self.name = name
self.observations = []
Alice = Patient('Alice')
print(Alice.name)
Output:
Alice
__init__
- the initialiser method which sets up the initial values and structure of the data inside a new instance of the class. We call the __init__
method every time we create a new instance of the class, as in Patient('Alice'). The argument self
refers to the instance on which we are calling the method and gets filled in automatically by Python whenever we instantiate a new class instance.Alice.name
).Encapsulating behavior
Let's add a method to the above class which operates on the data that the class contains: adding a new observation to a Patient instance.
class Patient:
"""A patient in an inflammation study."""
def __init__(self, name):
self.name = name
self.observations = []
def add_observation(self, value, day=None):
if day is None:
try:
day = self.observations[-1]['day'] + 1
except IndexError:
day = 0
new_observation = {
'day': day,
'value': value,
}
self.observations.append(new_observation)
return new_observation
Alice = Patient('Alice')
print(Alice)
observation = Alice.add_observation(3)
print(observation)
print(Alice.observations)
Output:
<__main__.Patient object at 0x7f67f424c190>
{'day': 0, 'value': 3}
[{'day': 0, 'value': 3}]
self
(using this name is not strictly necessary, but is a very strong convention). Similar to the initialiser method, when we call a method on an object, the value of self is automatically set to this object - hence the name.Alice.add_observation(3)
).Dunder Methods
The__init__
method begins and ends with a double-underscore - it is a dunder method. These dunder methods (also called magic methods) are not meant to be invoked directly by you, but the invocation happens internally from the class on a certain action. Built-in classes such in Python as the int
class define many magic methods.
print(Alice)
, it returned <__main__.Patient object at 0x7fd7e61b73d0>
which is the string represenation of the Alice object. Functions like print()
or str()
use __str__()
.__str__
method within our class to display the object's name instead of the object's string representation.
class Patient:
"""A patient in an inflammation study."""
def __init__(self, name):
self.name = name
self.observations = []
def add_observation(self, value, day=None):
if day is None:
try:
day = self.observations[-1]['day'] + 1
except IndexError:
day = 0
new_observation = {
'day': day,
'value': value,
}
self.observations.append(new_observation)
return new_observation
def __str__(self):
return self.name
Alice = Patient('Alice')
print(Alice)
Output:
Alice
Relationships between classes
There are two fundamental types of object characteristics which also denote the relationships among classes:
Composition
In object oriented programming, we can make things components of other things, e.g., we may want to say that a doctor has patients or that a patient has observations. In the way we had written our class so far, a patient already has observations - which is a case of composition.
Let's separate the two and make an own Observation class, and make use of it in the Patient Class.
class Observation:
def __init__(self, day, value):
self.day = day
self.value = value
def __str__(self):
return str(self.value)
class Patient:
"""A patient in an inflammation study."""
def __init__(self, name):
self.name = name
self.observations = []
def add_observation(self, value, day=None):
if day is None:
try:
day = self.observations[-1].day + 1
except IndexError:
day = 0
new_observation = Observation(day, value)
self.observations.append(new_observation)
return new_observation
def __str__(self):
return self.name
Alice = Patient('Alice')
obs = Alice.add_observation(3, 3)
print(obs)
Output:
3
Inheritance
Inheritance is about data and behaviour that two or more classes share: if class X inherits from (is a) class Y, we say that Y is the superclass or parent class of X, or X is a subclass of Y - X gets all attributes and methods of Y.
If we want to extend the previous example to also manage people who aren’t patients we can add another class Person. But Person will share some data and behaviour with Patient - in this case both have a name and show that name when you print them. Since we expect all patients to be people (hopefully!), it makes sense to implement the behaviour in Person and then reuse it in Patient.
To write our class in Python, we used the class keyword, the name of the class, and then a block of the functions that belong to it. If the class inherits from another class, we include the parent class name in brackets.
class Observation:
def __init__(self, day, value):
self.day = day
self.value = value
def __str__(self):
return str(self.value)
class Person:
def __init__(self, name):
self.name = name
def __str__(self):
return self.name
class Patient(Person):
"""A patient in an inflammation study."""
def __init__(self, name):
super().__init__(name)
self.observations = []
def add_observation(self, value, day=None):
if day is None:
try:
day = self.observations[-1].day + 1
except IndexError:
day = 0
new_observation = Observation(day, value)
self.observations.append(new_observation)
return new_observation
class Patient(Person)
), as well as in the initialiser (super().__init__(name)
).
__init__
method for our subclass, Python will look for one on the parent class and use it automatically). This is true of all methods - if we call a method which doesn’t exist directly on our class, Python will search for it among the parent classes.self.name = name
in the Patient class becomes obsolete.! QUESTION 1 ! What outputs do you expect here?
Alice = Patient('Alice')
print(Alice)
obs = Alice.add_observation(3)
print(obs)
Bob = Person('Bob')
print(Bob)
obs = Bob.add_observation(4)
print(obs)
Final note: When deciding how to implement a model of your particular system, you often have a choice of either composition or inheritance, where there is no obviously correct choice - multiple implementations may be equally good. (See more on that in the The Composition Over Inheritance Principle.
! TASK 3 ! Write a Doctor class to hold the data representing a single doctor:
test_patient.py
.In functional programming, programs apply and compose/chain functions. It is based on the mathematical definition of a function f()
which does a transformation/mapping from input x
to output f(x)
).
Contrary to imperative paradigms, it does not entail a sequence of steps during which the state of the code is updated to reach a final desired state. It describes the transformations to be done without producing such side effects.
The following two code examples implement the calculation of a factorial in procedural and functional styles, respectively. The factorial of a number n
(denoted by n!
) is calculated as the product of integer numbers from 1 to n.
Procedural style factorial function
def factorial(n):
"""Calculate the factorial of a given number.
:param int n: The factorial to calculate
:return: The resultant factorial
"""
if n < 0:
raise ValueError('Only use non-negative integers.')
factorial = 1
for i in range(1, n + 1): # iterate from 1 to n
# save intermediate value to use in the next iteration
factorial = factorial * i
return factorial
factorial
in the for loop) and advance towards the result.Functional style factorial function
def factorial(n):
"""Calculate the factorial of a given number.
:param int n: The factorial to calculate
:return: The resultant factorial
"""
if n < 0:
raise ValueError('Only use non-negative integers.')
if n == 0 or n == 1:
return 1 # exit from recursion, prevents infinite loops
else:
return n * factorial(n-1) # recursive call to the same function
factorial
in the above example), or modify data that exists outside the current function, including the input data (e.g., printing text, writing to a file, modifying the value of an input argument, or changing the value of a global variable).
n
, while the procedural impl. runs faster. It is vital to consider your use case before chosing which kind of paradgim to use for your software.! QUESTION 2 ! Which of these functions are pure?
def add_one(x):
return x + 1
def say_hello(name):
print('Hello', name)
def append_item_1(a_list, item):
a_list += [item]
return a_list
def append_item_2(a_list, item):
result = a_list + [item]
return result
Benefits of pure functions:
As an example of composability, let's look at Python decorators: As we had seen in the episode on parametrising our unit tests, a decorator can take a function, modify/decorate it, then return the resulting function. This is possible because in Python, functions can be passed around as normal data. Here, we discuss decorators in more detail and learn how to write our own. Let’s look at the following code for ways on how to “decorate” functions.
# define function where additional functionality is to be added
def ordinary():
print("I am an ordinary function")
# define decorator, or outer function for first function
def decorate(func):
# define the inner function
def inner():
# add some additional behavior to original function
print("I am a decorator")
# call original function
func()
# return the inner function
return inner
# decorate the ordinary function
decorated_func = decorate(ordinary)
# call the decorated function
decorated_func()
Output:
I am a decorator
I am an ordinary function
ordinary()
is to be decorated,decorate(func)
is the function that decorates another function,decorate(ordinary)
builds another function that adds functionality to ordinary()
.Another way to use decorators is to add @decorate before the function to be decorated:
# define decorator, or outer function for first function
def decorate(func):
# define the inner function
def inner():
# add some additional behavior to original function
print("I am a decorator")
# call original function
func()
# return the inner function
return inner
# define function where additional functionality is to be added
@decorate
def ordinary():
print("I am an ordinary function")
# call the decorated function
ordinary()
Output:
I am a decorator
I am an ordinary function
! TASK 4 ! Write a decorator that measures the time time taken to execute a particular function using the time.process_time_ns() function.
time
.start = time.process_time_ns()
, and get another time stamp once the calculation in question is done using end = time.process_time_ns()
.
def measure_me(n):
total = 0
for i in range(n):
total += i * i
return total
Pull requests let you tell others about changes you've pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch. Code review plays an essential role in this process.
Code review is one of the most important practices of collaborative software development that improves code quality and increases knowledge about the codebase across the team. Before contributions are merged into the main branch, code will need to be reviewed, e.g., by the maintainer(s) of the repository.
Although the role of code review can't be overstated, we will not go into the details here, as it's better suited for self-study compared to other building blocks in research software engineering that we touch upon in this tutorial. See, e.g., a guide on code review from Kimmo Brunfeldt here.
The way you and your team provide contributions to the shared codebase depends on the type of development model you use in your project. Two commonly used models are the following:
! FOLLOW ALONG IN YOUR CODESPACE !
! TASK 4 ! Let us try to make a small PR adding our names to the README.md
's list of participants. (5-10 min)
1. Starting from the main
branch, create a new branch add-name
:
git branch main
git switch -C add-name # creates and switch to the branch directly
2. Edit the README.md
file by adding your name under the "Participants" section.
3. Track the changes and commit them with:
git add README.md
git commit -m "Added my name for PR exercise"
Keep an eye out from here: Git might prompt you whether you would like
Here, Github Codespaces comes in very handy, as it will create a fork of the original project, since you probably don't have write permission to our original repository.
4. Push your newly created branch to your own fork:
git push -u origin add-name
5. Create a PR to the original repository
For the sake of simplicity, we create the PR in Github's web interface.
If the previous steps were followed precisely, the Code
tab of your fork of the project should give you the option to create a PR for your changes in add-name
to the branch of your choice on the original (upstream
) repository.
6. From the maintainer's perspective
Once a PR is received, we usually perform a code review.
The "File changed" tab in the PR's interface is a very useful tool to gage the changes that a PR makes to its targeted branch.
Once the code is reviewed, the maintainer can either request for some changes, or proceed to merge the PR if it satisfies all the requirements.
In our case, since the changes are quite minimal, we just proceed to merge and close the PR.
Once that is done, your changes will be reflected in the corresponding branch of the project, or in our case, the main
branch.
We will now have a look at inflammation-analysis.py
which, in our example, is the entry point of our simple application - users will need to call it within a CLI, alongside a set of arguments:
python3 inflammation-analysis.py data/inflammation-03.csv
How to use the application and which arguments to specify can be accessed via
python3 inflammation-analysis.py --help
inflammation-analysis.py
can be run in different ways - as an imported library, or as the top-level script in which case the global dunder variable __name__
will be set to "__main__"
.
__name__
In inflammation-analysis.py
, we see the following code:
# import modules
def main():
# perform some actions
if __name__ == "__main__":
# perform some actions before main()
main()
__name__
is a special dunder variable which is set, along with a number of other special dunder variables, by the python interpreter before the execution of any code in the source file. What value is given by the interpreter to __name__
is determined by the way in which it is loaded.
If you run the following command (i.e., run the file as a script), __name__
will be equal to __main__
, and everything following after the if-statement will be executed:
$ python3 inflammation-analysis.py
If you import your file as a module via import inflammation-analysis
, __name__
will be set to the module name, i.e., __name__ = "inflammation-analysis"
.
In other words, the global variable __name__
allows you to execute code when the file runs as a script, but not when it’s imported as a module.
Python sets the global name of a module equal to __main__
if the Python interpreter runs your code in the top-level code environment.
“Top-level code” is the first user-specified Python module that starts running. It’s “top-level” because it imports all other modules that the program needs.
To be able to run inflammation-analysis.py
in the CLI, we need to enable Python to read command line arguments. The standard Python library for reading command line arguments passed to a script is argparse
. Let's look into inflammation-analysis.py
again.
# we first initialise the argument parser class,
# passing an (optional) description of the program:
parser = argparse.ArgumentParser(
description='A basic patient inflammation data
system')
# we can now add the arguments that we want argparse
# to look out for; on our case, we only want to process
# the names of the file(s):
parser.add_argument(
'infiles',
nargs='+',
help='Input CSV(s) containing inflammation series for each patient')
# we parse the arguments passed to the script:
args = parser.parse_args()
infiles
), the number of arguments to be expected (nargs='+', where '+' indicates that there should be 1 or more arguments passed); and a help string for the user (help='Input CSV(s) containing inflammation series for each patient'
).parser.parse_args()
returns an object (called arg
) containing all the arguments requested. These can be accessed using the names that we have defined for each argument, e.g., args.infiles
would return the filenames that were used as inputs.python3 inflammation-analysis.py data/inflammation-03.csv
, nothing will happen at that point, as views.py
used matplotlib
, but a our CLI will output only text. But we could add another modality in views.py
to be able to generate output that is shown in the CLI.We will now look at how we can package software for release and distribution, using Poetry
to manage our Python dependencies and produce a code package we can use with a Python package indexing service such as PyPi.
Here, we only marginally touch upon important factors to consider before publishing software, most of which have to do with documentation. Documentation is a foundational pillar in coding/writing software. While its significance can't be overstated, we omit this part in this tutorial, as it's better for self-study compared to other building blocks in research software engineering.
Documentation
Before releasing software for reuse, make sure you have
Marking a software release
There are different ways in which we can make a software release from our code in Git/on GitHub, one of which is tagging: we attach a human-readable label to a specific commit, e.g., "v1.0.0", and push the change to our remote repo:
! FOLLOW ALONG IN YOUR CODESPACE !
$ git tag -a v1.0.0 -m "Version 1.0.0"
$ git push origin v1.0.0
We will use Python's Poetry
library which we'll install in our virtual environment (make sure you're in the root directory when avtivating the virtual environment, and let's check afterwards that we installed Poetry
within it):
! FOLLOW ALONG IN YOUR CODESPACE !
$ source venv/bin/activate
$ pip3 install poetry
$ which poetry
Poetry uses a pyproject.toml
file to describe the build system and requirements of the distributable package.
To create a pyproject.toml
file for our code, we can use poetry init
which will guide us through the most important settings (for each prompt, we either enter our data or accept the default).
Below, you see the questions with the recommended responses, so do follow these (and use your own contact details).
Poetry
can automatically find the code.$ poetry init
Output:
This command will guide you through creating your pyproject.toml config.
Package name [example]: inflammation
Version [0.1.0]: 1.0.0
Description []: Analyse patient inflammation data
Author [None, n to skip]: Nadine Spychala <nadine.spychala@gmail.com>
License []: MIT
Compatible Python versions [^3.8]: ^3.8
Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file
[tool.poetry]
name = "inflammation"
version = "1.0.0"
description = "Analyse patient inflammation data"
authors = ["Nadine Spychala <nadine.spychala@gmail.com>"]
license = "MIT"
[tool.poetry.dependencies]
python = "^3.8"
[tool.poetry.dev-dependencies]
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
Do you confirm generation? (yes/no) [yes] yes
When we add a dependency using Poetry
, Poetry will add it to the list of dependencies in the pyproject.toml
file, and automatically install the package into our virtual environment.
NumPy
. The latter are dependencies which are needed/essential in order to develop code, but not required to run it, e.g., pylint
or pytest
.$ poetry add matplotlib numpy
$ poetry add --dev pylint
$ poetry install
Let's build a distributable version of our software:
$ poetry build
This should produce two files for us in the dist
directory of which the most important one is the .whl
or wheel file. This is the file that pip
uses to distribute and install Python packages, so this is the file we’d need to share with other people who want to install our software.
If we gave this wheel file to someone else, they could install it using pip
:
$ pip3 install dist/inflammation*.whl
If we need to publish an update, we just update the version number in the pyproject.toml
file, then use Poetry
to build and publish the new version. Any re-publishing of the package, no matter how small the changes, needs to come with a new version number.
Note to self: General outline,
WORK ON THIS
WORK ON THIS
Giving credit, what has been changed, and license website.
I get support for this tutorial from my fellowship at the Software Sustainability Institute.
WORK ON THIS