Collaborative Research Software Engineering in Python

### 90 min tutorial at the Artificial Life conference 2023, 24th July **FOR FUTURE LEARNERS & INSTRUCTORS ALIKE: FEEL FREE TO REUSE! Bear in mind this event has just passed, and I (Nadine) will follow up on this doc with some post-processing.** :slightly_smiling_face: #### Instructors: [Nadine Spychala](https://nadinespy.github.io/) & [Rousslan Dossa](https://dosssman.github.io/) ## Welcome! :relaxed::four_leaf_clover: This is a tutorial on best practices in research software engineering using Python as an example programming language: - writing research software well and collaboratively, - while explicitly taking into account software sustainability and open- and reproducible science. It is a **modified 90-minute mini-version** of the [Intermediate Research Software Development](https://carpentries-incubator.github.io/python-intermediate-development/) course from the [Carpentries Incubator](https://carpentries-incubator.org/), so for anyone wanting to delve into more detail or get more practice... looking into that course is probably one of the best options. :slightly_smiling_face: Here, you'll get - little tasters of most sections of the original course - with a **focus on testing and software design**, - as well as some **new learning content, resources and tools** that you won't find in the original course. **Some content/sections** in this tutorial are meant for you to (optionally) go through **before or after the tutorial**. We'll use [GitHub CodeSpaces](https://github.com/features/codespaces) – a cloud-powered development environment that one can configure to one’s liking. - Everyone will instantiate a **GitHub codespace** within their GitHub account and **all coding will be done from there** - folks will be able to directly apply what is taught in their codespace, work on exercises, and implement solutions. - Thus, the only thing you will need for this tutorial is an account on [GitHub](https://github.com/). More on GitHub CodeSpaces further below. Disclaimer: rather than this being a tutorial about how to do collaborative research software engineering with a particular Python lens, we use **Python as a vehicle to convey fairly general RSE principles**. For time reasons, we can only get a little bit into tutorial's topics - each of which are little (or big) worlds on their own. - **This hackmd file will not disappear** - should you like to follow up on the tutorial, or go through (some of) its contents again, you are more than welcome to do so! - **You may even re-use this template**, should you want to deliver a tutorial yourself, or use the material for other purposes - before doing so, please read the license section at the end of this document. **If this tutorial incentivises you to delve more deeply into anything related to research software engineering, this tutorial will have been a full success!** :sunny: ## Table of Contents (tentative) **0. Welcome** (5 min) 0.1 Let's introduce ourselves 0.2 Recap & motivation: why collaboration and best research software engineering practices in the first place? 0.3 Difference between "mere" coding and research software engineering 0.4 What you’ll learn 0.5 Whom this tutorial is for 0.6 What you need to have done before the event (for participants to read before the event) **1. Let's start! Introduction into the project & setting up the environment** (15 min) 1.1 The project 1.2 GitHub CodeSpaces 1.3 Integrated Development Environments 1.4 Git and GitHub 1.5 Creating virtual environments **2. Ensuring correctness of software at scale** (20 min) 2.1 Unit tests 2.2 Scaling up unit tests 2.3 Debugging code & code coverage 2.4 Continuous integration **3. Software design** (20 min) 3.1 Programming paradigms 3.2 Object-oriented programming 3.3 Functional programming **4. Writing software with - and for - others: workflows on GitHub, APIs, and code packages** (20 min) 4.1 GitHub pull requests 4.2 How users can use the program you write: application programming interfaces 4.3 Producing a code package 4.4 Personal experience with research software academic and professional **6. Wrap-up** (5 min) **7. Further resources** **8. License** **9. Original course** **10. Funding & Acknowledgements** **Color coding** in this file: - **GREEN**: - denotes questions or tasks for the participants, or - indicates that you're supposed to follow along in your codespace. - **PURPLE**: indicates content/sections that participants may (optionally) go through before or after the tutorial. ### 0.1 Let's introduce ourselves Feel free to write down some or all the following suggestions: _name / institution or affiliation / connection details, e. g., mail, Twitter or Mastodon, if you'd like to share those / why did you come to this tutorial, or what is your motivation?_ Please write down your answers here: - Nadine Spychala / University of Sussex, Software Sustainability Institute / tw: @NadineSpychala, mastodon: https://mstdn.social/@nadinespy / I want to contribute to better software and collaborative practices – and therefore better research – in scientific fields I'm involved in myself. - [Rousslan Dossa](dosssman.github.io) / Researcher at [Araya Inc.](https://research.araya.org/) / tw: [@RousslanDossa](https://twitter.com/RousslanDossa) / Working toward sample efficient reinforcement learning agents with (super) human-level decision-making while producing easy to use and well maintained research code for others to collab. and build upon. - [Oskar Elek](elek.pub), Postdoc at University of California in Santa Cruz, founder/lead of the cross-disciplinary project [PolyPhy](github.com/PolyPhyHub), want to learn more about sustainability practices when it comes to open-source research software ecosystems - [M Charity](https://mastermilkx.github.io/) / NYU Tandon - Game Innovation Lab / Discord: MasterMilkX, Twitter: [@MasterMilkX](https://twitter.com/MasterMilkX) / I want to write better, cleaner open source code for people to use in their own research projects - [Amany Azevedo Amin](https://www.linkedin.com/in/amany-azevedo-amin/) - University of Sussex: I'd like to encourage collaboration in our research group, enabled through code that can be shared and understood **SECTIONS 0.2-0.6: PLEASE GO THROUGH THOSE PARTS BEFORE THE EVENT.** ### 0.2 Recap & motivation: why collaboration and best research software practices in the first place? - In science, we often want or need to reproduce results to **build knowledge incrementally**. - If, for some reason, results can't be reproduced, we at least want to **understand the steps taken** to arrive at the results, i.e., have transparency on the tools used, code written, computations done, and anything else that has been relevant for generating a given research result. - However, very often, the steps taken - and particularly the **code** written -, for generating scientific results are **not available**, and/or **not readily implementable**, and/or **not sufficiently understandable**. - The consequences are: - **Redundant**, or, at worst, **wasted work**, if reproduction of results is essential, but not possible. This, in the grand scheme of things, greatly slows down scientific progress. - Code that is not designed to be possibly re-used – and thus scrutinized by others – runs the risk of being flawed and therefore, in turn, produce, **flawed results**. - It **hampers collaboration** – something that becomes increasingly important as - people from all over the world become more inter-connected, - more diversified and specialized knowledge is produced (such that different "parts" need to come together to create a coherent "whole"), - the mere amount of people working in science increases, - many great things can't be achieved alone. - To manage those developments well and avoid working in silos, it is important to have **structures** at place that **enable people to join forces**, and respond to and integrate each other’s work well - we need more teamwork. - **Why is it difficult to establish collaborative and best coding practices?** For cultural/scientific practice reasons, and the way academia has set up its **incentives** (in terms of # of papers where authors are given credit as _individuals_, and prestige of journals plays a role), special value is placed on individual rather than collaborative research outputs. It also discourages doing things right rather than quick-and-dirty. This needs to change. ### 0.3 Difference between "mere" coding and research software engineering The terms *programming* (or even *coding*) and *software engineering* are often used interchangeably, but those terms don't mean the same thing. Programmers or coders tend to focus on one part of software development which is implementation. Also, in the context of academic research, they often write software just for themselves and are the sole stakeholders. Someone who is *engineering* software takes a broader view on code which also considers: - **the lifecycle of software**: viewing software development as a process that proceeds from understanding software requirements, to writing the and and using/releasing it, to what happens afterwards, - **who will (or may) be involved**: software is written for stakeholders - this may only be one researcher initially, but there is an understanding that others may become involved later on (even if that isn’t evident yet), - **the value of the software itself**: it is not merely a by-product, but may provide benefits in various ways, e.g., in terms of what it can do, the lessons learned throughout its development, and as an implementation of a research approach (i.e. a particular research algorithm, process, or technical approach), - **its potential reuse**: there is an assumption that (parts of) the software could be reused in the future (this includes your future self!). **Bearing the difference between coding and software engineering in mind, how much do scientists actually need to do either of them?** Should they rather code or write software, or do both (and if both, when do they do what)? This is a hard question and will very much depend on a given research project. In [Scientific coding and software engineering: what's the difference?](https://www.software.ac.uk/blog/2016-09-26-scientific-coding-and-software-engineering-whats-difference), it is argued that "*scientists want to explore, engineers want to build*". Both too little or too much of an engineering component in writing code can be a hindrance in the research process: - Too much engineering right at the start can be a problem because unlike other professions, doing research often means to venture into the unknown, and to take/abandon many different paths and turns - overgeneralizing code can be wasted work, if the research underlying it hasn't gone beyond the exploratory stage (and if noone else except for oneself is engaging in this exploratory stage). - Too little engineering can be a problem, too, as any code accompanying research that is beyond the very first exploratory stages, and where other people are involved in using it in some way will face problems. To boil down the challenge in other words, when you start out writing code for your research, you need to ask yourself: **How much do you want to generalize and consider factors in the software lifecycle *upfront*** in order to spare work at a later time-point **vs. stay specific and write single-use code** to not end up doing (potentially) large amounts of unnecessary work, if you (unexpectedly) abandon paths taken in your research? While this is a question every coder/software engineer needs to ask themselves, it's a particularly important one for researchers. It may not be easy to find a sweet spot, but, as a heuristic, you may err on the side of incorporating software engineering into your coding, as soon as - you believe that there's a slight chance the code will be re-used by others (including yourself) now or in the future, - you believe that there's a chance you are on a research trajectory that you will potentially, to some extent, communicate about in the future (in which case you'd need to provide the code alongside your research and have it ready for reuse, e.g., to explore your research and/or build incrementally on top of it). More often than not, one or both points will apply fairly quickly. ### 0.4 What you'll learn This tutorial equips you with a solid foundation for working on software development in a team, using practices that help you write code of higher quality, and that make it easier to develop and sustain code in the future – both by yourself and others. The topics covered concern core, intermediate skills covering important aspects of the software development life-cycle that will be of most use to anyone working collaboratively on code. **At the start, we’ll address** * Integrated Development Environments, * Git and GitHub, * virtual environments. **Regarding testing software**, you’ll learn how to * ensure that results are correct by using unit testing and scaling it up, * debug code & include code coverage, * continuous integration. **Regarding software design**, you’ll particularly learn about * object-oriented programming, and * functional programming. **With respect to working on software with - and for - others**, you’ll hear about * collaboratively developing software on GitHub (using pull requests), * application programming interfaces, * packaging code for release and distribution. Some of you will likely have written much more complex code than the one you’ll encounter in this tutorial, yet we call the skills taught “intermediate”, because for code development in teams, you need more than just the right tools and languages – you need a strategy (best practices) for how you’ll use these tools _as a team_, or at least for potential re-use by people outside your team (that may very well consist only of you). Thus, it’s less about the complexity of the code as such within a self-contained environment, and more about the complexity that arises due to other people either working on it, too, or re-using it for their purposes. ### 0.5 Whom this tutorial is for The best way to check whether this tutorial is for you is to browse its contents in this HackMD main document. This tutorial is targeted to anyone who * has basic programming skills in Python (or any other programming language – it is not very essential to be a Python coder), * has some basic familiarity with Git/GitHub, and * aims to learn more about best practices and new ways to tackle research software development (as a team). It is suitable for all career levels – from students to (very) senior researchers for whom writing code is part of their job, and who either are eager to up-skill and learn things anew, or would like to have a proper refresh and/or new perspectives on research software development. If you’re keen on learning how to restructure existing code such that it is more robust, reusable and maintainable, automate the process of testing and verifying software correctness, and collaboratively work with others in a way that mimics a typical software development process within a team, then **we’re looking forward to you**! ### 0.6 What you need to do before the event The only thing you need to before the event is to create an account on [GitHub](https://github.com/), if you haven't done so already. ## Let's start! :rocket::boom: ## 1. Introduction into the project & setting up the environment ### 1.1 The project In this tutorial, we will use the [Patient Inflammation Study Project](https://github.com/carpentries-incubator/python-intermediate-inflammation) which has been set up for educational purposes by the course creators, and is stored on GitHub. The project's purpose is to study the effect of a treatment for arthritis by analysing the inflammation levels in patients who have been given this treatment. **The data**: - each csv file in the ```data``` folder of the repository represents inflammation measurements from one separate clinical trial of the drug, - each single csv file contains information for 60 patients whose inflammation levels had been recorded for 40 days whilst participating in the trial: - each row holds inflammation measurements for a single patient, - each column represents a successive day in the trial, - each cell represents an inflammation value on a given day for a given patient (in some arbitrary units of inflammation measurement). The project as seen on the repository is not finished and contains some errors. We will work incrementally on the existing code to fix those and add features during the tutorial. **Goal**: Write an application for the command line interface (CLI) to easily retrieve patient inflammation data, and display simple statistics such as the daily mean or maximum value (using visualization). **The code**: - a Python script ```inflammation-analysis.py``` which provides the main entry point in the application - this is the script we'll eventually run in the CLI, and for which we need to provide inputs (such as which data files to use), - three directories - ```inflammation``` which contains collections of functions in ```views.py``` and ```models.py```,```data``` and ```tests``` which contains tests for our functions in ```inflammation```, - there is also a ```README``` file (describing the project, its usage, installation, authors and how to contribute). - We'll get into each of these files later on. ### 1.2 GitHub CodeSpaces We will use [GitHub CodeSpaces](https://github.com/features/codespaces). A codespace is a cloud-powered development environment that you can configure to your liking. It can be accessed from: - A web browser, - [Visual Studio Code](https://code.visualstudio.com/), - the [JetBrains Gateway application](https://www.jetbrains.com/remote-development/gateway/) (a compact desktop app that allows you to work remotely with a JetBrains IDE such as [PyCharm](https://www.jetbrains.com/pycharm/) without necessarily downloading one), - or by using [GitHub CLI](https://cli.github.com/) (a command-line tool that brings pull requests, issues, GitHub Actions, and other GitHub features to your terminal). GitHub CodeSpaces' superpower is that you can code from any device and get a standardized environment as long as you have internet. This is perfect for our purposes - and maybe for some of yours in the future, too! - as we'll avoid the hassle to install programs on your machines and copy/clone GitHub repositories remotely/locally before you will be able to code. This spares us unexpected problems that would very likely occur when setting up the environment we need. Python in particular can be a mess when it comes to dependencies between different components... see this [XKCD](https://xkcd.com/1987/) webcomic for an insightful illustration: ![](https://hackmd.io/_uploads/HJMYHv1uh.png) Creative Commons Attribution-NonCommercial 2.5 License **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's instantiate a GitHub codespace**: 1. Log into your GitHub account. 2. Go to the [Patient Inflammation Study Project](https://github.com/nadinespy/python-intermediate-inflammation-for-alife) (a fork from the original repository - wel'll get to forks below), and click on the upper right green button ```Use this template```, then choose ```Open in a codespace```. Your codespace should load and be ready in a few seconds. Let's inspect what we see... ### 1.3 Integrated Development Environments In the cloud, you'll see that we use VSCode as an **Integrated Development Environment** (IDE), however, you could also use a codespace via your locally installed VSCode program by adding a [GitHub Codespaces extension](https://marketplace.visualstudio.com/items?itemName=GitHub.codespaces) to it. - What is an [Integrated Development Environment](https://en.wikipedia.org/wiki/Integrated_development_environment)? IDEs help manage all the arising complexities of the software development process, and normally consist of at least a source code editor, build automation tools and a debugger. They often provide support for syntax highlighting, code completion, code refactoring, and embedded version control via Git (we'll get to this below). As your codebase gets bigger and more complex, it will quickly involve many different files and external libraries. - IDEs are extremely useful and modern software development would be very hard without them. - VSCode comes with - **syntax highlighting**: makes code easier to read, and errors easier to find) code and finding errors easier, - **code completion**: speeds up the process of coding by, e.g., reducing typos, offering available variable names, functions from available packages, or parameters of functions, - **code definition & documentation references**: helps you get code information by showing, e.g., definitions of symbols (e.g. functions, parameters, classes, fields, and methods), - **code search**: helps you find a text string within a project by, e.g., using different scopes to narrow your search process, and - **version control** (more on that below). - We have the standard icons as displayed in VSCode on our local machines: - ```Explorer```: browse, open, and manage all of the files and folders in your project, - ```Run and Debug```: see all information related to running and debugging, - ```Extensions```: add languages, debuggers, and tools to your installation, - ```Source control``` (or version control): to track and manage changes to code. - We can use the ```Terminal```: an interface in which you can type and execute text based commands - here, we'll use [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) which is both a Unix shell and command language, - A [shell](https://en.wikipedia.org/wiki/Shell_(computing)) is a program that provides an entry point to/exposes an operating system's services to a human user or other programs. **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's install the Python and Jupyter extension for VSCode** created by Microsoft by clicking the extensions icon at the bottom of the sidebar within the IDE, and searching for Python (selecting the *IntelliSense* extension), Jupyter, and clicking to install. - Jupyter will allow us to use the **IPython** console which is much more convenient for running and inspecting code outputs in a more interactive way. ### 1.4 Git and GitHub VSCode supports **version control** using [Git version control](https://git-scm.com/): at the lower left corner, we can see which branch - something like a "container" storing a particular version of our code - in our version control system we're currently in (normally, if you didn't change into another branch, it's the one called ```main```). - What is [version control](https://about.gitlab.com/topics/version-control/)? Version control allows you to track and control changes to your code. Its benefits can't be overstated: - it gives you and your team a single source of truth - one shared reality: everyone is working with the same set of data (files, code, etc.) and can access it in a common location, - it enables parallel development which becomes important as the need to manage multiple versions of code and files grows, - it maintains a full file history - nothing is lost, if mistakes happen, or older code regains importance, - it supports automation within the development process by automating tasks, such as testing and deployment, - it enables better team work by providing visibility into who is working on what, - it is particularly effective for tracking text-based files (e.g. source code files, CSV, Markdown, HTML, CSS, Tex). - Major components and commands used to interact with different parts in the Git infrastructure: - **working directory**: a directory (including any subdirectories) where your project files live and where you are currently working. This area is untracked by Git, if its not explicitly told to save changes which you do by using ```git add filename```, - **staging area**: once you tell Git to start tracking changes to files, it saves those changes in the staging area. Each new change to the same file needs to be followed by another ```git add filename``` command to update it in the staging area, - **local repository**: this is where Git wraps together all your changes from the staging area, if you use the ```git commit -m "some message indicating what you commit/had changed"``` command. Each commit is a new, permanent "snapshot" (checkpoint, or record) of your project in time which you can share and get back to. - ```git status``` allows you to check the current status of your working directory and local repository, e.g., whether there are files which have been changed in the working directory, but not staged for commit (another command you'd probably use very often). - **remote repository** - this is a version of your project that is hosted somewhere on the Internet (e.g. on [GitHub](https://github.com/), [GitLab](https://about.gitlab.com/) or elsewhere). While you may version-control your local repository, you still run the risk of losing your work, if your machine breaks. When you use a remote repository, you'll need to push your changes from your local repository using ```git push origin branch-name```, and, if collaborating with other people, pulling their changes using ```git pull``` or ```git fetch``` to keep your local repository in sync with others. - The key difference between ```git fetch``` and ```git pull``` is that the latter copies changes from a remote repository directly into your working directory, while ```git fetch``` copies changes only into your local Git repo. ![](https://hackmd.io/_uploads/HyRaMox_2.png) Git workflow from [PNGWing](https://www.pngwing.com/en/free-png-sazxf). - Working with different **branches**: whenever you add a separate and self-contained piece of work, it's best practice is to use a new branch (called **feature branch**) for it, and merge it later on with the main branch as soon as it's safe to do so. Each feature branch should have a suitable name that conveys its purpose (e.g. “issue-fix23”). - Using different branches enables - the main branch to remain stable while you and the team explore and test the new (not-yet-functional) code on a feature branch, - you and other collaborators to work on several features at the same time without interfering with each other. - Normally, we'd have - a **main branch** (most often called ```main```) which is the version of the code that is fully tested, stable and reliable, - a **development branch** (often called ```develop``` or ```dev``` by convention) that we use for work-in-progress code. Feature branches get first merged into ```develop``` after having been thoroughly tested. Once ```develop``` had been tested with the new features, it will get merged into ```main```. - Switching between/merging different branches: - ```git branch develop``` creates a new branch called ```develop```, - ```git merge branch-name``` allows you to merge ```branch-name``` with the one you're currently in, - ```git branch``` tells you which branch you're currently in (something you'd check probably very frequently) as well as gives you a list of which branches exist (the one you're in is denoted by a star symbol), - ```git checkout branch-name``` allows you to switch from your current branch into ```branch-name```. ![](https://hackmd.io/_uploads/HJKmcZ8D2.jpg) Git feature branches, adapted by original course creators from [Git Tutorial by sillevl](https://sillevl.gitbooks.io/git/content/collaboration/workflows/gitflow/) (Creative Commons Attribution 4.0 International License) ### 1.5 Creating virtual environments using ```venv``` In ```inflammation/models.py```, we see that we import two external libraries: ```python= from matplotlib import pyplot as plt import numpy as np ``` - In Python, we very often use external libraries that don’t come as part of the standard Python distribution. Sometimes, you may need a specific version of an external library (e.g., because your code uses a feature, class, or function from an older version that has been updated since), or a specific version of Python interpreter. - Consequently, each project may require a different setup and different dependencies, so it will be very useful to keep those different configurations separate to avoid confusion between projects --> that's where a [virtual environment](https://realpython.com/python-virtual-environments-a-primer/) comes in handy: a virtual environment **creates an isolated "working copy" of a given software project** that uses a specific version of Python interpreter together with specific versions of a number of external libraries installed into that virtual environment. You can create a self-contained virtual environment for any number of separate projects.The specifics of your virtual environment can be then reproduced by someone else using their own virtual environment. - Virtual environments make it a lot easier for collaborators to use and work on your project's code, as potential installation problems and package version clashes are spared. - Most modern programming languages are able to use virtual environments to isolate libraries for a specific project and make it easier to develop, run, test and share code with others; they are not a specific feature of Python. - A list of commonly used Python virtual environment managers (we'll use ```venv```): - ```venv```, - ```virtualenv```, - ```pipenv```, - ```conda```, - ```poetry```. **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's create our virtual environment** by creating a new folder called ```venv```, and instantiating a virtual environment equally called ```venv``` in the terminal: ```bash $ python3 -m venv venv # creating a new folder called "venv", # and instantiating a virtual environment # equally called "venv" $ source venv/bin/activate # activate virtual environment (venv) $ which python3 # check whether Python from venv is used Output: /workspaces/python-intermediate-inflammation/venv/bin/python3 (venv) $ deactivate # deactivate virtual environment ``` Our code depends on two external packages (```numpy```, ```matplotlib```). We need to install those into the virtual environment to be able to run the code using a **package manager tool** such as ```pip```: ```bash (venv) $ pip3 install numpy matplotlib ``` When you are collaborating on a project with a team, you will want to make it easy for them to replicate equivalent virtual environments on their machines. With pip, virtual environments can be exported, saved and shared with others by creating a file called, e.g., ```requirements.txt``` (you can name as you like, but it's practice to label this file as "requirements") in your current directory, and producing a list of packages that have been installed in the virtual environment: ```bash (venv) $ pip3 freeze > requirements.txt # produce list of packages (venv) $ pip3 list # view packages installed ``` If someone else is trying to use your library within their own virtual environment, instead of manually installing every dependency, they can just use the command below to install everything specified in the `requirements.txt` file. ```bash (venv) $ pip3 install -r requirements.txt # install packages from # requirements file ``` Let's check the status of our repository using ```git status```. We get the following output: ```bash On branch main Untracked files: (use "git add <file>..." to include in what will be committed) requirements.txt venv/ nothing added to commit but untracked files present (use "git add" to track) ``` While you do not want to commit the newly created directory ```venv``` and share it with others as it's is specific to your machine and setup only (containing local paths to libraries on your system specifically), you will want to share ```requirements.txt``` with your team as this file can be used to replicate the virtual environment on your collaborators’ systems. To tell Git to ignore and not track certain files and directories, you need to specify them in the ```.gitignore``` text file in the project root. You can also ignore multiple files at once that match a pattern (e.g. “\*.jpg” will ignore all jpeg files in the current directory). Let's add the necessary lines into the ```.gitignore``` file: ```bash # Virtual environments venv/ .venv/ ``` Let's make a first commit to our local repository: ```bash $ git add .gitignore requirements.txt $ git commit -m "Initial commit of requirements.txt. Ignoring virtual env. folder." ``` ## 2. Ensuring correctness of software (at scale) **Why is testing good?** - We want to increase chances that we're writing **correct code** for various possible inputs, and that, if it's a contribution to an existing codebase, it doesn't break it. (People often underestimate, or are not aware of, the risk of writing incorrect code.) - E.g., for the sake of argument, if each line we write has a 99% chance of being right, then a 70-line program will be wrong more than half the time. - Debugging code is a hard reality - it will happen (a lot), no matter the coding skills. Our future selves and collaborators will likely have a much **easier time debugging**, if tests had been written beforehand. - Obviously, writing code with tests takes some additional effort compared to writing code without tests. However, the effort pays off quickly, and likely **saves huge amounts of time in the medium to long term** by allowing us to more comprehensively and rapidly find errors, as well as giving us greater confidence in the correctness of our code. **We'll get into:** - **Automatically testing software** using a test framework called [Pytest](https://docs.pytest.org/en/7.3.x/) that helps us structure and run automated tests. - Manual and automatic code testing are not mutually exclusive, but complement each other. Manual testing is particularly good for checking graphical user interfaces and comparing visual outputs against inputs, but very time-consuming for many other things one would like to test for. - **Scaling up unit testing** by making use of test parameterisation to increase the number of different test cases we can run. - **Continuous integration** (CI) for automated using [GitHub Actions](https://github.com/features/actions) - a CI infrastructure that allows us to automate tasks when some change happens to our code, such as running tests when a new commit is made to a code repository. - **Debugging code** using the integrated debugger in VSCode which helps us locate an error in our code while it is running, and fix it. ### 2.1 Unit Testing Let's create a new branch called ```test-suite``` where we'll write our tests. It is good practice to write tests at the same time when we write some new code on a feature branch. But since the code already exists, we’re creating a feature branch just for writing tests this time. Generally, it is encouraged to use branches for even small bits of new work. **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's generate a new feature branch**: ``` $ git checkout develop $ git branch test-suite $ git checkout test-suite ``` Now let's look at the ```daily_mean()``` function in ```inflammation/models.py```. It calculates the daily mean of inflammation values across all patients. Let's first think about how we could manually test this function. One way to test whether this function does the right thing is to think about which output we'd expect given a certain input. We can **test** this **manually** by creating an input and output variable, and use, e.g., ```npt.assert_array_equal()``` to check whether the outcome of ```daily_mean()``` given the input variable matches the output variable. - To use ```daily_mean()```, we need to import it. To import it, we need to instantiate a directory for the codespace. - Let's open the IPython console by right-clicking on ```import numpy as np```, and choosing ```Run in Interactive Window```, and ```Run Selection/Line in Interactive Window```. ```python= import os # get the current working directory os.getcwd() import numpy.testing as npt from inflammation.models import daily_mean test_input = np.array([[1, 2], [3, 4], [5, 6]]) test_result = np.array([3, 4]) npt.assert_array_equal(daily_mean(test_input), test_result) ``` We can think about multiple pairs of expected output given a certain input: ```python= test_input = np.array([[2, 0], [4, 0]]) test_result = np.array([2, 0]) npt.assert_array_equal(daily_mean(test_input), test_result) test_input = np.array([[0, 0], [0, 0], [0, 0]]) test_result = np.array([0, 0]) npt.assert_array_equal(daily_mean(test_input), test_result) ``` However, we get a mismatch between input and output for the first test: ```python ... AssertionError: Arrays are not equal Mismatched elements: 1 / 2 (50%) Max absolute difference: 1. Max relative difference: 0.5 x: array([3., 0.]) y: array([2, 0]) ``` The reason here is that one of our specified outputs is wrong - which reminds us that tests themselves can be written in a wrong way, so it's good to keep them as simple as possible so as to minimize errors. We could put these tests in a separate script to automate running them. However, a Python script stops at the first failed assertion, so if we get one no matter why, all subsequent tests wouldn't be run at all --> this calls for a **testing framework** such as ```Pytest``` where we - define our tests we want to run as **functions**, - are able to run a **plethora of tests** at the same time regardless of whether test failures had occured, and - get an **output summary** of the test functions. #### Unit Testing Let's look at ```tests/test_models.py``` where we see one test function called ```test_daily_mean_zeros()```: ```python= def test_daily_mean_zeros(): """Test that mean function works for an array of zeros.""" from inflammation.models import daily_mean test_input = np.array([[0, 0], [0, 0], [0, 0]]) test_result = np.array([0, 0]) # Need to use NumPy testing functions to compare arrays npt.assert_array_equal(daily_mean(test_input), test_result ``` Generally, each test function requires - inputs, e.g., the ```test_input``` ```NumPy``` array, - execution conditions, i.e., what we need to do to set up the testing environment to run our test, e.g., importing the ```daily_mean()``` function so we can use it (we only import the necessary library function we want to test within each test function), - a testing procedure, e.g., running ```daily_mean()``` with our test_input array and using ```np.assert_array_equal()``` to test its validity, - expected outputs, e.g., our test_result NumPy array that we test against, - if using ```PyTest```, the letters ‘test_’ at the beginning of the function name. _________________________________________________________ **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's install ```PyTest```**: ``` $ pip3 install pytest ``` We can run ```PyTest``` in the CLI... ``` $ python -m pytest tests/test_models.py ``` ... and get the following output: ``` ======================================================== test session starts ========================================================= platform linux -- Python 3.10.8, pytest-7.3.2, pluggy-1.2.0 rootdir: /workspaces/python-intermediate-inflammation collected 2 items tests/test_models.py .. [100%] ========================================================= 2 passed in 1.06s ========================================================== ``` We can also test single testing functions in our ```test_models.py``` file. To do that, we need to configure our testing set up by clicking on the testing icon, choosing ```Configure Python Test```, then ```Pytest```, and then the folder the tests are in. - You will now be able to see a green arrow to the right of any testing function. You can click on it, and the test will be run. **! TASK 1 !** **Write a new test case that tests the ```daily_max()``` function**, adding it to ```test/test_models.py```. Also regenerate your ```requirements.txt``` file, commit your changes, and merge the ```test-suite``` branch with the ```develop``` branch. (5-10 min) - You could choose to write your functions very similarly to ```daily_mean()```, defining input and expected output variables followed by the equality assertion. - Use test cases that are suitably different. - Once added, run all the tests again with ```python -m pytest tests/test_models.py```, and have a look at your new tests pass. Solutions will be shown in this [hackmd.io file](https://hackmd.io/@nadinespy/ryBKNMLvn). (Don't look into it before you haven't given it a try yourself.) ### 2.2 Scaling up unit tests We had used two different testing functions to distinguish between integer and string inputs. Writing a separate test functions to test the same function for different cases is quite inefficient - that's where **test parameterisation** comes in handy. Instead of writing a separate function for each different test, we can parameterise the tests with multiple test inputs, e.g., in ```tests/test_models.py```, we can rewrite the ```test_daily_mean_zeros()``` from above and ```test_daily_mean_integers()``` from the solutions hackmd.io file into a single test function: ```python= @pytest.mark.parametrize( "test, expected", [ ([ [0, 0], [0, 0], [0, 0] ], [0, 0]), ([ [1, 2], [3, 4], [5, 6] ], [3, 4]), ]) def test_daily_mean(test, expected): """Test mean function works for array of zeroes and positive integers.""" from inflammation.models import daily_mean npt.assert_array_equal(daily_mean(np.array(test)), np.array(expected)) ``` - We need to provide input and output names - e.g., ```test``` for inputs, and ```expected``` for outputs -, as well as the inputs and outputs themselves that correspond to these names. Each row within the square brackets following the ```"test, expected"``` arguments corresponds to one test case. Let's look at the first row: - ```[ [0, 0], [0, 0], [0, 0] ]``` would be the input, corresponding to the input name ```test```, - ```[0, 0]``` would be the output, corresponding to the output name ```expected```, - The ```parameterize()``` function is a [**Python decorator**](https://www.programiz.com/python-programming/decorator): A Python decorator is a function that takes as an input a function, adds some functionality to it, and then returns it (more about this in the section on **functional programming**). - ```parameterize()``` is a decorator in that it takes as an input the respective testing function, adds functionality to it by specifying multiple input and expected output test cases, and calling the function over each of these inputs automatically when this test is called. **! TASK 2 !** **Rewrite your test functions for ```daily_max()``` using test parameterisation.** (5-10 min) - Make sure you're back in the ```test-suite``` branch. - Once added, run all the tests again with ```python -m pytest tests/test_models.py```, and have a look at your new tests pass. - Commit your changes, and merge the ```test-suite``` branch with the ```develop``` branch. Solutions will be shown in this [hackmd.io file](https://hackmd.io/@nadinespy/ryBKNMLvn). ### 2.3 Debugging code & code coverage #### Debugging code We can find problems in our code conveniently in VScode using **breakpoints** (points at which we want code execution to stop) and our testing functions. **! FOLLOW ALONG IN YOUR CODESPACE !** - Let's choose a function we want to debug, e.g., ```daily_max()```, and set a breakpoint somewhere within that function by left-clicking the space to the left of the line numbers, - we then click on the testing icon in the VSCode IDE, look for the respective testing function, e.g., ```test_daily_max()```, then choose ```Debug Test```. - We can now see the local and global variables in the upper left space of the IDE, and play around with those in the ```DEBUG CONSOLE``` to check whether our function does what it's supposed to do, e.g., run ```np.max(data, axis=0)``` and see whether it's giving the expected output (i.e., ```array([0, 0])```). - We can click on the red rectangle at the top to stop the debugging process. #### Code coverage While Pytest is an indispensable tool to speed up testing, it can't help us decide *what* to test and *how many* tests to run. As a heuristic, we should try to come up with tests that test - as many functions as possible, with test cases as different from each other as possible, - rather than, let's say, an endless number of test cases for the same function, and using rather redundant test cases. This ensures a high degree of **code coverage**. A Python package called ```pytest-cov``` that is used by Pytest gives you exactly this - the degree to which you've covered your code w. r. t. tests. **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's install ```pytest-cov``` and assess code coverage**: ```bash $ pip3 install pytest-cov $ python -m pytest --cov=inflammation.models tests/test_models.py ``` - ```--cov``` is an additional named argument to specify the code that is to be analysed for test coverage. - We also specify the file that contains the test for the code to be analysed. Output: ```bash ================================================= test session starts ================================================= platform linux -- Python 3.10.8, pytest-7.3.2, pluggy-1.2.0 rootdir: /workspaces/python-intermediate-inflammation plugins: cov-4.1.0 collected 7 items tests/test_models.py ....... [100%] ---------- coverage: platform linux, python 3.10.8-final-0 ----------- Name Stmts Miss Cover -------------------------------------------- inflammation/models.py 9 2 78% -------------------------------------------- TOTAL 9 2 78% ================================================== 7 passed in 1.25s ================================================== ``` - 78% of our statements in ```inflammation.models``` are tested. To see which ones have not yet been tested, we can use the following line in the terminal: ```bash python -m pytest --cov=inflammation.models --cov-report term-missing tests/test_models.py ``` Output: ```bash ================================================= test session starts ================================================= platform linux -- Python 3.10.8, pytest-7.3.2, pluggy-1.2.0 rootdir: /workspaces/python-intermediate-inflammation plugins: cov-4.1.0 collected 7 items tests/test_models.py ....... [100%] ---------- coverage: platform linux, python 3.10.8-final-0 ----------- Name Stmts Miss Cover Missing ------------------------------------------------------ inflammation/models.py 9 2 78% 18, 32 ------------------------------------------------------ TOTAL 9 2 78% ================================================== 7 passed in 0.29s ================================================== ``` - We'd need to look at lines 18 and 32 to check for the yet untested code. ```bash $ pip3 freeze > requirements.txt $ git status $ git add ./ $ git commit -m "Add coverage support" $ git checkout develop $ git merge test-suite ``` **What is Test Driven Development?** In test-driven development, we first write the tests, and then the code, i.e., the thinking process would go from - defining the feature we want to implement, writing the tests, and writing only as much code as needed to pass the test, - rather than defining the feature, writing the code, and then writing the tests. This way, the set of tests act like a specification of what the code does. The main advantages are: - writing tests can't be avoided/not prioritized, - we may get a better idea of how our code will be used before writing it, - we may refrain from adding things into our code that eventually turn out unnecessary anyway. #### Final notes on testing - A complex program requires a much higher investment in testing than a simple one --> find a good trade-off between code complexity and test coverage. - Tests can not find every bug that may exist in the code - manual testing will remain a crucial component. - If using data, try to use as many different datasets as possible to test your code, thereby increasing confidence that it does the correct thing. - Most software projects increase in complexity as they develop - using automated testing can save us a lot of time, particularly in the long term. ### 2.4 Continuous Integration If we're collaborating on a software project with multiple people who push a lot of changes to one of the major repositories, we'd need to constantly pull down their changes to our local machines, and do our tests with the newly pulled down code - this would result in a lot of back and forth, slowing us down quite a bit. That's where **Continuous integration** (CI) comes in handy: - It is the practice of merging all developers' working copies to a shared main working copy several times a day. When a new change is committed to a repository, CI clones the repository, builds it if necessary, and runs any tests. Once complete, it presents a report to let you see what happened. - It thereby gives *early* hints so as to whether there are important incompatibilities between a given working copy and the shared main working copy. - It also allows us to test whether our code works on different target user platforms. There are many CI infrastructures and services. We’ll be looking at [GitHub Actions](https://github.com/features/actions) - which, unsurprisingly, is available as part of GitHub. #### GitHub Actions ```YAML``` (a recursive acronym which stands for “YAML Ain’t Markup Language”) is a text format used by GitHub Action workflow files. ```YAML``` files use - key-value pairs (values can also be arrays), e.g., ``` name: Kilimanjaro height_metres: 5892 first_scaled_by: Hans Meyer first_scaled_by: - Hans Meyer - Ludwig Purtscheller ``` - maps which allow us to define nested, hierarchical data, e.g., ``` height: value: 5892 unit: metres measured: year: 2008 by: Kilimanjaro 2008 Precise Height Measurement Expedition ``` **! FOLLOW ALONG IN YOUR CODESPACE !** **Let's set up CI using GitHub Actions**: with a GitHub repository, there’s a way we can set up CI to run our tests automatically when we commit changes. To do this, we need to add a new file in a particular directory of our repository (make sure you're on the ```test-suite``` branch). Let's create a new directory ```.github/workflows``` which is used specifically for GitHub Actions, as well as a new file called ```main.yml```: ``` $ mkdir -p .github/workflows $ vim main.yml ``` In the ```main.yml```, we'll write the following: ```yaml= name: CI # We can specify which Github events will trigger a CI build on: push jobs: build: # we can also specify the OS to run tests on runs-on: ubuntu-latest # a job is a seq of steps steps: # Next we need to checkout out repository, and set up Python # A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything - name: Checkout repository uses: actions/checkout@v2 - name: Set up Python 3.9 uses: actions/setup-python@v2 with: python-version: "3.9" - name: Install Python dependencies run: | python3 -m pip install --upgrade pip pip3 install -r requirements.txt - name: Test with PyTest run: | python -m pytest --cov=inflammation.models tests/test_models.py ``` - ```name: CI```: name of our workflow - ```on: push```: indication that we want this workflow to run when we push commits to our repository. - ```jobs: build:``` the workflow itself is made of a single job named ```build```, but could contain any number of jobs after this one, each of which would run in parallel. - ```runs-on: ubuntu-latest```: statement about which operating systems we want to use, in this case just Ubuntu. - ```steps:``` the steps that our job will undertake in turn to 1) set up the job’s environment (think of it as a freshly installed machine, albeit virtual, with very little installed on it) and 2) run our tests. Each step has a name (which you can choose to your liking) and a way to be executed (as specified by ```uses```/```run```). - ```name: Checkout repository for the job```: use a GitHub Action called ```checkout``` - ```name: Set up Python 3.9```: here, we use the ```setup-python``` Action, indicating that we want Python version 3.9. - ```name: Install Python dependencies```: install latest version of ```pip```, dependencies, and our ```inflammation``` package: In order to locally install our inflammation package, it’s good practice to upgrade the version of pip that is present first, then we use pip to install our package dependencies. - ```name: Test with PyTest```: finally, we let PyTest run our tests in ```tests/test_models.py```, including code coverage. #### Scaling up testing using build matrices To address whether our code works on different target user platforms (e.g., Ubuntu, Mac OS, or Windows), with different Python installations (e. g., 3.8, 3.9 or 3.10), we can use a feature called **build matrices**. Doing our tests across all these platforms and program versions would take *a lot* of time - that's where a build matrix comes in handy. **! FOLLOW ALONG IN YOUR CODESPACE !** - Using a build matrix inside of our ```main.yml```, we can specify environments (such as operating systems) and parameters (such as Python versions), and new jobs will be created that run our tests for each permutation of these. - We first define a ```strategy``` as a ```matrix``` of operating systems and Python versions within ```build```. We then use ```matrix.os``` and ```matrix.python-version``` to reference these configuration possibilitiesjob ```yaml= name: CI # We can specify which Github events will trigger a CI build on: push # now define a single job 'build' (but could define more) jobs: build: strategy: matrix: os: [ubuntu-latest, macos-latest, windows-latest] python-version: ["3.8", "3.9", "3.10"] runs-on: ${{ matrix.os }} # a job is a seq of steps steps: # Next we need to checkout out repository, and set up Python # A 'name' is just an optional label shown in the log - helpful to clarify progress - and can be anything - name: Checkout repository uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: ${{ matrix.python-version }} ``` Let's add the new folder/file to the local repository and merge with ```develop```: ``` $ git add .github $ git commit -m "Add GitHub Actions configuration & build matrix for os and Python version $ git checkout develop $ git merge test-suite ``` Whenever you push your changes to a *remote* repository, GitHub will run CI as specified by the ```main.yml```. You can check its status on the website of your remote repository under ```Actions```. For each push, you'd get a report about which of the steps have been successfully/unsuccessfully taken. ![](https://hackmd.io/_uploads/rkhzMi4_n.png) ![](https://hackmd.io/_uploads/SkCrMsN_2.png) ![](https://hackmd.io/_uploads/HJXwMjNOh.png) ![](https://hackmd.io/_uploads/S1etGoEu3.png) ![](https://hackmd.io/_uploads/H1dYfiE_2.png) You may also look into these resources on unit testing, scaling it up, and continuous integration: - [An introduction to unit testing](https://software.ac.uk/blog/2021-10-22-introduction-unit-testing) - [Scaling up unit testing using parameterisation](https://software.ac.uk/blog/2021-12-15-scaling-unit-testing-using-parameterisation) - [Automating unit testing with Continuous Integration](https://software.ac.uk/blog/2022-05-23-automating-unit-testing-continuous-integration) ## 3. Software design Different things can be meant by the term "software design": - **algorithm design** - what methods are we going to use in our software to meet the software's requirements? - **software architecture** - what components will it have and how will they cooperate/interact? - **system architecture** - what other things will this software have to interact with and how will it do this? - getting software and particularly system architecture right requires to address technical and other organisational challenges/requirements *in conjunction* - an interesting problem space! - **UI/UX (user interface / user experience)** - how will users interact with the software? **Design patterns** are typical solutions to commonly occurring problems in software design (from any of the domains/levels mentioned above). From [Refactoring Guru](https://refactoring.guru/design-patterns/what-is-pattern): - "You can’t just find a pattern and copy it into your program, the way you can with off-the-shelf functions or libraries. The pattern is not a specific piece of code, but a general concept for solving a particular problem. You can follow the pattern details and implement a solution that suits the realities of your own program. - Patterns are often confused with algorithms, because both concepts describe typical solutions to some known problems. [...] An analogy to an algorithm is a cooking recipe: both have clear steps to achieve a goal. On the other hand, a pattern is more like a blueprint: you can see what the result and its features are, but the exact order of implementation is up to you." **Programming paradigms** such as **object-oriented** or **functional programming** (we'll get to those in a minute!) are not so straightforward to allocate w. r. t. the different facets of software design mentioned above: a programming paradigm represents an own way of thinking about and structuring code, with pros and cons when used to solve particular types of problems. **Technical debt**: If we don't follow best practices around code, including addressing design questions, we may build up too much technical debt - the cost of refactoring code due to having chosen a quick-and-dirty solution instead of having used a better approach that would have initially taken longer. - It is normal to accumulate technical debt in a software or code-based project to some degree, however, it can (and often does) go overboard: bad, but quick/easy solutions at the start may make the software too messy and too difficult to understand and maintain, thereby hampering its further development. - We want to write code such that we are able to respond well to changing requirements in the future - because requirements will change for sure (as in life generally, the only constant is *change*), so we want to design our software to be easily *modifiable* and *extensible*. - There's of course a trade-off/tension between the time available and the quality of our work: How much effort should we spend on designing our code properly and using good development practices? Look for the solution in this [XKCD comic](https://xkcd.com/844/): ![](https://hackmd.io/_uploads/B1MxYnYu3.png) ### 3.1 **Programming paradigms** There are two major families that we can group the common programming paradigms into: **Imperative** and **Declarative**. - **Imperative programming** prescribes how a program operates *step by step* via a set of explicit instructions. - While it executes those steps, it may also produce so-called **side effects** - modifications of some state variable values outside the local environment, or, in other words, the production of observable effects other than the program's primary effect of returning a value to the invoker of the operation. - **Procedural programming** falls under imperative programming and is the programming paradigm you're probably most familiar with: it uses lists of instructions that are executed one after the other starting from the top. In this way, code is grouped into *procedures*, or, as we normally call them, *functions* performing a single task, with exactly one entry and one exit point. - **Object-oriented programming** is often classified as an extension of the Imperative family of programming paradigms (with the extra feature being the objects one is dealing with), but opinions differ. - In Object Oriented Programming, we represent the data and the things we want to do with it as **objects** which have specific **properties** and **behaviours**. - As an example, if we’re writing a simulation for our physics research, we’re probably going to need to represent atoms. Any atom can be characterized by the mass and electric charge, so we can build an object structure that includes mass and electric charge as properties. We can also specify what to do with those properties, e.g., we might want to add two masses of atoms. Moreover, multiple atoms can make up a molecule which may be modelled as a separate object. In that case, we can also specify what the relationship between atoms and molecules is using object-oriented programming. - **Declarative programming** prescribes what data processing should happen, i.e., *what* the outcome is supposed to be rather than *how* it is achieved (as in imperative programming). In other words, a declarative program expresses the logic of a computation in terms of what should be accomplished rather than in terms of its control flow as an explicit sequence steps. - **Functional Programming** falls under declarative programming: - It is illuminative of the distinction between *code* and *data*, as in Functional Programming, a function can accept and transform other functions - code *is* data. - Side effect are avoided wherever possible. - Functional Programming is very advantageous w. r. t. **Big Data** where we can’t move the data around easily, and, instead, aim to send our code to where the data is. - It's also advantageous w. r. t. running operations in parallel, as each operation is guaranteed to *no* interact with other operations. - However, within the research context and apart from Big Data, functional programming will rarely be clearly advantageous - but it's still useful/interesting for you to know about it. We will look into two major paradigms from the imperative and declarative families that may be useful to you - **functional programming** and **object-oriented programming**. - Most of modern languages can be used with multiple paradigms, and single programs often uses multiple ones. - Python is a multi-paradigm and multi-purpose programming language. Procedural, object-oriented and functional programming all work well. However, as all its core data types (strings, integers, floats, booleans, lists, sets, arrays, tuples, dictionaries) as well as functions, modules and classes are objects, it does naturally lend itself to an object-oriented approach. ### 3.2 Object-oriented programming In object-oriented programming, objects encapsulate data in the form of attributes and code in the form of methods that manipulate the objects’ attributes and define how objects can behave (in interaction with each other). A class is a template for a structure and a set of permissible behaviors that we want our data to comply to, thus, each time we create some data using a class, we can be certain that it has the same structure. If you know about Python lists and dictionaries, you may recognize that they behave similarly to how we may define a class ourselves: - they each hold some data (attributes), - they provide some methods that describe how the data is supposed to behave (e.g., Lists can be appended to, indexed, sliced, and in dictionaries, key-value pairs can be added etc.) **Encapsulating data** Let's have a look at a simple class: ```python= class Patient: def __init__(self, name): self.name = name self.observations = [] Alice = Patient('Alice') print(Alice.name) ``` Output: ```python Alice ``` - We start defining a class with ```__init__``` - the initialiser method which sets up the initial values and structure of the data inside a new instance of the class. We call the ```__init__``` method every time we create a new instance of the class, as in Patient('Alice'). The argument ```self``` refers to the instance on which we are calling the method and gets filled in automatically by Python whenever we instantiate a new class instance. - We encapsulate the patient’s name and a list of inflammation observations as data/attributes, either by providing values for those as arguments when creating a new class instance, or by setting those values in the initialiser method. - In the example, we set a patient’s name ('Alice') to the value provided when creating a new class instance (here 'Alice'), and create a list of inflammation observations for the patient (initially empty). - We can access the encapsulated data by calling the attribute alongside the class instance using the dot (as in ```Alice.name```). **Encapsulating behavior** Let's add a method to the above class which operates on the data that the class contains: adding a new observation to a Patient instance. ```python= class Patient: """A patient in an inflammation study.""" def __init__(self, name): self.name = name self.observations = [] def add_observation(self, value, day=None): if day is None: try: day = self.observations[-1]['day'] + 1 except IndexError: day = 0 new_observation = { 'day': day, 'value': value, } self.observations.append(new_observation) return new_observation Alice = Patient('Alice') print(Alice) observation = Alice.add_observation(3) print(observation) print(Alice.observations) ``` Output: ```python <__main__.Patient object at 0x7f67f424c190> {'day': 0, 'value': 3} [{'day': 0, 'value': 3}] ``` - Methods on classes are the same as normal functions, except that they live inside a class and have an extra first parameter ```self``` (using this name is not strictly necessary, but is a very strong convention). Similar to the initialiser method, when we call a method on an object, the value of self is automatically set to this object - hence the name. - We can use the encapsulated method by calling it alongside the class instance using the dot (as in ```Alice.add_observation(3)```). **Dunder Methods** The```__init__``` method begins and ends with a double-underscore - it is a dunder method. These dunder methods (also called magic methods) are not meant to be invoked directly by you, but the invocation happens internally from the class on a certain action. Built-in classes such in Python as the ```int``` class define many magic methods. - When we called ```print(Alice)```, it returned ```<__main__.Patient object at 0x7fd7e61b73d0>``` which is the string represenation of the Alice object. Functions like ```print()``` or ```str()``` use ```__str__()```. - However, we can override the ```__str__``` method within our class to display the object's name instead of the object's string representation. ```python= class Patient: """A patient in an inflammation study.""" def __init__(self, name): self.name = name self.observations = [] def add_observation(self, value, day=None): if day is None: try: day = self.observations[-1]['day'] + 1 except IndexError: day = 0 new_observation = { 'day': day, 'value': value, } self.observations.append(new_observation) return new_observation def __str__(self): return self.name Alice = Patient('Alice') print(Alice) ``` Output: ```python Alice ``` **Relationships between classes** There are two fundamental types of object characteristics which also denote the relationships among classes: - ownership - x *has* a y - this is composition, - identity - x *is* a y - this is inheritance. **Composition** In object oriented programming, we can make things components of other things, e.g., we may want to say that a doctor *has* patients or that a patient *has* observations. In the way we had written our class so far, a patient already has observations - which is a case of composition. Let's separate the two and make an own Observation class, and make use of it in the Patient Class. ```python= class Observation: def __init__(self, day, value): self.day = day self.value = value def __str__(self): return str(self.value) class Patient: """A patient in an inflammation study.""" def __init__(self, name): self.name = name self.observations = [] def add_observation(self, value, day=None): if day is None: try: day = self.observations[-1].day + 1 except IndexError: day = 0 new_observation = Observation(day, value) self.observations.append(new_observation) return new_observation def __str__(self): return self.name Alice = Patient('Alice') obs = Alice.add_observation(3, 3) print(obs) ``` Output: ```python 3 ``` **Inheritance** Inheritance is about data and behaviour that two or more classes share: if class X inherits from (is a) class Y, we say that Y is the *superclass* or *parent class* of X, or X is a *subclass* of Y - X gets all attributes and methods of Y. If we want to extend the previous example to also manage people who aren’t patients we can add another class Person. But Person will share some data and behaviour with Patient - in this case both have a name and show that name when you print them. Since we expect all patients to be people (hopefully!), it makes sense to implement the behaviour in Person and then reuse it in Patient. To write our class in Python, we used the class keyword, the name of the class, and then a block of the functions that belong to it. If the class inherits from another class, we include the parent class name in brackets. ```python= class Observation: def __init__(self, day, value): self.day = day self.value = value def __str__(self): return str(self.value) class Person: def __init__(self, name): self.name = name def __str__(self): return self.name class Patient(Person): """A patient in an inflammation study.""" def __init__(self, name): super().__init__(name) self.observations = [] def add_observation(self, value, day=None): if day is None: try: day = self.observations[-1].day + 1 except IndexError: day = 0 new_observation = Observation(day, value) self.observations.append(new_observation) return new_observation ``` - To make the Patient class inherit the Person class, we need to indicate the parent class in the class name (```class Patient(Person)```), as well as in the initialiser (```super().__init__(name)```). - If we don’t define a new ``__init__`` method for our subclass, Python will look for one on the parent class and use it automatically). This is true of all methods - if we call a method which doesn’t exist directly on our class, Python will search for it among the parent classes. - ```self.name = name``` in the Patient class becomes obsolete. **! QUESTION 1 ! What outputs do you expect here?** ```python= Alice = Patient('Alice') print(Alice) obs = Alice.add_observation(3) print(obs) Bob = Person('Bob') print(Bob) obs = Bob.add_observation(4) print(obs) ``` **Final note**: When deciding how to implement a model of your particular system, you often have a choice of either composition or inheritance, where there is no obviously correct choice - multiple implementations may be equally good. (See more on that in the [The Composition Over Inheritance Principle](https://python-patterns.guide/gang-of-four/composition-over-inheritance/). **! TASK 3 ! Write a Doctor class to hold the data representing a single doctor**: - It should have a name attribute, as well as a list of patients that this doctor is responsible for. - If you have the time, write corresponding tests in ```test_patient.py```. ### 3.3 Functional programming In functional programming, programs apply and compose/chain functions. It is based on the mathematical definition of a function ```f()``` which does a transformation/mapping from input ```x``` to output ```f(x)```). Contrary to imperative paradigms, it does not entail a sequence of steps during which the state of the code is updated to reach a final desired state. It describes the transformations to be done without producing such side effects. The following two code examples implement the calculation of a factorial in procedural and functional styles, respectively. The factorial of a number ```n``` (denoted by ```n!```) is calculated as the product of integer numbers from 1 to n. **Procedural style factorial function** ```python= def factorial(n): """Calculate the factorial of a given number. :param int n: The factorial to calculate :return: The resultant factorial """ if n < 0: raise ValueError('Only use non-negative integers.') factorial = 1 for i in range(1, n + 1): # iterate from 1 to n # save intermediate value to use in the next iteration factorial = factorial * i return factorial ``` - In the function, we have a list of instructions to change the state of the program (e.g., the variable ```factorial``` in the for loop) and advance towards the result. **Functional style factorial function** ```python= def factorial(n): """Calculate the factorial of a given number. :param int n: The factorial to calculate :return: The resultant factorial """ if n < 0: raise ValueError('Only use non-negative integers.') if n == 0 or n == 1: return 1 # exit from recursion, prevents infinite loops else: return n * factorial(n-1) # recursive call to the same function ``` - In the function, we don't update any program states (as with the variable ```factorial``` in the above example), or modify data that exists outside the current function, including the input data (e.g., printing text, writing to a file, modifying the value of an input argument, or changing the value of a global variable). - Functional computations only rely on the values that are provided as inputs to a function which is also referred to as the **immutability of data**. - Such functions do not create any **side effects**, i.e. do not perform any action that affects anything other than the value they return. - Functions without side effects that return the same data each time the same input arguments are provided are called **pure functions**. - Rather than using iteration to repeat a series of steps as in procedural programming, functional programming typically uses recursion, i.e., it calls/repeats itself until a particular condition is reached. - **A bit of an aside:** the Fibonacci function is actually a good illustrations of trade-offs between different paradigms. You will find that the cost in time of the functional implementation rises exponentially w.r.t the value of `n`, while the procedural impl. runs faster. It is vital to consider your use case before chosing which kind of paradgim to use for your software. **! QUESTION 2 !** Which of these functions are pure? ```python= def add_one(x): return x + 1 def say_hello(name): print('Hello', name) def append_item_1(a_list, item): a_list += [item] return a_list def append_item_2(a_list, item): result = a_list + [item] return result ``` **Benefits of pure functions**: - **testability**: indicates how easy it is to test the function (usually meaning unit tests). Functions with side effects, or where outputs may not always be the same given the same input are more difficult to test. - **composability**: refers to the ability to make a new function from a chain of other functions by using the output of one as the input to the next. This may be more difficult to do with functions with side effects and/or varying input-output behaviour. - **parallelisability**: refers to the ability for operations to be performed at the same time (independently). If a function is pure and we handle lots of data, we can often improve performance by splitting data and distributing the computation across multiple processors. As an example of composability, let's look at **Python decorators**: As we had seen in the episode on parametrising our unit tests, a decorator can take a function, modify/decorate it, then return the resulting function. This is possible because in Python, functions can be passed around as normal data. Here, we discuss decorators in more detail and learn how to write our own. Let’s look at the following code for ways on how to “decorate” functions. ```python= # define function where additional functionality is to be added def ordinary(): print("I am an ordinary function") # define decorator, or outer function for first function def decorate(func): # define the inner function def inner(): # add some additional behavior to original function print("I am a decorator") # call original function func() # return the inner function return inner # decorate the ordinary function decorated_func = decorate(ordinary) # call the decorated function decorated_func() ``` Output: ``` I am a decorator I am an ordinary function ``` - ```ordinary()``` is to be decorated, - ```decorate(func)``` is the function that decorates another function, - calling ```decorate(ordinary)``` builds another function that adds functionality to ```ordinary()```. Another way to use decorators is to add @decorate before the function to be decorated: ```python= # define decorator, or outer function for first function def decorate(func): # define the inner function def inner(): # add some additional behavior to original function print("I am a decorator") # call original function func() # return the inner function return inner # define function where additional functionality is to be added @decorate def ordinary(): print("I am an ordinary function") # call the decorated function ordinary() ``` Output: ``` I am a decorator I am an ordinary function ``` **! TASK 4 ! Write a decorator that measures the time time taken to execute a particular function** using the time.process_time_ns() function. - You need to import ```time```. - To get a time stamp, you can simply write ```start = time.process_time_ns()```, and get another time stamp once the calculation in question is done using ```end = time.process_time_ns()```. - Use this function to measure its execution time: ```python= def measure_me(n): total = 0 for i in range(n): total += i * i return total ``` ## 4. Writing software with - and for - others: workflows on GitHub, APIs, and code packages ### 4.1 GitHub pull requests Pull requests let you tell others about changes you've pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch. **Code review** plays an essential role in this process. #### Code review Code review is one of the most important practices of collaborative software development that improves code quality and increases knowledge about the codebase across the team. Before contributions are merged into the main branch, code will need to be reviewed, e.g., by the maintainer(s) of the repository. Although the role of code review can't be overstated, we will not go into the details here, as it's better suited for self-study compared to other building blocks in research software engineering that we touch upon in this tutorial. See, e.g., a guide on code review from Kimmo Brunfeldt [here](https://www.swarmia.com/blog/a-complete-guide-to-code-reviews/?utm_term=code%20review&utm_campaign=Code+review+best+practices&utm_source=adwords&utm_medium=ppc&hsa_acc=6644081770&hsa_cam=14940336179&hsa_grp=131344939434&hsa_ad=552679672005&hsa_src=g&hsa_tgt=kwd-17740433&hsa_kw=code%20review&hsa_mt=b&hsa_net=adwords&hsa_ver=3&gclid=Cj0KCQiAw9qOBhC-ARIsAG-rdn7_nhMMyE7aeSzosRRqZ52vafBOyMrpL4Ypru0PHWK4Rl8QLIhkeA0aAsxqEALw_wcB). #### Types of development models The way you and your team provide contributions to the shared codebase depends on the type of development model you use in your project. Two commonly used models are the following: - **fork and pull model**: folks fork an existing repository (to create their copy of the project linked to the source) and push changes to their personal fork. A contributor can therefore work independently on their own fork as they do not need permissions on the source repository to push modifications to their own fork. The project maintainer can then pull the contributors' changes into the source repository on request and after a code review process. - One advantage of this model is that it makes it easy for new contributors to join a project, without upfront coordination with source project maintainers. It may be well suitable for, e.g., external collaborators as opposed to, e.g., core team members. - **shared repository model**: folks are granted push access to a single shared code repository, but feature branches for new developments are still created. This model is good for core contributors who may wish to have faster workflows in the testing and merging cycle. **! FOLLOW ALONG IN YOUR CODESPACE !** **! TASK 4 !** **Let us try to make a small PR adding our names to the `README.md`'s list of participants.** (5-10 min) **1. Starting from the `main` branch, create a new branch `add-name`**: ```bash git branch main git switch -C add-name # creates and switch to the branch directly ``` **2. Edit the `README.md` file by adding your name under the "Participants" section.** **3. Track the changes and commit them with**: ```bash git add README.md git commit -m "Added my name for PR exercise" ``` **Keep an eye out from here: Git might prompt you whether you would like** Here, Github Codespaces comes in very handy, as it will create a fork of the original project, since you probably don't have write permission to our original repository. **4. Push your newly created branch to your own fork**: ```bash git push -u origin add-name ``` **5. Create a PR to the original repository** For the sake of simplicity, we create the PR in Github's web interface. If the previous steps were followed precisely, the `Code` tab of your fork of the project should give you the option to create a PR for your changes in `add-name` to the branch of your choice on the original (`upstream`) repository. **6. From the maintainer's perspective** Once a PR is received, we usually perform a code review. The "File changed" tab in the PR's interface is a very useful tool to gage the changes that a PR makes to its targeted branch. Once the code is reviewed, the maintainer can either request for some changes, or proceed to merge the PR if it satisfies all the requirements. In our case, since the changes are quite minimal, we just proceed to merge and close the PR. Once that is done, your changes will be reflected in the corresponding branch of the project, or in our case, the `main` branch. ### 4.2 How can others use the programs you write: application programming interfaces (API) We will now have a look at ```inflammation-analysis.py``` which, in our example, is the entry point of our simple application - users will need to call it within a CLI, alongside a set of arguments: ```bash python3 inflammation-analysis.py data/inflammation-03.csv ``` How to use the application and which arguments to specify can be accessed via ``` python3 inflammation-analysis.py --help ``` ```inflammation-analysis.py``` can be run in different ways - as an imported library, or as the top-level script in which case the global dunder variable ```__name__``` will be set to ```"__main__"```. #### Global variable ```__name__``` In ```inflammation-analysis.py```, we see the following code: ```python= # import modules def main(): # perform some actions if __name__ == "__main__": # perform some actions before main() main() ``` ```__name__``` is a special dunder variable which is set, along with a number of other special dunder variables, by the python interpreter before the execution of any code in the source file. What value is given by the interpreter to ```__name__``` is determined by the way in which it is loaded. If you run the following command (i.e., run the file as a script), ```__name__``` will be equal to ```__main__```, and everything following after the if-statement will be executed: ``` $ python3 inflammation-analysis.py ``` If you import your file as a module via ```import inflammation-analysis```, ```__name__``` will be set to the module name, i.e., ```__name__ = "inflammation-analysis"```. In other words, the global variable ```__name__``` allows you to execute code when the file runs as a script, but not when it’s imported as a module. Python sets the global name of a module equal to ```__main__``` if the Python interpreter runs your code in the top-level code environment. “Top-level code” is the first user-specified Python module that starts running. It’s “top-level” because it imports all other modules that the program needs. #### Command-line options To be able to run ```inflammation-analysis.py``` in the CLI, we need to enable Python to read command line arguments. The standard Python library for reading command line arguments passed to a script is ``argparse``. Let's look into ```inflammation-analysis.py``` again. ```python= # we first initialise the argument parser class, # passing an (optional) description of the program: parser = argparse.ArgumentParser( description='A basic patient inflammation data system') # we can now add the arguments that we want argparse # to look out for; on our case, we only want to process # the names of the file(s): parser.add_argument( 'infiles', nargs='+', help='Input CSV(s) containing inflammation series for each patient') # we parse the arguments passed to the script: args = parser.parse_args() ``` - We have defined what the argument will be called (``infiles``), the number of arguments to be expected (nargs='+', where '+' indicates that there should be 1 or more arguments passed); and a help string for the user (``help='Input CSV(s) containing inflammation series for each patient'``). - You can add as many arguments as you wish, and these can be either mandatory (as the one above) or optional. - ```parser.parse_args()```returns an object (called ```arg```) containing all the arguments requested. These can be accessed using the names that we have defined for each argument, e.g., ```args.infiles``` would return the filenames that were used as inputs. - When we run, e.g., ```python3 inflammation-analysis.py data/inflammation-03.csv```, nothing will happen at that point, as ```views.py``` used ```matplotlib```, but a our CLI will output only text. But we could add another modality in ```views.py``` to be able to generate output that is shown in the CLI. ### 4.3 Producing a code package We will now look at how we can package software for release and distribution, using ```Poetry``` to manage our Python dependencies and produce a code package we can use with a Python package indexing service such as [PyPi](https://pypi.org/). #### Preparing software for release Here, we only marginally touch upon important factors to consider before publishing software, most of which have to do with documentation. Documentation is a foundational pillar in coding/writing software. While its significance can't be overstated, we omit this part in this tutorial, as it's better for self-study compared to other building blocks in research software engineering. **Documentation** Before releasing software for reuse, make sure you have - documented your code sufficiently (see, e.g., this blog post: [What are best practices for research software documentation?](https://www.software.ac.uk/blog/2019-06-21-what-are-best-practices-research-software-documentation)), - included essentials such as a README and a LICENSE file. A README may include the following: - **installation/deployment**: step-by-step instructions for setting up the software so it can be used, - **basic usage**: step-by-step instructions that cover using the software to accomplish basic tasks, - **contributing**: for those wishing to contribute to the software’s development, this is an opportunity to detail what kinds of contribution are sought and how to get involved, - **contact information/getting help**: which may include things like key author email addresses, and links to mailing lists and other resources, - **credits/acknowledgements**: where appropriate, be sure to credit those who have helped in the software’s development or inspired it, - **citation**: particularly for academic software, it’s a very good idea to specify a reference to an appropriate academic publication so other academics can cite use of the software in their own publications and media. You can do this within a separate CITATION text file within the repository’s root directory and link to it from the Markdown. **Marking a software release** There are different ways in which we can make a software release from our code in Git/on GitHub, one of which is **tagging**: we attach a human-readable label to a specific commit, e.g., "v1.0.0", and push the change to our remote repo: **! FOLLOW ALONG IN YOUR CODESPACE !** ``` $ git tag -a v1.0.0 -m "Version 1.0.0" $ git push origin v1.0.0 ``` #### Packaging up software We will use Python's ```Poetry``` library which we'll install in our virtual environment (make sure you're in the root directory when avtivating the virtual environment, and let's check afterwards that we installed ```Poetry``` within it): **! FOLLOW ALONG IN YOUR CODESPACE !** ``` $ source venv/bin/activate $ pip3 install poetry $ which poetry ``` Poetry uses a ```pyproject.toml``` file to describe the build system and requirements of the distributable package. - In the context of software development, *build* is the process of “translating” source code files into executable binary code files that can be run directly. - A *build system* is a collection of software tools that is used to facilitate the build process. To create a ```pyproject.toml``` file for our code, we can use ```poetry init``` which will guide us through the most important settings (for each prompt, we either enter our data or accept the default). Below, you see the questions with the recommended responses, so do follow these (and use your own contact details). - We’ve called our package “inflammation” instead of “inflammation-analysis” to match the name of our module package, so that ```Poetry``` can automatically find the code. ```bash $ poetry init ``` Output: ```bash This command will guide you through creating your pyproject.toml config. Package name [example]: inflammation Version [0.1.0]: 1.0.0 Description []: Analyse patient inflammation data Author [None, n to skip]: Nadine Spychala <nadine.spychala@gmail.com> License []: MIT Compatible Python versions [^3.8]: ^3.8 Would you like to define your main dependencies interactively? (yes/no) [yes] no Would you like to define your development dependencies interactively? (yes/no) [yes] no Generated file [tool.poetry] name = "inflammation" version = "1.0.0" description = "Analyse patient inflammation data" authors = ["Nadine Spychala <nadine.spychala@gmail.com>"] license = "MIT" [tool.poetry.dependencies] python = "^3.8" [tool.poetry.dev-dependencies] [build-system] requires = ["poetry-core>=1.0.0"] build-backend = "poetry.core.masonry.api" Do you confirm generation? (yes/no) [yes] yes ``` When we add a dependency using ```Poetry```, Poetry will add it to the list of dependencies in the ```pyproject.toml``` file, and automatically install the package into our virtual environment. - There are two different types of dependency: *runtime dependencies* and *development dependencies*. The former are those dependencies that need to be installed for our code to run, like ```NumPy```. The latter are dependencies which are needed/essential in order to develop code, but not required to run it, e.g., ```pylint``` or ```pytest```. ```bash $ poetry add matplotlib numpy $ poetry add --dev pylint $ poetry install ``` Let's build a distributable version of our software: ```bash $ poetry build ``` This should produce two files for us in the ```dist``` directory of which the most important one is the ```.whl``` or *wheel* file. This is the file that ```pip``` uses to distribute and install Python packages, so this is the file we’d need to share with other people who want to install our software. If we gave this wheel file to someone else, they could install it using ```pip```: ```bash $ pip3 install dist/inflammation*.whl ``` If we need to publish an update, we just update the version number in the ```pyproject.toml``` file, then use ```Poetry``` to build and publish the new version. Any re-publishing of the package, no matter how small the changes, needs to come with a new version number. ### 4.4 Personal experience in academic and professional research software engineering in deep learning **Note to self**: General outline, - [CleanRL](https://github.com/vwxyzjn/cleanrl) collaborating with other researchers on an open-source project. - Documentation: README.md and even [documentation website](https://docs.cleanrl.dev/). - Tests: unit test of algorithms implemented, as well as coding style tests, integrated with GH CI - PR templates and instructions for contributing, enforcing conformity to tests - Making contributions myself by following the PR-based workflow - Reviewing others' contributions (PR) and merging them - Following up on issues from users - Publishing the library as easy to use packages. - In deep learning, publishing of trained models weights, training and evaluation curves etc... is critical for reproducibility. (A neat tool to do so: [Weight and Biases (WANDB)](https://wandb.ai). Useful to publish along with papers. Consider also publishing scripts that generate the figures in your papers. - Currently, as a research at [Araya inc., Japan](https://research.araya.org/) - Internal process does not necessarily requires package publishing, but the abilty for members of same or different teams to collaborate on projects is critical. Be it software, or more general projects, good documentation and onboarding process is key. - Multiple teams with different specialities and research projects still need to collaborate. Members of different team collaborate on software projects with PR-based workflow. Some packages and modules are sometimes forked and shared internally too! ## Wrap-up **WORK ON THIS** ## Further resources - [Research Software Engineering with Python - Building software that makes research possible](https://merely-useful.tech/py-rse/), see also a description of the book [here](https://carpentries.org/blog/2021/07/pyrse-book/). ; ## Original course **WORK ON THIS** Giving credit, what has been changed, and license website. ## Funding & acknowledgements ### Funding I get support for this tutorial from my fellowship at the Software Sustainability Institute. ### Acknowledgements **WORK ON THIS**