General info
Do you use containers in your work and if yes what containers?
Do you think containers help with reproducibility?
Do you use workflow management tools and if yes what?
Or distributed computing abstractions such as Ray?
This is the fourth event in the Nordic RSE seminar series.
While SLURM itself provides tools for job orchestration like job arrays, high level tools like Snakemake and Ray are cluster agnostic and can either make use of SLURM or run on a laptop. To make Snakemake and Ray to run within Singularity, I present singreqrun, which works by requesting the host runs programs on behalf of the container.
The talk doubles as an introduction to Snakemake and Ray. After some brief background on the main tools (Singularity, SLURM, Snakemake and Ray), we proceed to shell code-along to run the following examples:
I end the talk by opening for discussion. Is this a good approach? Can we improve upon it?
If you would like to code along in a preprepared CSC environment during the talk, email username frrobert in the domain student.jyu.fi at least 24 hours before the talk with a CSC username if possible (e.g if you are at a Finnish university) or from an institutional (ideally university) email.
Singularity is a container platform for HPC. As well as addressing the security concerns of HPC administrators, its "convention over configuration" approach (e.g. binding the current working directory into the container by default) seems to dovetail well with needs of software development for HPC environments. In particular, it encourages writing software which can both be run in a container in a HPC environment and tested uncontainerised on a laptop for a faster hack-test-loop, as well as interoperating well with the typical SLURM + networked file system design of modern HPC clusters.
While SLURM provides some relatively high level tools for job orchestration like job arrays, there are also tools such as Snakemake and Ray which are cluster agnostic but can make use of SLURM (with slurm-profile and yaspi), or run limited to a single laptop. However, SLURM connector plugins typically work by running the SLURM utilities like squeue, which are only available on the host. While it is theoretically possibly to bind host executables and libraries into the container, this introduces strong library version requirement coupling between the container and the host. Therefore, I present singreqrun, a shim for requesting the host runs programs from within the container.
The talk begins with a quick roll call of the players: Singularity, SLURM, HPC, Snakemake and Ray.
In a live shell session (with all Python code preprepared), I then demonstrate the ways in which singreqrun can be used:
Snakemake for heterogeneous (mixture of CPU and GPU nodes) video corpus processing which can be ported across HPC clusters
Snakemake for text corpus processing including using extra Singularity containers for utilities
Ray for hyperparameter search
I end the talk by asking for comments. In particular, is this the right direction or a hack too far? Are there better ways to combine general purpose container orchestrators + Singularity + SLURM? The current implementation is quite hacky, and more like a proof-of-concept. If it is a good idea, how can we stabilise and improve upon this approach?
The talk expands on some ideas I give in a blog post: https://frankie.robertson.name/research/effective-cluster-computing/#use-monolithic