managing R on HPC systems

--- title: managing R on HPC systems --- In general, R scripts can be run just like any other kind of program on an HPC (high-performance computing) system such as the Compute Canada systems. However, there are a few peculiarities to using R that are useful to know about. This document compiles some helpful practices; it is aimed at people who are familiar with R but unfamiliar to HPC, or *vice versa*. Some of these instructions will be specific to Compute Canada ca. 2024, and particularly to the Graham cluster. I assume that you're somewhat familiar with high performance computing (i.e. you've taken the Compute Canada orientation session, know how to use `sbatch`/`squeue`/etc. to work with the batch scheduler) Below, "batch mode" means running R code from an R script rather than starting R and typing commands at the prompt (i.e. "interactive mode"); "on a worker" means running a batch-mode script via the SLURM scheduler (i.e. using `sbatch`) rather than in a terminal session on the head node. Commands to be run within R will use an `R>` prompt, those to be run in the shell will use `$`. # Getting started ## Compute Canada reminders While this document is not a replacement for taking the webinar and reading the basic documentation, but here are some reminders: * `sbatch` is the basic command for submitting a job * you can specify all sorts of parameters (required memory, number of nodes, etc.); the only required parameter is the 'wall clock time' estimate * the parameters can either be specified on the command line (e.g. `--time=2hours`) or in the batch file with `#SBATCH` specifications * `sq` (just your jobs) and `squeue` (all jobs) are the commands for inspecting the queue * below I provide a [sample minimal batch file](#sample-minimal-batch-file) ## loading modules R is often missing from the set of programs that is available by default on HPC systems. Most HPC systems use the `module` command to make different programs, and different versions of those programs, available for your use. - If you try to run `R` in interactive mode on Graham, the system will pop up a long list of possible modules to load and ask you which one you want. The first choice, the default, will be the most recent version available. Manually, you can `module load r` at the command line. If you need a specific version of R you can `module spider r` to see a list of all available R modules/versions. - If you try to run R in batch mode without loading the module first, *or* if you try to run R on a worker, you'll get an error. In order to run R on a worker you should add `module load r` to your batch script (specify a version if you want to be careful about reproducibility). - Sometimes your scripts may require other modules ... ## installing packages - In order to install your own packages you need to have created and specified a user-writable directory. If you are working interactively, the first time you try to install packages *for a particular version of R*, R will prompt you for whether you want to create such a directory (yes) and where to put it (the default is `~/R/x86_64-pc-linux-gnu-library/<R-version>`). (If you are in batch mode you'll get an error.) - It's probably easiest to install packages from the head node, either by running `R> install.packages("<pkg>")` in an interactive R session, or in batch mode by running an R script that defines and installs a long list of packages, e.g. ```r # you will need to specify a repository, this is a safe default value R> options(repos = c(CRAN = "https://cloud.r-project.org")) R> pkgs <- c("broom", "mvtnorm", <...>) R> install.packages(pkgs) ``` It's generally OK to run *short* (<10 minutes) interactive jobs like this in interactive mode, on the head node. - To install Bioconductor packages - at present (27 Oct 2023) Bioconductor requires R version <= 4.3.0; Graham has R 4.3.1 (too new), the next most recent version available is 4.2.2, so `module load r/4.2.2` at the command line. - follow the [Bioconductor installation instructions](https://www.bioconductor.org/install/) in an R session: ```r if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.18") ``` ... then use `BiocManager::install()` to install any other (non-base) Bioconductor packages you need ... - another newer option is to use the `rspm` package to install *binary* packages from the RStudio Package Manager repository (see [README](https://cran.r-project.org/web/packages/rspm/readme/README.html), e.g.) - `install.packages("rspm")` - `rspm::enable()` - `install.packages("rstan")` This will be **much** faster for packages that have compiled code. However, you may run into trouble with binary incompatibilities (e.g. a system library is not installed, or is installed somewhere other than where R is looking for it), in which case you'll have to fall back to installing from source. - The reason for doing package installation on the head node is that worker nodes don't have network access, so you won't be able to download packages from CRAN. If absolutely necessary you can work around this by downloading tarballs from CRAN (onto the head node, or onto your own machine and then copying them to SHARCnet), but this can be really annoying because you will also have to download tarballs for all of the dependencies of the package you want, and install them in the right order - when you install directly from CRAN this all gets handled automatically. If you have downloaded a package tarball `mypkg_0.1.5.tar.gz`, use `R> install.packages("mypkg_0.1.5.tar.gz", repos=NULL)` from within R or `$ R CMD INSTALL mypkg_0.1.5.tar.gz` from the shell to install it. - If you really do want to install packages via tarballs, we can probably put together some machinery using - https://stackoverflow.com/a/40391302/190277 - Packages typically need to be reinstalled for each major release of R (denoted by the second digit in the version number, e.g. from version 4.3.1 to 4.4.0 but not from 4.4.0 to 4.4.1) # running scripts in batch mode Given an R script stored in a `.R` file, there are a few ways to run it in batch mode (note, below I use `$` to denote the shell prompt) - `$ r <filename>`: an improved batch-R version by Dirk Eddelbuettel. You can install it by installing the `littler` package from CRAN (see [installing packages](#installing-packages)) and running ```sh $ mkdir ~/bin $ cd ~/bin $ ln -s ~/R/x86_64-pc-linux-gnu-library/4.1/littler/bin/r ``` in the shell. (You may need to adjust the path name for your R version.) - `$ Rscript {filename}`: by default, output will be printed to the standard output (which will end up in your `.log` file); you can use `> outputfile` to redirect the output where you want - one weirdness of `Rscript` is that it does not load the `methods` package by default, which may occasionally surprise you - if your script directly or indirectly uses stuff from `methods` you need to load it explicitly with `library("methods")` - `$ R CMD BATCH {filename}`: this is similar to `Rscript` but automatically sends output to `<filename>.Rout` - [This StackOverflow question](https://stackoverflow.com/questions/21969145/why-or-when-is-rscript-or-littler-better-than-r-cmd-batch) says that `r` > `Rscript` > `R CMD BATCH` (according to the author of `r` ...) See also [here](https://stackoverflow.com/questions/38312819/r-equivalent-of-bashs-set-x-echo-all-commands-being-executed) for enabling echoing of R commands (if *not* using `R CMD BATCH`) # running RStudio from SHARCnet If you want to use RStudio instead, - log into https://jupyterhub.sharcnet.ca with your Compute Canada username/password - click on the 'softwares' icon (left margin), load the `rstudio-server-...` module - an RStudio icon will appear -- click it! - this session does **not** have internet access, but it **does** see all of the files in your user space (including packages that you have installed locally) - You can run Jupyter notebooks, etc. too (I don't know if there is a way to run a Jupyter notebook with a Python kernel ...) - https://docs.alliancecan.ca/wiki/JupyterHub - https://docs.alliancecan.ca/wiki/JupyterHub#RStudio - it might make sense to 'reserve' your session in advance (so you don't have to wait a few minutes for it to start up), not yet sure how to do that ... maybe [via `salloc`](https://docs.alliancecan.ca/wiki/Running_jobs#Interactive_jobs)? (Maybe [this Princeton web page](https://researchcomputing.princeton.edu/support/knowledge-base/jupyter#salloc) is useful?) I seem to recall at some point an option to reserve compute resources *starting at a scheduled time* but can't find that any more ... There is a "Reservation" box on the JupyterLab startup screen, but it's not clear what you have to do in order to be able to use it (at present the only option I get is "None") # Multiple jobs ## running R jobs via job array - A [job array](https://docs.alliancecan.ca/wiki/Job_arrays) is the preferred method for submitting multiple batch jobs. - To use job arrays effectively with R scripts, you need to know how to use `commandArgs()` to read command-line arguments from within an R script. For example, if `batch.R` contains: ```r= R> cc <- commandArgs(trailingOnly = TRUE) R> intarg <- as.integer(cc[1]) R> chararg <- cc[2] R> cat(sprintf("int arg = %d, char arg = %s\n", intarg, chararg)) ``` then running `$ Rscript batch.R 1234 hello` will produce ``` int arg = 1234, char arg = hello ``` Note that all command-line arguments are passed as character, and must be converted to numeric as necessary. If you want fancier argument processing than base R provides (e.g. default argument values, named rather than positional arguments), see [this Stack Overflow question](https://stackoverflow.com/questions/3433603/parsing-command-line-arguments-in-r-scripts) for some options. - The number of jobs submitted at any given time (not just via job arrays) is restricted by the SLURM configuration variable `MaxSubmitJobs`. For the graham cluster, it is 1000. You can check by running this on the head node: `sh> sacctmgr list assoc user=<user_name> format=user,assoc,maxjobs,maxsubmit` - There is also a slurm configuration variable `MaxArraySize` that restricts both the number of jobs in the array and the indices used. On graham this is much larger than `MaxSubmitJobs` so the largest job array can only contain 1000 jobs. The array indices are restricted to `MaxArraySize`-1 meaning even if you use steps in your array indices ex. `--array=0-20000:10000` where the number of jobs is only 3, the job array will not run because the maximum index, 20000, is larger than `MaxArraySize`-1. Run something like this on the head node to determine the configuration: `sh> scontrol show config | grep -E 'MaxArraySize'` ([slurm job array support](https://slurm.schedmd.com/job_array.html)) ## improvement to job arrays using the META package With a little extra learning the [META package](https://docs.alliancecan.ca/wiki/META:_A_package_for_job_farming) is a nice upgrade from using job arrays with additional capabilities. The webinar is easy to follow along. When choosing between *SIMPLE mode* and *META mode* the META documentation gives general guidelines that *META mode* is best when the number of jobs is greater than allowed by `MaxSubmitJobs` and/or if the run-time of individual computations (or cases) is less than 20 minutes. As with job arrays there is a balance between the overall speed/wall-clock time of case completion and the resource grabbiness/queue wait time. These are some additional considerations: - Running in *SIMPLE mode* is similar to job arrays with the advantage of additional features that come with the package ex. resubmitting failed jobs, job handling functions, automatic post-processing script etc. and not being restricted to use or interpret `SLURM_ARRAY_TASK_ID` in the execution of the array. You can use any inputs for your program and any number of programs as long as you specify them in the *table.dat* file and the total number of cases is less than `MaxSubmitJobs`. - In *META mode* you are packaging multiple cases into a single job/chunk, and the run time of individual cases can be quite short in general (minutes or even seconds) because you are running `sbatch` at the job level and not the individual case level. Due to the resources required to load R on nodes prior to the computation, it is recommended that case length be at a minimum more on the order of minutes. - If your R script generates file(s), than choosing many short cases (versus larger chunks that run longer) can overload the HPC file system by creating too many files at once resulting in very slow jobs for other users as well. SHARCnet support recommends creating *no more than 10 files per second across all farm jobs*. There are also file number limits on graham https://docs.alliancecan.ca/wiki/Storage_and_file_management # Parallel computing ## levels of parallelization - **threading**: shared memory, lightweight. Base R doesn't do any multithreading. Multithreaded computations can be done in R (1) by making use of a multithreaded BLAS (linear algebra) library, (2) using a package that includes multithreaded computation (typically via the `OpenMP` system) within C++ code (e.g. `glmmTMB`). - It may sometimes be necessary to *suppress* multithreaded computation - `OpenMP` is usually controlled by setting the shell environment `export OMP_NUM_THREADS=1`, but there may be controls within an R package as well. - BLAS threading specifically can be controlled via `RhpcBLASctl::blas_set_num_threads()` (see [here](https://github.com/lme4/lme4/issues/492)) - **multicore**/**multiprocess**: multicore - multiple cores on a single node/computer, multiprocess - multiple processes within a single node or across multiple nodes in a cluster. Parallelization at these levels can be implemented within R. There are a bunch of different packages/front ends to handle this, almost all ultimately rest on either the `RMPI` package (see below) or the `parallel` package: `foreach`, `doParallel`, `future`, `furrr`, ... - if using these tools (i.e. parallelization within R) you probably want to figure out the number of chunks `N` and then define a virtual cluster with `N` cores within R (e.g. `R> parallel::makeCluster(N)`) and set `#SBATCH --ntasks==N` in your shell submission script (and let the scheduler pick the number of CPUs etc.) - MPI-based: you probably *don't* want to use MPI unless you are doing 'fancy' parallel computations that require inter-process communication during the course of the job (you can still use MPI but it's a waste if you don't need the communication); requires more specialized cluster resources etc. - The most efficient way to handle distributed/embarrassingly parallel problems (all tasks are independent of each other) is via the batch scheduler. The [META package](#improvement-to-job-arrays-using-the-META-package) in conjuction with [job arrays](#running-R-jobs-via-job-array) is recommended by SHARCnet support in these cases. - a [useful primer on threading vs multicore](https://medium.com/mineiros/how-to-use-multithreading-and-multiprocessing-a-beginners-guide-to-parallel-and-concurrent-a69b9dd21e9d) - some [tips/things to avoid when parallelizing](https://towardsdatascience.com/parallelization-caveats-in-r-1-the-basics-multiprocessing-and-multithreading-performance-eb584b7e850e) (e.g. be aware of memory constraints, overparallelization ...) ## parallelization and SLURM Determining the number of nodes/cores/processes to request using SLURM will depend on which R package is used for parallelization. The `foreach` package supports both multi-core and multiprocessing parellelization. This is an example on how to run both using `foreach`, including how to ensure R and SLURM are communicating via the shell script, https://docs.alliancecan.ca/wiki/R#Exploiting_parallelism_in_R You should not let the R package you are using detect and try to use the number of available cores when using HPC, you should instead always specify the number to use. When setting SLURM `#SBATCH` arguments, here are some helpful notes: - A `task` in SLURM is a process - a process uses one CPU core if it is single threaded. - How tasks are allocated across cores and nodes can be specified using the arguments `--nodes`, `--ntasks`, and `--ntasks-per-node` (`--cpus-per-task` is specific to multi-threading). Some helpful task allocation examples: https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html#q05-how-do-i-create-a-parallel-environment - The task allocation you choose will affect job scheduling. Requesting multiple tasks without specifying the number of nodes (if you don't require all tasks to be on the same node) puts fewer constraints on the system. Requesting a full node `--nodes=1 --ntasks-per-node=32` on the Graham cluster has a scheduling advantage but can be seen as abuse if this is not required. https://docs.alliancecan.ca/wiki/Job_scheduling_policies#Whole_nodes_versus_cores ## Code efficiency R is intrinsically slower than many other languages. The efficiency loss is worth it for the language's other many advantages, but before using HPC resouces you should make sure your R code is running as efficiently as possible. - [a great talk from Noam Ross on speeding up R code](https://www.noamross.net/archives/2013-04-25-faster-talk/) (see the "find better packages" and "improving your code" sections in particular) - Using tidyverse "tibbles" (via the `dplyr` package) is often faster than operations on base-R data frames; the `data.table` package is [even faster](https://towardsdatascience.com/data-table-speed-with-dplyr-syntax-yes-we-can-51ef9aaed585). If you need fast input, consider the `vrooom` and `arrow` packages. The `collapse` package also seems useful. # Miscellaneous useful (?) links - https://wiki.math.uwaterloo.ca/fluidswiki/index.php?title=Graham_Tips - https://helpwiki.sharcnet.ca/wiki/images/3/36/Webinar2016-parallel-hpc-R.pdf - this is useful in being written by SHARCnet folks, but is not super-useful: (1) out of date in some ways (i.e. SHARCnet no longer recommends spawning multiple batch submissions via shell `for` loop, they prefer [job arrays](https://docs.alliancecan.ca/wiki/Job_arrays)), (2) much of the document focuses on *general* performance tips for high-performance computing in R (using vectorization, packages for out-of-memory computation, etc..) that are not specific to running on HPC clusters - https://cran.r-project.org/web/packages/rslurm/vignettes/rslurm.html - SHARCnet support recommends not using this package. "The slurm settings on the national systems are set to include group accounting and memory request information that determines your group fairshare and priority. Bypassing that may result in unexpected results." # sample minimal batch file ``` #!/bin/bash # bash script for submitting a job to the sharcnet Graham queue # asks for minimal resources (1 node, 1 task per node, 500 M) so that # it will get run quickly so we can see what's going on #SBATCH --nodes=1 # number of nodes to use #SBATCH --time=00-00:10:00 # time (DD-HH:MM:SS) #SBATCH --job-name="A name" # job name #SBATCH --ntasks-per-node=1 # tasks per node #SBATCH --mem=500M # memory per node #SBATCH --output=sim-%j.log # log file #SBATCH --error=sim-%j.err # error file #SBATCH --mail-user=your_email@address # who to email #SBATCH --mail-type=FAIL # when to email #SBATCH --account=def-bolker module load r/4.3.1 ## self-contained; -e means "R code follows ..." Rscript -e "print('hello'); set.seed(101); rnorm(3)" ``` # To do * mention [rclone](https://rclone.org/) (as a tool for data movement; tangential to R?) * https://docs.alliancecan.ca/wiki/R#Exploiting_parallelism_in_R * talk to SHARCnet * checkpointing * replace 4.3.1 (e.g.) with `$CURVERSION` in instructions? * info about estimation: `peakRAM` etc.; scaling tests * change to first-level headers for ToC * installing packages that depend on **system** libraries (e.g. the [`units` package](https://CRAN.R-project.org/package=units), upstream of many graphics extension packages, has `SystemRequirements: udunits-2` in its `DESCRIPTION file`; you would normally need to use `apt-get install` or `yum install` (see [this SO question](https://stackoverflow.com/questions/42287164/install-udunits2-package-for-r3-3))) * the normal procedure on Linux would be "just `apt-get install` whatever you need", but this depends on having root access * one possibility is to ask the SHARCnet admins if they can help (it might be worth checking to see if there is a module that provides the library you need) [for example, the `ggVennDiagram` package requires the `udunits-2` system library; you can make this available via `module load udunits` (from the command line)] * harder but more general: set up a [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html) container; this is like Docker (and can take Docker files as input!), so it would provide you an environment within which you have root access; the Rocker project has [instructions on setting up Singularity containers with Rocker images](https://rocker-project.org/use/singularity.html), including a sample SLURM script ... ---