In general, R scripts can be run just like any other kind of program on an HPC (high-performance computing) system such as the Compute Canada systems. However, there are a few peculiarities to using R that are useful to know about. This document compiles some helpful practices; it is aimed at people who are familiar with R but unfamiliar to HPC, or vice versa.
Some of these instructions will be specific to Compute Canada ca. 2024, and particularly to the Graham cluster.
I assume that you're somewhat familiar with high performance computing (i.e. you've taken the Compute Canada orientation session, know how to use sbatch
/squeue
/etc. to work with the batch scheduler)
Below, "batch mode" means running R code from an R script rather than starting R and typing commands at the prompt (i.e. "interactive mode"); "on a worker" means running a batch-mode script via the SLURM scheduler (i.e. using sbatch
) rather than in a terminal session on the head node. Commands to be run within R will use an R>
prompt, those to be run in the shell will use $
.
While this document is not a replacement for taking the webinar and reading the basic documentation, here are some reminders:
sbatch
is the basic command for submitting a job--time=2hours
) or in the batch file with #SBATCH
specificationssq
(just your jobs) and squeue
(all jobs) are the commands for inspecting the queueR is often missing from the set of programs that is available by default on HPC systems. Most HPC systems use the module
command to make different programs, and different versions of those programs, available for your use.
R
in interactive mode on Graham, the system will pop up a long list of possible modules to load and ask you which one you want. The first choice, the default, will be the most recent version available. Manually, you can module load r
at the command line. If you need a specific version of R you can module spider r
to see a list of all available R modules/versions.module load r
to your batch script (specify a version if you want to be careful about reproducibility).~/R/x86_64-pc-linux-gnu-library/<R-version>
). (If you are in batch mode you'll get an error.)R> install.packages("<pkg>")
in an interactive R session, or in batch mode by running an R script that defines and installs a long list of packages, e.g.
# you will need to specify a repository, this is a safe default value
R> options(repos = c(CRAN = "https://cloud.r-project.org"))
R> pkgs <- c("broom", "mvtnorm", <...>)
R> install.packages(pkgs)
module load r/4.2.2
at the command line.if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.18")
… then use BiocManager::install()
to install any other (non-base) Bioconductor packages you need …
rspm
package to install binary packages from the RStudio Package Manager repository (see README, e.g.)
install.packages("rspm")
rspm::enable()
install.packages("rstan")
mypkg_0.1.5.tar.gz
, use R> install.packages("mypkg_0.1.5.tar.gz", repos=NULL)
from within R or $ R CMD INSTALL mypkg_0.1.5.tar.gz
from the shell to install it.Given an R script stored in a .R
file, there are a few ways to run it in batch mode (note, below I use $
to denote the shell prompt)
$ r <filename>
: an improved batch-R version by Dirk Eddelbuettel. You can install it by installing the littler
package from CRAN (see installing packages) and running
$ mkdir ~/bin
$ cd ~/bin
$ ln -s ~/R/x86_64-pc-linux-gnu-library/4.1/littler/bin/r
$ Rscript {filename}
: by default, output will be printed to the standard output (which will end up in your .log
file); you can use > outputfile
to redirect the output where you want
Rscript
is that it does not load the methods
package by default, which may occasionally surprise you - if your script directly or indirectly uses stuff from methods
you need to load it explicitly with library("methods")
$ R CMD BATCH {filename}
: this is similar to Rscript
but automatically sends output to <filename>.Rout
r
> Rscript
> R CMD BATCH
(according to the author of r
…)See also here for enabling echoing of R commands (if not using R CMD BATCH
)
If you want to use RStudio instead,
rstudio-server-...
modulesalloc
? (Maybe this Princeton web page is useful?) I seem to recall at some point an option to reserve compute resources starting at a scheduled time but can't find that any more … There is a "Reservation" box on the JupyterLab startup screen, but it's not clear what you have to do in order to be able to use it (at present the only option I get is "None")commandArgs()
to read command-line arguments from within an R script. For example, if batch.R
contains:
R> cc <- commandArgs(trailingOnly = TRUE)
R> intarg <- as.integer(cc[1])
R> chararg <- cc[2]
R> cat(sprintf("int arg = %d, char arg = %s\n", intarg, chararg))
$ Rscript batch.R 1234 hello
will produce
int arg = 1234, char arg = hello
MaxSubmitJobs
. For the graham cluster, it is 1000. You can check by running this on the head node:sh> sacctmgr list assoc user=<user_name> format=user,assoc,maxjobs,maxsubmit
MaxArraySize
that restricts both the number of jobs in the array and the indices used. On graham this is much larger than MaxSubmitJobs
so the largest job array can only contain 1000 jobs.MaxArraySize
-1 meaning even if you use steps in your array indices ex. --array=0-20000:10000
where the number of jobs is only 3, the job array will not run because the maximum index, 20000, is larger than MaxArraySize
-1. Run something like this on the head node to determine the configuration:sh> scontrol show config | grep -E 'MaxArraySize'
(slurm job array support)With a little extra learning the META package is a nice upgrade from using job arrays with additional capabilities. The webinar is easy to follow along.
When choosing between SIMPLE mode and META mode the META documentation gives general guidelines that META mode is best when the number of jobs is greater than allowed by MaxSubmitJobs
and/or if the run-time of individual computations (or cases) is less than 20 minutes. As with job arrays there is a balance between the overall speed/wall-clock time of case completion and the resource grabbiness/queue wait time. These are some additional considerations:
SLURM_ARRAY_TASK_ID
in the execution of the array. You can use any inputs for your program and any number of programs as long as you specify them in the table.dat file and the total number of cases is less than MaxSubmitJobs
.sbatch
at the job level and not the individual case level. Due to the resources required to load R on nodes prior to the computation, it is recommended that case length be at a minimum more on the order of minutes.OpenMP
system) within C++ code (e.g. glmmTMB
).
OpenMP
is usually controlled by setting the shell environment export OMP_NUM_THREADS=1
, but there may be controls within an R package as well.RhpcBLASctl::blas_set_num_threads()
(see here)RMPI
package (see below) or the parallel
package: foreach
, doParallel
, future
, furrr
, …
N
and then define a virtual cluster with N
cores within R (e.g. R> parallel::makeCluster(N)
) and set #SBATCH --ntasks==N
in your shell submission script (and let the scheduler pick the number of CPUs etc.)Determining the number of nodes/cores/processes to request using SLURM will depend on which R package is used for parallelization. The foreach
package supports both multi-core and multiprocessing parellelization. This is an example on how to run both using foreach
, including how to ensure R and SLURM are communicating via the shell script, https://docs.alliancecan.ca/wiki/R#Exploiting_parallelism_in_R
You should not let the R package you are using detect and try to use the number of available cores when using HPC, you should instead always specify the number to use.
When setting SLURM #SBATCH
arguments, here are some helpful notes:
task
in SLURM is a process - a process uses one CPU core if it is single threaded.--nodes
, --ntasks
, and --ntasks-per-node
(--cpus-per-task
is specific to multi-threading). Some helpful task allocation examples:--nodes=1 --ntasks-per-node=32
on the Graham cluster has a scheduling advantage but can be seen as abuse if this is not required.R is intrinsically slower than many other languages. The efficiency loss is worth it for the language's other many advantages, but before using HPC resouces you should make sure your R code is running as efficiently as possible.
dplyr
package) is often faster than operations on base-R data frames; the data.table
package is even faster. If you need fast input, consider the vrooom
and arrow
packages. The collapse
package also seems useful.for
loop, they prefer job arrays), (2) much of the document focuses on general performance tips for high-performance computing in R (using vectorization, packages for out-of-memory computation, etc..) that are not specific to running on HPC clusters#!/bin/bash
# bash script for submitting a job to the sharcnet Graham queue
# asks for minimal resources (1 node, 1 task per node, 500 M) so that
# it will get run quickly so we can see what's going on
#SBATCH --nodes=1 # number of nodes to use
#SBATCH --time=00-00:10:00 # time (DD-HH:MM:SS)
#SBATCH --job-name="A name" # job name
#SBATCH --ntasks-per-node=1 # tasks per node
#SBATCH --mem=500M # memory per node
#SBATCH --output=sim-%j.log # log file
#SBATCH --error=sim-%j.err # error file
#SBATCH --mail-user=your_email@address # who to email
#SBATCH --mail-type=FAIL # when to email
#SBATCH --account=def-bolker
module load r/4.3.1
## self-contained; -e means "R code follows ..."
Rscript -e "print('hello'); set.seed(101); rnorm(3)"
$CURVERSION
in instructions?peakRAM
etc.; scaling testsunits
package, upstream of many graphics extension packages, has SystemRequirements: udunits-2
in its DESCRIPTION file
; you would normally need to use apt-get install
or yum install
(see this SO question))
apt-get install
whatever you need", but this depends on having root accessggVennDiagram
package requires the udunits-2
system library; you can make this available via module load udunits
(from the command line)]