Try   HackMD

In general, R scripts can be run just like any other kind of program on an HPC (high-performance computing) system such as the Compute Canada systems. However, there are a few peculiarities to using R that are useful to know about. This document compiles some helpful practices; it is aimed at people who are familiar with R but unfamiliar to HPC, or vice versa.

Some of these instructions will be specific to Compute Canada ca. 2024, and particularly to the Graham cluster.

I assume that you're somewhat familiar with high performance computing (i.e. you've taken the Compute Canada orientation session, know how to use sbatch/squeue/etc. to work with the batch scheduler)

Below, "batch mode" means running R code from an R script rather than starting R and typing commands at the prompt (i.e. "interactive mode"); "on a worker" means running a batch-mode script via the SLURM scheduler (i.e. using sbatch) rather than in a terminal session on the head node. Commands to be run within R will use an R> prompt, those to be run in the shell will use $.

Getting started

Compute Canada reminders

While this document is not a replacement for taking the webinar and reading the basic documentation, here are some reminders:

  • sbatch is the basic command for submitting a job
  • you can specify all sorts of parameters (required memory, number of nodes, etc.); the only required parameter is the 'wall clock time' estimate
  • the parameters can either be specified on the command line (e.g. --time=2hours) or in the batch file with #SBATCH specifications
  • sq (just your jobs) and squeue (all jobs) are the commands for inspecting the queue
  • below I provide a sample minimal batch file

loading modules

R is often missing from the set of programs that is available by default on HPC systems. Most HPC systems use the module command to make different programs, and different versions of those programs, available for your use.

  • If you try to run R in interactive mode on Graham, the system will pop up a long list of possible modules to load and ask you which one you want. The first choice, the default, will be the most recent version available. Manually, you can module load r at the command line. If you need a specific version of R you can module spider r to see a list of all available R modules/versions.
  • If you try to run R in batch mode without loading the module first, or if you try to run R on a worker, you'll get an error. In order to run R on a worker you should add module load r to your batch script (specify a version if you want to be careful about reproducibility).
  • Sometimes your scripts may require other modules

installing packages

  • In order to install your own packages you need to have created and specified a user-writable directory. If you are working interactively, the first time you try to install packages for a particular version of R, R will prompt you for whether you want to create such a directory (yes) and where to put it (the default is ~/R/x86_64-pc-linux-gnu-library/<R-version>). (If you are in batch mode you'll get an error.)
  • It's probably easiest to install packages from the head node, either by running R> install.packages("<pkg>") in an interactive R session, or in batch mode by running an R script that defines and installs a long list of packages, e.g.
    ​​​​# you will need to specify a repository, this is a safe default value
    ​​​​R> options(repos = c(CRAN = "https://cloud.r-project.org"))
    ​​​​R> pkgs <- c("broom", "mvtnorm", <...>)
    ​​​​R> install.packages(pkgs)
    
    It's generally OK to run short (<10 minutes) interactive jobs like this in interactive mode, on the head node.
  • To install Bioconductor packages
    • at present (27 Oct 2023) Bioconductor requires R version <= 4.3.0; Graham has R 4.3.1 (too new), the next most recent version available is 4.2.2, so module load r/4.2.2 at the command line.
    • follow the Bioconductor installation instructions in an R session:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.18")

then use BiocManager::install() to install any other (non-base) Bioconductor packages you need

  • another newer option is to use the rspm package to install binary packages from the RStudio Package Manager repository (see README, e.g.)
    • install.packages("rspm")
    • rspm::enable()
    • install.packages("rstan")
      This will be much faster for packages that have compiled code. However, you may run into trouble with binary incompatibilities (e.g. a system library is not installed, or is installed somewhere other than where R is looking for it), in which case you'll have to fall back to installing from source.
  • The reason for doing package installation on the head node is that worker nodes don't have network access, so you won't be able to download packages from CRAN. If absolutely necessary you can work around this by downloading tarballs from CRAN (onto the head node, or onto your own machine and then copying them to SHARCnet), but this can be really annoying because you will also have to download tarballs for all of the dependencies of the package you want, and install them in the right order - when you install directly from CRAN this all gets handled automatically. If you have downloaded a package tarball mypkg_0.1.5.tar.gz, use R> install.packages("mypkg_0.1.5.tar.gz", repos=NULL) from within R or $ R CMD INSTALL mypkg_0.1.5.tar.gz from the shell to install it.
  • If you really do want to install packages via tarballs, we can probably put together some machinery using
  • Packages typically need to be reinstalled for each major release of R (denoted by the second digit in the version number, e.g. from version 4.3.1 to 4.4.0 but not from 4.4.0 to 4.4.1)

running scripts in batch mode

Given an R script stored in a .R file, there are a few ways to run it in batch mode (note, below I use $ to denote the shell prompt)

  • $ r <filename>: an improved batch-R version by Dirk Eddelbuettel. You can install it by installing the littler package from CRAN (see installing packages) and running
    ​​​​$ mkdir ~/bin
    ​​​​$ cd ~/bin
    ​​​​$ ln -s ~/R/x86_64-pc-linux-gnu-library/4.1/littler/bin/r
    
    in the shell. (You may need to adjust the path name for your R version.)
  • $ Rscript {filename}: by default, output will be printed to the standard output (which will end up in your .log file); you can use > outputfile to redirect the output where you want
    • one weirdness of Rscript is that it does not load the methods package by default, which may occasionally surprise you - if your script directly or indirectly uses stuff from methods you need to load it explicitly with library("methods")
  • $ R CMD BATCH {filename}: this is similar to Rscript but automatically sends output to <filename>.Rout
  • This StackOverflow question says that r > Rscript > R CMD BATCH (according to the author of r )

See also here for enabling echoing of R commands (if not using R CMD BATCH)

running RStudio from SHARCnet

If you want to use RStudio instead,

  • log into https://jupyterhub.sharcnet.ca with your Compute Canada username/password
  • click on the 'softwares' icon (left margin), load the rstudio-server-... module
  • an RStudio icon will appear click it!
  • this session does not have internet access, but it does see all of the files in your user space (including packages that you have installed locally)
  • You can run Jupyter notebooks, etc. too (I don't know if there is a way to run a Jupyter notebook with a Python kernel )
  • it might make sense to 'reserve' your session in advance (so you don't have to wait a few minutes for it to start up), not yet sure how to do that maybe via salloc? (Maybe this Princeton web page is useful?) I seem to recall at some point an option to reserve compute resources starting at a scheduled time but can't find that any more There is a "Reservation" box on the JupyterLab startup screen, but it's not clear what you have to do in order to be able to use it (at present the only option I get is "None")

Multiple jobs

running R jobs via job array

  • A job array is the preferred method for submitting multiple batch jobs.
  • To use job arrays effectively with R scripts, you need to know how to use commandArgs() to read command-line arguments from within an R script. For example, if batch.R contains:
    ​​​​R> cc <- commandArgs(trailingOnly = TRUE) ​​​​R> intarg <- as.integer(cc[1]) ​​​​R> chararg <- cc[2] ​​​​R> cat(sprintf("int arg = %d, char arg = %s\n", intarg, chararg))
    then running $ Rscript batch.R 1234 hello will produce
    ​​​​int arg = 1234, char arg = hello
    
    Note that all command-line arguments are passed as character, and must be converted to numeric as necessary. If you want fancier argument processing than base R provides (e.g. default argument values, named rather than positional arguments), see this Stack Overflow question for some options.
  • The number of jobs submitted at any given time (not just via job arrays) is restricted by the SLURM configuration variable MaxSubmitJobs. For the graham cluster, it is 1000. You can check by running this on the head node:
    sh> sacctmgr list assoc user=<user_name> format=user,assoc,maxjobs,maxsubmit
  • There is also a slurm configuration variable MaxArraySize that restricts both the number of jobs in the array and the indices used. On graham this is much larger than MaxSubmitJobs so the largest job array can only contain 1000 jobs.
    The array indices are restricted to MaxArraySize-1 meaning even if you use steps in your array indices ex. --array=0-20000:10000 where the number of jobs is only 3, the job array will not run because the maximum index, 20000, is larger than MaxArraySize-1. Run something like this on the head node to determine the configuration:
    sh> scontrol show config | grep -E 'MaxArraySize' (slurm job array support)

improvement to job arrays using the META package

With a little extra learning the META package is a nice upgrade from using job arrays with additional capabilities. The webinar is easy to follow along.

When choosing between SIMPLE mode and META mode the META documentation gives general guidelines that META mode is best when the number of jobs is greater than allowed by MaxSubmitJobs and/or if the run-time of individual computations (or cases) is less than 20 minutes. As with job arrays there is a balance between the overall speed/wall-clock time of case completion and the resource grabbiness/queue wait time. These are some additional considerations:

  • Running in SIMPLE mode is similar to job arrays with the advantage of additional features that come with the package ex. resubmitting failed jobs, job handling functions, automatic post-processing script etc. and not being restricted to use or interpret SLURM_ARRAY_TASK_ID in the execution of the array. You can use any inputs for your program and any number of programs as long as you specify them in the table.dat file and the total number of cases is less than MaxSubmitJobs.
  • In META mode you are packaging multiple cases into a single job/chunk, and the run time of individual cases can be quite short in general (minutes or even seconds) because you are running sbatch at the job level and not the individual case level. Due to the resources required to load R on nodes prior to the computation, it is recommended that case length be at a minimum more on the order of minutes.
  • If your R script generates file(s), than choosing many short cases (versus larger chunks that run longer) can overload the HPC file system by creating too many files at once resulting in very slow jobs for other users as well. SHARCnet support recommends creating no more than 10 files per second across all farm jobs. There are also file number limits on graham https://docs.alliancecan.ca/wiki/Storage_and_file_management

Parallel computing

levels of parallelization

  • threading: shared memory, lightweight. Base R doesn't do any multithreading. Multithreaded computations can be done in R (1) by making use of a multithreaded BLAS (linear algebra) library, (2) using a package that includes multithreaded computation (typically via the OpenMP system) within C++ code (e.g. glmmTMB).
    • It may sometimes be necessary to suppress multithreaded computation
      • OpenMP is usually controlled by setting the shell environment export OMP_NUM_THREADS=1, but there may be controls within an R package as well.
      • BLAS threading specifically can be controlled via RhpcBLASctl::blas_set_num_threads() (see here)
  • multicore/multiprocess: multicore - multiple cores on a single node/computer, multiprocess - multiple processes within a single node or across multiple nodes in a cluster. Parallelization at these levels can be implemented within R. There are a bunch of different packages/front ends to handle this, almost all ultimately rest on either the RMPI package (see below) or the parallel package: foreach, doParallel, future, furrr,
    • if using these tools (i.e. parallelization within R) you probably want to figure out the number of chunks N and then define a virtual cluster with N cores within R (e.g. R> parallel::makeCluster(N)) and set #SBATCH --ntasks==N in your shell submission script (and let the scheduler pick the number of CPUs etc.)
    • MPI-based: you probably don't want to use MPI unless you are doing 'fancy' parallel computations that require inter-process communication during the course of the job (you can still use MPI but it's a waste if you don't need the communication); requires more specialized cluster resources etc.
  • The most efficient way to handle distributed/embarrassingly parallel problems (all tasks are independent of each other) is via the batch scheduler. The META package in conjuction with job arrays is recommended by SHARCnet support in these cases.
  • a useful primer on threading vs multicore
  • some tips/things to avoid when parallelizing (e.g. be aware of memory constraints, overparallelization )

parallelization and SLURM

Determining the number of nodes/cores/processes to request using SLURM will depend on which R package is used for parallelization. The foreach package supports both multi-core and multiprocessing parellelization. This is an example on how to run both using foreach, including how to ensure R and SLURM are communicating via the shell script, https://docs.alliancecan.ca/wiki/R#Exploiting_parallelism_in_R

You should not let the R package you are using detect and try to use the number of available cores when using HPC, you should instead always specify the number to use.

When setting SLURM #SBATCH arguments, here are some helpful notes:

Code efficiency

R is intrinsically slower than many other languages. The efficiency loss is worth it for the language's other many advantages, but before using HPC resouces you should make sure your R code is running as efficiently as possible.

  • a great talk from Noam Ross on speeding up R code (see the "find better packages" and "improving your code" sections in particular)
  • Using tidyverse "tibbles" (via the dplyr package) is often faster than operations on base-R data frames; the data.table package is even faster. If you need fast input, consider the vrooom and arrow packages. The collapse package also seems useful.

Miscellaneous useful (?) links

sample minimal batch file

#!/bin/bash
# bash script for submitting a job to the sharcnet Graham queue
# asks for minimal resources (1 node, 1 task per node, 500 M) so that
#  it will get run quickly so we can see what's going on 

#SBATCH --nodes=1               # number of nodes to use
#SBATCH --time=00-00:10:00         # time (DD-HH:MM:SS)
#SBATCH --job-name="A name"     # job name

#SBATCH --ntasks-per-node=1              # tasks per node
#SBATCH --mem=500M                       # memory per node
#SBATCH --output=sim-%j.log               # log file
#SBATCH --error=sim-%j.err                # error file
#SBATCH --mail-user=your_email@address    # who to email
#SBATCH --mail-type=FAIL                  # when to email
#SBATCH --account=def-bolker
module load r/4.3.1
## self-contained; -e means "R code follows ..."
Rscript -e "print('hello'); set.seed(101); rnorm(3)"

To do

  • mention rclone (as a tool for data movement; tangential to R?)
  • https://docs.alliancecan.ca/wiki/R#Exploiting_parallelism_in_R
  • talk to SHARCnet
  • checkpointing
  • replace 4.3.1 (e.g.) with $CURVERSION in instructions?
  • info about estimation: peakRAM etc.; scaling tests
  • change to first-level headers for ToC
  • installing packages that depend on system libraries (e.g. the units package, upstream of many graphics extension packages, has SystemRequirements: udunits-2 in its DESCRIPTION file; you would normally need to use apt-get install or yum install (see this SO question))
    • the normal procedure on Linux would be "just apt-get install whatever you need", but this depends on having root access
    • one possibility is to ask the SHARCnet admins if they can help (it might be worth checking to see if there is a module that provides the library you need) [for example, the ggVennDiagram package requires the udunits-2 system library; you can make this available via module load udunits (from the command line)]
    • harder but more general: set up a Singularity container; this is like Docker (and can take Docker files as input!), so it would provide you an environment within which you have root access; the Rocker project has instructions on setting up Singularity containers with Rocker images, including a sample SLURM script