Analysing data in a HPC environment using R

# Analysing data in a HPC environment using R **Date:** 22 april 2021 **Time:** 9:00 - 16:00 **Zoom:** https://kth-se.zoom.us/j/61699320936 **Instructor:** Henric Zazzi ## Schedule | Time | Lesson | | --- | --- | | 9:00 - 9:10 | Welcome | | 9:10 - 9:40 | What is parallelization | | 09:40 - 10:20 | Introduction to PDC | | 10:20 - 10:35 | Coffee break | | 10:35 - 11:00 | Lab: Login to PDC | | 11:00 - 11:30 | Serial R using PDC clusters | | 11:30 - 12:00 | Lab: compare methodologies | | 12:00 - 13:00 | Lunch | | 13:00 - 13:30 | Shared memory computing in R | | 13:30 - 14:15 | Lab: shared memory computing | | 14:15 - 14:30 | Coffee break | | 14:30 - 14:45 | Distributed memory computing in R | | 14:45 - 15:00 | Usable parallelized R functions | | 15:00 - 16:00 | Lab: distributed memory computing | ## Important links #### HackMD interactive form https://hackmd.io/IXDDovCUS-ug1jJqOqYUVA?both #### Evaluation form https://docs.google.com/forms/d/e/1FAIpQLSfbnK3O2YmujNZHUQQU9wNwnTYW6KqzBta0G_J_JiyvqYnCrA/viewform?usp=sf_link #### Course material https://drive.google.com/file/d/1xOnUlHmflRpLscORxZ2ONTlgi1O1KtOB/view?usp=sharing ### Support information #### PDC web portal https://www.pdc.kth.se/ #### Client software for kerberos installation https://www.pdc.kth.se/support/documents/login/login.html#step-by-step-login-tutorial #### SNIC rules for applying to allocations https://snic.se/allocations/ #### SUPR portal to submit proposals https://supr.snic.se/ ### R used packages #### parallel package https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf #### snow package https://cran.r-project.org/web/packages/snow/snow.pdf #### foreach package https://cran.r-project.org/web/packages/foreach/foreach.pdf ## Course allocation information #### Course cluster tegner.pdc.kth.se #### Course allocation edu21.hpcr #### Course reservation hpcr ## Questions Please enter questions in here. Copy an earlier entry and write your own questions. Answers will be given here or orally during the course. ### Question Do you know the usage of R code at PDC in percentage? ### Answer Not directly. R is unfortunately not used as much as I would want to PDC but I do hope this course will change that. --- ### Question Is it allowed to run many short R jobs at PDC (maybe of 1 hr. each)? ### Answer Yes that's allowed! --- ### Question mpirun is the wrapper which already includes the information on the number of cores? ### Answer mpirun says how many processes will be used, but the number of nodes that should be used is specified using salloc/sbatch --- ### Question how to go head with projects. I am not from KTH ### Answer You do not have to be from KTH, but you must be part of swedish academia. In order to submit proposals you must use the SUPR portal. (Link above) --- ### Question What does "increasing number of threads mean"? is it different from number of CPU's? ### Answer Yes, threads is more like number of processes that you are using. CPUs is more what hardware we do have at our disposal. Usually you use as many processes as you do have CPU, but that is not always the case. --- ### Question Between lapply and sapply, which one is faster? ### Answer Difficult to say. I would say that they are equally fast, although I would guess, from a machine code perspective, that creating an array is faster than creating a list and therefore sapply should be marginally faster. --- ### Question What is the recommended practice for creating/debugging R scripts for HPC deployment? Any remote editor to use? Or upload script with scp every time something was changed? ### Answer In this case R studio server would be a good remote editor for use. Unfortunately we have not installed it yet (perphaps on Dardel). For now I think the best practice is develop locally and upload script. You could although save/use scripts in AFS so they are available on both your local computer and cluster. --- ### Question Any confirmed AFS clients for Mac OS X Big Sur (11.3)? ### Answer Apple has made substantial changes which affect the AFS client. Any OpenAFS or AuriStorFS client prior to AuriStorFS v0.160 will panic a High Sierra OS when started, so make sure to install only versions above v0.160. --- ### Question In the example for Pi I think sapply is faster because one is not growing the vectors x,y? ### Answer I quite agree with you, and I would expect that would be the case. Expanding vectors do take a lot of computing power whereas fixed vectors are as fast as arrays. --- ### Question I try to to upload script from local machine to Tegner, but cannot find it anywhere `scp test.R oleksil@t04n27.pdc.kth.se:/cfs/klemming/scratch/o/oleksil/Private` UPD: Unfortunately, didn't work for me: ``` [oleksil@tegner-login-2 ~]$ module add nano test.R ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'nano' ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'test.R' ``` UPD2: Thank you, indeed, just calling for nano works perfect. ### Answer It depends a bit of what OS you have on your local machine. You could although solve it by copy/past directly into a file on Tegner aka.... 1. Open Tegner 2. module add nano <filename> 3. <ctrl><shift>+v which will past it 4. <ctrl>+o <ctrl>+x nano is quite a small easy editor on clusters, and much easier to use than vi or emacs for simple things UPD: sorry my mistake. We do not have a nano module, as it is installed by default on Tegner. (Not the case on Beskow). Just skip #2 --- ### Question For the Pi example, it seems that lapply is faster than sapply? ### Answer I would expect the other way around, but the code is quite simple and usually runs so fast so you get no performance gain. If you run it multiple time you see that the time seems to variate. --- ### Question I'm trying to run the calcpi using like 100 different n but when I use sapply, I get only a single value while the for loop gives all the 100 different pi's, am wondering what could be the problem with my code ``` calcpi <- function(no) { y <- runif(no) x <- runif(no) z <- sqrt(x^2 + y^2) length(which(z<=1))*4 / length(z) } no <- list(seq(10, 1000, 10)) print(system.time({ sa1 <- sapply(no, FUN=calcpi) })) ``` I want the sa1 to be of length 100 not only 1 value. This is how I did it in for loop ``` nn <- seq(10, 1000, 10) montca <- NULL print(system.time({ for(i in 1:length(nn)) { y <- runif(nn[i]) x <- runif(nn[i]) z <- sqrt(x^2 + y^2) montca[i] <- length(which(z<=1))*4 / length(z) } })) ``` ### Answer You want sapply to return a list of values for PI with increasing numbers of **no**. Right? sapply or other similar do not work that way. If you give it an array or vector it thinks that those are the input array to calculate on. See for example mean, where it calculate mean on every entry in the input. See the exercise folder for how it has been implemented. --- ### Question Will you support ThinLinc in the new Dardel? ### Answer Yes. Perhaps even apache guacamole ### Reply What is the difference between both? ### Answer thinlinc is a paid software, and we do have to pay for the number concurrent licenses, whereas guacamole is open source. From a functionality perspective they are more or less similar --- ### Question At PDC the BLAS library when one loads R is already the optimal one? An aside, how do you fix the number of threads? ### Answer It should be but depends on installation. You do not it is automatically calculated. --- ### Question I have a 100k by 100k matrix which I want to extract the top 10 PCs from but the normal base R takes like over 1 hour to compute the eigen values and vectors. Moreover I ran into memory problem since this big matrix has to be in R's memory. could openBLAS help me out in speeding this eigen decomposition than the Base R? ### Answer Yes, I am quite certain it will speed up your code. Regarding memory, perhaps there are limitations there if you run on your laptop, so could be advantageous to run at a node a PDC as we do have nodes with a max of 2 Tbytes of RAM. --- ### Question If you run on HPC is it recommended to use detectCores()-1 (subtracting 1)? ### Answer That depends. If you run interactively then yes please substract 1, but if you send in a job I think do not do that as you want to use all available cores. --- ### Question Why did you load foreach package on slide 11? ### Answer That was an error on the slide. I will correct that. It should be the parallel package --- ### Question Do the doParallel package already contain foreach so that it doesn't need to be loaded? Otherwise, the last foreach is masked? ### Answer Yes, you are right. doParallel loads foreach. I wanted to include foreach to show you that it is needed as well. Actually doParallel loads the parallel library as well. --- ### Question It seems that the Pi code in my case doesn't scale beyond 4 cores for the vector x <- rep(1E6, 100). I obtained |nr. cores | time (sec.) | | --- | --- | |1 | 6.2 | |2 | 3.7 | |3 | 2.6 | |4 | 1.9 | |5 | 2.1 | ### Answer Yes, that is a typical example that with increasing number of processes after a while the overhead of dividing the data for parallelisation is too time consuming and you do not get any speedup, or rather even increases the time. A good example of why scaling is important. --- ### Question I am trying to allocate job in R ``` [oleksil@tegner-login-2 ~]$ salloc -t 5:00 -N 1 -A edu21.hpcr mpirun -n 1 ./test.R ``` but it says ``` salloc: Granted job allocation 960176 salloc: error: _fork_command: Unable to find command "mpirun" salloc: Relinquishing job allocation 960176 ``` ### Answer mpirun is not installed by default so you must load the openmpi module that I mentioned. Also you have an error in your code as you are running the R script right away, where you need the R module. Please look at examples in my presentation of running R serially. --- ### Question Do you need mpirun if you are runnig doParallel? ### Answer Yes, mpirun sees to it that your code is run on the node that you have acquired. Otherwise it will run on the login node. --- ### Question mclapply is not available for Windows. Do you have a suggestion what to use instead, when working in Windows? ### Answer I am sorry but I do not. a couple of suggestions, perhaps the parallel package is not installed by default and you need to install it? Have you tried the foreach package I mentioned? You could do the lab with that one instead. ### Reply I have the parallel package installed. I will try foreach, thank you! (sorry for the formatting) --- ### Question On the slide 12 from Distributed memory computing in R, snow and doParallel are both loaded, is it a requirement? ### Answer Yes, otherwise you cannot use MPI and distributed computing.