# Cluster training # 1. Logging in ## 1.1. Using VPN * Connect to VPN; * `ssh you@codon-login` ## 1.2. Using ssh proxy * This is my favourite way as you don't need to connect to VPN. It is also faster, as just your interaction with the terminal will go to EBI servers, and more private; * Unfortunately, EBI will disallow this as it is easier for people to have unauthorized access to your account... I think by the end of the year, we will all need to use the VPN; * This is a multi-step approach, EBI explains how to do this here: https://intranet.ebi.ac.uk/article/remote-access-using-ssh * Once you are set up following EBI's guide, you can add this section to your local `~/.ssh/config` to make things easier: ``` Host ebi-gate Hostname ligate.ebi.ac.uk # Hostname mitigate.ebi.ac.uk # you can use mitigate if ligate has issues User <your_username> IdentityFile ~/.ssh/id_rsa.pub Host codon-ext Hostname codon-login.ebi.ac.uk User <your_username> IdentityFile ~/.ssh/id_rsa.pub ProxyJump ebi-gate ForwardX11Trusted yes ``` * Once this is all set up, you can simply do `ssh codon-ext` to connect to the codon cluster; Your ssh connection might be closed due to inactivity (after a few minutes of inactivity). To avoid this, add this to your `~/.ssh/config`: ``` # keeping connections alive Host * ServerAliveInterval 300 ServerAliveCountMax 2 ``` ## 1.3. Post login stuff ### Sourcing shell initialization files I had some issues with sourcing shell initialization files in the codon cluster only. It seems that when we login to the cluster, `~/.profile` is loaded, but when we start a new terminal (e.g. submit an interactive job, or start a screen), `~/.bashrc` is loaded. So I needed both files to always initialize the shell with my configs. Easiest way to deal with this is to link one to the other (e.g. I did `ln -s ~/.profile ~/.bashrc`). As such, whenever you change any of them, the change is reflected on the other, and the issues of sometimes one being used and the other being used is solved. # 2. Basic filesystems ## 2.1. Software filesystem * `/hps/software/users/iqbal` * Fastest filesystem in the codon cluster; * Should be used just to store software you use frequently; * ~~Very small filesystem: 20GB/Group;~~ The quota for each group is actually considerable, 200GB/Group. When you log into the codon cluster, you are welcomed with this message, but there is a typo: ``` -- _____________________________________________________________________ -- -- | | | | Data | -- -- | File System | Description | Quotas | Retention | -- -- |_________________|________________________|______________|___________| -- -- | | User/Group Software | | | -- -- | /hps/software | and Conda envs | 20GB/Group | | -- -- |_________________|________________________|______________|__________ | -- ``` The typo is that the quota for `/hps/software` is **NOT** 20GB/Group, but **200GB/Group**. Although it is kind of large, we should just store software, conda environments, etc here. Don't store containers, as these can be quite heavy. * Writable from login and worker nodes; ## 2.2. HPS * `/hps/nobackup/iqbal/` * Large and fast non-backed up filesystem; * Should be used as workdir of your pipeline and scripts (i.e. your pipeline and scripts should create intermediary and temporary files here); * It is the filesystem you will normally be using; * Not writable from login nodes, writable from worker nodes; ## 2.3. NFS * `/nfs/research/zi` (will soon be changed to `/nfs/research/iqbal`); * Slowest filesystem of the 3, but it is weekly backed up; * Should be used to store raw data, or important data that you should not lose. Should not be used to store temporary or data that you can easily reproduce; * Can be used as input to pipelines, but not as workdirs (i.e. pipelines should not be creating files in `/nfs`)... although some people have reported they had issues running pipelines on `/hps`, but worked just fine on `/nfs`; * Writable from login and worker nodes; ## 2.4. Troubleshooting slow read and write speed in filesystems Sometimes a filesystem gets slow because it is overloaded with users using it, or it reached almost full capacity. Then what usually happens is that your pipelines and jobs start to run a lot slower, as in bioinformatics everything relies a lot on reading and writing data to disks. If you suspect a filesystem you are using is overloaded and slow, you can use this script to verify this: https://github.com/leoisl/test_disk_speed_in_cluster . Unfortunately, this only helps you know there is an issue, it does not solve the issue. The solution is to wait for the slow down to be fixed, or migrate to another filesystem that is not overloaded. # 3. Submitting jobs ## 3.1. Hello world `bsub -o hello_world.o -e hello_world.e echo Hello world!` * `hello_world.o`: output stream (stdout) contents will be written to this file; * `hello_world.e`: error stream (stderr) contents will be written to this file; An annotated look at `hello_world.o`: ``` Hello world! <------- Output of your command <These are some footer, a job summary, automatically added by LSF about your job execution> ------------------------------------------------------------ Sender: LSF System <lsf@hl-codon-38-03> Subject: Job 1819497: <echo Hello world!> in cluster <codon> Done Job <echo Hello world!> was submitted from host <hl-codon-06-02> by user <leandro> in cluster <codon> at Thu Apr 29 20:33:02 2021 Job was executed on host(s) <hl-codon-38-03>, in queue <standard>, as user <leandro> in cluster <codon> at Thu Apr 29 20:33:02 2021 </homes/leandro> was used as the home directory. </hps/nobackup/iqbal/leandro/cluster_training> was used as the working directory. Started at Thu Apr 29 20:33:02 2021 Terminated at Thu Apr 29 20:33:04 2021 Results reported at Thu Apr 29 20:33:04 2021 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input echo Hello world! ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.02 sec. Max Memory : - Average Memory : - Total Requested Memory : - Delta Memory : - Max Swap : - Max Processes : - Max Threads : - Run time : 2 sec. Turnaround time : 2 sec. The output (if any) is above this job summary. PS: Read file <hello_world.e> for stderr output of this job. ``` ## 3.2. Improving Hello world 1. Use scripts: ``` echo "sleep 30; echo Hello World!" > hello_world.sh ``` 2. Give your job a name: `-J <jobname>` 3. Ask for an amount of memory for your job (there is a default of `4GB` I think?): `-M <amount_of_RAM_in_MB>` ``` bsub -o hello_world.o -e hello_world.e -J hello_world -M 1000 bash hello_world.sh ``` 4. See your job running: `bjobs` (or `bjobs -w`): ``` bjobs -w JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1819503 leandro RUN standard hl-codon-06-02 hl-codon-17-01 hello_world Apr 29 20:43 ``` 5. ... and finished: ``` $ bjobs -wa JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1819503 leandro DONE standard hl-codon-06-02 hl-codon-17-01 hello_world Apr 29 20:43 ``` 6. Now we can see some different run stats: ``` Resource usage summary: CPU time : 0.13 sec. Max Memory : 8 MB Average Memory : 8.00 MB Total Requested Memory : 1000.00 MB Delta Memory : 992.00 MB Max Swap : - Max Processes : 4 Max Threads : 5 Run time : 43 sec. Turnaround time : 32 sec. ``` ## 3.3. Asking for multiple cores or threads 1. Use `-n <number_of_CPUs>`: `bsub -o hello_world.o -e hello_world.e -J hello_world -M 1000 -n 8 bash hello_world.sh` **ATTENTION!!** Submitting a job asking for 1 CPU and running your tool with more CPUs than 1 (2, 4, 8, etc) is **evil**. You are telling the job scheduler you need 1 CPU, but you use 8. If everyone does this, we will have workers with 50 CPUs trying to do the work of 100 or more CPUs, the jobs of everyone will execute very slow. There is no way for the job scheduler to guess how many CPUs your job will use. If you use more CPUs than the number you asked, the job scheduler won't kill your job (different from RAM, it will kill your job). 2. Better host selection: `-R "select[mem>1000] rusage[mem=1000]"`; 3. Testing for filesystem errors: `-E 'test -e /homes/<your_username>'` AHHH this is too much! ## 3.4. Simplifying our lives: bsub.py This is a wrapper script created by Martin that simplifies a lot submitting jobs to the cluster. Installation: `pip install git+https://github.com/sanger-pathogens/Farmpy` Usage: ``` $ bsub.py --help usage: bsub.py <memory> <name> <command> Wrapper script for running jobs using LSF positional arguments: memory Memory in GB to reserve for the job name Name of the job command Command to be bsubbed optional arguments: -h, --help show this help message and exit . . . --threads int Number of threads to request [1] ``` ### 3.4.1. Submitting a job with bsub.py: ``` bsub.py 1 hello_world bash hello_world.sh ``` ### 3.4.2. Submitting a job with bsub.py asking several cores: ``` bsub.py --threads 8 1 hello_world bash hello_world.sh ``` # 4. Monitoring jobs Use `bjobs`. Most used parameters: * Wide formatting: `bjobs -w` * Details about a job: `bjobs -l <job_id>` * Seeing all jobs: `bjobs -a` # 5. Troubleshooting This is probably the most important section. When the cluster runs well, and you are submitting jobs correctly, everything runs fine. When things break (i.e. cluster is overloaded, or you have stuck jobs, or submitted to a bad node, or you submitted jobs incorrectly, etc), then you start investing a lot of time to understand and fix what is happening. ## 5.1. Out of RAM jobs Let's simulate a job that uses ~6 GB of RAM, but we ask only for 2 GB to LSF (in reality, heavy RAM jobs are jobs > 100 GB, but it is hard to control memory allocation in `python`...): ``` bsub.py 2 heavy_RAM_job python -c \""bytearray(10_000_000_000)"\" ``` Checking `heavy_RAM_job.o`: ``` TERM_MEMLIMIT: job killed after reaching LSF memory usage limit. Exited with exit code 137. Resource usage summary: CPU time : 0.48 sec. Max Memory : 2000 MB Average Memory : 2000.00 MB Total Requested Memory : 2000.00 MB Delta Memory : 0.00 MB Max Swap : - Max Processes : - Max Threads : - Run time : 9 sec. Turnaround time : 4 sec. ``` Once we see `TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.`, this means that our job requested more memory than we asked for, so LSF killed it. Let's now ask for 10 GB (4 GB of margin): ``` bsub.py 10 heavy_RAM_job python -c \""bytearray(10_000_000_000)"\" ``` It runs fine! ``` Successfully completed. Resource usage summary: CPU time : 2.11 sec. Max Memory : 6459 MB Average Memory : 6459.00 MB Total Requested Memory : 10000.00 MB Delta Memory : 3541.00 MB Max Swap : - Max Processes : 3 Max Threads : 4 Run time : 3 sec. Turnaround time : 6 sec. ``` `Successfully completed.` -> your job run fine! ## 5.2 Stuck or slow jobs Suppose you have a job that is supposed to do a lot of computation (e.g. running on 20 threads for a several days). After a full day of work, you do a `bjobs -l <job_id>` and you see this: ``` ... Thu Apr 29 22:19:01: Resource usage collected. The CPU time used is 6 seconds. MEM: 7 Mbytes; SWAP: 0 Mbytes; NTHREAD: 4 PGID: 3124158; PIDs: 3124158 PGID: 3124159; PIDs: 3124159 PGID: 3124161; PIDs: 3124161 ... ``` This means that your job was "running" for a whole day, but actually executed on a CPU for just 6 seconds. In this case, your job is stuck. There are some reasons for jobs to get stuck. The most common one is getting stuck when mounting unresponsive filesystems when starting a container (see [9.3. Solving singularity stuck issue](#93-solving-singularity-stuck-issue)), but there are others. Your job could also be very slow if the worker node is overloaded, but this rarely happens. These are very unusual situations, but in any case, if you need to debug why your job is stuck, I would recommend logging into the worker node and checking what is happening. You can do this by: 1. Checking which worker node your job is running: ``` bjobs -w JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1819493 leandro RUN standard hl-codon-21-01 hl-codon-06-02 /usr/local/bin/bash Apr 29 20:29 ``` `EXEC_HOST` is the worker node. Then you can log into the worker node: ``` bsub -Is -m hl-codon-06-02 "$SHELL" ``` Here, you are in an interactive job. You can run `top`, etc, to monitor what is going on in the worker node. More on interactive jobs in section [8. Interactive jobs](#8-interactive-jobs). ## 5.3. Jobs that never run, always pending When you submit a job, it first goes to the `PEND` state - it is waiting for LSF to find a suitable host to run your job. Depending on the cluster overload, your job might take a good time to get scheduled. You might want to investigate why it is not being scheduled, and maybe change some submission parameters to get things going. You can do this by inspecting the pending job details with `bjobs -l <job_id>`. For example, let's submit a job that requires 200 CPUs (there are no worker nodes capable of running this job in `codon`): ``` bsub.py --threads 200 1 pending_job ls ``` We can see that our job is pending: ``` $ bjobs -w JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1819560 leandro PEND standard hl-codon-06-02 - pending_job Apr 29 22:37 ``` Let's find out why: ``` bjobs -l 1819560 ... PENDING REASONS: Not enough job slot(s): 187 hosts; ... ``` This is saying that there are no hosts among the 187 available hosts that has enough job slots (or CPUs, 200) that you are asking. There might be several reasons why your job is pending, and they are all listed in the `PENDING REASONS` section. # 6. Killing jobs ## 6.1. Killing a single job If you need to kill a job, use `bkill <job_id>`. For example, to kill the previously pending job (or a job that you run incorrectly): ``` bkill 1819560 Job <1819560> is being terminated ``` If we run `bjobs -wa`, we will that the job was killed: ``` bjobs -wa JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1819560 leandro EXIT standard hl-codon-06-02 - pending_job Apr 29 22:37 ``` ## 6.2. Killing all your jobs `bkill 0` ## 6.3. Killing specific jobs To kill some specific jobs, but not all, you can use this handy bash function (add it to your `~/.bashrc` to always have it available): ``` # print the bkill command to kill jobs such that the job name grep-match the given argument grep_bkill() { echo "To kill jobs grep-matchin $1, run:" bjobs -w -noheader | grep "$1" | awk '{print $1}' | xargs echo bkill } ``` For example, I have 3 short and 3 long jobs runing (the job name identifies the long and the short jobs): ``` $ bjobs -w JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1820171 leandro RUN standard hl-codon-15-02 hl-codon-22-02 long_1 May 5 00:12 1820172 leandro RUN standard hl-codon-15-02 hl-codon-18-03 long_2 May 5 00:12 1820173 leandro RUN standard hl-codon-15-02 hl-codon-18-03 long_3 May 5 00:12 1820174 leandro RUN standard hl-codon-15-02 hl-codon-18-03 short_1 May 5 00:12 1820175 leandro RUN standard hl-codon-15-02 hl-codon-18-03 short_2 May 5 00:12 1820176 leandro RUN standard hl-codon-15-02 hl-codon-17-01 short_3 May 5 00:12 ``` If I find an error with the long jobs, I can kill them all by running: ``` $ grep_bkill long To kill jobs grep-matchin long, run: bkill 1820171 1820172 1820173 ``` Note that `grep_bkill` don't actually kill your jobs, it just gives you the `bkill` command to kill the jobs. It allows you to recheck if everything is fine before actually killing them. Now we can proceed and kill the jobs: ``` $ bkill 1820171 1820172 1820173 Job <1820171> is being terminated Job <1820172> is being terminated Job <1820173> is being terminated ``` # 7. Queues, worker nodes and submitting jobs to the bigmem queue We can check the queues that we submit jobs to using `bqueues`: ``` $ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP production 100 Open:Active - - - - 0 0 0 0 debug 100 Open:Active - 1 - - 0 0 0 0 gpu 100 Open:Active - - - - 40 0 40 0 standard 50 Open:Active - - - - 496 0 496 0 research 50 Open:Active - - - - 0 0 0 0 bigmem 50 Open:Active - - - - 12 12 0 0 datamover_debug 50 Open:Active - 1 - - 0 0 0 0 datamover 40 Open:Active - 10 - - 0 0 0 0 mpi 40 Open:Active - - - - 0 0 0 0 long 30 Open:Active 500 20 - - 2 0 2 0 ``` Jobs are submitted by default to the `standard` queue. We can check stats (number of CPUs, max mem, etc) of each host (login and worker hosts) with `lshosts`: ``` $ lshosts HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES codon-maste X86_64 XeonE526 65.0 16 125.5G 1.9G Yes (mg) codon-maste X86_64 XeonE526 65.0 16 125.5G 1.9G Yes (mg) hl-codon-01 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-11 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-21 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () ... hl-codon-bm X86_64 XeonE526 65.0 16 755.6G 1.9G Yes () hl-codon-bm X86_64 XeonE526 65.0 16 755.6G 1.9G Yes () hl-codon-bm X86_64 XeonE526 65.0 16 755.6G 1.9G Yes () hl-codon-bm X86_64 XeonE526 65.0 24 755.5G 1.9G Yes () hl-codon-bm X86_64 XeonE526 65.0 24 755.5G 1.9G Yes () ... hl-codon-01 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-01 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-01 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-02 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-02 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-02 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-02 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-03 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-03 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () hl-codon-03 X86_64 XeonGold 65.0 96 376.3G 1.9G Yes () ``` Here we can see we have several big memory nodes (`hl-codon-bm`) with 755GB or 1.4TB of RAM. You might want to cherry pick these worker nodes to run your jobs. We can also see that the normal (non-big-mem) nodes (`hl-codon-01`, ...) have 376GB of RAM only, but a lot more cores (96). Big mem nodes are in the `bigmem` queue, while most of the other nodes are in the `standard` queue. Let's say we want to run a 500Gb job. We won't even manage to submit it to the `standard` queue: ``` $ bsub -M500000 ls MEMLIMIT: Cannot exceed queue's hard limit(s). Job not submitted. ``` But we can submit it to the `bigmem` queue (now using `bsub.py`): ``` $ bsub.py -q bigmem 500 heavy_job ls bsub -q bigmem -E 'test -e /homes/leandro' -R "select[mem>500000] rusage[mem=500000]" -M500000 -o heavy_job.o -e heavy_job.e -J heavy_job ls 1819565 submitted ``` And it will run in one of the big mem nodes. # 8. Interactive jobs Interactive jobs are nice to test stuff out in a worker node before putting everything up into a script and running. It is like a terminal session but on a worker node. To run interactive jobs, I highly recommend adding this function to your `~/.bashrc` (credits to Michael): ``` # function to start an interactive jobs with given memory and threads # usage: bsub_i <mem_in_gb> <threads> bsub_i() { mem="${1:-1}" mem=$((mem * 1000)) threads="${2:-1}" bsub -Is -n "$threads" -R "span[hosts=1] select[mem>${mem}] rusage[mem=${mem}]" -M"$mem" "$SHELL" } ``` Then, to start an interactive job where we can use up to 16 GB and 4 threads, we run: ``` bsub_i 16 4 ``` This will start a job in one of the worker nodes: ``` Job <1819566> is submitted to default queue <standard>. <<Waiting for dispatch ...>> <<Starting on hl-codon-08-03>> (base) hl-codon-08-03:/hps/nobackup/iqbal/leandro/cluster_training $ ``` # 9. Screens Screens are interactive sessions that live on login nodes that you can detach, and reattach later. In a nutshell, you can use an interactive job inside a screen to do some work, and then, when you have to go home, for example, you can simply detach from the screen and turn off your laptop. Your screen still exists (it was not killed), and your interactive job lives inside the screen, so it was not killed neither. On the other day, you can simply resume your screen and resume working on your interactive job. Without screen, as soon as you lose connection to the server (e.g. turn off the laptop) or exit the session, the interactive job will be killed. Then on the next day you need to resubmit your interactive job, and remember what you were doing to continue your work. You might also not have your bash history available. Screens solve this problem. ## 9.1. Creating a screen (with name hello_world) ``` $ screen -S hello_world ``` As soon as you create a screen, you are in it (i.e. attached to it) and can type commands. Let's do an `ls`: ``` (base) hl-codon-21-01:/hps/nobackup/iqbal/leandro $ ls 090rc1 cluster_training make_prg_0.2.0_prototype new_precompiled_test pandora_0.9.0_test pandora_paper pandora_paper_review pandora_versions pdrv singularity_cache temp ``` ## 9.2. Detaching from the screen Let's say I finished my work today, but want to continue tomorrow. I can simply detach from the screen. To do this type `<ctrl> a` and then `d`. You will see: ``` [detached from 4093722.hello_world] ``` ## 9.3. Listing available screens When you come back to work, you want to resume your screens, but you forgot its names and how many you have. Just list your available screens: ``` $ screen -ls There are screens on: 4094952.test (Detached) 4093722.hello_world (Detached) 2 Sockets in /run/screen/S-leandro. ``` The `(Detached)` means that the screen is not attached and can be attached to. ## 9.4. Attaching to a screen (with name hello_world) ``` screen -r hello_world ``` And we are back: ``` (base) hl-codon-21-01:/hps/nobackup/iqbal/leandro $ ls 090rc1 cluster_training make_prg_0.2.0_prototype new_precompiled_test pandora_0.9.0_test pandora_paper pandora_paper_review pandora_versions pdrv singularity_cache temp (base) hl-codon-21-01:/hps/nobackup/iqbal/leandro $ ``` ## 9.5. Attaching to an already attached screen (with name hello_world) Sometimes, for example when you close your laptop without detaching the screen, it will not detach itself. Then you will have trouble resuming it. Let's see an example: ``` $ screen -ls There are screens on: 4093722.hello_world (Attached) 2 Sockets in /run/screen/S-leandro. ``` Here, `hello_world` screen is attached to a terminal, I can't attach it to me now: ``` $ screen -r hello_world There is a screen on: 4093722.hello_world (Attached) There is no screen to be resumed matching hello_world. ``` What we can do is to tell `screen` to `detach` the screen and `resume` it for us here, by adding the parameter `-d`: ``` $ screen -d -r hello_world ``` And we are back in the screen: ``` (base) hl-codon-21-01:/hps/nobackup/iqbal/leandro $ ls 090rc1 cluster_training make_prg_0.2.0_prototype new_precompiled_test pandora_0.9.0_test pandora_paper pandora_paper_review pandora_versions pdrv singularity_cache temp (base) hl-codon-21-01:/hps/nobackup/iqbal/leandro $ ``` There are much more to `screen`s than what is written here. Google is really good resource to learn more. And from here you can do whatever you want. # 10. Modules Systems manages a bunch of software through modules. If what you need is already in modules, then just use it! There are several documentation on the web explaining how to use modules, but here we will just see the most used commands, so that you can get going with things: ## 10.1. List all available modules ``` module avail ------------------------------------------------------------------------ /hps/software/spack/share/spack/modules/linux-centos8-sandybridge ------------------------------------------------------------------------ ant-1.10.0-gcc-9.3.0-xzxbcc6 hpctoolkit-2020.08.03-gcc-9.3.0-7e7d77v likwid-5.1.0-gcc-9.3.0-5r3pmf3 py-markupsafe-1.1.1-gcc-9.3.0-3hoepzr ant-1.10.7-gcc-9.3.0-o4ilc7d hwloc-1.11.11-gcc-9.3.0-6whrzxh llvm-11.0.0-clang-11.0.0-trut6so py-memory-profiler-0.57.0-gcc-9.3.0-wi4elxy autoconf-2.69-gcc-9.3.0-bpocjpe hwloc-1.11.11-gcc-9.3.0-udp27qs llvm-12.0.0-gcc-9.3.0-4wfwsth py-numexpr-2.7.0-gcc-9.3.0-xt5ndff automake-1.16.3-gcc-9.3.0-j5atga2 hwloc-2.2.0-clang-11.0.0-2r2csuw llvm-openmp-9.0.0-clang-11.0.0-exvkysy py-numpy-1.19.4-gcc-9.3.0-x2neh6p ``` ## 10.2. Searching for the module that has what you need (e.g. singularity) ``` module keyword singularity --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The following modules match your search criteria: "singularity" --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- singularity-3.5.3-gcc-9.3.0-o6v53jz: singularity-3.5.3-gcc-9.3.0-o6v53jz singularity-3.6.4-gcc-9.3.0-yvkwp5n: singularity-3.6.4-gcc-9.3.0-yvkwp5n singularity-3.7.0-gcc-9.3.0-dp5ffrp: singularity-3.7.0-gcc-9.3.0-dp5ffrp ``` ## 10.3. Loading a module I don't have `nextflow` installed: ``` $ nextflow bash: nextflow: command not found ``` Search for the module and load it: ``` $ module load nextflow-20.07.1-gcc-9.3.0-mqfchke $ nextflow -v nextflow version 20.07.1.5412 ``` ## 10.4. List all loaded modules ``` $ module list Currently Loaded Modules: 1) gcc-9.3.0-gcc-9.3.0-lnsweiq 2) cmake-3.19.5-gcc-9.3.0-z5ntmum 3) singularity-3.7.0-gcc-9.3.0-dp5ffrp 4) nextflow-20.07.1-gcc-9.3.0-mqfchke ``` ## 10.5. Remove all loaded modules ``` $ module purge ``` Nothing is loaded: ``` $ module list No modules loaded ``` ## 10.6. Loading modules automatically If you use some tools frequently, load them automatically, adding a `module load` line to your `~/.bashrc`. For example, this is mine: ``` # load some default modules module load gcc-9.3.0-gcc-9.3.0-lnsweiq cmake-3.19.5-gcc-9.3.0-z5ntmum singularity-3.7.0-gcc-9.3.0-dp5ffrp ``` # 11. Tips and misc ## 11.1. Running hundreds or thousands of jobs The codon cluster is capable of running even hundreds of thousands of jobs, but managing which job failed, and has to be resubmitted, which failed due to RAM and we have to submit with more RAM, which succeeded, etc is a mess and a complicated job. For complicated workflow, we should use workflow managers. The most commonly used in bioinformatics is [snakemake](https://snakemake.readthedocs.io/en/stable/) (in python) and [nextflow](https://www.nextflow.io/) (in groovy, a Java variant). You can manage workflows similarly with both. For `snakemake`, we have a [LSF profile](https://github.com/Snakemake-Profiles/lsf/) that takes care of many stuff and issues with LSF clusters automatically for us. This profile was started by Michael and is maintained by him, Brice, me and others. Most importantly, we have been using it for a couple of years already in EBI clusters, including `codon`, so it works well on our cluster. If you need a workflow manager and choose `snakemake`, I can help you in case of any issues (as well as Michael, Brice, etc). For Nextflow, I think Martin or other people can help. ## 11.2. Isolating your environment You can try stuff out on whatever environment, but we strongly suggest that when you run something important, for a project or a paper, you do it in isolated enviroments. An easy isolated environment to set up is a conda environment. If you want even more isolation, reproducibility, and easyness for external people to reproduce your results or rerun your pipeline, go for containers (`Docker` or `Singularity`). In the cluster, we just have `singularity`, as `docker` needs special permissions to run. But `singularity` can run `docker` containers, so you can create recipes for any of these, and you will be able to run both on the cluster through `singularity`. Using containers will totally remove the but-it-worked-on-my-machine-or-cluster issue. ## 11.3. Solving singularity stuck issue If you use `singularity` to run containers, you might have some stuck jobs from time to time. By stuck I mean that the job is running for, let's say, 1 day, and when you query its CPU time usage (with `bjobs -l <job_id>`), you see that it actually executed for just a few seconds. There is a high chance that singularity is just stuck trying to mount several paths predefined by systems. In summary, systems wants to simplify the life of everyone using singularity, so they specify a configuration that every almost every filesystem available in the codon cluster is mounted when a singularity container starts. It takes a single one of these filesystems to be overloaded or very slow for some other reason to get all your jobs using singularity stuck. We, as a research group, don't use any special filesystem, just `/nfs,/hps,/homes`. Thus, you can add this to your `~/.bashrc`: ``` # avoids singularity getting stuck export SINGULARITY_CONTAIN=TRUE export SINGULARITY_BINDPATH=/nfs,/hps,/homes,/tmp ``` This will tell `singularity` to **not** mount any predefined paths specified by systems, and just mount `/nfs,/hps,/homes,/tmp`, which is all we normally need. ## 11.4. Solving issues with not enough space in temp directory Many tools use `/tmp` directory to create temporary files during its execution. In the `codon` cluster, `/tmp` is a very fast filesystem (it is a local filesystem of the worker node), but it is also relatively small. There is no control of how much `tmp` space a job can use, thus we could have 96 jobs running on a worker node, and a single one of these could fill up the `tmp` space, and every job that needs `tmp` now will fail. Most worrying is that there are tools that don't clean up their `tmp` files if they have an error during their execution. So, in summary, `/tmp` filesystem is super fast, but unreliable. If you don't want to encounter these issues anymore (they look like this: `Not enough free disk space on /tmp`), you can change the temporary directory by adding this to your `~/.bashrc`: ``` # avoids using node's tmp dir (it can fill up) custom_temp_dir="/hps/nobackup/iqbal/<your_user>/temp" if touch "${custom_temp_dir}/touch_test_to_change_temp_dir" 2>/dev/null ; then export TMPDIR="${custom_temp_dir}" fi ``` Notes: 1. For the `custom_temp_dir`, change to one of your paths. It has to be a `/hps` path and the dir has to exist. It should not be a `/nfs` path (too slow, and no need to be backed up), or a `/hps/software` path (it will just fill up your `/hps/software` and you won't be able to install anything else); 2. Your jobs will get slightly slower, as it is much faster to write and read from the worker node's `/tmp` dir than to `/hps`. If your job does a lot of I/O to the temp dir, then it might actually get slower, otherwise the slowdown is negligible; 3. Your jobs can still fail if `/hps` is full, but then the whole cluster goes down; 4. The if checks if the custom temp dir is writable. If it is, we change the default temp dir to it. Otherwise, no change is made. This is required because `/hps` filesystem is not writable from login nodes, just from worker nodes. But we need access to a writable temp dir from login nodes for several reasons, one of them is submitting jobs: `bsub` creates temporary files when submitting jobs, and if the temp dir is not writable, you won't be able to submit any jobs; ## 11.5. Some handy bash functions ``` #show full path in terminal PS1="\[\`if [[ \$? = "0" ]]; then echo '\e[32m\h\e[0m'; else echo '\e[31m\h\e[0m' ; fi\`:\$PWD\n\$ " # function to start an interactive jobs with given memory and threads # usage: bsub_i <mem_in_gb> <threads> bsub_i() { mem="${1:-1}" mem=$((mem * 1000)) threads="${2:-1}" bsub -Is -n "$threads" -R "span[hosts=1] select[mem>${mem}] rusage[mem=${mem}]" -M"$mem" "$SHELL" } # function to start an interactive debug job in a host # usage: bsub_debug <host> bsub_debug() { host=$1 bsub -Is -m "$host" "$SHELL" } # print the bkill command to kill jobs such that the job name grep-match the given argument grep_bkill() { echo "To kill jobs grep-matchin $1, run:" bjobs -w -noheader | grep "$1" | awk '{print $1}' | xargs echo bkill } # simple job monitoring command alias monitor_jobs="watch -n 30 \"bjobs -w | awk '{print \\\$3}' | sort | uniq -c\"" ``` # 12. Running jupyter notebooks from the cluster This is a copy paste from Michael Hall's instructions. All credits go to him. ## Jupyter on the cluster First thing we will define is some port number variables we use in different places to make it easy to control ``` local_port=9000 middle_port=8888 jupyter_port=8080 ``` 1. Start and interactive job on the cluster with whatever parameters you need: ``` bsub -Is "$SHELL" ``` 2. Note the hostname the interactive job is running on and start jupyter: ```sh echo "$HOSTNAME" local_port=9000 middle_port=8888 jupyter_port=8080 # start jupyter # export XDG_RUNTIME_DIR="" jupyter notebook --no-browser --port="$jupyter_port" # NOTE DOWN URL NOTEBOOK IS RUNNING ON # http://localhost:8080/?token= ``` If you get a permission error when trying to run `jupyter notebook` then try running `export XDG_RUNTIME_DIR=""`. 3. In a new terminal window (from your local machine): ```sh local_port=9000 middle_port=8889 jupyter_port=8080 # change codon-login to however you log in into codon ssh -L "$local_port":localhost:"$middle_port" codon-login # log into the interactive job you have running # change XX-YY to whatever numbers your interactive job has interactive_host="XX-YY" bsub -Is -m "$interactive_host" "$SHELL" # forward the port from the interactive job back to codon login local_port=9000 middle_port=8889 jupyter_port=8080 ssh -R "$middle_port":localhost:"$jupyter_port" codon-login-XXX ``` 4. Open new tab in your web browser and enter in the URL noted down from Step 2. **Make sure your replace the `localhost:<PORT_NUM>` with the value of the `local_port` variable you set in the beginning.** For example - In the above we would change `http://localhost:8080/?token=<TOKEN>` to `http://localhost:9000/?token=<TOKEN>` # 13. Conclusion I have surely missed stuff about using the cluster here. There are more things, but we could make these meetings from time to time, even if it is just 10 minutes to talk about something new. Using the cluster is like riding a bike, it is not much use to read and re-read this document - now that you know the basics, when the time come that you need to run some job in the cluster, try it yourself. If it does not work, try to debug it yourself, or check if this document helps. EBI intranet (https://intranet.ebi.ac.uk/) might help, or googling about your issue. You can also always contact us on the `#codon-cluster` channel on Slack or, if you feel more comfortable, just message me directly with your issue and we can discuss it. I really like solving cluster issues, as I think it is really important to keep the group workflow going, and to let others be aware of potential issues and already provide a solution when they face them. # TODO Incorporate EBI guides (search email for "EBI cluster guides")