Cluster training

1. Logging in

1.1. Using VPN

  • Connect to VPN;
  • ssh you@codon-login

1.2. Using ssh proxy

  • This is my favourite way as you don't need to connect to VPN. It is also faster, as just your interaction with the terminal will go to EBI servers, and more private;
  • Unfortunately, EBI will disallow this as it is easier for people to have unauthorized access to your account I think by the end of the year, we will all need to use the VPN;
  • This is a multi-step approach, EBI explains how to do this here: https://intranet.ebi.ac.uk/article/remote-access-using-ssh
  • Once you are set up following EBI's guide, you can add this section to your local ~/.ssh/config to make things easier:
Host ebi-gate
    Hostname ligate.ebi.ac.uk
#    Hostname mitigate.ebi.ac.uk  # you can use mitigate if ligate has issues
    User <your_username>
    IdentityFile ~/.ssh/id_rsa.pub

Host codon-ext
    Hostname codon-login.ebi.ac.uk
    User <your_username>
    IdentityFile ~/.ssh/id_rsa.pub
    ProxyJump ebi-gate
    ForwardX11Trusted yes
  • Once this is all set up, you can simply do ssh codon-ext to connect to the codon cluster;

Your ssh connection might be closed due to inactivity (after a few minutes of inactivity). To avoid this, add this to your ~/.ssh/config:

# keeping connections alive
Host *
    ServerAliveInterval 300
    ServerAliveCountMax 2

1.3. Post login stuff

Sourcing shell initialization files

I had some issues with sourcing shell initialization files in the codon cluster only. It seems that when we login to the cluster, ~/.profile is loaded, but when we start a new terminal (e.g. submit an interactive job, or start a screen), ~/.bashrc is loaded. So I needed both files to always initialize the shell with my configs. Easiest way to deal with this is to link one to the other (e.g. I did ln -s ~/.profile ~/.bashrc). As such, whenever you change any of them, the change is reflected on the other, and the issues of sometimes one being used and the other being used is solved.

2. Basic filesystems

2.1. Software filesystem

  • /hps/software/users/iqbal
  • Fastest filesystem in the codon cluster;
  • Should be used just to store software you use frequently;
  • Very small filesystem: 20GB/Group; The quota for each group is actually considerable, 200GB/Group. When you log into the codon cluster, you are welcomed with this message, but there is a typo:
--   _____________________________________________________________________   --
--  |                 |                        |              |   Data    |  --
--  |   File System   |       Description      |    Quotas    | Retention |  -- 
--  |_________________|________________________|______________|___________|  --
--  |                 | User/Group Software    |              |           |  --
--  | /hps/software   | and Conda envs         |  20GB/Group  |           |  --
--  |_________________|________________________|______________|__________ |  --

The typo is that the quota for /hps/software is NOT 20GB/Group, but 200GB/Group. Although it is kind of large, we should just store software, conda environments, etc here. Don't store containers, as these can be quite heavy.

  • Writable from login and worker nodes;

2.2. HPS

  • /hps/nobackup/iqbal/
  • Large and fast non-backed up filesystem;
  • Should be used as workdir of your pipeline and scripts (i.e. your pipeline and scripts should create intermediary and temporary files here);
  • It is the filesystem you will normally be using;
  • Not writable from login nodes, writable from worker nodes;

2.3. NFS

  • /nfs/research/zi (will soon be changed to /nfs/research/iqbal);
  • Slowest filesystem of the 3, but it is weekly backed up;
  • Should be used to store raw data, or important data that you should not lose. Should not be used to store temporary or data that you can easily reproduce;
  • Can be used as input to pipelines, but not as workdirs (i.e. pipelines should not be creating files in /nfs) although some people have reported they had issues running pipelines on /hps, but worked just fine on /nfs;
  • Writable from login and worker nodes;

2.4. Troubleshooting slow read and write speed in filesystems

Sometimes a filesystem gets slow because it is overloaded with users using it, or it reached almost full capacity. Then what usually happens is that your pipelines and jobs start to run a lot slower, as in bioinformatics everything relies a lot on reading and writing data to disks. If you suspect a filesystem you are using is overloaded and slow, you can use this script to verify this: https://github.com/leoisl/test_disk_speed_in_cluster . Unfortunately, this only helps you know there is an issue, it does not solve the issue. The solution is to wait for the slow down to be fixed, or migrate to another filesystem that is not overloaded.

3. Submitting jobs

3.1. Hello world

bsub -o hello_world.o -e hello_world.e echo Hello world!

  • hello_world.o: output stream (stdout) contents will be written to this file;
  • hello_world.e: error stream (stderr) contents will be written to this file;

An annotated look at hello_world.o:

Hello world!  <------- Output of your command

<These are some footer, a job summary, automatically added by LSF about your job execution>
------------------------------------------------------------
Sender: LSF System <lsf@hl-codon-38-03>
Subject: Job 1819497: <echo Hello world!> in cluster <codon> Done

Job <echo Hello world!> was submitted from host <hl-codon-06-02> by user <leandro> in cluster <codon> at Thu Apr 29 20:33:02 2021
Job was executed on host(s) <hl-codon-38-03>, in queue <standard>, as user <leandro> in cluster <codon> at Thu Apr 29 20:33:02 2021
</homes/leandro> was used as the home directory.
</hps/nobackup/iqbal/leandro/cluster_training> was used as the working directory.
Started at Thu Apr 29 20:33:02 2021
Terminated at Thu Apr 29 20:33:04 2021
Results reported at Thu Apr 29 20:33:04 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
echo Hello world!
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   0.02 sec.
    Max Memory :                                 -
    Average Memory :                             -
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   2 sec.
    Turnaround time :                            2 sec.

The output (if any) is above this job summary.



PS:

Read file <hello_world.e> for stderr output of this job.


3.2. Improving Hello world

  1. Use scripts:
echo "sleep 30; echo Hello World!" > hello_world.sh
  1. Give your job a name: -J <jobname>
  2. Ask for an amount of memory for your job (there is a default of 4GB I think?): -M <amount_of_RAM_in_MB>
bsub -o hello_world.o -e hello_world.e -J hello_world -M 1000 bash hello_world.sh
  1. See your job running: bjobs (or bjobs -w):
bjobs -w
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1819503 leandro RUN   standard   hl-codon-06-02 hl-codon-17-01 hello_world Apr 29 20:43
  1. and finished:
$ bjobs -wa
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1819503 leandro DONE  standard   hl-codon-06-02 hl-codon-17-01 hello_world Apr 29 20:43
  1. Now we can see some different run stats:
Resource usage summary:

    CPU time :                                   0.13 sec.
    Max Memory :                                 8 MB
    Average Memory :                             8.00 MB
    Total Requested Memory :                     1000.00 MB
    Delta Memory :                               992.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                5
    Run time :                                   43 sec.
    Turnaround time :                            32 sec.

3.3. Asking for multiple cores or threads

  1. Use -n <number_of_CPUs>: bsub -o hello_world.o -e hello_world.e -J hello_world -M 1000 -n 8 bash hello_world.sh

ATTENTION!! Submitting a job asking for 1 CPU and running your tool with more CPUs than 1 (2, 4, 8, etc) is evil. You are telling the job scheduler you need 1 CPU, but you use 8. If everyone does this, we will have workers with 50 CPUs trying to do the work of 100 or more CPUs, the jobs of everyone will execute very slow. There is no way for the job scheduler to guess how many CPUs your job will use. If you use more CPUs than the number you asked, the job scheduler won't kill your job (different from RAM, it will kill your job).

  1. Better host selection: -R "select[mem>1000] rusage[mem=1000]";

  2. Testing for filesystem errors: -E 'test -e /homes/<your_username>'

AHHH this is too much!

3.4. Simplifying our lives: bsub.py

This is a wrapper script created by Martin that simplifies a lot submitting jobs to the cluster.

Installation: pip install git+https://github.com/sanger-pathogens/Farmpy

Usage:

$ bsub.py --help
usage: bsub.py <memory> <name> <command>

Wrapper script for running jobs using LSF

positional arguments:
  memory                Memory in GB to reserve for the job
  name                  Name of the job
  command               Command to be bsubbed
optional arguments:
  -h, --help            show this help message and exit
  .
  .
  .
  --threads int         Number of threads to request [1]

3.4.1. Submitting a job with bsub.py:

bsub.py 1 hello_world bash hello_world.sh

3.4.2. Submitting a job with bsub.py asking several cores:

bsub.py --threads 8 1 hello_world bash hello_world.sh

4. Monitoring jobs

Use bjobs. Most used parameters:

  • Wide formatting: bjobs -w
  • Details about a job: bjobs -l <job_id>
  • Seeing all jobs: bjobs -a

5. Troubleshooting

This is probably the most important section. When the cluster runs well, and you are submitting jobs correctly, everything runs fine. When things break (i.e. cluster is overloaded, or you have stuck jobs, or submitted to a bad node, or you submitted jobs incorrectly, etc), then you start investing a lot of time to understand and fix what is happening.

5.1. Out of RAM jobs

Let's simulate a job that uses ~6 GB of RAM, but we ask only for 2 GB to LSF (in reality, heavy RAM jobs are jobs > 100 GB, but it is hard to control memory allocation in python):

bsub.py 2 heavy_RAM_job python -c \""bytearray(10_000_000_000)"\"

Checking heavy_RAM_job.o:

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 137.

Resource usage summary:

    CPU time :                                   0.48 sec.
    Max Memory :                                 2000 MB
    Average Memory :                             2000.00 MB
    Total Requested Memory :                     2000.00 MB
    Delta Memory :                               0.00 MB
    Max Swap :                                   -
    Max Processes :                              -
    Max Threads :                                -
    Run time :                                   9 sec.
    Turnaround time :                            4 sec.

Once we see TERM_MEMLIMIT: job killed after reaching LSF memory usage limit., this means that our job requested more memory than we asked for, so LSF killed it.

Let's now ask for 10 GB (4 GB of margin):

bsub.py 10 heavy_RAM_job python -c \""bytearray(10_000_000_000)"\"

It runs fine!

Successfully completed.

Resource usage summary:

    CPU time :                                   2.11 sec.
    Max Memory :                                 6459 MB
    Average Memory :                             6459.00 MB
    Total Requested Memory :                     10000.00 MB
    Delta Memory :                               3541.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                4
    Run time :                                   3 sec.
    Turnaround time :                            6 sec.

Successfully completed. -> your job run fine!

5.2 Stuck or slow jobs

Suppose you have a job that is supposed to do a lot of computation (e.g. running on 20 threads for a several days). After a full day of work, you do a bjobs -l <job_id> and you see this:

...
Thu Apr 29 22:19:01: Resource usage collected.
                     The CPU time used is 6 seconds.
                     MEM: 7 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 4
                     PGID: 3124158;  PIDs: 3124158 
                     PGID: 3124159;  PIDs: 3124159 
                     PGID: 3124161;  PIDs: 3124161 
...

This means that your job was "running" for a whole day, but actually executed on a CPU for just 6 seconds. In this case, your job is stuck. There are some reasons for jobs to get stuck. The most common one is getting stuck when mounting unresponsive filesystems when starting a container (see 9.3. Solving singularity stuck issue), but there are others. Your job could also be very slow if the worker node is overloaded, but this rarely happens. These are very unusual situations, but in any case, if you need to debug why your job is stuck, I would recommend logging into the worker node and checking what is happening. You can do this by:

  1. Checking which worker node your job is running:
bjobs -w
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1819493 leandro RUN   standard   hl-codon-21-01 hl-codon-06-02 /usr/local/bin/bash Apr 29 20:29

EXEC_HOST is the worker node. Then you can log into the worker node:

bsub -Is -m hl-codon-06-02 "$SHELL"

Here, you are in an interactive job. You can run top, etc, to monitor what is going on in the worker node. More on interactive jobs in section 8. Interactive jobs.

5.3. Jobs that never run, always pending

When you submit a job, it first goes to the PEND state - it is waiting for LSF to find a suitable host to run your job. Depending on the cluster overload, your job might take a good time to get scheduled. You might want to investigate why it is not being scheduled, and maybe change some submission parameters to get things going. You can do this by inspecting the pending job details with bjobs -l <job_id>. For example, let's submit a job that requires 200 CPUs (there are no worker nodes capable of running this job in codon):

bsub.py --threads 200 1 pending_job ls

We can see that our job is pending:

$ bjobs -w
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1819560 leandro PEND  standard   hl-codon-06-02    -        pending_job Apr 29 22:37

Let's find out why:

bjobs -l 1819560
...
 PENDING REASONS:
 Not enough job slot(s): 187 hosts;
...

This is saying that there are no hosts among the 187 available hosts that has enough job slots (or CPUs, 200) that you are asking. There might be several reasons why your job is pending, and they are all listed in the PENDING REASONS section.

6. Killing jobs

6.1. Killing a single job

If you need to kill a job, use bkill <job_id>. For example, to kill the previously pending job (or a job that you run incorrectly):

bkill 1819560
Job <1819560> is being terminated

If we run bjobs -wa, we will that the job was killed:

bjobs -wa
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1819560 leandro EXIT  standard   hl-codon-06-02    -        pending_job Apr 29 22:37

6.2. Killing all your jobs

bkill 0

6.3. Killing specific jobs

To kill some specific jobs, but not all, you can use this handy bash function (add it to your ~/.bashrc to always have it available):

# print the bkill command to kill jobs such that the job name grep-match the given argument
grep_bkill() {
    echo "To kill jobs grep-matchin $1, run:"
    bjobs -w -noheader | grep "$1" | awk '{print $1}' | xargs echo bkill
}

For example, I have 3 short and 3 long jobs runing (the job name identifies the long and the short jobs):

$ bjobs -w
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1820171 leandro RUN   standard   hl-codon-15-02 hl-codon-22-02 long_1     May  5 00:12
1820172 leandro RUN   standard   hl-codon-15-02 hl-codon-18-03 long_2     May  5 00:12
1820173 leandro RUN   standard   hl-codon-15-02 hl-codon-18-03 long_3     May  5 00:12
1820174 leandro RUN   standard   hl-codon-15-02 hl-codon-18-03 short_1    May  5 00:12
1820175 leandro RUN   standard   hl-codon-15-02 hl-codon-18-03 short_2    May  5 00:12
1820176 leandro RUN   standard   hl-codon-15-02 hl-codon-17-01 short_3    May  5 00:12

If I find an error with the long jobs, I can kill them all by running:

$ grep_bkill long
To kill jobs grep-matchin long, run:
bkill 1820171 1820172 1820173

Note that grep_bkill don't actually kill your jobs, it just gives you the bkill command to kill the jobs. It allows you to recheck if everything is fine before actually killing them. Now we can proceed and kill the jobs:

$ bkill 1820171 1820172 1820173
Job <1820171> is being terminated
Job <1820172> is being terminated
Job <1820173> is being terminated

7. Queues, worker nodes and submitting jobs to the bigmem queue

We can check the queues that we submit jobs to using bqueues:

$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
production      100  Open:Active       -    -    -    -     0     0     0     0
debug           100  Open:Active       -    1    -    -     0     0     0     0
gpu             100  Open:Active       -    -    -    -    40     0    40     0
standard         50  Open:Active       -    -    -    -   496     0   496     0
research         50  Open:Active       -    -    -    -     0     0     0     0
bigmem           50  Open:Active       -    -    -    -    12    12     0     0
datamover_debug  50  Open:Active       -    1    -    -     0     0     0     0
datamover        40  Open:Active       -   10    -    -     0     0     0     0
mpi              40  Open:Active       -    -    -    -     0     0     0     0
long             30  Open:Active     500   20    -    -     2     0     2     0

Jobs are submitted by default to the standard queue.

We can check stats (number of CPUs, max mem, etc) of each host (login and worker hosts) with lshosts:

$ lshosts 
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
codon-maste  X86_64 XeonE526  65.0    16 125.5G   1.9G    Yes (mg)
codon-maste  X86_64 XeonE526  65.0    16 125.5G   1.9G    Yes (mg)
hl-codon-01  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-11  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-21  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
...
hl-codon-bm  X86_64 XeonE526  65.0    16 755.6G   1.9G    Yes ()
hl-codon-bm  X86_64 XeonE526  65.0    16 755.6G   1.9G    Yes ()
hl-codon-bm  X86_64 XeonE526  65.0    16 755.6G   1.9G    Yes ()
hl-codon-bm  X86_64 XeonE526  65.0    24 755.5G   1.9G    Yes ()
hl-codon-bm  X86_64 XeonE526  65.0    24 755.5G   1.9G    Yes ()
...
hl-codon-01  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-01  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-01  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-02  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-02  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-02  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-02  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-03  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-03  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()
hl-codon-03  X86_64 XeonGold  65.0    96 376.3G   1.9G    Yes ()

Here we can see we have several big memory nodes (hl-codon-bm) with 755GB or 1.4TB of RAM. You might want to cherry pick these worker nodes to run your jobs. We can also see that the normal (non-big-mem) nodes (hl-codon-01, ) have 376GB of RAM only, but a lot more cores (96). Big mem nodes are in the bigmem queue, while most of the other nodes are in the standard queue.

Let's say we want to run a 500Gb job. We won't even manage to submit it to the standard queue:

$ bsub -M500000 ls
MEMLIMIT: Cannot exceed queue's hard limit(s). Job not submitted.

But we can submit it to the bigmem queue (now using bsub.py):

$ bsub.py -q bigmem 500 heavy_job ls
bsub -q bigmem -E 'test -e /homes/leandro' -R "select[mem>500000] rusage[mem=500000]" -M500000 -o heavy_job.o -e heavy_job.e -J heavy_job ls
1819565 submitted

And it will run in one of the big mem nodes.

8. Interactive jobs

Interactive jobs are nice to test stuff out in a worker node before putting everything up into a script and running. It is like a terminal session but on a worker node. To run interactive jobs, I highly recommend adding this function to your ~/.bashrc (credits to Michael):

# function to start an interactive jobs with given memory and threads
# usage: bsub_i <mem_in_gb> <threads>
bsub_i() {
    mem="${1:-1}"
    mem=$((mem * 1000))
    threads="${2:-1}"
    bsub -Is -n "$threads" -R "span[hosts=1] select[mem>${mem}] rusage[mem=${mem}]" -M"$mem" "$SHELL"
}

Then, to start an interactive job where we can use up to 16 GB and 4 threads, we run:

bsub_i 16 4

This will start a job in one of the worker nodes:

Job <1819566> is submitted to default queue <standard>.
<<Waiting for dispatch ...>>
<<Starting on hl-codon-08-03>>
(base) hl-codon-08-03:/hps/nobackup/iqbal/leandro/cluster_training
$ 

9. Screens

Screens are interactive sessions that live on login nodes that you can detach, and reattach later. In a nutshell, you can use an interactive job inside a screen to do some work, and then, when you have to go home, for example, you can simply detach from the screen and turn off your laptop. Your screen still exists (it was not killed), and your interactive job lives inside the screen, so it was not killed neither. On the other day, you can simply resume your screen and resume working on your interactive job. Without screen, as soon as you lose connection to the server (e.g. turn off the laptop) or exit the session, the interactive job will be killed. Then on the next day you need to resubmit your interactive job, and remember what you were doing to continue your work. You might also not have your bash history available. Screens solve this problem.

9.1. Creating a screen (with name hello_world)

$ screen -S hello_world

As soon as you create a screen, you are in it (i.e. attached to it) and can type commands. Let's do an ls:

(base) hl-codon-21-01:/hps/nobackup/iqbal/leandro
$ ls
090rc1  cluster_training  make_prg_0.2.0_prototype  new_precompiled_test  pandora_0.9.0_test  pandora_paper  pandora_paper_review  pandora_versions  pdrv  singularity_cache  temp

9.2. Detaching from the screen

Let's say I finished my work today, but want to continue tomorrow. I can simply detach from the screen. To do this type <ctrl> a and then d. You will see:

[detached from 4093722.hello_world]

9.3. Listing available screens

When you come back to work, you want to resume your screens, but you forgot its names and how many you have. Just list your available screens:

$ screen -ls
There are screens on:
	4094952.test	(Detached)
	4093722.hello_world	(Detached)
2 Sockets in /run/screen/S-leandro.

The (Detached) means that the screen is not attached and can be attached to.

9.4. Attaching to a screen (with name hello_world)

screen -r hello_world

And we are back:

(base) hl-codon-21-01:/hps/nobackup/iqbal/leandro
$ ls
090rc1  cluster_training  make_prg_0.2.0_prototype  new_precompiled_test  pandora_0.9.0_test  pandora_paper  pandora_paper_review  pandora_versions  pdrv  singularity_cache  temp
(base) hl-codon-21-01:/hps/nobackup/iqbal/leandro
$ 

9.5. Attaching to an already attached screen (with name hello_world)

Sometimes, for example when you close your laptop without detaching the screen, it will not detach itself. Then you will have trouble resuming it. Let's see an example:

$ screen -ls
There are screens on:
	4093722.hello_world	(Attached)
2 Sockets in /run/screen/S-leandro.

Here, hello_world screen is attached to a terminal, I can't attach it to me now:

$ screen -r hello_world
There is a screen on:
	4093722.hello_world	(Attached)
There is no screen to be resumed matching hello_world.

What we can do is to tell screen to detach the screen and resume it for us here, by adding the parameter -d:

$ screen -d -r hello_world

And we are back in the screen:

(base) hl-codon-21-01:/hps/nobackup/iqbal/leandro
$ ls
090rc1  cluster_training  make_prg_0.2.0_prototype  new_precompiled_test  pandora_0.9.0_test  pandora_paper  pandora_paper_review  pandora_versions  pdrv  singularity_cache  temp
(base) hl-codon-21-01:/hps/nobackup/iqbal/leandro
$ 

There are much more to screens than what is written here. Google is really good resource to learn more.

And from here you can do whatever you want.

10. Modules

Systems manages a bunch of software through modules. If what you need is already in modules, then just use it! There are several documentation on the web explaining how to use modules, but here we will just see the most used commands, so that you can get going with things:

10.1. List all available modules

module avail

------------------------------------------------------------------------ /hps/software/spack/share/spack/modules/linux-centos8-sandybridge ------------------------------------------------------------------------
   ant-1.10.0-gcc-9.3.0-xzxbcc6                          hpctoolkit-2020.08.03-gcc-9.3.0-7e7d77v      likwid-5.1.0-gcc-9.3.0-5r3pmf3                 py-markupsafe-1.1.1-gcc-9.3.0-3hoepzr
   ant-1.10.7-gcc-9.3.0-o4ilc7d                          hwloc-1.11.11-gcc-9.3.0-6whrzxh              llvm-11.0.0-clang-11.0.0-trut6so               py-memory-profiler-0.57.0-gcc-9.3.0-wi4elxy
   autoconf-2.69-gcc-9.3.0-bpocjpe                       hwloc-1.11.11-gcc-9.3.0-udp27qs              llvm-12.0.0-gcc-9.3.0-4wfwsth                  py-numexpr-2.7.0-gcc-9.3.0-xt5ndff
   automake-1.16.3-gcc-9.3.0-j5atga2                     hwloc-2.2.0-clang-11.0.0-2r2csuw             llvm-openmp-9.0.0-clang-11.0.0-exvkysy         py-numpy-1.19.4-gcc-9.3.0-x2neh6p

10.2. Searching for the module that has what you need (e.g. singularity)

module keyword singularity

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The following modules match your search criteria: "singularity"
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  singularity-3.5.3-gcc-9.3.0-o6v53jz: singularity-3.5.3-gcc-9.3.0-o6v53jz

  singularity-3.6.4-gcc-9.3.0-yvkwp5n: singularity-3.6.4-gcc-9.3.0-yvkwp5n

  singularity-3.7.0-gcc-9.3.0-dp5ffrp: singularity-3.7.0-gcc-9.3.0-dp5ffrp

10.3. Loading a module

I don't have nextflow installed:

$ nextflow
bash: nextflow: command not found

Search for the module and load it:

$ module load nextflow-20.07.1-gcc-9.3.0-mqfchke
$ nextflow -v
nextflow version 20.07.1.5412

10.4. List all loaded modules

$ module list

Currently Loaded Modules:
  1) gcc-9.3.0-gcc-9.3.0-lnsweiq   2) cmake-3.19.5-gcc-9.3.0-z5ntmum   3) singularity-3.7.0-gcc-9.3.0-dp5ffrp   4) nextflow-20.07.1-gcc-9.3.0-mqfchke

10.5. Remove all loaded modules

$ module purge

Nothing is loaded:

$ module list
No modules loaded

10.6. Loading modules automatically

If you use some tools frequently, load them automatically, adding a module load line to your ~/.bashrc. For example, this is mine:

# load some default modules
module load gcc-9.3.0-gcc-9.3.0-lnsweiq cmake-3.19.5-gcc-9.3.0-z5ntmum singularity-3.7.0-gcc-9.3.0-dp5ffrp

11. Tips and misc

11.1. Running hundreds or thousands of jobs

The codon cluster is capable of running even hundreds of thousands of jobs, but managing which job failed, and has to be resubmitted, which failed due to RAM and we have to submit with more RAM, which succeeded, etc is a mess and a complicated job. For complicated workflow, we should use workflow managers. The most commonly used in bioinformatics is snakemake (in python) and nextflow (in groovy, a Java variant). You can manage workflows similarly with both. For snakemake, we have a LSF profile that takes care of many stuff and issues with LSF clusters automatically for us. This profile was started by Michael and is maintained by him, Brice, me and others. Most importantly, we have been using it for a couple of years already in EBI clusters, including codon, so it works well on our cluster. If you need a workflow manager and choose snakemake, I can help you in case of any issues (as well as Michael, Brice, etc). For Nextflow, I think Martin or other people can help.

11.2. Isolating your environment

You can try stuff out on whatever environment, but we strongly suggest that when you run something important, for a project or a paper, you do it in isolated enviroments. An easy isolated environment to set up is a conda environment. If you want even more isolation, reproducibility, and easyness for external people to reproduce your results or rerun your pipeline, go for containers (Docker or Singularity). In the cluster, we just have singularity, as docker needs special permissions to run. But singularity can run docker containers, so you can create recipes for any of these, and you will be able to run both on the cluster through singularity. Using containers will totally remove the but-it-worked-on-my-machine-or-cluster issue.

11.3. Solving singularity stuck issue

If you use singularity to run containers, you might have some stuck jobs from time to time. By stuck I mean that the job is running for, let's say, 1 day, and when you query its CPU time usage (with bjobs -l <job_id>), you see that it actually executed for just a few seconds. There is a high chance that singularity is just stuck trying to mount several paths predefined by systems. In summary, systems wants to simplify the life of everyone using singularity, so they specify a configuration that every almost every filesystem available in the codon cluster is mounted when a singularity container starts. It takes a single one of these filesystems to be overloaded or very slow for some other reason to get all your jobs using singularity stuck. We, as a research group, don't use any special filesystem, just /nfs,/hps,/homes. Thus, you can add this to your ~/.bashrc:

# avoids singularity getting stuck
export SINGULARITY_CONTAIN=TRUE
export SINGULARITY_BINDPATH=/nfs,/hps,/homes,/tmp

This will tell singularity to not mount any predefined paths specified by systems, and just mount /nfs,/hps,/homes,/tmp, which is all we normally need.

11.4. Solving issues with not enough space in temp directory

Many tools use /tmp directory to create temporary files during its execution. In the codon cluster, /tmp is a very fast filesystem (it is a local filesystem of the worker node), but it is also relatively small. There is no control of how much tmp space a job can use, thus we could have 96 jobs running on a worker node, and a single one of these could fill up the tmp space, and every job that needs tmp now will fail. Most worrying is that there are tools that don't clean up their tmp files if they have an error during their execution. So, in summary, /tmp filesystem is super fast, but unreliable. If you don't want to encounter these issues anymore (they look like this: Not enough free disk space on /tmp), you can change the temporary directory by adding this to your ~/.bashrc:

# avoids using node's tmp dir (it can fill up)
custom_temp_dir="/hps/nobackup/iqbal/<your_user>/temp"
if touch "${custom_temp_dir}/touch_test_to_change_temp_dir" 2>/dev/null ; then
    export TMPDIR="${custom_temp_dir}"
fi

Notes:

  1. For the custom_temp_dir, change to one of your paths. It has to be a /hps path and the dir has to exist. It should not be a /nfs path (too slow, and no need to be backed up), or a /hps/software path (it will just fill up your /hps/software and you won't be able to install anything else);
  2. Your jobs will get slightly slower, as it is much faster to write and read from the worker node's /tmp dir than to /hps. If your job does a lot of I/O to the temp dir, then it might actually get slower, otherwise the slowdown is negligible;
  3. Your jobs can still fail if /hps is full, but then the whole cluster goes down;
  4. The if checks if the custom temp dir is writable. If it is, we change the default temp dir to it. Otherwise, no change is made. This is required because /hps filesystem is not writable from login nodes, just from worker nodes. But we need access to a writable temp dir from login nodes for several reasons, one of them is submitting jobs: bsub creates temporary files when submitting jobs, and if the temp dir is not writable, you won't be able to submit any jobs;

11.5. Some handy bash functions

#show full path in terminal
PS1="\[\`if [[ \$? = "0" ]]; then echo '\e[32m\h\e[0m'; else echo '\e[31m\h\e[0m' ; fi\`:\$PWD\n\$ "

# function to start an interactive jobs with given memory and threads
# usage: bsub_i <mem_in_gb> <threads>
bsub_i() {
    mem="${1:-1}"
    mem=$((mem * 1000))
    threads="${2:-1}"
    bsub -Is -n "$threads" -R "span[hosts=1] select[mem>${mem}] rusage[mem=${mem}]" -M"$mem" "$SHELL"
}

# function to start an interactive debug job in a host
# usage: bsub_debug <host>
bsub_debug() {
    host=$1
    bsub -Is -m "$host"  "$SHELL"
}

# print the bkill command to kill jobs such that the job name grep-match the given argument
grep_bkill() {
    echo "To kill jobs grep-matchin $1, run:"
    bjobs -w -noheader | grep "$1" | awk '{print $1}' | xargs echo bkill
}

# simple job monitoring command
alias monitor_jobs="watch -n 30 \"bjobs -w | awk '{print \\\$3}' | sort | uniq -c\""

12. Running jupyter notebooks from the cluster

This is a copy paste from Michael Hall's instructions. All credits go to him.

Jupyter on the cluster

First thing we will define is some port number variables we use in different places to make it easy to control

local_port=9000
middle_port=8888
jupyter_port=8080 
  1. Start and interactive job on the cluster with whatever parameters you need:
bsub -Is "$SHELL" 
  1. Note the hostname the interactive job is running on and start jupyter:
echo "$HOSTNAME"
local_port=9000
middle_port=8888
jupyter_port=8080

# start jupyter
# export XDG_RUNTIME_DIR=""
jupyter notebook --no-browser --port="$jupyter_port"

# NOTE DOWN URL NOTEBOOK IS RUNNING ON
# http://localhost:8080/?token= 

If you get a permission error when trying to run jupyter notebook then try running export XDG_RUNTIME_DIR="".

  1. In a new terminal window (from your local machine):
local_port=9000
middle_port=8889
jupyter_port=8080

# change codon-login to however you log in into codon
ssh -L "$local_port":localhost:"$middle_port" codon-login

# log into the interactive job you have running
# change XX-YY to whatever numbers your interactive job has
interactive_host="XX-YY"
bsub -Is -m "$interactive_host" "$SHELL"

# forward the port from the interactive job back to codon login
local_port=9000
middle_port=8889
jupyter_port=8080
ssh -R "$middle_port":localhost:"$jupyter_port" codon-login-XXX
  1. Open new tab in your web browser and enter in the URL noted down from Step 2. Make sure your replace the localhost:<PORT_NUM> with the value of the local_port variable you set in the beginning. For example - In the above we would change http://localhost:8080/?token=<TOKEN> to http://localhost:9000/?token=<TOKEN>

13. Conclusion

I have surely missed stuff about using the cluster here. There are more things, but we could make these meetings from time to time, even if it is just 10 minutes to talk about something new. Using the cluster is like riding a bike, it is not much use to read and re-read this document - now that you know the basics, when the time come that you need to run some job in the cluster, try it yourself. If it does not work, try to debug it yourself, or check if this document helps. EBI intranet (https://intranet.ebi.ac.uk/) might help, or googling about your issue. You can also always contact us on the #codon-cluster channel on Slack or, if you feel more comfortable, just message me directly with your issue and we can discuss it. I really like solving cluster issues, as I think it is really important to keep the group workflow going, and to let others be aware of potential issues and already provide a solution when they face them.

TODO

Incorporate EBI guides (search email for "EBI cluster guides")