Farm Cluster Notes

--- title: Farm Cluster Notes --- [toc] # Log in `ssh username@farm.cse.ucdavis.edu` ## ssh key When signing up, you upload a public rsa key. Everytime you log in with `ssh`, your system offers a private key that you have and matches the public key the server has in record. This is how the server identifies you when you log in. See below on how to generate a key pair if you haven't done this yet. ### Generate an ssh key 1. open terminal 2. type in below command: `ssh-keygen -t rsa` You should see this: `> Generating public/private rsa key pair.` 3. It will then ask you to put in a location to save the key file: ``` > Enter a file in which to save the key (/home/you/.ssh/id_rsa): ``` You can either type enter to use default location (`~/.ssh.id_rsa`) or type in a location you want (I use `~/.ssh/farm` so I know it's for farm and farm only) 4. Next it's gonna ask you for passphrase to protect the key file ``` > Enter passphrase (empty for no passphrase): [Type a passphrase] > Enter same passphrase again: [Type passphrase again] ``` Alternatively, you can just hit enter without typing a passphrase if you don't want to use password to protect your key (this means you won't need to type password to log in but it's also less secure. Be advised.) 5. Once you finish creating you key file, you then need to add your key to your system's key manager so it knows to use this key to authenticate you whenever you try to log in. Type below command to do so: ``` eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_rsa ``` Now you're all set! ## log in with less typing 1. Open (create if not existing) ssh config file in terminal with nano: ```{bash} nano ~/.ssh/config ``` 2. A text editor screen should pop up. In that screen add in below text: ``` Host farm HostName farm.cse.ucdavis.edu User username ``` Replace `username` with your username on farm. 3. Now you can log in to farm with simply typing: ``` ssh farm ``` 4. And transfer data: ``` scp -P 2022 farm:~/path ./local/path ``` ## log in from another machine Remember every machine you're trying to log in on needs to have that exact same pair of keys. You can simply copy the key pair to the new machine and add them to key manager (last step in [creating an ssh key](#Generate-an-ssh-key)). You may need to change the permission on the key files so they are not accessible by anyone but you. Use `chmod 600` to change permission for your key file. # Data Transfer ## Transfer between farm and local computer use `scp -P 2022` to transfer data to and from farm: From farm to local: `scp -P 2022 farm:~/path/file /local/dest/file` From local to farm: `scp -P 2022 /local/file farm:~/path/dest/file` Note that if you didn't follow [this step](#log-in-with-less-typing) then you have to spell out `farm` in the form of `username@farm.cse.ucdavis.edu` ## Download data from web to farm Normally you can use `wget`, `curl`, or `ftp` to download data from a web server (NCBI, ENA, sequencing center, etc). But doing this directly on head node will result in massive slow down for everyone working on that node if your data is very big. To avoid this, log in through port 2022 to download: `ssh -p 2022 farm` will log you in as normal. But now you will be able to download data using aforementioned commands without making everybody hate you :) ## Transfer data between different servers (eg. between farm and kaiden) This is essentially the same as [Transfer between farm and local computer](#Transfer-between-farm-and-local-computer). Just treat the one you're currently logged on as local and the other one as remote. Some considerations: 1. If you were to use kaiden as local: a. You need key pair for farm cluster on your kaiden's `~/.ssh` directory. b. You need to use port 2022 (`scp -P 2022`) to transfer data to and from farm. 4. If you were to use farm as local: a. You can access kaiden through your password like you normally would. b. You need to first log in to farm through port 2022 (`ssh -p 2022 farm`) # Submit a Single Command ## srun and parameters Use `srun` to submit a single command to execute. In slurm language, this is called submitting a `job`. In order for workload manager to properly assign you allocation of resources, you need to inform it: 1. Approximately how long this job will take 2. How many threads you need for this job 3. How much RAM you need for this job 4. Which partition you wish to use More discussion on these paramters will follow. Here is an example: 1. Download sample data: ``` curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz ``` 2. Load in fastqc package: `module load fastqc` 3. submit a fastqc job: ``` srun -t 10 --mem=2g -p high -n 1 -N 1 -c 1 fastqc ERR458493.fastq.gz ``` 4. You should then see below messages: ``` srun: job 12946788 queued and waiting for resources srun: job 12946788 has been allocated resources ``` Followed by output from fastqc: ``` Started analysis of ERR458493.fastq.gz Approx 5% complete for ERR458493.fastq.gz Approx 10% complete for ERR458493.fastq.gz ... Approx 95% complete for ERR458493.fastq.gz Analysis complete for ERR458493.fastq.gz ``` Note that during the fastqc run, you can't close your terminal window or the job will be killed. ## srun in the background If you're running a job that takes a long time, you would propably want to use something like `screen`, `tmux`, or `nohup`. This will allwow you to log out and close terminal while your job is still running: 1. Open up a session in screen: ``` screen -S fastqc ``` `-S fastqc` gives the screen session a name `fastqc` 2. Submit a job with srun: ``` srun -t 10 --mem=2g -p high -n 1 -N 1 -c 1 fastqc ERR458493.fastq.gz ``` 3. Then you can `ctrl`+`A`+`D` to get out of this screen session 4. To get back in that session: ``` screen -r fastqc ``` This will call back the session named `fastqc` Note that the standard output in a screen session can be hard to read. To overcome this, use `-o stdout.txt` to redirect standard output of your srun job to a file named `stdout.txt`. You can then see it in this file. You can also use `tmux` or `nohup`. To each their own... Do consider [batch script](#Submit-a-script-job) tho. I personally almost never use `srun` except in below scenario: ## Run an interactive session If you're like me and miss being able to test run some commands, see output, and terminate immediately if needed to make modifications, then you want to start an interactive session on a compute node to do so. (**Remember cluster rule #1: never run any serious computing job on head node. Don't make everybody hate you!**) To start an interactive session: ``` srun -p high -t 24:00:00 --mem=20000 --pty bash ``` This session will be closed when either you log out or it exceeds 24 hours, whichever comes first. ## common srun options Below is a list of common srun options for your reference: (For a complete documentation, see [here](https://slurm.schedmd.com/srun.html)) - `-t` time limit of the job, default is in minutes. You can also use format: "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". If the run time of a job exceeds this limit, it will be killed. - `-n` number of tasks to run. - `-N` number of nodes to run tasks on. - `-c` number of cpus to use per task. - `--mem` Specify the real memory required for each node. - `-p` [partition](#Partition) to use. Note: partition has limit on run time of a job, number of nodes, size of memory, and number of cpus. If you ask for more than the partition allows, your job will be left on pending indefinitely. - `--mail-type`: used with `--mail-user`, send emails to the address specified by `--mail-user` when events specified by `--mail-type` happens. Allowed specifications for `--mailt-type` are: NONE, BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50. The `TIME_LIMIT` events will notify you when a job reaches 100%, 90%, 80%, 50% time limit respectively. This is useful for monitoring your job and request for extension before it gets killed (by contacting admin). - `-J` Give the job a name. This is useful when submitting multiple jobs. # Submit a Script Job A better way to utilize farm cluster is to use batch script instead running commands with `srun` ## batch script To do this, you first need a working bash script. A minimal example may look like this: `qc.sh`: >#! /bin/bash -login module load fastqc fastqc ERR458493.fastq.gz Then below the first line of the script, add in parameters: >#! /bin/bash -login #SBATCH -J fastqc #SBATCH -t 20 #SBATCH -N 1 #SBATCH -n 1 #SBATCH -c 1 #SBATCH --mem=5gb #SBATCH --mail-type=ALL #SBATCH --mail-user=user@email.com module load fastqc fastqc ERR458493.fastq.gz ## Submit batch script Once you have the script ready, run the following command to submit this script: ``` sbatch qc.sh ``` Now you should see a message like this: >Submitted batch job 12950981 This means your script is submitted and in queue to be run. Once the system can allocate enough resource you requested, it will start executing it. ## Check job state To check the status of your job, run: ``` squeue -u username ``` Replace `username` with your username on farm. You should see something like this: ``` JOBID PARTITION NAME USER ST TIME NODES CPU MIN_ME NODELIST(REASON) ``` `ST` refers to job state, `R` means it's running and "PD" means it's pending. For a full list of state codes, see [here](https://slurm.schedmd.com/squeue.html#lbAG). ## Cancel a job To cancel a specific job, do: ``` scancel jobid ``` Replace `jobid` with the job id of the job you're trying to cancel. To cancel all jobs of a particular user: ``` scancel -u username ``` This will kill all jobs submitted by a user. You can of course only cancel your own jobs. ## Script output Batch script output is by default written to a file starting with `slurm-*`. You can redirect this to a file you want with `-o` option to `sbatch` or in the `#SBATCH` parameter section in your batch script. Be careful putting that in your script as the same file might be writen to by multiple jobs if not changed between runs. # Best Practice With Conda Conda is a package manager that resolves most of package installation and dependencies for you. the more you work in bioinformatics the more you appreciate its offering! ## Install conda To install a minimal version of conda: ``` curl -LO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh ``` Then run below code to config conda at start up: ``` conda init bash echo "source ~/.bashrc" >> ~/.bash_profile ``` Now log out and log back in. You should be able to run: ``` conda ``` And see below message: ``` usage: conda [-h] [-V] command ... conda is a tool for managing and deploying applications, environments and packages. Options: ... ``` You should also see `(base)` before your username at command prompt. This means you're currently in conda's base environment! ## Config conda channels Packages like `bwa`, `samtools`, etc are hosted on publically available conda channels. Whenever you try to install a particular package, conda goes to those channels to look for them. Now we need to tell conda where we want it to look for packages we are interested in (bioinformatics programms) ``` conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ``` ## Create conda environment A conda environment (env) contains a set of packages that are separated from those of other environments or system packages. Namely you can have different versions of `samtools` or `python`, etc without them messing with each other. This is mainly to solve software dependency issues. Two bioinformatics programs may depend on different versions of `python` and without separate environment you're guaranteed to have a problem running both of them! Use following code to create a conda env: ``` conda create -n fastqc ``` This will create a conda env named "fastqc" Then you can "get into" the env by activating it with conda: ``` conda activate fastqc ``` Now the `(base)` before your username should have changed to `(fastqc)`. that means you're now in the "fastqc" env! You can then install fastqc ***only*** in this env: ``` conda install fastqc ``` Now try to run fastqc: ``` fastqc --help ``` Or you can do it in one step: ``` conda create -n fastqc fastqc ``` You can leave current env by either activating another env or deactivate current env: ``` conda deactivate ``` I like to have one conda env for each tool I'm using. Eg. a `bwa` env for bwa tools, a `fastqc` env for fastqc, a `samtools` env for samtools, etc. ## Use conda with srun You can simply activate a conda env and use `srun` command to submit a job. There is no need to `module load` anymore if the tool you use is available in your conda env. ## Use conda with batch script Once you have all your conda env set up, you can replace `module load` code in your batch script with `conda activate` This is recommended as you get a tighter control of the environment and is much more reproducible and portable. **And perhaps most importantly, you can install and use most tools you want instantly without waiting for sys admin to install for you.** ## Using conda with snakemake See below [use conda in snakemake](#Use-conda-in-snakemake) # Unlock True Power of HPC with Snakemake ## Use conda in snakemake # More Notes ## Common settings I use |Job|Cpu|Memory|Time|Partition| |---|---|---|---|---| |fastqc|1|2G|1h|bml| |trim|6|5G|4h|bmm| |bwa|12|30G|2d|bmm| |samtools sort|4|5G|4h|bml| |sambamba dedup|4|8G|4h|bmm| |sambamba merge|5|10G|5h|bml| |sambamba flagstat|4|4G|2h|bml| |freebayes|36|70G|3d|bmm| Notes: `sambamba` consistantly outperforms samtools excecpt in sorting. `sambamba` consumes large amount of memory when sorting and it is not clear to me how well it scales. So I'm using `samtools` for sorting and `sambamba` for everything else at the moment. ## Threads and CPU Strictly speaking, a CPU often refers to a physical computing core. An octa-core CPU has 8 computing cores or 8 "CPUs". A thread refers to an actual computing thread. Most modern CPUs support a technology called Simultaneous Multithreading (SMT) and as a result each physical core has two threads. So an octa-core CPU, while has 8 cores, has 16 threads. However, in HPC terminolgy and bioinformatics domain, CPUs and threads are often interchangeable and refer to threads rather than cores. We have access to a total of 96 CPUs (threads) as a group. For general usage guidline, see [here](##Using-Farm-responsibly). ## RAM (memory) Random-Access Memory (RAM), or commonly called "memory", is like the short-term memory of human brains. The information stored in RAM disappears once system is turned off but it can be accessed by CPUs very quickly. CPUs read data from RAM, does computing, and puts it back into RAM. Sufficient amount of RAM is essential for scientific computing. Some programs like `bwa-mem`, `STAR`, and `freebayes` are very RAM-hungry while some like `samtools sort`, `samtools flagstat`, and `plink` require very little RAM. Requesting insufficient RAM for RAM-hungry jobs will result in jobs being killed while requesting too much RAM will prolong your queue time as resources are less likely to be readily available. We have priority access to 1TB RAM on Farm. For general usage guidline, see [here](##Using-Farm-responsibly). ## Storage We have a total of 100TB storage as a group. Currently there is no quota or limitation per person but that may change as we approach limit. As a general rule, use common sense and remove data to local drives if no used in reasonable time. ## Time ## Partition Partitions we have access to on farm are: 1. `bmh`: big memory high priority. Jobs start as soon as resources are available and run uninterrupted until finish (or fail). 2. `bmm`: big memory medium priority. Jobs start as soon as resources are available. Can be paused if resources are requested by a job from high priority and will resume as soon as possible. 3. `bml`: big memory low priority. Jobs start as soon as resources are available. Can be terminated if resources are requested by a job from higher priorities and will restart as soon as possible. ## Using Farm responsibly Unlike Kaiden, we do not have exclusive access to a set of computing resources but share them with other researchers. Therefore it is essential that every user acts responsibly. Aside from avoid running heavy jobs on log-in node, below are some general guidlines that we all agree to abide by, as outlined [here](https://github.com/dib-lab/farm-notes/blob/master/getting-started.md#Using-shared-resources): > Users in `ctbrowngrp` collectively share resources. Currently, this group has priority access to 1 TB of ram and 96 CPUs. These resources are accessed using the big mem partition, bm*. As of February 2020, there are 31 researches who share these resources. To manage and share these resources equitably, we have created a set of rules for resource usage. When submitting jobs, if you submit to a bm* partition, please follow these rules: > - bmh: use for 1. small-ish interactive testing 2. single-core snakemake jobs that submit other jobs. 3. only if really needed: one job that uses a reasonable amount of resources of “things that I really need to not get bumped.” Things that fall into group 3 might be very long running jobs that would otherwise always be interupted on bmm or bml (e.g. > 5 days), or single jobs that need to be completed in time for a grant or presentation. If your single job on bmh will exceed 1/3 of the groups resources for either RAM or CPU, please notify the group prior to submitting this job. > - bmm: don’t submit more than 1/3 of resources at once. This counts for cpu (96 total, so max 32) and ram (1TB total, so max 333 GB). > - bml: free for all! Go hog wild! Submit to your hearts content! For my personal use, I generally submit all my jobs to `bml` except: 1. Mapping reads with `bwa` or `STAR` 2. Calling Variants with `freebayes` 3. Peak calling with `MACS2` 4. Identifying duplicates with `Picard` or `Sambamba` These above jobs usually tend to run longer and therefore I cannot afford having them restart constantly. For these jobs I submit to `bmm` and in 99% of times they start within 1 hour of queue time and run uninteruppted. `bml` partitions are really good for a large amount of small jobs like `fastqc`, `samtools flagstat`, etc as they make use of "crumbs" of resources idle on Farm that can't be utilized by larger jobs.