# What might be useful to know when working with the MPIPZ HPC cluster
I tried to summarize what you will likely be doing often on the MPIPZ HPC cluster. I try to provide links for further help when possible. Sometimes there are more special "pro-tips" in citation blocks.
## What is the MPIPZ HPC cluster?
A few powerful computers with a common file storage. The best of them (`hpc001..006`) are only usable through the `Slurm` job submission system for fair distribution of compute time. The rest (`dell-node-1..12`) have no such restrictions.
A helpful summary by Saurabh is [here](https://rucola.mpipz.mpg.de/p/9834/) (intranet, so with VPN or from the institute). The part about LSF is not true anymore though.
## Where can I get help?
- ask around! All we do is sitting by these computers days long anyway
- rucola.mpipz.mpg.de -- A forum with questions answered by Saurabh, our bioinformatics expert. He is often helpful in figuring out why some errors occur.
- it@mpipz.mpg.de -- The IT team may help set up the PC, install new software etc.
- for coding questions, you can also ask Google or ChatGPT.
## Daily routines
### Connecting to the cluster
We connect to the cluster using the so-called "SSH protocol". It allows us to use the command line of the remote server from the command line of a local PC.
At the institute or with VPN, open command line prompt of your computer and type:
```
ssh -Y username@dell-node-11.mpipz.mpg.de
```
Replace `username`, and you may replace `dell-node-11` with any other server (e.g. `hpc001` if you want to submit a `Slurm` job.)
If you can't get VPN access and want to work from outside the institute, ask the IT team for access to `cucumber` server. Then you can do the following to start an interactive SSH session on the cluster:
```
ssh -Y -oHostKeyAlgorithms=+ssh-dss -t username@cucumber.mpipz.mpg.de -t ssh dell-node-11
```
> **PRO-TIP**: the argument `-Y` is to enable the X server, which is used to forward some graphics in addition to the command line text. You can test if it works by running `xclock`, and if it does, you can e.g. do `display image.png`. It might be hard to set up for Windows; there you can use the `MobaXTerm` terminal application which does this job for you.
### Navigating the filesystem, organizing files
As you have connected to the server, you are using the **Linux command line** of the cluster. In particular, we use what is called the "`bash` shell". It has many commands to work with files and directories.
If you are new to the Linux command line, [this](https://ryanstutorials.net/linuxtutorial/) looks like a good tutorial to start with (all chapters are relevant for us!), and a cheatsheet with most of the useful commands is [here](https://bioinformaticsworkbook.org/Appendix/Unix/UnixCheatSheet.html#gsc.tab=0).
If you want to know a bit more context, check out the (overly) comprehensive ["Linux for bioinformatics" course material](https://drive.google.com/drive/folders/1J8C2olv2yQsBk-vi6VuyzXWE4870yu8E).
As you will see, there are four main filesystem partitions accessible to you:
- `/netscratch/dep_mercier/grp_novikova/` or "netscratch" -- the partition where we mostly work. We have more space for experiments there, but the files are not backed up.
- `/biodata/dep_mercier/grp_novikova/` or "biodata" -- here we mostly store raw data (sequencing, microscope images etc.) and some crucial results (VCF files, genome assemblies, BAM files). This is backed up regularly, but because of that we should not move things around there (otherwise multiple instances of a file will be present in backups). If something is not needed for us on "biodata" anymore but it should stay in the backups for good, we can *archive* it to free some space -- Anna (aglushkevich@mpipz.mpg.de) does that.
- `/groups/dep_mercier/grp_novikova/` or "groups" -- not much happens here, but this is normally used as space for photographs from some events etc. -- and occasionally for some experimental data.
- `/home/$USER` where `$USER` is your username -- this is your *home directory* where you find yourself when you connect to the server. You cannot store more than 60G in there, and storing files there generally is not advisable because *only you* will have access to them. The exception are some configuration files (e.g. `.bashrc` and `.bash_profile` -- the list of `bash` commands to be executed on every login, setting up custom functions etc.)
And here are a few useful places on netscratch:
- `/netscratch/dep_mercier/grp_novikova/software` -- some programs that were not available for everyone so we installed them for ourselves
- `/netscratch/dep_mercier/grp_novikova/Scripts` -- the programs we write, with subfolders by username
### Looking into files
Besides the "normal" text files best handled by native Linux tools, we often work with some special file formats, it's good to know the most common of them:
- FASTA (.fasta, .fa) -- simplest sequence/alignment storage format, we use it e.g. for assemblies
- FASTQ (.fastq, .fq) -- raw sequencing data format (sequences + quality values)
- BAM (.bam) -- read mapping format. Pretty complex, see [specifications](http://samtools.github.io/hts-specs/SAMv1.pdf).
- VCF (.vcf, more often compressed .vcf.gz) -- variant calling format. Very complex, you will most likely need to consult [specifications](http://samtools.github.io/hts-specs/VCFv4.4.pdf) if you work with it.
- GFF (.gff3, .gff) -- genomic feature annotation format (basically type of feature + coordinates).
- BED (.bed) -- genomic region format (in the minimal form -- just the coordinates).
Although these are all some kinds of text files and we sometimes process them like normal text, there are special software tools for working with these formats that make life easier. Consider the following tools:
- `seqkit` for FASTA and FASTQ
- `samtools` for BAM
- `bcftools` for VCF
- `agat` for GFF
- `bedtools` for BED
- there is also very nice `csvtk` tool for tables (CSV and TSV)
### Running and writing scripts
You can look for examples of our programs in the `Scripts/` folder on netscratch. A script suitable for execution on an HPC node would look like this:
```
#!/bin/bash
# The SBATCH comment lines specify run parameters for the submission system.
#SBATCH --job-name=example
#SBATCH --nodes=1
var1="value1"
var2=5
command1 ${var1}
command2 ${var2}
# Other people might use your script, so keep it clean and
# comment difficult places
command3
```
On dell-nodes, you would simply run this script (suppose `script.sh`) like
```
bash script.sh
# or
/path/to/script.sh # or even ./script.sh if you are in the script's directory
```
The `#SBATCH` lines will then simply be ignored as they are no different from other comment lines.
On HPC nodes you would submit it through `Slurm`:
```
sbatch script.sh
```
Some scripts contain lines like `module load foo/bar`. Check [Make yourself comfortable](#make-yourself-comfortable) section to see how one sets up the module system.
> **PRO-TIP**: most likely at some point you will need to do something that is difficult to accomplish with existing tools. We use simple scripting languages to deal with these cases. You will likely need some knowledge of the most famous ones -- `R` and `python`, look for tutorials and feel free to ask us for help. If you wonder which one to learn coding in: `R` is a bit harder to install, it is only good fit for statistics and graphics on smaller datasets (e.g. <100M rows), but the code is shorter for these purposes; `python` is a language used for anything, but the code for our tasks will be a bit more convoluted and more third-party packages will be needed.
> **PRO-TIP** for dell-nodes: If you want to run a longer program without leaving your PC with the SSH session on, run it from within a `tmux` session and detach it. You will be able to reattach it in a new SSH session. Read more about `tmux` [here](https://hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/). I play it safe and only use `tmux` sessions with dell-nodes.
### Installing software
Often you don't find the right tool on the cluster and have to install it. There are a few ways to do it.
**The easiest** is to use `conda` if it is possible. Set it up once with `conda init bash` or load a module `module load mambaforge/self-managed/v23*`, and you will be able to create new environments where the right version of the program can be isolated to avoid conflicts:
```
conda create -n myenv softwarename
```
where `myenv` is the environment name (you will load it like `conda activate myenv`) and `softwarename` is the name of the program (make sure it is available in `conda` libraries).
**Secondly**, you can compile the software ("build" it from the source code with a compiler program following author's instructions) yourself, but success in that case is not granted. It is easier to do on the build server (i.e. do `ssh build-stretch` first and then compile). It is best to put the self-compiled software to `/netscratch/dep_mercier/grp_novikova/software`.
### Making yourself comfortable
A lot of daily routines can be simplified with the right commands and setup, and all of us accumulate our favorite tricks in own configuration files (e.g. `.bashrc`). Do it too!
Make yourself a folder in `/netscratch/dep_mercier/grp_novikova` and keep your less useful stuff there. Also, make yourself a scripts folder so people would have access to them.
Ask for someone else's `.bashrc` and copy to your home directory. E.g. take mine `/netscratch/dep_mercier/grp_novikova/nikita/.bashrc` but be aware that it's quite busy. Or start from scratch:
```
echo "source ~/.bashrc" > ~/.bash_profile
echo "source /opt/share/software/scs/appStore/modules/init/profile.sh" > ~/.bashrc
echo "export PATH=/opt/share/software/packages/miniconda3-4.9.2/bin:/opt/share/software/packages/miniconda3-4.9.2/condabin:/opt/lsf/8.3/linux2.6-glibc2.3-x86_64/etc:/opt/lsf/8.3/linux2.6-glibc2.3-x86_64/bin:$PATH" >> ~/.bashrc
```
Now, if you reload the shell or do `source ~/.bashrc`, you will get access to the *module system* with some programs that are preinstalled on the cluster in loadable environments. See `module avail` for a full list and do `module load name/version` to load one.
After some weeks or months, think what long commands you type too often. They can be simplified by aliases or functions in your `.basrhc` like:
```
alias ll="ls -lah --color" # aliases are used for single long commands
rebash() { source ~/.bashrc; } # functions are for chains of commands or commands with arguments
# Functions can have arguments, specified by $1, $2 etc. or altogether with $@:
tssh() { ssh -t -o User=$USER -Y $@ 'tmux attach || tmux new'; } # connect to specified node and attach to a tmux session there; create a new one if none exist.
```
### Backing things up
There are a couple of options to track changes in your files:
- Copy most sensible things to biodata once in a while. We have a `scripts_backup` folder there, use it. E.g. i have the following alias in my `.bashrc` that I run once in a while:
```
alias sync.scripts="rsync -avhr --delete /netscratch/dep_mercier/grp_novikova/$USER/scripts/* /biodata/dep_mercier/grp_novikova/scripts_backup/$USER/"
```
- for small files like scripts, you can maintain a `git` repository. `git` is a version control system that saves all intermediate stages of your files. Most important `git` repos of our lab are kept at https://github.com/novikovalab (you will need a GitHub account, ask Polina to connect it to our organization.
- Make sure to keep your most precious data (e.g. raw data) on biodata where it is backed up. Make sure it is archived before deleting it from there.
### Monitoring the cluster
You can see how busy the cluster is in a few ways:
- `Slurm` has tools for tracking HPC loading but I do not know them yet. In theory with them you can even know exactly who is running what and how heavy it is
- There are monitor websites for [dell-nodes](http://dell-head.mpipz.mpg.de/ganglia/) and [hpc-nodes](hpc-head.mpipz.mpg.de/ganglia/) (in intranet).