MEC Lab Bioinformatic Guidelines

For information about working on the cluster in general, see the [University of Massachusset's Unity Cluster Documentation](https://docs.unity.rc.umass.edu) --- # Before You Get Started In The Cluster Review our lab’s guide on working in unity: [Tips and tricks for working in Unity](https://hackmd.io/U4iKTg8CToWq-ZkymFqiEg) ### Skills to demonstrate prior to granting access to Unity: 1. Be able to proficiently navigate the cluster (e.g. using `ls`, `cd`, `cp`, and `mv`) 1. Make your own directories in your personal space and change permissions to them so only you have access 1. Understand what the different queues are and when to use them 1. Be able to write a script (e.g. in Nano, emacs, or Atom) 2. Be able to direct input and output correctly (this helps to avoid overwriting issues and streamlines file organization). 3. Be able to run scripts on the interactive queue (unity-compute) and submit scripts to the long queue 4. What are the key rules we have for working on shared cluster space? 5. Look at fastqc reports and understand the different metrics # Basic Rules and Considerations of Working in the Cluster ## Monitor your space usage Because the project space is shared, the jobs of others in the lab could fail or be delayed if one person is using a disproportionate amount of space. Be sure to clean out your directory frequently (for example checking on this once per week). The command `du -sh [directory]` will tell you space usage details. The command `du --max-depth=1 -h /project/uma_lisa_komoroske/` will tell you space usage details for everyone in the project space, which can help you understand how much space is being used/is available total. Be **EXTREMELY** careful with using the `-rm` and `-mv` commands. You **should only be using this command within your own folder**, and even then *with caution*. It can be helpful to work within Unity OnDemand, a GUI application, when removing files for the purpose of caution. Use caution with the `mv` command when moving or renaming files. ## Do not navigate into other people’s directories within the project space. You should only be moving within your own or within project specific folders that you are working with. This is different, of course, if you have a prior arrangement and/or are collaborating. Compare md5sums when downloading or transferring files to and from the cluster (see the <a href="#MD5-checksums"> MD5 checksums section</a>) ## If running any commands that take time (e.g. >30s) You should submit a request for an interactive job, as not to pull resources from others on the head node. `SBATCH` is a non-blocking command, meaning there is not a circumstance where running the command will cause it to hold. Even if the resources requested are not available, the job will be thrown into the queue and will start to run once resources become available. The status of a job can be seen using `squeue` while it is pending or running and `sacct` at any time. # Working In The Cluster ## Getting Started The most traditional method of connecting to Unity is using an SSH connection. A shell is what you type commands into. The most common shell in linux is bash, which is what you will likely be using on Unity. SSH stands for "secure shell". ### Setting Up Your SSH Key Using the [SSH protocol](https://docs.unity.rc.umass.edu/connecting/ssh.html), you can connect and authenticate to remote servers and services such as Unity, a high-performance computing cluster located at the Massachusetts Green High Performance Computing Center. > Sign up for a Unity account here. You need a PI to endorse your account. > https://unity.rc.umass.edu/ > > If you are having trouble using the university generated SSH keys, you can generate your own using Tanya Lama's protocol located here: > https://hackmd.io/@tlama/unity --- ## Managing Files on The Cluster For more information on managing files on Unity visit the [Unity Cluster Documentation](https://docs.unity.rc.umass.edu/managing-files/intro.html) ### Globus Connect Personal [Globus](https://app.globus.org/) allows for transfers between Globus collections. Visit the [Unity Cluster Documentation](https://docs.unity.rc.umass.edu/documentation/managing-files/globus/) to see the features of Globus and how to install/setup an account. ### Unity OnDemand [Unity OnDemand](https://ood.unity.rc.umass.edu/shibboleth-ds/?entityID=https%3A%2F%2Funity.rc.umass.edu%2Fshibboleth&return=https%3A%2F%2Food.unity.rc.umass.edu%2FShibboleth.sso%2FLogin%3FSAMLDS%3D1%26target%3Dss%253Amem%253A731745734272e2bd76b1806d4e46ed0694f94404a245d9e9a042bed982730ab3) is the easiest way to work with the Unity filesystem. Visit the [Unity Cluster Documentation](https://docs.unity.rc.umass.edu/documentation/managing-files/ondemand/) to see all of the OnDemand features. IF you want to use RStudio with Unity OnDemand but have issues with installing packages or dependencies, you can create a conda environment with R and all the packages/dependencies you need, and then summon the conda environment in OnDemand with an extra slurm command: `--export=RSTUDIO_WHICH_R=<path/to/conda/env/r/executable>` ### MD5 checksums #### What is an MD5 Checksum? A checksum is a string of characters and numbers generated by running a cryptographic hash function against a file. Checksums are often used to verify the integrity of files that are downloaded or to ensure that files have not been corrupted during storage or transmission. It is incredibly important to compare md5sums when downloading or transferring files to and from the cluster. If the files are the same, they will have the same string of characters/numbers; if one file is corrupted, incomplete, or different, the two outputs will be different. #### How to check the MD5 checksum of a downloaded file 1. Open a terminal window. 2. Type the following command: `md5sum [file] [path to the file]` > Note: You can also drag the file to the terminal window instead of typing the full path. 3. Hit the Enter key. 4. You’ll see the MD5 sum of the file. *Match it against the original value.* #### Troubleshooting ##### Why does my checksum not match the original value? ###### Wrong File ###### Corrupt Download Files may become corrupted after transfer. *This can be a very big problem if data are lost and we aren’t aware of it.* ##### Possible routes of correcting a checksum that does not match 1. Note all of your steps from downloading the software to troubleshooting – *URL, file name and size, etc.* 2. Try downloading the file again ## Submitting jobs to Unity ### Slurm, The Job Scheduler [Slurm](https://docs.unity.rc.umass.edu/slurm/index.html) is the job scheduler we use in Unity. Once the scheduler has started your job, it will run on some node in the cluster, using some resources that were defined by your parameters. ### Partitions/Queues Our cluster has a number of slurm partitions defined, also known as a queue. As you may have guessed, you as the user request to use a specific partition based on what resources your job needs. Find out which partition is best for your job [here](https://docs.unity.rc.umass.edu/technical/partitionlist.html). ### Preventing Loss of Work **When your job reaches its time limit, it will be killed**, even if it's 99% of the way through its task. Without checkpointing, all those CPU hours will be for nothing and you will have to schedule the job all over again. #### Time Limit Email One way to prevent this is to check on your job's output as it approaches its time limit. You can specify `--mail-type=TIME_LIMIT_80`, and Slurm will email you if 80% of the time limit has passed and your job is still running. Then you can check on the job's output and determine if it will finish in time. ### Using SBATCH to Submit Jobs ##### Generic example of script with common SBATCH parameters for submitting jobs: ``` #!/bin/bash #SBATCH -c 4 # Number of Cores per Task #SBATCH --mem=8192 # Requested Memory #SBATCH -p gpu # Partition #SBATCH -G 1 # Number of GPUs #SBATCH -t 01:00:00 # Job time limit #SBATCH -o slurm-%j.out # %j = job ID module load cuda/10 /modules/apps/cuda/10.1.243/samples/bin/x86_64/linux/release/deviceQuery ``` ### Modules You can load and unload modules as you please, enabling and disabling different software. You can list currently active modules with `module list`, search for modules with `module avail`, and unload all active modules with `module purge`. ### Job Resources #### Ensure the resources you request on the Unity cluster match the job you are doing. For example, if you are asking for a job to use 15 cores on Unity, you need to make sure that program can thread and use that many resources. If you request for 15 cores on Unity, but your job is unable to thread based on the program manual, then it will not actually thread across 15 cores and only use a single core. Be mindful of this to ensure we are using resources appropriately. If you are running a program that allows for parallelization and encourages the use of multiple threads, specify this via the `-c` command in the SBATCH header e.g., `#SBATCH -c 4` if wanting to use 4 threads. Ideally, the `#SBATCH --threads-per-core` setting should always be set to 1 on Unity per Georgia Stuart: ![](https://hackmd.io/_uploads/rynRDMsZ6.png) #### Checking on your job You can check your job with `squeue --me`. If you submit a job and want to check the expected wait time before your script is run, use `squeue --me --start` Here's a note from Unity support on how to interpret outputs from the above command: <font size=2>"As it gets closer to the top of the priority list it will be able to provide a value. Note that even when it does, it's very pessimistic about it in that it assumes that the jobs on the nodes it is considering for the job will run until their timelimit, which is rarely the case."</font> --- ### Using scratch space <font size=2>Notes largely from meeting on March 6, 2024 with Lisa, Jess, John and Cecile Cres (from Unity)</font> Our goal is to better utilize [Scratch space](https://docs.unity.rc.umass.edu/documentation/managing-files/hpc-workspace/), since we are about to drop from ~50Tb to ~20Tb. However, one issue is that we often have a lot of files where we need them for the medium-term. Each Scratch space saves files for 30 days, and each Scratch space can have three additional extensions for 30 days each. So each Scratch space can last up to 120 days. There is a command to extend Scratch, see here: Use `ws_list` to see the available number of extensions. The following code sample shows how to create a workspace and extend it for 30 days. **create scratch workspace** ``` username@login2:~$ ws_allocate simple 14 Info: creating workspace. /scratch/workspace/username-simple remaining extensions : 3 remaining time in days: 14 ``` **extend scratch workspace** ``` username@login2:~$ ws_extend simple 30 Info: extending workspace. /scratch/workspace/username-simple remaining extensions : 2 remaining time in days: 30 ``` You can check on your active scratch workspaces and how much time remains in each using `ws_list`. Cecile would recommend that we use Globus for moving files between `/scratch` and `/nese`. It's a "local" transfer. `/scratch` and `/nese` can communicate with one another, no problem. So we can keep the raw/trimmed files on `/nese` and reference them from `/scratch`. We can receive an email when our scratch space purge deadline approaches -- see Unity documentation for how to set this up. **RAD workflow** - `/scratch` should work seamlessly with Stacks and our RAD workflow - the main part of the workflow that we need more space for is parameter optimization, and there's no reason this should take longer than 90-120 days. **WGR workflow:** - trimmed files can easily work with `/scratch` (bc they're generated then used to make a .bam file) - .bam files might be more challenging because they're needed for longer --- # Unix Command Cheat Sheet ## Basic navigation `ssh` Secure shell, log into a machine  `clear` Clears the terminal of previous commands  `bash` Enters into the bash shell (default on ND machines is -tch)  `echo` Echo typed words, or a variable (which has $ before it)  `declare` Defines a variable, i.e. `declare myvar=hello`. No spaces permitted without escape or quotes. `pwd` Displays the absolute path to the current directory from root `(/)`  `ls` Listing or “let’s see”. lists the files in the current folder.  `man COMMAND` Displays the manual pages for the command  `cd` Change current directory  `mkdir` Make a new directory  `less FILE` Opens a readable file, does not allow you to edit it.  `nano` Opens a file (or creates a new one) in a word pad style editor  `mv FILE DESTINATION` Moves a file from one point to another  `mv FILE NEWNAME` Renames a file  `cp FILE DESTINATION` Copies a file from one point to another  `rm FILE` - removes a file (-r flag needed to remove a directory)  `scp FILE COMPUTER:DESTINATION` - moves a file from home computer to another computer, in the file designated after the `:` `scp COMPUTER:FILE DESTINATION` - moves a file from another computer to the current computer. `head` - list out the top ten lines of a file to the terminal  `tail` - list the last ten lines of a file to the terminal  `info COMMAND` - more extensive info than on the man pages  `date` - outputs date and time  `which COMMAND` - tells you where a program exists if it is within you PATH variable  `ln -s DESTINATION LINKNAME` - creates a symbolic link from the destination (ie:dropbox in course directory) to a `./linkname/` (ie:linktodropbox). Deleting a softlink does not effect the actual directory (unless you use rm -r). `gzip FILE` - compress a file  `gzip -d FILE` - decompress a file, file must end in .gz  `tar cf FILE` - tar files together  `tar xfz FILE` - untar files: e(x)tract (f)rom g(z)ip #### Hints: `Tab` will complete names to the next point of ambiguity Hit the `up key` to go back through your previous commands You can chain together multiple directories for commands. Ex. `cd ../../..` - goes up 3 directories `Ctrl-c` kills a running program Add directories to your PATH with declare `PATH=$PATH:` `.bashrc` is your set up for your terminal in bash. You can add any code in this file that you wish to be executed upon entering bash (such as adding things to your path, aliases, variable declarations, etc). --- ## Keyboard Shortcuts `Control-A`: Move the cursor to the beginning of the line. `Control-E`: Move the cursor to the end of the line. `Control-K`: Delete everything from the cursor to the end of the line. `Control-W`: Delete everything from the cursor backwards to the start of the preceding word. `Control-U`: Delete everything on the line. --- ## Special Characters `/` Separates directory names in a file path (ex: `MyDocuments/MyMovies/movie.mov)`. Also refers to “root” directory `(/)`  `\` “escape character”, causes the single character following it to be read literally (ex: `echo \` echoes a single` `) ’ `$` Signals a variable. When inside a set of `“”` **cannot** be escaped and always signifies a variable. (i.e. `echo “$HOME is my home”`)  `;` Used to signal the end of a command on a single line.  `#` Designates a comment in bash. Anything following will not be read by computer.  `#!` Hash bang or shebang, tells the computer you are about to give it a program to use to interpret the code you are providing. i.e. #!/bin/bash  `:` Used to separate directory locations in PATH variable.  `>` Used to capture output to screen and put it into a file. `&&` Used to execute commands sequentially, one after the other (Example: `command1 && command2`) --- ## Wildcards `*` is a wildcard, which represents zero or more other characters. ###### Used to match any number of characters. i.e. `test*.txt`, `test*`, `*`  `?` is also a wildcard, but it represents exactly one character ###### `?` Used to match any single character. i.e. `test1?.txt`, `test?.txt`  --- ## Pipes and Filters `wc` counts lines, words, and characters in its inputs. `cat` displays the contents of its inputs. `sort` sorts its inputs. `head` displays the first 10 lines of its input. `tail` displays the last 10 lines of its input. `command > [file]` redirects a command’s output to a file (overwriting any existing content). `command >> [file]` appends a command’s output to a file. `[first] | [second]` is a pipeline: the output of the first command is used as the input to the second. --- ## How To Search Your Command History Press `Control-R` to enter history search mode. The command prompt then looks like this (in zsh): `bck-i-search: _` Type a few characters that you remember being in an earlier command. If the right command appears, you can press Return to enter it again, if not, you can continue pressing `Control-R` repeatedly to look for additional matches. --- ## Absolute vs Relative Paths An absolute path specifies a location from the root of the file system. A relative path specifies a location starting from the current location. ``` cd . > Current Working Directory cd / > Root Directory cd /home/ > Home Directory cd .. > Command goes back one level cd ../.. > Command goes back two levels cd ~ > User's home directory cd > Shortcut to go back to the user's home directory ``` --- ## File System/System Information `free -mg` - gives you the amount of RAM on the current machine in GB (-g)  `du -h` - gives you disk usage information (in human readable format)  `df -h` - gives you info about amount of disk free (in human readable format)  `top` - shows all processes running and who owns them, their CPU usage, Memory usage, and other things (q to quit) --- ## Obtaining Data and Software `wget http://...` Downloads file at http link (can also use with ftp link) `gzip -d file.gz` - decompresses file.gz (can be a tar.gz file)  `bunzip2 file.bz2` - decompresses file.bz2 (can be tarred)  `unzip file.zip` - decompresses file.zip (can be tarred) `tar -xf file.tar` - unpacks file.tar  `tar -zxf file.tar.gz` - decompresses and unpacks a tar.gz file  `tar -zcvf file.tar.gz targetdir` - makes `file.tar.gz` out of targetdir (`z` = `gzip`, `c` = `create`) ## Installing Software `./configure --prefix=/path/to/install` Configures your source to install. Leave off prefix if you have root and want to install to /usr/bin  `make` Creates binaries  `make install` Installs to the path specified in prefix `arch` Check to see if we’re running a 32- or 64-bit machine --- ## Shell Scripts > MEC Lab Scripts can be accessed in `MEC_lab_shared_resources/Data analysis_Resources/MEC Bioinformatics/Scripts` Save commands in files (usually called shell scripts) for re-use. `bash [filename]` runs the commands saved in a file. `$@` refers to all of a shell script’s command-line arguments. `$1`, `$2`, etc., refer to the first command-line argument, the second command-line argument, etc. Place variables in quotes if the values might have spaces in[](https://) them. --- # Useful Resources #### MEC Lab Data Management Procedures > Can be accessed in `MEC_lab_shared_resources/Data_Management_group_docs` --- #### [Harvard FAS Tutorials and Training](https://informatics.fas.harvard.edu/category/tutorials.html) ##### FAS Informatics provides a number of training sessions on everything from basic Linux to transcript assembly. --- #### [Reference-guided RAD analysis workflow](https://hackmd.io/@jlastoll/rJ5zJ-xTL) ##### This workflow should be used for RAD-sequencing data where there is a reference genome available for the species of interest. --- #### [SNP Genotyping with Freebayes](https://hackmd.io/jwZH8rn4SxCLVji5U4qncQ) ##### The goal of this workflow is to identify single nucleotide polymorphisms (SNPs) in genomic data and genotype individuals using these sequence variants. --- #### [SNP discovery and genotyping](https://gist.github.com/MolEcolConsLab/10f3ea633a55dbc877d6ecade30c9f8c) (de novo-based RAD data): ##### This is a pipeline for processing RAD-Seq data and determining optimal parameters for assembling loci, calling SNPs, and calculating population statistics using the program Stacks. --- #### [SNP filtering](https://github.com/MolEcolConsLab/RAD_SNP_filtering) (reference-based and de novo workflows): ##### This repository contains scripts, RMarkdown files, and associated documentation for SNP filtering downstream of [de novo (Stacks)](https://github.com/MolEcolConsLab/Stacks) and [reference based (BWA & Freebayes)](https://github.com/MolEcolConsLab/Reference-guided-RAD-data-analysis) RAD analysis workflows. --- #### [Practical Computing and Bioinformatics for Conservation and Evolutionary Genomics](https://eriqande.github.io/eca-bioinf-handbook/erics-notes-of-what-he-might-do.html): ##### The is a collection of notes with 'useful things to know for bioinformatics' Includes info of linx-intros, shell programing, file formating, etc. --- #### [Sandbox.bio](https://sandbox.bio/) ##### Interactive Environment for working on a cluster, has examples of linux commands, bedtools, samtools etc ___ > Dependency Resolution: > Conda: Conda uses its SAT (satisfiability) solver for resolving dependencies. While effective, it might be slower in certain situations. > Mamba: Mamba uses a different solver called "libsolv," which is known for its speed. This makes **Mamba significantly faster than Conda** in terms of dependency resolution, especially in complex environments. Mamba is a third-party, drop-in replacement for Conda that aims to provide faster dependency resolution and improved overall performance. It is built on the same core technology as Conda but with a focus on speed.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.