Introduction to using the MGHPCC computing cluster @UMass

# Introduction to using the MGHPCC computing cluster @UMass The Massachusetts Green High Performance Computing Cluster is a shared computing resource available to UMass Amherst students, faculty, and staff. In our lab, we primarily use the cluster for analyzing high-throughput sequence data but it can be useful for other large data sets as wells. - See the [UMass cluster user wiki](http://wiki.umassrc.org/wiki/index.php/Main_Page#Welcome_to_the_University_of_Massachusetts_Green_High_Performance_Computing_Cluster) for addtional and up-to-date info. - See the [MEC Lab Guidelines for working on the MGHPCC](https://hackmd.io/DkKnjNHAQmyGI98exzyFlg?view) for a summary of expectations, tips, and tricks. - For questions related to the UMass GHPCC cluster please contact: hpcc-support@umassmed.edu ## Connecting to the cluster for the first time **Step 1:** Request access to cluster (must be approved by PI) http://wiki.umassrc.org/wiki/index.php/Requesting_Access **Step 2:** Get an email with your user name and password **Step 3:** If working on Windows, download and install an ssh client or linux-like environment, then log into the cluster. There are a few different options listed below, with instructions for how to log in to the MGHPCC: - **Option 1:** [PuTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/) 1. Open PuTTY 2. Type `ghpcc06.umassrc.org` into "Host Name (or IP address)" box make sure port is set to 22 and connection type SSH is seldcted and click open. 3. Type your user name into login as: 4. Type your given password **NOTE:** you will not be able to see anything appearing as you type your pw. Press enter when you are done. 5. Type your given password again 6. Type your new password (Your password must contain:at least 9 characters, at least one digit, at least one other character (non alpha numeric)) 7. Type your new password again 8. Open putty again and type ghpcc06.umassrc.org into "Host Name (or IP address)" box and click open 9. Type your new user name 10. Type your new password (you just set) 11. You're ready to go! To paste copied text (from anywhere) into PuTTY just right click. Clicking the right mouse button twice will act as an enter and submit your line of code. If you have copied and "enter" the command you paste will automatically run even if you only click once. - **Option 2:** [cygwin](https://www.cygwin.com/) 1. After installing cygwin, open the cygwin terminal. 2. Type `ssh [your_user_name]@ghpcc06.umassrc.org` 3. Follow the instructions to enter your given password twice, and then set up a new password. 4. Note, you will not be able to see what you are typing when typing your password. - **Option 3:** Blair recommendation When assigned an account for the MGHPCC, you will be given a personal work space in `/home/[user_name]`. You have a total of 50gb of storage space to work with in this directory. ## Uploading to the cluster and sharing files There are a variety of ways to transfer files to and from the cluster. Different techniques are convenient for different purposes. Each user has a home folder with a limited amount of space (50 gb). Lab groups can request project folders with more space. We are currently doing most of our work in `project/uma_lisa_komoroske/`. However, we don't have permission to upload things from our own computers to the project space, so we have to upload to files our home space then move files to the project space. ### Uploading files from your own computer via command line this is most useful for mac users (or linux users) who have a command line terminal built into their computer. **rsync** allows you to upload folders containing mulitple files.the flags `-av` and `-e` require you to type yes before files transfer. ``` rsync -av -e ssh ./test.file user1@ghpcc06.umassrc.org:/home/user1/testfolder/ ``` You can also use scp to copy files from your computer to the cluster ``` scp ./test.file /home/user1/testfolder ``` ### Uploading files from your own computer using a sftp client This method allows you to use a GUI interface and search for files on your computer in the way you normally do. Three commonly used sftp clients are [filezilla](https://filezilla-project.org/), [cyberduck](https://cyberduck.io/), and [WinSCP](https://winscp.net/eng/index.php). These programs allow you to log into the cluster (using the same IP address, port, and login info you use to access the cluster) and drag and drop files from your computer to your home folder in the cluster. Regardless of which ftp/sftp client you're using, enter the below information to connect to the cluster and transfer files: Host: ghpcc06.umassrc.org Username: [your_user_name] Password: your_password Port: 22 Once you've transferred the file(s) of interest to the cluster with your sftp client, you can log in to the cluster using PuTTY/cygwin and use the `mv` command to move files to the project space. ``` mv ~/file.to.move /project/uma_lisa_komoroske/folder ``` It very important not to make a typo in the destination part of this command because if you move a file to a fodler that does not exist you cannot find that file again. It is best to use tab complete to type out paths. You can also use the `scp` command to move files. ``` scp ~/file.to.move /project/uma_lisa_komoroske/folder ``` #### Downloading files from the internet As always, use caution when downloading files from the internet. You can use the `wget` or `curl` commands to download files from the web. You will need a file's address. See this [blog post](https://www.thegeekstuff.com/2012/07/wget-curl/) about when to use each command. For the most part, they seem to be interchangeable. #### Transferring files between users (only necessary if one user doesn't have access to shared project space) In general, this is not very useful because the user you want to send something to will have to type their password into your computer before the file will send. So you have to be in the same room as them. ``` rsync -av -e ssh /home/user1/testfolder/send.file user2@ghpcc06.umassrc.org:/home/user2/folder ``` **NOTE about rsync:** This creates copies, we are not sharing one file. This is good because we cannot accidentally delete a file from all of our accounts. We need to be aware that one person's work on a file in their account will not update the file in other accounts. ## Writing scripts Scripts are text files that contain commands that will run in order when the script in run. writing scripts allows you to submit a job that includes multiple commands and helps you keep track of your code to help facilitate troubleshooting and remond others and your future self what you did. In general, you should write your script in a text editor or HackMD, rather than a Word processor like MSWord. Programs like this can add hidden characters to your code that become tricky to troubleshoot. Popular text editors include - [atom](https://atom.io/) - [notepad++](https://notepad-plus-plus.org/downloads/) - [emacs](https://www.gnu.org/software/emacs/) #### Tips for writing functional scripts: - You can create and edit scripts by calling the text editor on the cluster using e.g. `nano xxxscript.sh` - Bash script file names typically end with `.sh` so your script would be called `xxxscript.sh` - The first line of your script should be `#!/bin/bash`. - You can read your script in the console using the command `less xxxscript.sh` - To make your script executable (give yourself permission to run it) use `chmod u+x xxxscript.sh` - To submit a script as a job to the cluster, run `bsub < ./path/xxxscript.sh` - See below for more info ## Submitting a job To run jobs that will take a large amount of memory and/or time you will need to request resources/submit a job to the cluster. To submit a job to the MGHPCC you must use the `bsub` command. see the [wiki job submission page](http://wiki.umassrc.org/wiki/index.php/Submitting_Cluster_Jobs) for additonal and up-to-date info. Jobs are run from the directory/environment that the job is submitted from and unless specified otherwise, output files will be written into the directory the job was submitted from. For your script to work, you will need to call files by the path relative to the directory you submit your job from. To avoid this you can use a file's full path (use `pwd` to get full path). This is an example job submission for a command ``` #cd your.working.directory bsub -q long -W 1:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] #command# #FILENAME# ``` This is an example job submission for a script ``` #cd your.working.directory bsub -q long -W 1:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] < ./path/xxxscript.sh ``` The default resources requested by bsub are 1 hour, 1GB, and 1 core. You can adjust your resources using the flags below. ``` #BSUB -n X # Where X is in the set {1..X} #BSUB -J Bowtie_job # Job Name #BSUB -o myjob.out # Append to output log file #BSUB -e myjob.err # Append to error log file #BSUB -oo myjob.log # Overwrite output log file #BSUB -eo myjob.err # Overwrite error log file #BSUB -q short # Which queue to use {short, long, parallel, GPU, interactive} #BSUB -W 0:15 # How much time does your job need (HH:MM) #BSUB -L /bin/sh # Shell to use #BSUB -R span\[hosts=1\] #says you want to use cores on the same host! #BSUB -R rusage[mem=16000] # max amount of memory the cluster will use (in megabytes! so 16000 = 16GB) #BSUB -n 1 # specifies number of cores you want to use ``` *Note: anytime you request more than one core you should use the flag `-R span\[hosts=1\]` to avoid issues with running the job in parallel* Currently the smallest memory nodes in the long queue have 128G of memory, the largest 512G, so you can specify larger amounts, though the tradeoff tends to be a longer time pending for resources before the job can be dispatched. To check job status and job number of all jobs you currently have running use:`bjobs` If you notice a typo, or need to cancel a job for some reason you can use `bkill xxxx` `# xxxx = your job number` #### Errors - `Exit code 1` means there is an issue with your script/command code - All other exit codes I've seen are related to job submission (i.e. job hit wall time, hit memory limit, etc.) - If you get an error saying something like `lsbatch: command not found` try adding the program path to the command, or if the program is a python/java program you may need to add code about that. - If you get an error saying something like `lsbatch: xxx is/is not a directory or file not found` check for typos and that you submitted your job from the correct directory. ## Using and Installing programs There are three main ways we can install/use programs to the cluster: 1) `module load`, 2) download files, and 3) use bioconda. Using `module load` for programs seems easiest and also doesn't use up our shared project space. While at first glance it looks like using bioconda to install programs is more challenging than just downloading executable files, bioconda is usually easier because you don't have to worry about dependencies, it works the same way for all the programs it's compatible with, and you can easily update it (but updating is not automatic). #### module load There are a variety of programs already installed and available for use on the cluster. You can also email `hpcc-support@umassmed.edu` to see if they can make a new program available through `module load`. To see what programs are available use the command `module avail` to print a list of programs to your console. To load a program use the command `module load program.v.1`. The program is now ready to use, but this command does not necessarily load the programs dependencies. Any program available through module load will have it's dependencies available through module load as well, so you just need to look at the program documentation (google) or error messages to figure out what programs to module load. When you run `module load program.v.1` the program's location will print to your console. This path is needed for some programs. #### Installing and using programs downloaded from files You can also install programs by downloading zipped files from the internet. You can download files using `wget`. You can unzip tar files using the command `tar xzf program.v.1.tar.gz` You can then remove the zipped file. We typically download programs to `/project/uma_lisa_komoroske/bin` so everyone can use them. You will likely need the path to the program to use it. For example: ``` /project/uma_lisa_komoroske/bin/program.v.1/command1 file.txt ``` #### Installing and using programs with bioconda [bioconda](https://bioconda.github.io/) is a program (available through `module load`) that makes downloading programs that have a lot of dependencies a lot easier. You can see what programs are available to install and use via biocionda [here](https://bioconda.github.io/recipes.html#recipes). To install a program to our shared bin folder use the following commands: ``` cd /project/uma_lisa_komoroske/bin module load anaconda2/4.4.0 conda create --prefix /project/uma_lisa_komoroske/bin/#PROGRAM# source activate /project/uma_lisa_komoroske/bin/#PROGRAM# conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda conda install #PROGRAM# source deactivate /project/uma_lisa_komoroske/bin/#PROGRAM# ``` To use the program in the future use the commands ``` cd /working.directory module load anaconda2/4.4.0 source activate /project/uma_lisa_komoroske/bin/#PROGRAM# #comands source deactivate /project/uma_lisa_komoroske/bin/#PROGRAM# ``` You can include these commands into a script to submit as a job. **Note:** if you are using a java based program you usually have to start your command with java -jar /path/to/program/program.jar