Colella Lab - Computing set-up

--- tags: computing --- Colella Lab - Computing set-up == ### KU Center for Research Computing (CRC) cluster access To get an account on the KU high-performance computing (HPC) cluster, you will need a KU ID and password. Fill out [this form](https://deptsec.ku.edu/~crc/forms/form/2) 3X: 1. "EEB" owner group 2. "Biodiversity Institute" owner group 3. "Colella" owner group **Other information needed for the form:** - Academic Department: EEB - Cluster: KU Community Cluster - Research Group or Project: Colella bioinformatics ### Logging in to the HPC - **Mac & Linux**: Open a terminal window and type: ```ssh <username>@hpc.crc.ku.edu``` - **Windows**: Install [MobaXterm](https://mobaxterm.mobatek.net/download.html). Open MobaXterm and type: ```ssh <username>@hpc.crc.ku.edu``` - Enter your KU password & click enter (*Note: you will not be able to see anything when you type in your password*) **replace <username> in the above commands with your KU ID **You may need to install an OpenSSH client on your computer for the above commands to work: ```sudo apt-get install openssh-client``` ### Virtual Private Network (VPN) - for off-campus HPC access The HPC can only be accessed through the KU VPN when off-campus (or outside of KU wifi). 1. Download and set up [DUO Authenication](https://technology.ku.edu/catalog/duo-multi-factor-authentication) with KU **it's easiest to connect this to your phone* 2. Download and install the KU Anywhere Cisco AnyConnect Secure Mobility Client 3. Open the Cisco VPN app and add ```kuanywhere.ku.edu``` as the client you wish to connect to. Click 'Connect' 4. Select the following: Group: DuoAuthenication Username: <KUID> Password: <KU password> Second password: ```PUSH``` Click 'OK' 5. Step #4 will prompt a DUO Authentication notice to your phone. Accept it and you will be logged in to the VPN and can ssh into the HPC as above The first time you use the VPN you may have to change your VPN "entitlement" to have ssh access To do that go here: https://myidentity.ku.edu/services/vpn-entitlement and select Ecology & Evolutionary Biology (KU Anywhere). ### File Transfer Systems ### An easy way to transfer large data files to the HPC is through [Globus](https://www.globus.org/). The HPC is automatically available as an endpoint called "KU Data Transfer Node". Follow the instructions at the website to set up your laptop as an endpoint, which will allow you to securely transfer large files between laptop and HPC. *Note that a VPN is needed to do this off campus, which causes the transfer to be much slower.* Alternatively, if you are downloading data from a password-protected dropbox, use the following steps to open a firefox window through the HPC and download directly there. windows: ```ssh -X <username>@hpc.crc.ku.edu``` mac: ``` ssh -Y <username>@hpc.crc.ku.edu``` ```firefox``` #this will open a (very slow) browser window Navigate to the password protected link. If that looks right, then click download on the files you want, or change the end of the link from dl=0 to dl=1 to download everything ### Tips for getting started on the HPC **Download, install, & use a plain text editor** 'Fancy' text editors like word use special characters to represent whitespace, therefore if you want to edit and/or copy and paste code, you need to use a plain text editor. Mac: [BBedit](https://www.barebones.com/products/bbedit/) Windows: [Notepad++](https://notepad-plus-plus.org/) **Choose a command line text editor (nano, vi/vim, emacs)** A command line text editor allows you to edit code and plain text inside a terminal window (as opposed to toggling back and forth between a plain text app and your terminal window). Nano is the easiest/most intuitive. VIM is the most powerful and probably most used. Computer scientists are very picky about which text editor they use. - [nano cheatsheet](nano-editor.org/dist/latest/cheatsheet.html) - [vim cheatsheet](https://vim.rtorr.com/) - [emacs cheatsheet](http://cs.hamilton.edu/misc/EmacsCheatSheet_iupui.pdf) ![](https://i.imgur.com/qzqES7H.png) **Set up a 'nickname' for the cluster.** Typing your username + the hpc IP address over and over again can be cumbersome and prone to spelling errors. To circumvent this, you can set up an 'alias' or 'nickname' on your computer: Edit your ssh configuration file ```~/.ssh/config``` to contain: ``` Host kuhpc hostname hpc.crc.ku.edu user <your_username> port 22 ``` Save and exit. When you open a new terminal, you should be able to ```ssh kuhpc``` (or whatever you named the KU hpc) instead of typing out your user name and ip address. **Set up password-less entry** Typing your password over and over again is also cumberson. ssh keys allow secure password-less access to a remote system For Mac: 1. Open a terminal and type ```ssh-keygen -t rsa``` 2. Press ENTER to accept default save location. 3. You will be prompted to enter a password, while this is more secure, that defeats the purpose of this exercise. Click ENTER, ENTER to skip password specification. 4. Then copy the public rsa key you just created to the ku hpc by doing: ```ssh-copy-id -i ~/.ssh/id_rsa.pub <username>@hpc.crc.ku.edu``` 5. You should now be able to log in password free! NOTE: these processes may be slightly different for different operating systems - google it, I've provided the key words to get you started! **Vist the KU HPC Home page:** https://docs.crc.ku.edu/ **See what software is already installed**: https://docs.crc.ku.edu/software/list-software/ Can also see list of available software by running: ```module avail``` Use the [KU CRC Software Request form](https://docs.crc.ku.edu/software/modules/) to ask for new software to be installed on the cluster For cluster-questions, email: crchelp@ku.edu **Hit a space limit?** Check out this R shiny app (made by a KU grad student!) to see if you're using too much space (If you're over the red line, that's bad): https://devonderaad.shinyapps.io/cluster/ ### General HPC Workflow **Storage Groups** Space on the HPC is limited. Below are the storage limits for the various storage groups. **\$HOME** - 100GB and 100,000 file limit **\$WORK** - Work directory. Size based on owner group allocation (for the BI it's 9TB, but that's almost always consumed) **/panfs/pfs.local/work/colella/** -- extra storage for lab, currently 5TB shared **\$SCRATCH** - Temp space. ~75TB total. No personal limit. Files >60 days old will be removed, but you will receive notice about a week before this happens. **\$TEMP** - 164 TB. No personal limits. Files >30 days old will be deleted everyday, without notice. IT LOOKS LIKE TEMP DOESN'T EXIST ANYMORE???? Each of these storage groups is best used for different purposes. HOME has very limited space, but can be used for small projects or classes. We have access to two WORK directories, one shared by the BI and one shared by the lab. While there is more space in BI WORK, it shared between everyone in the BI, so in practice there is less space there then in lab WORK. The WORK directories are best used for storing large files that you are routinely working with. There is much more space in SCRATCH, but files cannot be stored there long term. It is best used as the space for writing large output. Each of these storage spaces have soft quotas and hard quotas. When a soft quota is hit, everyone in the that group will receive an email. When a hard quota is hit, all scripts writing output to that group will be cancelled, and nothing can be done until enough files are removed. Try not to hit a hard quota. A general workflow on the HPC looks like this: 1. Set up directories for scripts and important data files in WORK 2. Set up directories for large and/or intermediate data files in SCRATCH 3. Upload raw data to WORK or SCRATCH, depending on size 4. Back up raw data to midden 5. Execute scripts from WORK, but write large output to SCRATCH 6. Process large data files in SCRATCH, writing large output to SCRATCH 7. Once you have smaller and/or less temporary data files, which you plan to work with routinely, copy them to WORK using rsync **Node Partitions** Just as for storage space, the computing nodes are split into different groups. **sixhour** - Six hour time limit (at which point jobs will be cancelled), largest node has 1024GB, most available nodes. **eeb** - No time limit, largest node has 512GB, relatively few available nodes. **bi** - No time limit, largest node has 768GB, many nodes. **colella** - No time limit, one node with 1024GB, lab has priority. Choosing which node partition to submit your job to depends on how much time and memory you need. If a job will complete in less than six hours, it is fastest to submit to sixhour. If more time is needed, you can submit to one of the other partitions, but you may have to wait in line depending on how many of the nodes are already in use. The eeb and bi partitions are available only to members of those groups. Colella lab members are always given first priority on the colella partition, but when the nodes here are idle, they are given to sixhour jobs, so it is still possible to have to wait if other jobs are currently running. SLURM script template: ``` #!/bin/bash #SBATCH --job-name=test #SBATCH --partition=sixhour #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --mail-user=<username>@ku.edu #SBATCH --ntasks=1 #SBATCH --mem=1gb #SBATCH --time=00:05:00 #SBATCH --output=test.log # <load modules> echo "Running bash script" # <code goes here> ``` **Helpful KU-specific commands** ```crctool``` - view info about storage groups and partitions that you have access to ```crctool -p <partition>``` - see all jobs and queue for a partition, see how many nodes are in use and idle **** ## GitLab The lab manages shared code in GitLab, a web-based platform that helps teams develop, secure, and deploy software, funded by KU. Follow the following steps to create and start working in a GitLab account: 1. Create/open a KU GitLab account: https://technology.ku.edu/catalog/gitlab-ku Or go to https://gitlab.ku.edu/ to open GitLab ***Let Jocelyn know once you have done this and she will add you to the Colella Lab GitLab Group.*** 2. Add an ssh key to your GitLab profile: Click **profile logo** on the right side of the bar on the top left Select **Preferences** Select **SSH Keys** in the menu bar on the lefthand side Select **add new key ** Open a terminal/CLI window on your laptop and type: `ssh-keygen -t rsa` Press ENTER to accept default save location You will be prompted to enter a password, while that is more secure, it defeats the purpose of this exercise. Click ENTER, ENTER to skip password specification. Type `more ~/.ssh/id_rsa.pub` Copy the output and paste it into the **Key** dialogue box in GitLab The key will be automatically assigned a **Title** (e.g., the name of your computer) Leave **Usage Type** as **Authentication & Signing** Delete **Expiration date** Select the **Add key** button 3. Verify that it worked by copying/cloning a directory from the Colella GitLab onto your local computer (or the HPC): Run `git clone git@gitlab.ku.edu:colella-lab/example-workflows.git` Output of that command will look similar to: ``` Cloning into 'example-workflows'... The authenticity of host 'gitlab.ku.edu (123.456.78.910)' can't be established. ED25519 key fingerprint is … This key is not known by any other names. Are you sure you want to continue connecting (yes/no/[fingerprint])? # Enter yes to allow access to your computer. ``` ### Managing & organizing code within GitLab GitLab is a 'homebase' for our lab to develop and share code. #### There are 2 ways to add code to GitLab: (1) **Web/GUI directions:** Navigate to the repo and/or directory that you wish to add to in GitLab Click the **+** at the end of the path a. Choose to Upload file and select the script you wish to upload OR b. Paste your text into a **New file** Enter a **Filename** and text, enter **commit message**, click **Commit changes** (2) **Command Line directions:** This requires that you have git installed on your computer. Navigate to directory with a script you would like to add to the Colella GitLab, then: ``` git add <scriptname> git push <scriptname> git commit <scriptname> -m “<changes_i_made” # If you don’t enter a message, you will get stuck in VIM. # To escape VIM press: esc :w ``` **** ### Other computing Accounts 1. Create a GitHub account with your personal email account - https://github.com/ - Code and scripts are published via Gitub and it can also be used for version control and - Tutorial on how to use GitHub: [HERE](https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners) 2. Download and install [git](https://git-scm.com/downloads) - For Windows users, you may choose to use one of these alternative tools: git within [MobaXterm](https://mobaxterm.mobatek.net/), [Git Kraken](https://www.gitkraken.com/) or [Git Tortoise](https://tortoisegit.org/) 2. Create a Discord account - https://discord.com/channels/@me - This is for regular inter-lab and inter-divisional communication (social & professional) - Ask Jocelyn to add you to Lab Servers - Android/Apple apps are available for your phone, you can download a stand-alone app for your laptop, or you can access it online. Your call! 3. Midden The Lab has a >96TB network-attached storage array (NAS) for raw-data storage and back ups. All raw data should be backed up here upon receipt. - Ask Jocelyn to set up a username for you - Adjust your VPN entitlement at [this link](https://myidentity.ku.edu/services/vpn-entitlement). Log in and select "Ecology & Evolutionary Biology (KU Anywhere)" or "Biodiversity Institute (KU Anywhere)" - Log into the KU VPN (even if you're on campus) - Log into midden as: ```ssh <username>@midden.ku.edu``` Enter pw provided by JPC - Immediately change your password using the command ```passwd``` **** ### HELPFUL TUTORIALS #### All around basics - [Software Carpentry](https://software-carpentry.org/lessons/) - [Analysis of Next-Gen Sequencing Data](https://angus.readthedocs.io/en/2019/) - [OSU primer for computational biology](https://open.oregonstate.education/computationalbiology/front-matter/preface/) - General bio programming practice problems (in any language): [Rosalind](http://rosalind.info/problems/locations/) #### Linux/Unix - Basics: https://linuxsurvival.com/ - Linux-fu: https://linuxjourney.com/ #### Python - [Python for Kids](https://nostarch.com/pythonforkids) *recommended by the Editor of Systematic Biology ![](https://i.imgur.com/k7ZY1tM.jpg) - General python: [Automate the boring stuff](https://automatetheboringstuff.com/) - [Python Programming for Biologists](https://swcarpentry.github.io/python-novice-inflammation/index.html) - [Numpy](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html) #### Conda (Anaconda) - Use Conda with Python to manage computing environments and dependencies - [Conda commands cheat sheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf) #### R - [Swirl tutorial](https://swirlstats.com/) #### Regex (regular expressions) - Basics and practice: https://regexone.com/ #### Other tools: [Rclone](https://rclone.org/) [Rsync](https://linux.die.net/man/1/rsync) ### IN PERSON WORKSHOPS **Linux/unix**: [Software Carpentry](https://software-carpentry.org/?fbclid=IwAR1RP6FaOGY8ckfEFbgT7PweYGbomPd9U6_6FGk4G_AW9IZbvv6YPs7hUn8) **Molecular Evolution**: [MBL Workshop](https://www.mbl.edu/education/courses/workshop-on-molecular-evolution/)