GCB Academy - SciComp - Oct 2018

--- tags: course notes --- # GCB Academy - SciComp - Oct 2018 ## Automating tasks with the Unix Shell October 29, 2018 ### Introducing the Shell - `cd` to change directories and move around "folders" in the shell - `ls` to list files in a directory. - `ls -F` to show which ones are directories. They end with a `/` - `ls -l` for **long** format (owner, size, permissions, etc) - `ls -a` to list including hidden files/directories - `pwd` to ask the shell _Where am I now?_ - Use the `TAB` key on your keyboard to complete commands and files ### Navigating Files and Directories - `cd ..` to go up one level of directories. (e.g. from `dc_sample_data/untrimmed_fastq` to `dc_sample_data`) - `cd` works with absolute (beginning with a `/` like `cd /Users/dan/Desktop/dc_sample_data`) and with relative paths (like `cd Desktop/dc_sample_data`) - `cd ~` to go to your HOME directory, `cd -` to go to the directory you were last in ### Working with Files - Wildcards: `ls *.fastq` to list all files ending with **.fastq**. `*` matches any text of any length. `?` matches a single character, but you can use multiple `???` together **Exercise**: Do each of the following tasks from your current directory using a single ls command for each. - List all of the files in /usr/bin that start with the letter ‘c’. - `ls /usr/bin/c*` - List all of the files in /usr/bin that contain the letter ‘a’. - `ls /usr/bin/*a*` - List all of the files in /usr/bin that end with the letter ‘o’. - `ls /usr/bin/*o` **Bonus**: List all of the files in ‘usr/bin’ that contain the letter ‘a’ or the letter ‘c’. `ls /usr/bin/*[ac]*` Command History shortcuts: - `Ctrl-C` to cancel the current command and get a fresh prompt - `Ctrl-R` to begin a search through your command history - `Ctrl-L` to clear the screen - `history` command will list out everything you've typed, each line with a number - Use `!260` to repeat the command listed on line 260 above File Content commands: - `cat SRR097977.fastq` to show the entire contents of a file, but this will fill the screen quickly, so try... - `less SRR097977.fastq` to show the file one screen at a time (paging). Use the `Space` key to advance one screen. Up and down arrows (or `j`/`k`) to move back/forward one line. - `head` and `tail` to show the beginning and end of a file respectively Moving/copying files, Creating directories: - `cp SRR097977.fastq SRR097977-copy.fastq` to make a copy of a file - `mkdir backup` to make a directory called **backup** - `mv SRR097977-copy.fastq backup/` to move the copied file into the **backup** directory - `rm backup/SRR097977-copy.fastq` to remove (delete) a file. - **But there is no trash bin, `rm` means gone** - Make files read-only with `chmod -w` - _Minus W_ subtracts write permissions - `chmod -w SRR097977-copy.fastq` Removing directories - `rm` will not remove a directory by default. `rmdir` will remove a directory, but not an empty one! - `rm -r` to **recursively** remove a directory, by removing the files it contains - Or, remove the files first and then remove the empty directory with `rmdir` Break: Return at 10:35 ### Redirection Searching files - `grep` can search through files: - `grep NNNNNNNNNN SRR098026.fastq` to find _low-quality_ reads (having 10 consecutive `N`s) in the fastq file - `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq` to include 1 line **before** and 2 lines **after** the matched line. The read name is the line before the reads, and the quality score is 2 lines after. Exercise: 1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. - `grep -B1 GNATNACCACTTCC SRR098026.fastq` 2. Search for the sequence `AAGTT` in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. - `grep -B1 AAGTT *.fastq` Redirecting Output When we run a command, output usually just _spills_ to the terminal. Often we want to save that output for further steps. - Use the `>` command to send output from one command to a file. Always overwrites the destination file. - `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt` - Then, `less bad_reads.txt` to look at the produced file, or `wc -l bad_reads.txt` to see word/line counts - Use the `>>` command to **append** output from one command to a file. - `grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt` - Use the `|` (pipe) command to send output from one command to the **input** of another command. - Connect them with a *pipe* - `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less` - Can connect more than 2 commands with pipes too! Exercise: How many reads in the `SRR098026.fastq` file contain at least two regions of 5 unknown nucleotides (`N`s) in a row, separated by anything (known or unknown). Hint: Use a wildcard for the pattern and a pipe character `|` to get the count without creating an intermediate file. File Manipulation and Pipe Practice - Looking at tabular data, use `head -1` to see the column names - `head -n 1 SraRunTable.txt` - Use `cut` to extract columns from tabular data - `cut -f3 SraRunTable.txt | grep PAIRED | wc -l` Exercise: How many single-end libraries are in our samples? `cut -f3 SraRunTable.txt | grep SINGLE | wc -l` > 35 - `uniq` will reduce a list to just the unique items, but usually you want to `sort` first - `cut -f3 SraRunTable.txt | sort | uniq -c` - Use `grep -v PATTERN` to filter out things matching the PATTERN. - `cut -f3 SraRunTable.txt | grep -v LibraryLayout | sort | uniq -c` Exercise: 1. How many different sample load dates are there? 2. How many samples were loaded on each date? Hint: You already have these commands, but may need to explore to find the column with the load dates. `cut -f5 SraRunTable.txt | grep -v LoadDate | sort | uniq -c` >` 6 25-Jul-12` >` 31 29-May-14` Sorting an entire file - Use `sort -k` (k for **key**) to sort a file by a column number - `sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt` #### Writing Scripts - Use a text editor, like the easy, simple `nano` to put commands into a file - `nano count-load-dates.sh` opens the nano editor with a new file - Add your commands to run - `Control-O` to save (write out) the file, `Control-X` to exit `nano` and return to shell - Your script is just a file. You can `cat` it! - Run your script with `bash count-load-dates.sh` - Make scripts executable by the owning user with `chmod u+x count-load-dates.sh`. Executable scripts can be run without the `bash` command. `./count-load-dates.sh` ### Shell Tips: - If your desktop is using OneDrive, `cd Desktop` may not work as expected. Right-click the folder you want to access and select **Git Bash Here** - `man` not working on windows under **Git Bash**? [Internet search for `man ls`](http://google.com/?q=man+ls) or whatever command you're looking for. The man pages are all online. ## Programming with Python October 30, 2018 Download [python-fasta.zip](https://github.com/Duke-GCB/SciComp-Nov-2017/releases/download/1.0/python-fasta.zip) To launch Jupyter Notebook from the command line (do this from your home directory, or the directory to which python-fasta.zip was uncompressed): ``` $ jupyter notebook ``` ### Variables: How to define and use ```python= # strings: greeting = 'Aloha' print(greeting, 'World') # integer numbers age = 5 + 1 print('My daughter is', age, 'years old') # floating point numbers lb = 50.0 kg = 2.2 * lb print(kg) ``` ### Indexing into a string ```python= name = 'Python' # uses 0-based indexing! print(name[0]) # negative index means go from the end print(name[-1]) # can use sequence of indices too (start to end+1) print(name[0:3], name[3:6]) # is a letter in a string? 'P' in name ``` ### Variables point to values ```python= x = y = 'pencil' print(x, y) y = 'apple' # the value x points to remains the same print(x, y) ``` ### Strings that are sequences ```python= seq = 'ACTTGCATGC' ``` #### Computing the reverse complement To tackle a programming problem, it's always useful to break it down: * Reverse the sequence, then * Complement the sequence ```python= # a naïve way: rev = seq[9] + seq[8] + seq[7] + seq[6] + seq[5] + seq[4] + seq[3] + seq[2] + seq[1] + seq[0] # won't work whenever the sequence string isn't exactly 10 characters long # # use a loop instead rev = '' for s in seq: rev = s + rev print(rev) print(seq) ``` Defining and accessing dictionaries: ```python= counts = {'A': 2, 'C': 4, 'G': 2, 'T': 2} # index by the key: print('There are', counts['A'], 'As') # can also change values: counts['A'] = 3 # can use a variable for indexing: base = 'A' print(base, 'is there', counts[base], 'times') # can loop over all keys of a dictionary: for b in counts: print('Seen', b, count[b], 'times') ``` Complement using a dictionary: ```python= comps = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'} comp = '' for s in seq: c = comps[s] comp = comp + c print(seq, comp) ``` :::info Exercise: Combine reverse and complement ::: ```python= comps = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'} revcomp = '' for s in seq: # complement the base c = comps[s] # build up sequence in reverse revcomp = c + revcomp print(seq, revcomp) ``` ### Lists and arrays ```python= seqs = [seq, revcomp, rev] # also zero-indexed: print(seqs[0]) # can delete elements from a list: del seqs[0] print(seqs) ``` ### Conditional blocks ```python= if 'GC' in seq: print('Found GC in', seq) ``` #### Calculate %GC ```python= # initialize count values gc = 0 total = 0 # loop over each base for base in seq: # if it's a G or C increment gc by 1 if base == 'G' or base == 'C': gc = gc + 1 # increment total by one total = total + 1 pct = (gc * 100.0) / total print('GC-content of', seq, 'is', pct) ``` ### Functions Some functions are global - always available ```python= print('Hello world') # calling a function len(seq) x = [5, 6, 1, 9, 2] x_sorted = sorted(x, reverse=TRUE) print(x_sorted) # can be combined: print(sorted(x, reverse=TRUE)) ``` Some functions are "methods" on a data type, uses the `object.methodname()` notation: ```python= seq_lower = seq.lower() print(seq, seq_lower) ``` Functions are defined using `def fname(arguments)` notation: ```python= def double(x): result = x * 2 return result print(double(5)) # the variable x in the function is local to the function # and insulated from the surrounding code x = 10 print(double(20)) ``` ```python= # Reverse function, reverses the sequence provided def reverse(seq): rev = '' for s in seq: rev = s + rev return rev print(reverse(seq), reverse(reverse(seq)), seq) # Complement function, complements the sequence provided def complement(seq): # much better to define this locally here # than to rely on a global variable comps = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'} comp = '' for s in seq: c = comps[s] comp = comp + c return comp print(seq, complement(seq)) def revcomp(seq): return reverse(complement(seq)) print(seq, revcomp(seq)) ``` ### Reading lines from a file ```python= def readfasta(filename): f = open(filename, 'r') seq = '' for line in f: line = line.strip() if '>' not in line: seq = seq + line f.close() return seq print(readfasta('ae.fa')) ``` ### Importing code that we didn't write ```python= import sys # now we can use variables and functions # defined in module `sys` print(sys.argv) # if in a Python script (as opposed to in the Jupyter notebook), # the first element of the list is the name of the script ``` ## Version control with Git If you didn't attend the Python lesson, or lost the directory from that lesson, download materials in the following way: ```bash cd # or wherever else you want the materials folder mkdir read-fasta git clone https://github.com/Duke-GCB/scicomp-python.git cp scicomp-python cp *.py *.fa ../read-fasta cd ../read-fasta ``` ### Asking "What's going on?" ```bash git status ``` Always tells you the status of your repository ### Configuring your author info and editor ```bash git config --global user.name "Dan Leehr" git config --global user.email "dan.leehr@duke.edu" git config --global core.editor "nano -w" ``` These configuration commands tell git what information to record when you commit code, and what editor to use for writing commit messages. ### Creating a git repository Locally, in an existing directory: ```bash git init ``` This creates a _hidden_ `.git` directory, which is where our changes are tracked. ### Committing changes Two steps: 1. Prepare a set of changes to commit (to staging area) - `git add read_fasta.py` - `git status` (optional, but will show you what files are staged for commit) 2. Commit the staged changes with a message (committing) - `git commit` (will pop open your editor to provide a message) - Write a brief message in nano explaining the change, and `Control-O` to save ### Looking at changes before committing ```bash git diff ``` Shows changes to the local files that have not been staged or committed yet. ```bash git diff --staged ``` Shows changes to the **staged** files that have not been committed yet. ### Viewing history of changes ```bash git log # Shows commit authors, emails, dates, IDs and full commit messages git log --oneline # One line for each commit, its ID, and first line of message git log -p # Commit info, plus the changes to each file for each commit (+/-) ``` ### Restoring from backup Unstaging a staged change: ```bash git reset HEAD read_fasta.py git status # Shows there's still a change but it's not staged ``` Restoring a locally changed file to the committed version ```bash git checkout -- read_fasta.py git status # Shows that there are no changes ``` ### Restoring an earlier committed version ```bash git log # To find the commit ID you want to revert to git checkout 3db84d5f -- read_fasta.py # Restore read_fasta.py from how it was in commit 3db84d5f git status # Shows that above already staged the file, ready to commit git commit -m "Backs out arg checking" # Commit the staged change with a message right on the command line ``` ### Undoing an entire commit ```bash git log # To find the commit ID you want to undo git revert 1196dc9 # Reverse the changes made in 1196dc9 and prepare a new commit # Edit the commit message, but it will be pre-filled with info about revert git log # Shows that 1196dc9 is still in the history, but a new commit reverses it ``` ### Creating a branch We've so far been using the default `master` branch, but we may want to try some experimental changes or work on a set of changes that belong together. ```bash git checkout -b ignore-data # Creates a branch called ignore-data and switches to it git status # shows On branch ignore-data ``` ### Ignoring files Some files we don't want to commit with our code (like data files), but we also don't want git to constantly tell us they're untracked. Here, we can **ignore** them by listing them in a `.gitignore` file ```bash nano .gitignore # Opens editor *.fa # List the files and patterns to ignore *.fasta # Save and exit git status # Doesn't show the data files but does mention an unstaged .gitignore git add .gitignore # Stage the file git commit -m "Ignore data files" # Commit our staged changes ``` ### Fixing your last commit To make changes to the message or files in the commit you just made: ```bash git commit --amend # Opens editor, prompting you to revise your commit message ``` You can also stage additional changes before amending the commit, and they will be added to the commit. ### Switching branches ```bash git branch # Lists your branches git checkout master # Switches to your 'master' branch from 'ignore-data' git status # shows 'On branch master' and my ignore files are back! ``` ### Staging parts of a file ```bash git add -p read_fasta.py # Prompts you to stage individual changes (hunks) of a file ``` ### Incorporating changes from other branches (merges) ```bash git status # On branch 'master' git merge ignore-data # Adds the commits from 'ignore-data' onto the end of master git status # on branch master git log # Now includes the commits we made on 'ignore-data' ``` ### Creating a GitHub repository GitHub can also host copies of your git repositories and serve as a backup or point of collaboration See http://swcarpentry.github.io/git-novice/07-github/index.html for screenshots and step-by-step 1. Visit github.com 2. Click **+**, **New Repository** 3. Type a name, leave Public, and click **Create repository** 4. Copy the .git address provided (e.g. https://github.com/dleehr/scicomp-course-2018.git) 5. Add this address as a remote to your local git repository ```bash git remote add origin https://github.com/dleehr/scicomp-course-2018.git git push --all origin # Send all of our branches and commits to the repo on GitHub ``` ### Pushing new changes After making and committing more changes, you can synchronize your local work to GitHub by pushing your branch ```bash git push --set-upstream origin master # Just the first time to tell Git to push local master to origin master git push # pushes local master to origin master ``` ### Collaborating http://swcarpentry.github.io/git-novice/08-collab/index.html Get a partner's github id and view their repo https://github.com/partner-name/scicomp-readfasta The **Clone or Download** button will provide the **Clone** url. Then, clone their repo (it'll be separate from your local copy) ```bash git clone https://github.com/partner-name/scicomp-readfasta.git cd scicomp-readfasta git log # Hey, these are Partner's commits, not yours ``` Notice that the **clone**d repo will have an `origin` remote already set, and branch `master` Create a new branch and make some changes ```bash git checkout -b more-docs nano read_fasta.py # Edit the file, add more docs git add read_fasta.py git commit # Describe what you're adding git push --set-upstream origin more-docs ``` Then, visit their repo on GitHub.com. You'll likely see that your pushed branch can be made into a **Pull Request**. A pull request asks the owner to incorporate the changes you sent. Write an explanation of why you're sending these. ```bash git checkout master # Switch back to master git pull # Pull in new changes from GitHub into local copy ``` ## Running on High-Performance Computing Cluster **Before Friday** Please verify you can access the cluster. From terminal or git-bash: ```bash # Replace <netid> with your DukeNet ID ssh <netid>@dcc-slogin-01.oit.duke.edu ``` - When prompted for a password, use your Duke NetID password. - If prompted about verifying the authenticity of the host, say `yes` ### Download material to the cluster ```bash git clone https://github.com/Duke-GCB/scicomp-hpc.git ``` ### Run a command on the cluster To run a command on one of the cluster's _compute nodes_ (as opposed to the _login node_): ```bash srun hostname ``` Can also obtain an interactive shell on one of the compute nodes: ```bash srun --pty bash ``` Run Python script for computation ```bash cd scicomp-hpc # if you haven't already srun python fasta_gc.py data/E2f1_dna.fasta ``` Create a shell script to have more than one command: ```bash= #!/bin/bash echo "Starting GC content computation" python fasta_gc.py data/E2f1_dna.fasta echo "Done with GC content computation" ``` Save as `countgc.sh`. Then submit the script: ```bash sbatch countgc.sh ``` Once the job is done, there'll be a file `slurm-<JOB_ID>.out` with the output from the job, where `<JOB_ID>` is the job ID it received from the scheduler. ### Inquire about your jobs To show the current status of _your_ jobs: ```bash $ squeue -u <NetID> ``` Report of jobs run since midnight: ```bash $ sacct # hlapp@dcc-slogin-01 ~/scicomp-hpc $ sacct # JobID JobName Partition Account AllocCPUS State ExitCode # ------------ ---------- ---------- ---------- ---------- ---------- -------- # 34802706 fasta_gc.+ common gcb 1 FAILED 13:0 # 34802707 python common gcb 1 COMPLETED 0:0 ``` Use `--starttime` argument to include older jobs, for example `--starttime 2018-01-01`. ### Control which resources are being requested Inquire how much memory a job actually used (it's in the MaxRSS column, which is not included in the default report output format): ```bash sacct -o JobName,State,MaxRSS,ReqMem ``` Modify the job script to control the amount of memory being requested: ```bash= #!/bin/bash #SBATCH --mem=400M echo "Starting GC content computation" python fasta_gc.py data/E2f1_dna.fasta echo "Done with GC content computation" ``` Requesting less memory may allow the job to be scheduled sooner. Note, however, that if your job ends up using more memory than requested, it will get killed by the scheduler. ### Aid granular job status tracking within a script Prefix every command in the script with `srun` for which you want separate status reporting: ```bash= #!/bin/bash #SBATCH --mem=400M srun echo "Starting GC content computation" srun python fasta_gc.py data/E2f1_dna.fasta srun echo "Done with GC content computation" ``` These are called "steps" by slurm. Each one will have a separate entry in `sacct`. ### Array jobs: Control how many jobs you run simultaneously ```bash= #!/bin/bash #SBATCH --mem=400M #SBATCH --array=0-4%2 # ^^^^^ means 5 jobs (0-4), 2 at a time # Slurm sets this variable for each invocation of the script echo "SLURM_ARRAY_TASK_ID = $SLURM_ARRAY_TASK_ID" # Turn the TASK_ID into a subset of files to operate on LINENUM=$SLURM_ARRAY_TASK_ID FILENAME=$(ls data/*.fasta | ./get_line.sh $LINENUM) # which file did we extract for operating on? echo $FILENAME # do the actual computation on that filename python fasta_gc.py $FILENAME ``` Notes: * Shell variables are set using the `VAR=value` construct. (There must not be spaces surrounding the `=`.) * The value of a shell variable is referenced using the `$VAR` construct. (Prefix the variable name with a `$`. No space after the `$`.) * `VAR=$(command)` asks the shell to run `command` and to capture the output of that in the variable `VAR`. `command` can use arguments, pipes etc. You can test the result by running `echo $(command)`. * `./get-line.sh` uses the `sed` command to extract one line from its input. You can also combine `head` and `tail` to achieve the same: ```bash=11 FILENAME=$(ls data/*.fasta | head -n $(($LINENUM + 1)) | tail -n 1) ``` This uses the shell (bash) construct `$((expression))` to do simple arithmetic, in this case adding 1 to the zero-based `SLURM_ARRAY_TASK_ID`, because the `-n` argument to `head` requires a 1-based line number. Hence, another (and perhaps better) way would be to do this at the point of defining the line number (because line numbers are typically interpreted as 1-based): ```bash=10 LINENUM=$(($SLURM_ARRAY_TASK_ID + 1)) FILENAME=$(ls data/*.fasta | head -n $LINENUM | tail -n 1) ``` ### Request to be notified upon job completion ```bash= #!/bin/bash #SBATCH --mem=400M #SBATCH --mail-type=END #SBATCH --mail-user=<your email address> #SBATCH --array=0-4%2 # ^^^^^ means 5 jobs (0-4), 2 at a time # Slurm sets this variable for each invocation of the script echo "SLURM_ARRAY_TASK_ID = $SLURM_ARRAY_TASK_ID" # Turn the TASK_ID into a subset of files to operate on LINENUM=$SLURM_ARRAY_TASK_ID FILENAME=$(ls data/*.fasta | ./get_line.sh $LINENUM) # which file did we extract for operating on? echo $FILENAME # do the actual computation on that filename python fasta_gc.py $FILENAME ``` ### What kind of resources does the cluster have? Which nodes are part of a "partition" (a defined set of nodes), and are they occupied? ```bash sinfo -p common ``` ### Getting data files onto the cluster #### Secure copy -- `scp` Use `scp <src> <dest>` to copy files from a local directory to a remote server, or vice versa, or from one remote system to another. The location (`<src>` or `<dest>`) that is remote takes the following form: `<userID>@hostname:directory`. For the DCC, `<userID>` is your NetID, and hostname is the login node: `<netID>@dcc-slogin-01.oit.duke.edu:scicomp-hpc`. (Note that if given as a relative path, the remote directory is relative to your home directory _on the remote system_.) #### Downloading from a URL -- `wget` On the cluster, use `wget` to download files from a URL: ```bash srun wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/chrI.fa.gz ``` If you want to run multiple downloads, you might first open an interactive session (`srun --pty bash`) and then run `wget` commands (and then you can omit the `srun` prefix). ### Making use of installed analysis tools Obtain an interactive shell session (`srun --pty bash`) before running the following! ```bash $ RepeatMasker -bash: RepeatMasker: command not found # lots of additional software is available as "modules": $ module avail ### long list of available modules # # make a software installed as a module available: $ module load RepeatMasker/4.0.5 # now RepeatMasker is found RepeatMasker -species "Saccharomyces cerevisiae" chrI.fa.gz ``` Necessary `module load` commands need to be included in your `sbatch` script. ### Slides for HPC https://github.com/Duke-GCB/SciComp-Oct-2018/releases/download/v1.0/HPC.2018.pdf

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.