owned this note
owned this note
Published
Linked with GitHub
---
tags: course notes
---
# GCB Academy - SciComp - Oct 2018
## Automating tasks with the Unix Shell
October 29, 2018
### Introducing the Shell
- `cd` to change directories and move around "folders" in the shell
- `ls` to list files in a directory.
- `ls -F` to show which ones are directories. They end with a `/`
- `ls -l` for **long** format (owner, size, permissions, etc)
- `ls -a` to list including hidden files/directories
- `pwd` to ask the shell _Where am I now?_
- Use the `TAB` key on your keyboard to complete commands and files
### Navigating Files and Directories
- `cd ..` to go up one level of directories. (e.g. from `dc_sample_data/untrimmed_fastq` to `dc_sample_data`)
- `cd` works with absolute (beginning with a `/` like `cd /Users/dan/Desktop/dc_sample_data`) and with relative paths (like `cd Desktop/dc_sample_data`)
- `cd ~` to go to your HOME directory, `cd -` to go to the directory you were last in
### Working with Files
- Wildcards: `ls *.fastq` to list all files ending with **.fastq**. `*` matches any text of any length. `?` matches a single character, but you can use multiple `???` together
**Exercise**:
Do each of the following tasks from your current directory using a single ls command for each.
- List all of the files in /usr/bin that start with the letter ‘c’.
- `ls /usr/bin/c*`
- List all of the files in /usr/bin that contain the letter ‘a’.
- `ls /usr/bin/*a*`
- List all of the files in /usr/bin that end with the letter ‘o’.
- `ls /usr/bin/*o`
**Bonus**: List all of the files in ‘usr/bin’ that contain the letter ‘a’ or the letter ‘c’.
`ls /usr/bin/*[ac]*`
Command History shortcuts:
- `Ctrl-C` to cancel the current command and get a fresh prompt
- `Ctrl-R` to begin a search through your command history
- `Ctrl-L` to clear the screen
- `history` command will list out everything you've typed, each line with a number
- Use `!260` to repeat the command listed on line 260 above
File Content commands:
- `cat SRR097977.fastq` to show the entire contents of a file, but this will fill the screen quickly, so try...
- `less SRR097977.fastq` to show the file one screen at a time (paging). Use the `Space` key to advance one screen. Up and down arrows (or `j`/`k`) to move back/forward one line.
- `head` and `tail` to show the beginning and end of a file respectively
Moving/copying files, Creating directories:
- `cp SRR097977.fastq SRR097977-copy.fastq` to make a copy of a file
- `mkdir backup` to make a directory called **backup**
- `mv SRR097977-copy.fastq backup/` to move the copied file into the **backup** directory
- `rm backup/SRR097977-copy.fastq` to remove (delete) a file.
- **But there is no trash bin, `rm` means gone**
- Make files read-only with `chmod -w`
- _Minus W_ subtracts write permissions
- `chmod -w SRR097977-copy.fastq`
Removing directories
- `rm` will not remove a directory by default. `rmdir` will remove a directory, but not an empty one!
- `rm -r` to **recursively** remove a directory, by removing the files it contains
- Or, remove the files first and then remove the empty directory with `rmdir`
Break: Return at 10:35
### Redirection
Searching files
- `grep` can search through files:
- `grep NNNNNNNNNN SRR098026.fastq` to find _low-quality_ reads (having 10 consecutive `N`s) in the fastq file
- `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq` to include 1 line **before** and 2 lines **after** the matched line. The read name is the line before the reads, and the quality score is 2 lines after.
Exercise:
1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
- `grep -B1 GNATNACCACTTCC SRR098026.fastq`
2. Search for the sequence `AAGTT` in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
- `grep -B1 AAGTT *.fastq`
Redirecting Output
When we run a command, output usually just _spills_ to the terminal. Often we want to save that output for further steps.
- Use the `>` command to send output from one command to a file. Always overwrites the destination file.
- `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt`
- Then, `less bad_reads.txt` to look at the produced file, or `wc -l bad_reads.txt` to see word/line counts
- Use the `>>` command to **append** output from one command to a file.
- `grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt`
- Use the `|` (pipe) command to send output from one command to the **input** of another command.
- Connect them with a *pipe*
- `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | less`
- Can connect more than 2 commands with pipes too!
Exercise:
How many reads in the `SRR098026.fastq` file contain at least two regions of 5 unknown nucleotides (`N`s) in a row, separated by anything (known or unknown).
Hint: Use a wildcard for the pattern and a pipe character `|` to get the count without creating an intermediate file.
File Manipulation and Pipe Practice
- Looking at tabular data, use `head -1` to see the column names
- `head -n 1 SraRunTable.txt`
- Use `cut` to extract columns from tabular data
- `cut -f3 SraRunTable.txt | grep PAIRED | wc -l`
Exercise:
How many single-end libraries are in our samples?
`cut -f3 SraRunTable.txt | grep SINGLE | wc -l`
> 35
- `uniq` will reduce a list to just the unique items, but usually you want to `sort` first
- `cut -f3 SraRunTable.txt | sort | uniq -c`
- Use `grep -v PATTERN` to filter out things matching the PATTERN.
- `cut -f3 SraRunTable.txt | grep -v LibraryLayout | sort | uniq -c`
Exercise:
1. How many different sample load dates are there?
2. How many samples were loaded on each date?
Hint: You already have these commands, but may need to explore to find the column with the load dates.
`cut -f5 SraRunTable.txt | grep -v LoadDate | sort | uniq -c`
>` 6 25-Jul-12`
>` 31 29-May-14`
Sorting an entire file
- Use `sort -k` (k for **key**) to sort a file by a column number
- `sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt`
#### Writing Scripts
- Use a text editor, like the easy, simple `nano` to put commands into a file
- `nano count-load-dates.sh` opens the nano editor with a new file
- Add your commands to run
- `Control-O` to save (write out) the file, `Control-X` to exit `nano` and return to shell
- Your script is just a file. You can `cat` it!
- Run your script with `bash count-load-dates.sh`
- Make scripts executable by the owning user with `chmod u+x count-load-dates.sh`. Executable scripts can be run without the `bash` command. `./count-load-dates.sh`
### Shell Tips:
- If your desktop is using OneDrive, `cd Desktop` may not work as expected. Right-click the folder you want to access and select **Git Bash Here**
- `man` not working on windows under **Git Bash**? [Internet search for `man ls`](http://google.com/?q=man+ls) or whatever command you're looking for. The man pages are all online.
## Programming with Python
October 30, 2018
Download [python-fasta.zip](https://github.com/Duke-GCB/SciComp-Nov-2017/releases/download/1.0/python-fasta.zip)
To launch Jupyter Notebook from the command line (do this from your home directory, or the directory to which python-fasta.zip was uncompressed):
```
$ jupyter notebook
```
### Variables: How to define and use
```python=
# strings:
greeting = 'Aloha'
print(greeting, 'World')
# integer numbers
age = 5 + 1
print('My daughter is', age, 'years old')
# floating point numbers
lb = 50.0
kg = 2.2 * lb
print(kg)
```
### Indexing into a string
```python=
name = 'Python'
# uses 0-based indexing!
print(name[0])
# negative index means go from the end
print(name[-1])
# can use sequence of indices too (start to end+1)
print(name[0:3], name[3:6])
# is a letter in a string?
'P' in name
```
### Variables point to values
```python=
x = y = 'pencil'
print(x, y)
y = 'apple'
# the value x points to remains the same
print(x, y)
```
### Strings that are sequences
```python=
seq = 'ACTTGCATGC'
```
#### Computing the reverse complement
To tackle a programming problem, it's always useful to break it down:
* Reverse the sequence, then
* Complement the sequence
```python=
# a naïve way:
rev = seq[9] + seq[8] + seq[7] + seq[6] + seq[5] + seq[4] + seq[3] + seq[2] + seq[1] + seq[0]
# won't work whenever the sequence string isn't exactly 10 characters long
#
# use a loop instead
rev = ''
for s in seq:
rev = s + rev
print(rev)
print(seq)
```
Defining and accessing dictionaries:
```python=
counts = {'A': 2, 'C': 4, 'G': 2, 'T': 2}
# index by the key:
print('There are', counts['A'], 'As')
# can also change values:
counts['A'] = 3
# can use a variable for indexing:
base = 'A'
print(base, 'is there', counts[base], 'times')
# can loop over all keys of a dictionary:
for b in counts:
print('Seen', b, count[b], 'times')
```
Complement using a dictionary:
```python=
comps = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
comp = ''
for s in seq:
c = comps[s]
comp = comp + c
print(seq, comp)
```
:::info
Exercise: Combine reverse and complement
:::
```python=
comps = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
revcomp = ''
for s in seq:
# complement the base
c = comps[s]
# build up sequence in reverse
revcomp = c + revcomp
print(seq, revcomp)
```
### Lists and arrays
```python=
seqs = [seq, revcomp, rev]
# also zero-indexed:
print(seqs[0])
# can delete elements from a list:
del seqs[0]
print(seqs)
```
### Conditional blocks
```python=
if 'GC' in seq:
print('Found GC in', seq)
```
#### Calculate %GC
```python=
# initialize count values
gc = 0
total = 0
# loop over each base
for base in seq:
# if it's a G or C increment gc by 1
if base == 'G' or base == 'C':
gc = gc + 1
# increment total by one
total = total + 1
pct = (gc * 100.0) / total
print('GC-content of', seq, 'is', pct)
```
### Functions
Some functions are global - always available
```python=
print('Hello world') # calling a function
len(seq)
x = [5, 6, 1, 9, 2]
x_sorted = sorted(x, reverse=TRUE)
print(x_sorted)
# can be combined:
print(sorted(x, reverse=TRUE))
```
Some functions are "methods" on a data type, uses the `object.methodname()` notation:
```python=
seq_lower = seq.lower()
print(seq, seq_lower)
```
Functions are defined using `def fname(arguments)` notation:
```python=
def double(x):
result = x * 2
return result
print(double(5))
# the variable x in the function is local to the function
# and insulated from the surrounding code
x = 10
print(double(20))
```
```python=
# Reverse function, reverses the sequence provided
def reverse(seq):
rev = ''
for s in seq:
rev = s + rev
return rev
print(reverse(seq), reverse(reverse(seq)), seq)
# Complement function, complements the sequence provided
def complement(seq):
# much better to define this locally here
# than to rely on a global variable
comps = {'A': 'T',
'T': 'A',
'G': 'C',
'C': 'G'}
comp = ''
for s in seq:
c = comps[s]
comp = comp + c
return comp
print(seq, complement(seq))
def revcomp(seq):
return reverse(complement(seq))
print(seq, revcomp(seq))
```
### Reading lines from a file
```python=
def readfasta(filename):
f = open(filename, 'r')
seq = ''
for line in f:
line = line.strip()
if '>' not in line:
seq = seq + line
f.close()
return seq
print(readfasta('ae.fa'))
```
### Importing code that we didn't write
```python=
import sys
# now we can use variables and functions
# defined in module `sys`
print(sys.argv)
# if in a Python script (as opposed to in the Jupyter notebook),
# the first element of the list is the name of the script
```
## Version control with Git
If you didn't attend the Python lesson, or lost the directory from that lesson, download materials in the following way:
```bash
cd # or wherever else you want the materials folder
mkdir read-fasta
git clone https://github.com/Duke-GCB/scicomp-python.git
cp scicomp-python
cp *.py *.fa ../read-fasta
cd ../read-fasta
```
### Asking "What's going on?"
```bash
git status
```
Always tells you the status of your repository
### Configuring your author info and editor
```bash
git config --global user.name "Dan Leehr"
git config --global user.email "dan.leehr@duke.edu"
git config --global core.editor "nano -w"
```
These configuration commands tell git what information to record when you commit code, and what editor to use for writing commit messages.
### Creating a git repository
Locally, in an existing directory:
```bash
git init
```
This creates a _hidden_ `.git` directory, which is where our changes are tracked.
### Committing changes
Two steps:
1. Prepare a set of changes to commit (to staging area)
- `git add read_fasta.py`
- `git status` (optional, but will show you what files are staged for commit)
2. Commit the staged changes with a message (committing)
- `git commit` (will pop open your editor to provide a message)
- Write a brief message in nano explaining the change, and `Control-O` to save
### Looking at changes before committing
```bash
git diff
```
Shows changes to the local files that have not been staged or committed yet.
```bash
git diff --staged
```
Shows changes to the **staged** files that have not been committed yet.
### Viewing history of changes
```bash
git log # Shows commit authors, emails, dates, IDs and full commit messages
git log --oneline # One line for each commit, its ID, and first line of message
git log -p # Commit info, plus the changes to each file for each commit (+/-)
```
### Restoring from backup
Unstaging a staged change:
```bash
git reset HEAD read_fasta.py
git status # Shows there's still a change but it's not staged
```
Restoring a locally changed file to the committed version
```bash
git checkout -- read_fasta.py
git status # Shows that there are no changes
```
### Restoring an earlier committed version
```bash
git log # To find the commit ID you want to revert to
git checkout 3db84d5f -- read_fasta.py # Restore read_fasta.py from how it was in commit 3db84d5f
git status # Shows that above already staged the file, ready to commit
git commit -m "Backs out arg checking" # Commit the staged change with a message right on the command line
```
### Undoing an entire commit
```bash
git log # To find the commit ID you want to undo
git revert 1196dc9 # Reverse the changes made in 1196dc9 and prepare a new commit
# Edit the commit message, but it will be pre-filled with info about revert
git log # Shows that 1196dc9 is still in the history, but a new commit reverses it
```
### Creating a branch
We've so far been using the default `master` branch, but we may want to try some experimental changes or work on a set of changes that belong together.
```bash
git checkout -b ignore-data # Creates a branch called ignore-data and switches to it
git status # shows On branch ignore-data
```
### Ignoring files
Some files we don't want to commit with our code (like data files), but we also don't want git to constantly tell us they're untracked. Here, we can **ignore** them by listing them in a `.gitignore` file
```bash
nano .gitignore # Opens editor
*.fa # List the files and patterns to ignore
*.fasta # Save and exit
git status # Doesn't show the data files but does mention an unstaged .gitignore
git add .gitignore # Stage the file
git commit -m "Ignore data files" # Commit our staged changes
```
### Fixing your last commit
To make changes to the message or files in the commit you just made:
```bash
git commit --amend # Opens editor, prompting you to revise your commit message
```
You can also stage additional changes before amending the commit, and they will be added to the commit.
### Switching branches
```bash
git branch # Lists your branches
git checkout master # Switches to your 'master' branch from 'ignore-data'
git status # shows 'On branch master' and my ignore files are back!
```
### Staging parts of a file
```bash
git add -p read_fasta.py # Prompts you to stage individual changes (hunks) of a file
```
### Incorporating changes from other branches (merges)
```bash
git status # On branch 'master'
git merge ignore-data # Adds the commits from 'ignore-data' onto the end of master
git status # on branch master
git log # Now includes the commits we made on 'ignore-data'
```
### Creating a GitHub repository
GitHub can also host copies of your git repositories and serve as a backup or point of collaboration
See http://swcarpentry.github.io/git-novice/07-github/index.html for screenshots and step-by-step
1. Visit github.com
2. Click **+**, **New Repository**
3. Type a name, leave Public, and click **Create repository**
4. Copy the .git address provided (e.g. https://github.com/dleehr/scicomp-course-2018.git)
5. Add this address as a remote to your local git repository
```bash
git remote add origin https://github.com/dleehr/scicomp-course-2018.git
git push --all origin # Send all of our branches and commits to the repo on GitHub
```
### Pushing new changes
After making and committing more changes, you can synchronize your local work to GitHub by pushing your branch
```bash
git push --set-upstream origin master # Just the first time to tell Git to push local master to origin master
git push # pushes local master to origin master
```
### Collaborating
http://swcarpentry.github.io/git-novice/08-collab/index.html
Get a partner's github id and view their repo
https://github.com/partner-name/scicomp-readfasta
The **Clone or Download** button will provide the **Clone** url.
Then, clone their repo (it'll be separate from your local copy)
```bash
git clone https://github.com/partner-name/scicomp-readfasta.git
cd scicomp-readfasta
git log # Hey, these are Partner's commits, not yours
```
Notice that the **clone**d repo will have an `origin` remote already set, and branch `master`
Create a new branch and make some changes
```bash
git checkout -b more-docs
nano read_fasta.py # Edit the file, add more docs
git add read_fasta.py
git commit # Describe what you're adding
git push --set-upstream origin more-docs
```
Then, visit their repo on GitHub.com. You'll likely see that your pushed branch can be made into a **Pull Request**. A pull request asks the owner to incorporate the changes you sent. Write an explanation of why you're sending these.
```bash
git checkout master # Switch back to master
git pull # Pull in new changes from GitHub into local copy
```
## Running on High-Performance Computing Cluster
**Before Friday**
Please verify you can access the cluster. From terminal or git-bash:
```bash
# Replace <netid> with your DukeNet ID
ssh <netid>@dcc-slogin-01.oit.duke.edu
```
- When prompted for a password, use your Duke NetID password.
- If prompted about verifying the authenticity of the host, say `yes`
### Download material to the cluster
```bash
git clone https://github.com/Duke-GCB/scicomp-hpc.git
```
### Run a command on the cluster
To run a command on one of the cluster's _compute nodes_ (as opposed to the _login node_):
```bash
srun hostname
```
Can also obtain an interactive shell on one of the compute nodes:
```bash
srun --pty bash
```
Run Python script for computation
```bash
cd scicomp-hpc # if you haven't already
srun python fasta_gc.py data/E2f1_dna.fasta
```
Create a shell script to have more than one command:
```bash=
#!/bin/bash
echo "Starting GC content computation"
python fasta_gc.py data/E2f1_dna.fasta
echo "Done with GC content computation"
```
Save as `countgc.sh`. Then submit the script:
```bash
sbatch countgc.sh
```
Once the job is done, there'll be a file `slurm-<JOB_ID>.out` with the output from the job, where `<JOB_ID>` is the job ID it received from the scheduler.
### Inquire about your jobs
To show the current status of _your_ jobs:
```bash
$ squeue -u <NetID>
```
Report of jobs run since midnight:
```bash
$ sacct
# hlapp@dcc-slogin-01 ~/scicomp-hpc $ sacct
# JobID JobName Partition Account AllocCPUS State ExitCode
# ------------ ---------- ---------- ---------- ---------- ---------- --------
# 34802706 fasta_gc.+ common gcb 1 FAILED 13:0
# 34802707 python common gcb 1 COMPLETED 0:0
```
Use `--starttime` argument to include older jobs, for example `--starttime 2018-01-01`.
### Control which resources are being requested
Inquire how much memory a job actually used (it's in the MaxRSS column, which is not included in the default report output format):
```bash
sacct -o JobName,State,MaxRSS,ReqMem
```
Modify the job script to control the amount of memory being requested:
```bash=
#!/bin/bash
#SBATCH --mem=400M
echo "Starting GC content computation"
python fasta_gc.py data/E2f1_dna.fasta
echo "Done with GC content computation"
```
Requesting less memory may allow the job to be scheduled sooner. Note, however, that if your job ends up using more memory than requested, it will get killed by the scheduler.
### Aid granular job status tracking within a script
Prefix every command in the script with `srun` for which you want separate status reporting:
```bash=
#!/bin/bash
#SBATCH --mem=400M
srun echo "Starting GC content computation"
srun python fasta_gc.py data/E2f1_dna.fasta
srun echo "Done with GC content computation"
```
These are called "steps" by slurm. Each one will have a separate entry in `sacct`.
### Array jobs: Control how many jobs you run simultaneously
```bash=
#!/bin/bash
#SBATCH --mem=400M
#SBATCH --array=0-4%2
# ^^^^^ means 5 jobs (0-4), 2 at a time
# Slurm sets this variable for each invocation of the script
echo "SLURM_ARRAY_TASK_ID = $SLURM_ARRAY_TASK_ID"
# Turn the TASK_ID into a subset of files to operate on
LINENUM=$SLURM_ARRAY_TASK_ID
FILENAME=$(ls data/*.fasta | ./get_line.sh $LINENUM)
# which file did we extract for operating on?
echo $FILENAME
# do the actual computation on that filename
python fasta_gc.py $FILENAME
```
Notes:
* Shell variables are set using the `VAR=value` construct. (There must not be spaces surrounding the `=`.)
* The value of a shell variable is referenced using the `$VAR` construct. (Prefix the variable name with a `$`. No space after the `$`.)
* `VAR=$(command)` asks the shell to run `command` and to capture the output of that in the variable `VAR`. `command` can use arguments, pipes etc. You can test the result by running `echo $(command)`.
* `./get-line.sh` uses the `sed` command to extract one line from its input. You can also combine `head` and `tail` to achieve the same:
```bash=11
FILENAME=$(ls data/*.fasta | head -n $(($LINENUM + 1)) | tail -n 1)
```
This uses the shell (bash) construct `$((expression))` to do simple arithmetic, in this case adding 1 to the zero-based `SLURM_ARRAY_TASK_ID`, because the `-n` argument to `head` requires a 1-based line number. Hence, another (and perhaps better) way would be to do this at the point of defining the line number (because line numbers are typically interpreted as 1-based):
```bash=10
LINENUM=$(($SLURM_ARRAY_TASK_ID + 1))
FILENAME=$(ls data/*.fasta | head -n $LINENUM | tail -n 1)
```
### Request to be notified upon job completion
```bash=
#!/bin/bash
#SBATCH --mem=400M
#SBATCH --mail-type=END
#SBATCH --mail-user=<your email address>
#SBATCH --array=0-4%2
# ^^^^^ means 5 jobs (0-4), 2 at a time
# Slurm sets this variable for each invocation of the script
echo "SLURM_ARRAY_TASK_ID = $SLURM_ARRAY_TASK_ID"
# Turn the TASK_ID into a subset of files to operate on
LINENUM=$SLURM_ARRAY_TASK_ID
FILENAME=$(ls data/*.fasta | ./get_line.sh $LINENUM)
# which file did we extract for operating on?
echo $FILENAME
# do the actual computation on that filename
python fasta_gc.py $FILENAME
```
### What kind of resources does the cluster have?
Which nodes are part of a "partition" (a defined set of nodes), and are they occupied?
```bash
sinfo -p common
```
### Getting data files onto the cluster
#### Secure copy -- `scp`
Use `scp <src> <dest>` to copy files from a local directory to a remote server, or vice versa, or from one remote system to another.
The location (`<src>` or `<dest>`) that is remote takes the following form: `<userID>@hostname:directory`. For the DCC, `<userID>` is your NetID, and hostname is the login node: `<netID>@dcc-slogin-01.oit.duke.edu:scicomp-hpc`. (Note that if given as a relative path, the remote directory is relative to your home directory _on the remote system_.)
#### Downloading from a URL -- `wget`
On the cluster, use `wget` to download files from a URL:
```bash
srun wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/chrI.fa.gz
```
If you want to run multiple downloads, you might first open an interactive session (`srun --pty bash`) and then run `wget` commands (and then you can omit the `srun` prefix).
### Making use of installed analysis tools
Obtain an interactive shell session (`srun --pty bash`) before running the following!
```bash
$ RepeatMasker
-bash: RepeatMasker: command not found
# lots of additional software is available as "modules":
$ module avail
### long list of available modules
#
# make a software installed as a module available:
$ module load RepeatMasker/4.0.5
# now RepeatMasker is found
RepeatMasker -species "Saccharomyces cerevisiae" chrI.fa.gz
```
Necessary `module load` commands need to be included in your `sbatch` script.
### Slides for HPC
https://github.com/Duke-GCB/SciComp-Oct-2018/releases/download/v1.0/HPC.2018.pdf