owned this note
owned this note
Published
Linked with GitHub
# GCB Academy: Intro to Scientific Computing for Genomics.
November 4-5, 2019
Course website: https://duke-gcb.github.io/SciComp-Nov-2019/ (includes links to materials)
Instructors: Hilmar Lapp, Dan Leehr
Stuck? Put your pink sticky note on the top of your computer.
## Monday Morning - Shell
Launching the shell: **Git Bash** on Windows, **Terminal** on macOS.
### Commands:
- `pwd` - print working directory, where am I?
- `ls` - list files in current directory
- `ls -F` - format directories by trailing `/`
- `ls -lh` - show human-readable sizes
- `ls -a` - include hidden files/directories
- `cd` - change directories
- `man ls` - manual page for a command, e.g. `ls`
Note: to exit the manual page, press the <kbd>q</kbd> key on your keyboard
_Exercise_: What additional information do you learn with `ls -l`?
_Answer_: Date modified, file size, owner, permissions, directory or not
### Tab completion:
While typing a command or file name, hit the <kbd>tab</kbd> key and the shell will finish the word (or help as much as it can!)
- `cd un<tab>` becomes `cd untrimmed_fastq/`
Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you.
Multiple matches? Start typing and hit <kbd>tab</kbd>. The shell will sound a ding, and list possible completions. Keep typing and hit <kbd>tab</kbd> again once it can pick a single result.
### Navigation shortcuts
- `cd` _(by itself, no directory name)_ - changes to your home directory (`cd ~` has the same result)
- `cd Desktop/shell_data/untrimmed_fastq` - go directly into `untrimmed_fastq` with one command. Remember <kbd>tab</kbd> completion!
- `cd ..` - go one level up in the directory tree (parent directory)
- `cd ~/Desktop/shell_data` - from anywhere on the filesystem since we begin with `~` (home directory)
- `cd -` - Go back to where I came from (the previous directory you were in)
_Exercise_: Find the hidden directory
### Relative and Absolute paths
**Relative Path**: Any path that doesn't begin with a slash. (e.g. `Desktop/untrimmed_fastq`) Relative paths are relative to where you are.
**Absolute Path**: Any path that begins with a slash (e.g. `/Users/sam/Desktop/untrimmed_fastq`)
- `/` is the root directory
### Files and Directories
`cd ~/Desktop/untrimmed_fastq`
- `ls *.fastq` - list all files ending with `.fastq`
- `ls *977.fastq` - list all files ending with `977.fastq`
- `ls /bin/*sh` - list all the commands that are shells
- `echo $SHELL` - which shell am I running?
- `history` - command history with number
- `! {number, n}` - execute the nth command
- `!!` is replaced with the contents of the previous command
- `~` is an absolute path to the home directory
### Working with Files
- `cat SRR098026.fastq` - outputs **ALL** contents of the file to the screen (good for small files)
- `less SRR098026.fastq` - lets you navigate and scroll through the file
- `space` - move forward in the file
- `/` + `[search term]` + `[enter]` to search within the document
- `b` - move backwards in the file
- `g` - go to the beginning of the file
- `G` - go to the end of the file
- `q` - to exit 'less'
#### Head and Tail
- `head` - prints the first 10 lines of a file to the screen (default is 10 lines)
- `tail` - prints the last 10 lines of a file to the screen (default is 10 lines)
- add `-n 100` to instead print the first (`head`) or last (`tail`) 100 lines of the file to the screen
example:
`head -n 1 SRR098026.fastq`
will output the first line of the file:
`@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35`
### FASTQ File Format
[A FASTQ file normally uses four lines per sequence.](https://en.wikipedia.org/wiki/FASTQ_format)
- Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
- Line 2 is the raw sequence letters.
- Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
`@SEQ_ID`
`GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT`
`+`
`!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65`
### Creating, moving, copying, and removing files
- `cp` to copy a file
- `cp /original_path/file.txt /new_path/file-copy.txt`
- `mkdir` to make a new directory (folder)
- `mkdir backup` makes a new directory called "backup"
- `mv` can both move a file and/or rename that file
- `mv SRR098026-copy.fastq backup/` **moves** `SRR098026-copy.fastq` to the `backup` directory
- `mv SRR098026-copy.fastq SRR098026-backup.fastq` **renames* `SRR098026-copy.fastq` to `SRR098026-backup.fastq` in the same directory
- **NOTE** moving and renaming files are effectively the same thing
- to move a file "down" a directory into your current directory (1 level up), you can do: `mv /backup/file.txt .`
- the `.` indicates your current directory
- adding `-i` to the `mv` command makes sure you don't accidentally overwrite files `mv -i file1.txt file2.txt`
### File Permissions
- `r` read
- `w` write
- `x` execute
`ls -l` shows you the permissions of your files:
`-rwxr-xr--`
- the first `-` character is the file type
- the next three `rwx` characters are permissions for the file's owner
- the next three `r-x` characters are permissions for the group
- the final three `r--` characters are permissions for the current user
`chmod` change mode of the file (change permission settings)
- `chmod -w file.txt` removes write permissions
- if you try to remove this file without having permission, it will not allow you do delete e.g. `rm file.txt` will return a "permission denied" message
- `chmod +w file.txt` adds write permissions
- `chmod g+w file.txt` adds group write permission
- `chmod u+w file.txt` gives owner write permission
More examples here: https://www.washington.edu/computing/unix/permissions.html
### Removing a file or directory
-`rm` **permanently** deletes the file (i.e. does **not** go into your trash or recycle bin)
- `rm -r` remove recursive - to delete a directory and everything in it **NOTE** this command should be used with care, as it can **permanently** delete a lot of data
*Exercise:* Make a backup of your FASTQ files and move them to a new `backup/` directory. Remove 'write' permission for the backup files.
*Answer:*
- `rm -r backup`
- `cp SRR098026.fastq SRR098026-backup.fastq` and `cp SRR097977.fastq SRR097977-backup.fastq`
- `mkdir backup` and `mv *-backup.fastq backup`
- `chmod -w backup/*-backup.fastq`
### Redirection and searching with grep
`grep` is a program that allows you to search within a file without having to open it.
- `grep NNNNNNNNNN SRR098026.fastq` will return every single line in the SRR098026 file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.
Search for a term within a specific context:
- `grep -B 1 -A 2 NNNNNNNNNN SRR098026.fastq`
- This will return the line before the search term and the two lines after the search term (in addition to the instance of 10 Ns)
- `-B` before
- `-A` after
*Exercise*
1. Search for the sequence GNATNACCACTTCC in the SRR098026.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
- *Answer:* `grep -B1 GNATNACCACTTCC SRR098026.fastq`
**returns**
`@SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35
GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG`
2. Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
- *Answer:* `grep -B1 AAGTT *.fastq`
**returns**
`SRR097977.fastq-@SRR097977.11 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36
SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG
--
SRR097977.fastq-@SRR097977.67 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36
SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT
--
SRR097977.fastq-@SRR097977.68 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36
SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA
--
SRR097977.fastq-@SRR097977.80 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36
SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT
--
SRR097977.fastq-@SRR097977.92 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36
SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT
--
SRR097977.fastq-@SRR097977.139 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36
SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA
--
SRR097977.fastq-@SRR097977.238 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36
SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT
--
SRR098026.fastq-@SRR098026.158 HWUSI-EAS1599_1:2:1:1:1505 length=35
SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG`
### Redirection to capture the output
- `>` this is the command to redirect output
e.g. `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt` will redirect the results of this grep search to the `bad_reads.txt` file
- Note: this command also created the `bad_reads.txt` file
- `wc` is short for "word count" - returns information about your file:
- first number is the number of lines
- second number is number of words
- third number is number of characters
- adding `-l` returns only the number of lines e.g. `wc -l file.txt`
*Exercise:* How many sequences in SRR098026.fastq contain at least 3 consecutive Ns?
*Answer:* `wc -l bad_reads.txt` returns `249` sequences in the file have at least 3 consecutive Ns
- `>>` **appends** to an existing file e.g. `grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt`
- Using the single `>` to redirect output to an existing file will **overwrite** contents of the file
- You can also use a wildcard to redirect output of multiple files to the same output files at once e.g. `grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt`
**NOTE** on file extensions:
- Mac & Windows care (sometimes) about what file extension you use
- Unix doesn't care at all what you use, so they're entirely for your (the user's) benefit in understanding the contents of the file
The **pipe** command is `|`
- `grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l` will perform your grep search then "pipe" the output to the `wc` program, so your output will be the number of lines returned from this grep search.
### For Loops
Performs a command over a set of parameters or conditions
**To define a variable:**
variable_name = information
ex: `f = Hello`
To access a variable:
$variable_name
ex: `$f`
**Loop structure template:**
```
for iterable_name in list_of_things:
do
stuff
done
```
ex1, print the names of fastq files in the present working directory:
```
for f in *fastq
do
echo $f
done
```
ex2, make a backup of fastq files in the present working directory:
`for f in $fastq; do echo cp $f $f.backup ; done`
The iterable_name is a variable defined in the loop
For each loop, the iterable_name will contain the next item in the list_of_things
You can write the loop in one line, using a semicolon to deliminate each step (ex2)
## Testing your Cluster Account
From your shell prompt, enter:
```
ssh NETID@dcc-slogin.oit.duke.edu
```
Replace NETID with your Duke NetID, and enter your NetID password when prompted.
If prompted with
```
The authenticity of host 'dscr-slogin-01.oit.duke.edu (152.3.100.80)' can't be established.
RSA key fingerprint is SHA256:7RrLpUJew15D1Iij0i9io6XF3RY2LEFxZi67Ep1SlbY.
Are you sure you want to continue connecting (yes/no)?
```
Type `yes`.
If you are greeted with a bash prompt, your account is working correctly.:
```
dcl9@dcc-slogin-03 ~ $
```
If you are unable to login, please add your NetID below:
- avd16
- sa372
- bd111 (verified login at duke.edu/online but permission denied for ssh)
- sl375
## Monday Afternoon - Programming with Python
File link: https://github.com/Duke-GCB/SciComp-Nov-2017/releases/download/1.0/python-fasta.zip
Running Jupyter Notebook
Method 1: Use Anaconda Navigator and run jupyter
Method 2: Type `jupyter notebook` into the terminal
### Printing and Strings
Functions in python are used with the following syntax: `function()`
The parentheses contain parameters you want to give the function
To print, use the `print()` function.
Ex: `print('Hello')`
A string is a datatype that indicates a word or sequence. In the example above, hello is in quotes, indicating it is a string
Variables are set in python in the following way `variable_name = information`
To call a variable, you type the variable_name, unlike in the terminal where you add a '$' beforehand
Ex, prints the information stored in the variable called name:
```python
name = 'Sam'
print(name)
```
You can use numbers (and other datatypes) in print as well
```python
weight = 55.5
print(weight)
```
### Indexing
You can access parts of a string (or list) with indexing.
General format:
```
string[start:end:countby]
```
Example, print the first letter in Python:
```python
name = 'Python'
print(name[0])
```
Indicies in python start at 0, so the first letter or item in a list will be index 0
Indexing starts at the position you indicate and ends at one position ahead. The end value is exclusive, so it is up to that position rather than including that position
Example, print 'ytho' from 'Python':
```python
name = 'Python'
print(name[1:5])
```
### Check if a letter is in a string
General format:
```python
letter in string
letter not in string
```
Example, returns true:
```python
'a' in 'apple'
```
Example, returns false:
```python
'b' not in 'bee'
```
### Working with sequences
Given the sequence:
`seq = 'ACCTGCATGC'`
Ex1, reverse the first 3 letters
```python
base1 = seq[0]
base2 = seq[1]
base3 = seq[2]
r3 = base3 + base2 + base1
```
For strings, you cannot edit them in the following way:
`string[3] = 'a'`
Strings are immutable datatypes, which cannot be directly changed.
You can combine parts of a string together to make a new string (Ex1).
#### Reversing a sequence
You can reverse a sequence with indexing. Recall that the general format is
```
string[start:end:countby]
```
If you make the last parameter -1, it will count backwards and reverse the string
Ex2, reverse a string via indexing
```python
name = 'Python'
print(name[::-1])
```
#### Looping through a DNA sequence
A for loop can iterate through a string, each iteration will pick the next letter in the string
Ex1, print each letter in a sequence separately:
```python
for letter in 'CACGTAGCA':
print(letter)
```
Loop structure is a bit different than bash in the terminal.
General format of a loop in python:
```
for each_item in collection_of_things:
do stuff
```
A loop in python requires a colon at the end of the for statement and each line of code within the loop to be indented
Ex2, reverse a string via for loop:
```python
rev = ''
seq = 'ACCTGCATGC'
for letter in seq:
rev = letter + rev
print(rev)
```
#### Dictionaries
Creating a dictionary:
Empty dictionary: `dictionary_name = {}`
Dictionary with keys and values:
```python
dictionary_name = {'key1':'value1', 'key2':'value2'}
```
Adding a key:value pair to a dictionary you already made
```python
dictionary_name['new_key'] = 'new_value'
```
Ex, a dictionary to give the complement letter
```python
complement_letter = {'A':'T', 'T':'A', 'C':'G','G':'C'}
```
Getting the value from a dictionary using a key
```python
dictionary_name['key1']
```
Will return the value for key1, which we defined before as 'value1'
For keys, if the key is a string you will have to use the same exact string, including any capitalization.
#### Lists
List are containers that can contain items of any datatype (even other lists)
To make a list:
```python
list_name = []
```
A list can contain different datatypes
Ex, a list with a string, integer (whole number), float (number with decimal points), and another list
```python
list_name = ['CACGTA', 52, 48.082, ['ACTG', 5]]
```
You can loop (and index) through a list, just like you can loop through a DNA sequence or string
#### Conditionals, if, elif, else statements
Python has several statements to determine if something is true or false, conditional statements. Some common ones are:
- `==` equals
- `>` greater than
- `<` less than
- `in` item in a list, string or iterable sequence
- `not in` item not in a list or iterable sequence
You can also use `and, or` statements to evaluate multiple statements. Or means one of the statements must be true while and means that both have to be true
Ex, all return true:
```python
'a' in 'apple' and 5 < 100
'a' not in 'apple' or 5 > 2
```
If, elif, and else can be used to perform a task, given a certain condition
General format:
```
if something == True:
do something
elif something_else == True:
do something
else:
do something
```
Python will evaluate the if statement first. If the statement returns false, it will then evaluate the elif (else if) statement. If the elif statement returns false, it will then do the else statement If the first if statement is true, it won't evaluate the elif or else statement.
#### Functions in Python
There are many built in functions available in python (such as the `print()` function). See the documentation for a list: https://docs.python.org/3/library/functions.html
`len` gives you the length of an object
```python
print(len(seq))
```
10
Some commonly used functions:
```python
abs() # finds the absolute value
int() # converts the input to the integer type
type() # tells you the type of the input (e.g. int, float, etc.)
sorted() # sorts the intput
```
`sort()` will sort a list
```python
x = [7, 4, 3, 2, 8]
y = sorted(x)
```
output: [2, 3, 4, 7, 8]
You can also reverse the way `sorted()` sorts the list:
```python
r = sorted(x, reverse = True)
```
You can run `help(sorted)` to learn about what other built-in options are available for a particular function.
Jupyter notebook also allows you to view this in an alternate way:
- `shift` + `tab` while your cursor is in the function call
- hit `shift` + `tab` again to get the pop-up to stay visible, and then hit these keys one more time (3 times total) to get the menu to appear separately at the bottom of your page.
`split()` will split the contents of your list based on a specified separator (such as '.' or ' ' (space)) into multiple list items. The default is white space between words.
e.g.
```python
seqs = 'rna dna protein'
seqs_list = seqs.split()
print(seqs_list, len(seqs_list))
```
returns: `['rna', 'dna', 'protein'] 3` (a list with 3 items instead of just 1)
```python
print(type(seqs))
```
returns:
```python
<class 'str'>
```
*Whereas*:
```python
print(type(seqs_list))
```
returns:
```python
<class 'list'>
```
*Exercise:* Functions
bases = 'adenine cytosine guanine thymine'
Write some code that will:
- Make a **list** of these bases from the string
- **Uppercases** the names (e.g. ['ADENINE', ...])
- **Reverses** the order (e.g. ['THYMINE', ...])
hint: use `help(str)` and `help(list)` to see what functions are available for strings and lists
**Bonus** write a for loop to print the first letter of each (e.g. 'A, C, ...')
*One solution:*
```python
bases = 'adenine cytosine guanine thymine'
# upper() can only be used on strings, so do this before converting to a list using split()
upper_list = bases.upper().split()
# print(upper_list)
# returns:
# ['ADENINE', 'CYTOSINE', 'GUANINE', 'THYMINE']
# reverse the order of the list
# help(list.reverse) -> "reverse in place"
upper_list.reverse() #the list is now reversed and is still stored in upper_list
# print(upper_list)
# will now return:
# ['THYMINE', 'GUANINE', 'CYTOSINE', 'ADENINE']
```
### Writing your own function
Start by defining the function using `def` and indenting inside the newly defined functions just like we did in for loops and if statements:
```python
def double(x):
result = 2 * x
return result
```
Now we can use this function by passing in a value for the 'x' argument:
```python
print(double(10))
```
returns:
`20`
This function can also multiply floats and even strings!
```python
print(double('Eggs'))
```
will return:
`EggsEggs`
Now if we define a new variable:
```python
x = 100
```
Note that this 'x' will not be confused by the 'x' *inside* the `double()` function.
**Making a reverse function:**
```python=
def reverse(seq): # 'seq' is used in a global context here
rev = ''
for s in seq():
rev = s + rev
return rev
```
Now we can use this new function:
```python=
print(reverse(seq))
```
will give you the reverse of`seq`
and
```python=
print(reverse(reverse(seq)))
```
will reverse `seq` and then reverse it again back to its original state
**Making a complement function:**
```python=
def complement(seq):
comp = ''
for s in seq:
comp = comp + comp[s]
return comp
```
Now we can use this function:
```python
print(reverse(complement(seq)))
```
will give you the reverse complement of `seq`
**NOTE** the `comp()` function will only work if we have defined the dictionary earlier in the script. It is a much better to idea to include the `comps` dictionary **inside** the definition of the function:
```python
def complement(seq):
comp = ''
comps = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
for s in seq:
comp = comp + comp[s]
return comp
```
We can combine these two functions into one function that does both in one step:
```python
def rc(seq):
return reverse(complement(seq))
```
### Working with files in Python
In python, you want to open a file and store it in a variable so it can be used in your program
```python
f = open('ae.fa')
```
Now the data from `ae.fa` is stored in the variable f.
We can loop over the data in f:
```python
for line in f:
print(line)
f.close() # It's a good idea to close your file when you are done working with it
```
but this will just do the same thing as the `cat` function on the command line!
Let's make a function to interpret the fasta file in a way that makes the different entries easier to work with. To start, let's convert our code from above into a function:
Step 1:
```python
def read_fasta(filename):
"""
Read in a file in FASTA format
"""
f = open(filename)
for line in f:
line = line.strip()
print(line)
f.close()
read_fasta('ae.fa')
```
Step 2: add to list instead of printing and return that list
```python
def read_fasta(filename):
"""
Read in a file in FASTA format
"""
seq = ''
f = open(filename)
for line in f:
line = line.strip()
seq = seq + line
f.close()
return seq
read_fasta('ae.fa')
```
Step 3: is the line a sequence line or identifier line?
```python
def read_fasta(filename):
"""
Read in a file in FASTA format
"""
seq = ''
f = open(filename)
for line in f:
line = line.strip()
if not '>' in line:
seq = seq + line #only keep the line if it's not an identifier line that begins with '>'
f.close()
return seq
print(read_fasta('ae.fa'))
```
Once you have a function working, you can download it from Jupyter Notebooks as a python script (document with a `.py` file extension) and use it on your computer.
Rename the file to something more readable if it was given an automated name:
`mv download.html read_fasta.py`
and you can run the script with:
`python read_fasta.py`
which can be piped to `wc -l`
`python read_fasta.py | wc -l`
or used with any other command line tools.
## Tuesday - Version Control
Git Lesson Online Reference: http://swcarpentry.github.io/git-novice/
https://github.com/Duke-GCB/scicomp-python/blob/master/intro-programming-python.ipynb
Python script to use for today. Save as a .py file.
```python=
def read_fasta(filename):
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if not '>' in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
print(read_fasta('ae.fa'))
```
General format of git commands
```
git <verb>
```
Git commands
- `git init` Initialize a git repository
- `git status` Status of your git repository
- `git config` Sets configurations for git
- `git add` Lets git know that you want to make updates to a file
- `git commit` Makes a snapshot of the update
- `git diff` Shows what has changed
- `git log` Gives a log of the commits
- `git checkout` Lets you navigate to a branch
- `git reset` Undo changes to local git
- `git revert` Undo changes by making a commit reversing the previous commit
Within the python-fasta folder, use `git init` to initialize a repository and `git status` to check the status
Next, use the following git config commands
```
git config --global user.name "user_name"
git config --global user.email "email"
git config --global core.editor "nano -w"
```
Nano is a terminal text editor.
```
git add python-fasta.py
git commit
```
After git commit, the editor will open. Then type in the comments for the commit. Use `git status` to check that the python-fasta.py file is no longer untracked. After you make changes to the file, running `git status` should place the file under `modified: python-fasta.py`
Use `git diff` to show the changes made. Then `git diff --staged` will show the change that has been staged.
Use `git log`, `git log --oneline`, `git checkout 'first_7_chars_in_commit_checksum' -- python-fasta.py`
About git reset, restore, and revert: https://git-scm.com/docs/git#_reset_restore_and_revert
General workflow for making commits:
1. Make a change to the file you are working on, ie. python-fasta.py
2. Use `git add filename`
3. Use `git commit`, then make comments on the commit
### More with Git (Tuesday Afternoon session)
**Branching**
Making a new branch allows you to maintain different versions of your work silmutaneously.
`git branch` shows that you are on the master branch
To make a new branch AND move to the new branch in one step:
`git checkout -b file-as-param`
- `-b` means create a new branch
- `git checkout` means to go this branch
Now when you run `git status`, it will say: `On brach file-as-param` at the top of the output message.
On this new branch, we will modify our python-fasta.py script to take an input file as an argument.
Open python-fasta.py with nano and add `import sys` to the top of the file, then add `sys.argv[1]` in place of the file name to indicate that you want the first input from the command line arguments to be used here.
```python=
import sys
def read_fasta(filename):
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if not '>' in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
print(read_fasta(sys.argv[1]))
```
Now to run this script, you will have to enter:
```
python python-fasta.py ae.fa
```
And the script will run like it did before (check that it works before updating in git).
Save your file and exit nano, then again stage your changes:
`git add python-fasta-py`
and then commit using a quicker way just using the command line argument `-m` (stands for "message"):
`git commit -m 'Parameterizes name of the file'`
`git log` will show this new commit, and will also show that it is on the branch `file-as-param`.
Next we will make one more change to our script to give a nicer error message if we forget to give a filename as an argument when running the script:
```python=
import sys
def read_fasta(filename):
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if not '>' in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
if len(sys.argv) < 2:
print("Need to provide filename as argument")
exit(1)
print(read_fasta(sys.argv[1]))
```
Now commmit your updates:
`git add`
`git commit -m 'Catches case of missing filename'`
Next we want to go back to the "master" branch, which has the version of python-fasta.py that does not take the filename as an argument.
Check which branch you are currently on using `git branch`. The branch with the '\*' next it is the branch you are currently on.
Note: if you try to run `git checkout -b master` won't work because the `-b` argument means "make a new branch", but this branch already exists.
Instead, run:
`git checkout master` to get back to the master branch.
We can check that this worked by again running `git log --oneline` which shows that HEAD is pointing to 'master' (HEAD -> master) in the output.
What is a branch name in git?
- the branch name is essentially a label for a node. Each label/branch name points to a particular commit.
- HEAD is also just a label. It points to the tip (end) of whichever branch you are on.
- if you change to a different branch, then the pointer for the HEAD label changes to point to this different branch.
- `git log --online` shows you the commits that came before the HEAD node e.g.
```
332eea2 (HEAD -> master) Revert "Add python as interpreter"
0cbaa3c Adds python as interpreter
f8585b4 Removes cruft from Jupyter Notebook
```
### Working with branches
Make a new branch:
`git checkout -b ignore-data`
And double check that you successfully switched using
`git branch` (you should have a '\*' next to a branch named 'ignore-data')
We will make a new hidden file so git will ignore certain data files that you don't want to back up.
`nano .gitignore` to open a new file.
Type into your new document:
```
*.fa
```
And save and exit.
Now when you run `git status` you won't see any files ending with '.fa' in your "Untracked files" section.
You can open .gitignore again with nano and add:
```
*.fa
*.fasta
*.ipynb
```
Run `git add .gitignore`
and `git commit -m 'Ignores data and notebook files'` to save your changes with git.
### Merging
To merge a branch with another one, you first have to checkout the branch you want to merge *into*. We want to merge our ignore-data branch with the master branch, so first:
`git checkout master`
then you an merge the branch you want with the branch you are currently on with the command:
`git merge ignore-data`
This has created a "fast forward" in git, where the pointer for "master" gets moved "forward" to a later commit (ignore-data) node. The HEAD pointer also is fast forwarded to the ignore-data node.
Now, `git log --online` will show:
```
069564e (HEAD -> master, ignore-data) Ignores data and notebook files
```
Fast forwarding happens when you didn't update anything on the master branch so you're just "fast forwarding" it to a later commit node.
What if we want to do something more complicated and merge two branches with different changes?
`git merge file-as-param` creates a *merge commit* by opening a new file in nano. Save and close this file to commit the merge.
Git can usually figure out how to merge two different branches, but sometimes there will be conflicts to be resolved.
### Adding your repository to the cloud
Make a new repository on the Github website called **python-fasta**. You can choose to make it either public or private, but since we want to work with collaborators easily, we will make it public today.
**NOTE** Do NOT click the option to initialize this repository, because it already exists on our local computer. Only do this if you don't already have a repository created on your computer.
Github will take you to a helpful instructions page. We want to push an existing repository from the command line:
`git remote add origin https://github.com/username/python-fasta.git`
Git is what is known as a **distributed** version control system, meaning the repository exists fully on your local computer, but can also be mirrored elsewhere (like your github profile online, on a server, or on a different computer).
- Naming your main remote repository *origin* is a commonly used git convention, but is not required.
If you enter the command
`git remote`
or
`git remote -v` (for more verbose options)
it will tell you your two command options:
To move your changes from a local computer TO online (remote): `push`
To move changes from the online version TO your local machine: `fetch` (sometimes `pull`)
Enter `git push -u origin master` to add your changes to your online github repository.
- Note: the `-u` option here means "upstream". We will discuss this more later.
Enter your Github username and password when it prompts you to.
If you view this repository on the Github website, we can see it is no longer empty, however it doesn't yet have all of the commits and branches we've made today.
`git push --all origin` will add the other branches and their commits to the online repository.
### Collaborating with others
You can collaborate with others by looking at their code on Github and modifying it. If you'd like to submit your changes to another person's repository, you can submit a *pull request*. The other person may or may not accept your changes, but this is one avenue through with you can collaborate with others. More on pull requests here: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests
To pull from someone else's repsitory, use the command:
`git clone <github-url>` where `<github-url>` is just the url of the repository you want to clone. This will "clone" their repository from github (online) onto your local computer.
It is important to work on your own branch of someone else's repository so you don't create conflicts.
`git checkout -b add-docs`
Now we will edit their copy of `python-fasta.py` by adding some documentation.
```python=
import sys
def read_fasta(filename):
# Reads fasta file from disk and returns the sequence
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if not '>' in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
if len(sys.argv) < 2:
print("Need to provide filename as argument")
exit(1)
print(read_fasta(sys.argv[1]))
```
`git add python-fasta.py` to stage your changes.
`git commit` to open in nano and add the message:
```
Adds docstring for function
Documentation of the function will be good to have for our future self in a year from now
```
Using `git branch` still works to see how this collaborator's repository is structured (what commits have been added in the past), but you need to add a `-a` argument to see everything in the remote branch:
`git branch -a`
We have staged and committed our changes, but we still have to *push* to the remote repository online.
`git push -u origin add-docs`
- adding the `-u` sets up to track the remote branch, which is very helpful when you want to continue collaborating (i.e. if the other user has accepted your pull request)
- adding `-u` also allows you to just enter `git push` in the future instead of having to add "origin" and the branch name every time.
Let's say this other user accepted our pull request. Now we need to get up to date with their master branch:
`git fetch origin`
So now it's on our computer, but we have to merge:
`git merge origin/master`
- note: this is a fast forward merge
We can see our pull request merged with the master branch if we use:
`git log --oneline`
**NOTE** If we had used `git pull`, it would have done the same thing as the *two* steps we did to update from the online repository (`git fetch` and `git merge`). It can be useful to do this in two stages to make sure it's doing the right thing, but usually `git pull` works fine.
### Running programs on HPC
Login to the cluster
ssh netid@dcc-slogin-duke.oit.duke.edu
Enter password, then you should see a prompt to run additional commands
`hostname` gives the name of the host computer
`srun hostname` Connects to a remote server in the cluster and gives the hostname.
`srun` submits a job, you are assigned an ID and it is put into a que.
Interactive job
`srun --pty bash` Give the bash shell (bash) and let you interact with it (--pty)
Some commands to run in the interactive session
- `pwd` present working directory
- `free` available memory
- `exit` ends the interactive job
Exit and go to the login node:
```bash
git clone https://github.com/Duke-GCB/scicomp-hpc.git
```
Then cd into the new folder created, scicomp-hpc. Check out the read_fasta.py file with `less fasta_gc.py`
To run the file on the cluster:
```bash
srun python fasta_gc.py data/E2f1_dna.fasta
```
To run a set of commands, use `sbatch`. Use `man sbatch` to get more information
Write a script to use with sbatch with nano, `nano countgc.sh`
In the editor type:
```bash
#!/bin/batch
echo "Starting GC Counter"
python fasta_gc.py data/E2f1_dna.fasta
echo "Finished GC Counter"
sleep 60
echo "Done"
```
Save the file. To submit the file run `sbatch countgc.sh`
By default, it will make a file with the job id in the current directory. Ex: slurm-4272224.out
Results will be placed in the .out file. That file is updated as the job runs.
#### Checking job status in HPC
- `squeue` Inidcates the que of jobs
- `sacct` When the job is done, placed on the accounting list
`squeue -u netid` will give the jobs for just the net id selected
sacct options
- `sacct --mem` amount of memory the job used
- `sacct -e` all options sacct can show
By default the cluster requests 2GB of memory for each job.
You can edit the script to request a different amount of memory with `#SBATCH --mem=requested_amount`
Ex, requesting 100MB of memory:
`nano countgc.sh`
```bash
#!/bin/batch
#SBATCH --mem=100M
echo "Starting GC Counter"
python fasta_gc.py data/E2f1_dna.fasta
echo "Finished GC Counter"
sleep 60
echo "Done"
```
Other options to specify in the script: <https://slurm.schedmd.com/sbatch.html>
Add the following to have the HPC email you when the job is done
```bash
#SBATCH --mail-type=END
#SBATCH --mail-user=email_address
```
Placing `srun` in front of the commands in the bash script will indicate if specific parts of the script fail via `sacct`
`#SBATCH --array=1-5%2` will have a batch of items run at a time.
- 1-5 indicates to run the script 5 times
- %2 indicates to run 2 at a time
Adding variables to the bash script
```bash
#!/bin/bash
#SBATCH --mem=400M
#SBATCH --array=1-5%2
echo $SLURM_ARRAY_TASK_ID
FILENAME=$(ls data/*.fasta | head -n $SLURM_ARRAY_TASK_ID | tail -n 1)
echo $FILENAME
```
### Moving files in/out of the cluster (Afternoon session)
**NOTE**: Github should be used primarily for small text/script files, CSV files, etc, NOT for backing up your data, as there is a ~1 GB storage limit.
Reminder, you can redirect command line output with `>`:
`echo "Data Staging > datafile.txt"`
`scp` (stands for "secure copy") allows you to securely copy data files to the server:
`scp datafile.txt username@dcc-slogin.oit.duke.edu:datafile.txt`
* You should be prompted for your password (for the server, not the one for your local computer)
* Note: if you don't specify a path after the `:`, the copied file will end up in your home directory
Another way to download data to the server is with the command `wget` if you have a url for the data you want.
`srun wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/chrI.fa.gz`
Other files are available in an online databases: e.g. http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/
- main page with other organisms: https://hgdownload.soe.ucsc.edu/downloads.html
It's a good idea to move data files out of your home directory on the server, as these usually have limited storage space.
### Finding software on the HPC
**Modules**
Check what software is available with:
`module avail`
To use software use the command: `module load <X>`
e.g.
`module load RepeatMasker/4.0.5`
Check that it is working with `module list`, which should list the pieces of software you have loaded
You can load multiple modules at one time. Load another program: `module load Perl/5.30.0` and run `module list` again to see that now there are 2 items.
Now you can run the software you have loaded. e.g. `RepeatMasker -species "Saccharomyces cerevisiae" chrI.fa.gz`
*Reminder: use command `srun --pty bash` to start an interactive session.*
`module purge` will close any software you had loaded. You can double check that this worked by again entering `module list` to see that it is now empty.
HARDAC is another computing cluster at Duke: https://wiki.duke.edu/x/QgSsAw
**Note:** saving your scripts (and using version control) is crucial for being able to reproduce your own results and allowing others to repeat your experiments.