owned this note changed 5 years ago
Linked with GitHub

GCB Academy: Intro to Scientific Computing for Genomics.

November 4-5, 2019

Course website: https://duke-gcb.github.io/SciComp-Nov-2019/ (includes links to materials)

Instructors: Hilmar Lapp, Dan Leehr

Stuck? Put your pink sticky note on the top of your computer.

Monday Morning - Shell

Launching the shell: Git Bash on Windows, Terminal on macOS.

Commands:

  • pwd - print working directory, where am I?
  • ls - list files in current directory
    • ls -F - format directories by trailing /
    • ls -lh - show human-readable sizes
    • ls -a - include hidden files/directories
  • cd - change directories
  • man ls - manual page for a command, e.g. ls

Note: to exit the manual page, press the q key on your keyboard

Exercise: What additional information do you learn with ls -l?

Answer: Date modified, file size, owner, permissions, directory or not

Tab completion:

While typing a command or file name, hit the tab key and the shell will finish the word (or help as much as it can!)

  • cd un<tab> becomes cd untrimmed_fastq/

Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you.

Multiple matches? Start typing and hit tab. The shell will sound a ding, and list possible completions. Keep typing and hit tab again once it can pick a single result.

  • cd (by itself, no directory name) - changes to your home directory (cd ~ has the same result)
  • cd Desktop/shell_data/untrimmed_fastq - go directly into untrimmed_fastq with one command. Remember tab completion!
  • cd .. - go one level up in the directory tree (parent directory)
  • cd ~/Desktop/shell_data - from anywhere on the filesystem since we begin with ~ (home directory)
  • cd - - Go back to where I came from (the previous directory you were in)

Exercise: Find the hidden directory

Relative and Absolute paths

Relative Path: Any path that doesn't begin with a slash. (e.g. Desktop/untrimmed_fastq) Relative paths are relative to where you are.

Absolute Path: Any path that begins with a slash (e.g. /Users/sam/Desktop/untrimmed_fastq)

  • / is the root directory

Files and Directories

cd ~/Desktop/untrimmed_fastq

  • ls *.fastq - list all files ending with .fastq

  • ls *977.fastq - list all files ending with 977.fastq

  • ls /bin/*sh - list all the commands that are shells

  • echo $SHELL - which shell am I running?

  • history - command history with number

    • ! {number, n} - execute the nth command
    • !! is replaced with the contents of the previous command
  • ~ is an absolute path to the home directory

Working with Files

  • cat SRR098026.fastq - outputs ALL contents of the file to the screen (good for small files)
  • less SRR098026.fastq - lets you navigate and scroll through the file
    • space - move forward in the file
    • / + [search term] + [enter] to search within the document
    • b - move backwards in the file
    • g - go to the beginning of the file
    • G - go to the end of the file
    • q - to exit 'less'

Head and Tail

  • head - prints the first 10 lines of a file to the screen (default is 10 lines)
  • tail - prints the last 10 lines of a file to the screen (default is 10 lines)
  • add -n 100 to instead print the first (head) or last (tail) 100 lines of the file to the screen
    example:
    head -n 1 SRR098026.fastq
    will output the first line of the file:
    @SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35

FASTQ File Format

A FASTQ file normally uses four lines per sequence.

  • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Creating, moving, copying, and removing files

  • cp to copy a file
    • cp /original_path/file.txt /new_path/file-copy.txt
  • mkdir to make a new directory (folder)
    • mkdir backup makes a new directory called "backup"
  • mv can both move a file and/or rename that file
    • mv SRR098026-copy.fastq backup/ moves SRR098026-copy.fastq to the backup directory
    • mv SRR098026-copy.fastq SRR098026-backup.fastq *renames SRR098026-copy.fastq to SRR098026-backup.fastq in the same directory
    • NOTE moving and renaming files are effectively the same thing
    • to move a file "down" a directory into your current directory (1 level up), you can do: mv /backup/file.txt .
      • the . indicates your current directory
    • adding -i to the mv command makes sure you don't accidentally overwrite files mv -i file1.txt file2.txt

File Permissions

  • r read
  • w write
  • x execute

ls -l shows you the permissions of your files:
-rwxr-xr--

  • the first - character is the file type
  • the next three rwx characters are permissions for the file's owner
  • the next three r-x characters are permissions for the group
  • the final three r-- characters are permissions for the current user

chmod change mode of the file (change permission settings)

  • chmod -w file.txt removes write permissions
    • if you try to remove this file without having permission, it will not allow you do delete e.g. rm file.txt will return a "permission denied" message
  • chmod +w file.txt adds write permissions

Removing a file or directory

-rm permanently deletes the file (i.e. does not go into your trash or recycle bin)
- rm -r remove recursive - to delete a directory and everything in it NOTE this command should be used with care, as it can permanently delete a lot of data

Exercise: Make a backup of your FASTQ files and move them to a new backup/ directory. Remove 'write' permission for the backup files.
Answer:

  • rm -r backup
  • cp SRR098026.fastq SRR098026-backup.fastq and cp SRR097977.fastq SRR097977-backup.fastq
  • mkdir backup and mv *-backup.fastq backup
  • chmod -w backup/*-backup.fastq

Redirection and searching with grep

grep is a program that allows you to search within a file without having to open it.

  • grep NNNNNNNNNN SRR098026.fastq will return every single line in the SRR098026 file that contains at least 10 consecutive Ns is printed to the terminal, regardless of how long or short the file is.

Search for a term within a specific context:

  • grep -B 1 -A 2 NNNNNNNNNN SRR098026.fastq
    • This will return the line before the search term and the two lines after the search term (in addition to the instance of 10 Ns)
    • -B before
    • -A after

Exercise

  1. Search for the sequence GNATNACCACTTCC in the SRR098026.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
    • Answer: grep -B1 GNATNACCACTTCC SRR098026.fastq
      returns
      @SRR098026.245 HWUSI-EAS1599_1:2:1:2:801 length=35 GNATNACCACTTCCAGTGCTGANNNNNNNGGGATG
  2. Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match.
    • Answer: grep -B1 AAGTT *.fastq
      returns
      SRR097977.fastq-@SRR097977.11 209DTAAXX_Lenski2_1_7:8:3:247:351 length=36 SRR097977.fastq:GATTGCTTTAATGAAAAAGTCATATAAGTTGCCATG -- SRR097977.fastq-@SRR097977.67 209DTAAXX_Lenski2_1_7:8:3:544:566 length=36 SRR097977.fastq:TTGTCCACGCTTTTCTATGTAAAGTTTATTTGCTTT -- SRR097977.fastq-@SRR097977.68 209DTAAXX_Lenski2_1_7:8:3:724:110 length=36 SRR097977.fastq:TGAAGCCTGCTTTTTTATACTAAGTTTGCATTATAA -- SRR097977.fastq-@SRR097977.80 209DTAAXX_Lenski2_1_7:8:3:258:281 length=36 SRR097977.fastq:GTGGCGCTGCTGCATAAGTTGGGTTATCAGGTCGTT -- SRR097977.fastq-@SRR097977.92 209DTAAXX_Lenski2_1_7:8:3:353:318 length=36 SRR097977.fastq:GGCAAAATGGTCCTCCAGCCAGGCCAGAAGCAAGTT -- SRR097977.fastq-@SRR097977.139 209DTAAXX_Lenski2_1_7:8:3:703:655 length=36 SRR097977.fastq:TTTATTTGTAAAGTTTTGTTGAAATAAGGGTTGTAA -- SRR097977.fastq-@SRR097977.238 209DTAAXX_Lenski2_1_7:8:3:592:919 length=36 SRR097977.fastq:TTCTTACCATCCTGAAGTTTTTTCATCTTCCCTGAT -- SRR098026.fastq-@SRR098026.158 HWUSI-EAS1599_1:2:1:1:1505 length=35 SRR098026.fastq:GNNNNNNNNCAAAGTTGATCNNNNNNNNNTGTGCG

Redirection to capture the output

  • > this is the command to redirect output
    e.g. grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt will redirect the results of this grep search to the bad_reads.txt file

    • Note: this command also created the bad_reads.txt file
  • wc is short for "word count" - returns information about your file:

    • first number is the number of lines
    • second number is number of words
    • third number is number of characters
  • adding -l returns only the number of lines e.g. wc -l file.txt

Exercise: How many sequences in SRR098026.fastq contain at least 3 consecutive Ns?
Answer: wc -l bad_reads.txt returns 249 sequences in the file have at least 3 consecutive Ns

  • >> appends to an existing file e.g. grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt

    • Using the single > to redirect output to an existing file will overwrite contents of the file
  • You can also use a wildcard to redirect output of multiple files to the same output files at once e.g. grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt

NOTE on file extensions:

  • Mac & Windows care (sometimes) about what file extension you use
  • Unix doesn't care at all what you use, so they're entirely for your (the user's) benefit in understanding the contents of the file

The pipe command is |

  • grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l will perform your grep search then "pipe" the output to the wc program, so your output will be the number of lines returned from this grep search.

For Loops

Performs a command over a set of parameters or conditions

To define a variable:
variable_name = information
ex: f = Hello
To access a variable:
\(variable_name ex: `\)f`

Loop structure template:

for iterable_name in list_of_things:
do
stuff
done

ex1, print the names of fastq files in the present working directory:

for f in *fastq
do
echo $f
done

ex2, make a backup of fastq files in the present working directory:
for f in $fastq; do echo cp $f $f.backup ; done

The iterable_name is a variable defined in the loop
For each loop, the iterable_name will contain the next item in the list_of_things
You can write the loop in one line, using a semicolon to deliminate each step (ex2)

Testing your Cluster Account

From your shell prompt, enter:

ssh NETID@dcc-slogin.oit.duke.edu

Replace NETID with your Duke NetID, and enter your NetID password when prompted.

If prompted with

The authenticity of host 'dscr-slogin-01.oit.duke.edu (152.3.100.80)' can't be established.
RSA key fingerprint is SHA256:7RrLpUJew15D1Iij0i9io6XF3RY2LEFxZi67Ep1SlbY.
Are you sure you want to continue connecting (yes/no)? 

Type yes.

If you are greeted with a bash prompt, your account is working correctly.:

dcl9@dcc-slogin-03  ~ $ 

If you are unable to login, please add your NetID below:

  • avd16
  • sa372
  • bd111 (verified login at duke.edu/online but permission denied for ssh)
  • sl375

Monday Afternoon - Programming with Python

File link: https://github.com/Duke-GCB/SciComp-Nov-2017/releases/download/1.0/python-fasta.zip

Running Jupyter Notebook

Method 1: Use Anaconda Navigator and run jupyter
Method 2: Type jupyter notebook into the terminal

Printing and Strings

Functions in python are used with the following syntax: function()
The parentheses contain parameters you want to give the function

To print, use the print() function.

Ex: print('Hello')

A string is a datatype that indicates a word or sequence. In the example above, hello is in quotes, indicating it is a string

Variables are set in python in the following way variable_name = information
To call a variable, you type the variable_name, unlike in the terminal where you add a '$' beforehand

Ex, prints the information stored in the variable called name:

name = 'Sam'
print(name)

You can use numbers (and other datatypes) in print as well

weight = 55.5
print(weight)

Indexing

You can access parts of a string (or list) with indexing.

General format:

string[start:end:countby]

Example, print the first letter in Python:

name = 'Python'
print(name[0])

Indicies in python start at 0, so the first letter or item in a list will be index 0

Indexing starts at the position you indicate and ends at one position ahead. The end value is exclusive, so it is up to that position rather than including that position

Example, print 'ytho' from 'Python':

name = 'Python'
print(name[1:5])

Check if a letter is in a string

General format:

letter in string

letter not in string

Example, returns true:

'a' in 'apple'

Example, returns false:

'b' not in 'bee'

Working with sequences

Given the sequence:
seq = 'ACCTGCATGC'

Ex1, reverse the first 3 letters

base1 = seq[0]
base2 = seq[1]
base3 = seq[2]
r3 = base3 + base2 + base1

For strings, you cannot edit them in the following way:
string[3] = 'a'

Strings are immutable datatypes, which cannot be directly changed.
You can combine parts of a string together to make a new string (Ex1).

Reversing a sequence

You can reverse a sequence with indexing. Recall that the general format is

string[start:end:countby]

If you make the last parameter -1, it will count backwards and reverse the string
Ex2, reverse a string via indexing

name = 'Python'
print(name[::-1])

Looping through a DNA sequence

A for loop can iterate through a string, each iteration will pick the next letter in the string

Ex1, print each letter in a sequence separately:

for letter in 'CACGTAGCA':
    print(letter)

Loop structure is a bit different than bash in the terminal.

General format of a loop in python:

for each_item in collection_of_things:
    do stuff

A loop in python requires a colon at the end of the for statement and each line of code within the loop to be indented

Ex2, reverse a string via for loop:

rev = ''
seq = 'ACCTGCATGC'
for letter in seq:
    rev = letter + rev
print(rev)

Dictionaries

Creating a dictionary:

Empty dictionary: dictionary_name = {}

Dictionary with keys and values:

dictionary_name = {'key1':'value1', 'key2':'value2'}

Adding a key:value pair to a dictionary you already made

dictionary_name['new_key'] = 'new_value'

Ex, a dictionary to give the complement letter

complement_letter = {'A':'T', 'T':'A', 'C':'G','G':'C'}

Getting the value from a dictionary using a key

dictionary_name['key1']

Will return the value for key1, which we defined before as 'value1'

For keys, if the key is a string you will have to use the same exact string, including any capitalization.

Lists

List are containers that can contain items of any datatype (even other lists)

To make a list:

list_name = []

A list can contain different datatypes
Ex, a list with a string, integer (whole number), float (number with decimal points), and another list

list_name = ['CACGTA', 52, 48.082, ['ACTG', 5]]

You can loop (and index) through a list, just like you can loop through a DNA sequence or string

Conditionals, if, elif, else statements

Python has several statements to determine if something is true or false, conditional statements. Some common ones are:

  • == equals
  • > greater than
  • < less than
  • in item in a list, string or iterable sequence
  • not in item not in a list or iterable sequence

You can also use and, or statements to evaluate multiple statements. Or means one of the statements must be true while and means that both have to be true
Ex, all return true:

'a' in 'apple' and 5 < 100
'a' not in 'apple' or 5 > 2

If, elif, and else can be used to perform a task, given a certain condition
General format:

if something == True:
    do something
elif something_else == True:
    do something
else:
    do something

Python will evaluate the if statement first. If the statement returns false, it will then evaluate the elif (else if) statement. If the elif statement returns false, it will then do the else statement If the first if statement is true, it won't evaluate the elif or else statement.

Functions in Python

There are many built in functions available in python (such as the print() function). See the documentation for a list: https://docs.python.org/3/library/functions.html

len gives you the length of an object

print(len(seq))

10

Some commonly used functions:

abs() # finds the absolute value
int() # converts the input to the integer type
type() # tells you the type of the input (e.g. int, float, etc.)
sorted() # sorts the intput

sort() will sort a list

x = [7, 4, 3, 2, 8]
y = sorted(x)

output: [2, 3, 4, 7, 8]

You can also reverse the way sorted() sorts the list:

r = sorted(x, reverse = True)

You can run help(sorted) to learn about what other built-in options are available for a particular function.

Jupyter notebook also allows you to view this in an alternate way:

  • shift + tab while your cursor is in the function call
  • hit shift + tab again to get the pop-up to stay visible, and then hit these keys one more time (3 times total) to get the menu to appear separately at the bottom of your page.

split() will split the contents of your list based on a specified separator (such as '.' or ' ' (space)) into multiple list items. The default is white space between words.
e.g.

seqs = 'rna dna protein'
seqs_list = seqs.split()
print(seqs_list, len(seqs_list))

returns: ['rna', 'dna', 'protein'] 3 (a list with 3 items instead of just 1)

print(type(seqs))

returns:

<class 'str'>

Whereas:

print(type(seqs_list))

returns:

<class 'list'>

Exercise: Functions
bases = 'adenine cytosine guanine thymine'
Write some code that will:

  • Make a list of these bases from the string
  • Uppercases the names (e.g. ['ADENINE', ])
  • Reverses the order (e.g. ['THYMINE', ])

hint: use help(str) and help(list) to see what functions are available for strings and lists
Bonus write a for loop to print the first letter of each (e.g. 'A, C, ')

One solution:

bases = 'adenine cytosine guanine thymine'

# upper() can only be used on strings, so do this before converting to a list using split()
upper_list = bases.upper().split()
# print(upper_list) 
# returns:
# ['ADENINE', 'CYTOSINE', 'GUANINE', 'THYMINE']


# reverse the order of the list
# help(list.reverse) -> "reverse in place"
upper_list.reverse() #the list is now reversed and is still stored in upper_list
# print(upper_list)
# will now return:
# ['THYMINE', 'GUANINE', 'CYTOSINE', 'ADENINE']

Writing your own function

Start by defining the function using def and indenting inside the newly defined functions just like we did in for loops and if statements:

def double(x):
    result = 2 * x
    return result

Now we can use this function by passing in a value for the 'x' argument:

print(double(10))

returns:
20

This function can also multiply floats and even strings!

print(double('Eggs'))

will return:
EggsEggs

Now if we define a new variable:

x = 100

Note that this 'x' will not be confused by the 'x' inside the double() function.

Making a reverse function:

def reverse(seq): # 'seq' is used in a global context here rev = '' for s in seq(): rev = s + rev return rev

Now we can use this new function:

print(reverse(seq))

will give you the reverse ofseq

and

print(reverse(reverse(seq)))

will reverse seq and then reverse it again back to its original state

Making a complement function:

def complement(seq): comp = '' for s in seq: comp = comp + comp[s] return comp

Now we can use this function:

print(reverse(complement(seq)))

will give you the reverse complement of seq

NOTE the comp() function will only work if we have defined the dictionary earlier in the script. It is a much better to idea to include the comps dictionary inside the definition of the function:

def complement(seq):
    comp = ''
    comps = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    for s in seq:
        comp = comp + comp[s]
    return comp

We can combine these two functions into one function that does both in one step:

def rc(seq):
    return reverse(complement(seq))

Working with files in Python

In python, you want to open a file and store it in a variable so it can be used in your program

f = open('ae.fa')

Now the data from ae.fa is stored in the variable f.

We can loop over the data in f:

for line in f:
    print(line)
f.close() # It's a good idea to close your file when you are done working with it

but this will just do the same thing as the cat function on the command line!

Let's make a function to interpret the fasta file in a way that makes the different entries easier to work with. To start, let's convert our code from above into a function:
Step 1:

def read_fasta(filename):
    """
    Read in a file in FASTA format
    """
    f = open(filename)
    for line in f:
        line = line.strip()
        print(line)
    f.close()
read_fasta('ae.fa')

Step 2: add to list instead of printing and return that list

def read_fasta(filename):
    """
    Read in a file in FASTA format    
    """
    seq = ''
    f = open(filename)
    for line in f:
        line = line.strip()
        seq = seq + line
    f.close()
    return seq
read_fasta('ae.fa')

Step 3: is the line a sequence line or identifier line?

def read_fasta(filename):
    """
    Read in a file in FASTA format
    """
    seq = ''
    f = open(filename)
    for line in f:
        line = line.strip()
        if not '>' in line:
            seq = seq + line #only keep the line if it's not an identifier line that begins with '>'
    f.close()
    return seq
print(read_fasta('ae.fa'))

Once you have a function working, you can download it from Jupyter Notebooks as a python script (document with a .py file extension) and use it on your computer.

Rename the file to something more readable if it was given an automated name:
mv download.html read_fasta.py

and you can run the script with:
python read_fasta.py

which can be piped to wc -l
python read_fasta.py | wc -l
or used with any other command line tools.

Tuesday - Version Control

Git Lesson Online Reference: http://swcarpentry.github.io/git-novice/

https://github.com/Duke-GCB/scicomp-python/blob/master/intro-programming-python.ipynb

Python script to use for today. Save as a .py file.

def read_fasta(filename): sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence print(read_fasta('ae.fa'))

General format of git commands

git <verb>

Git commands

  • git init Initialize a git repository
  • git status Status of your git repository
  • git config Sets configurations for git
  • git add Lets git know that you want to make updates to a file
  • git commit Makes a snapshot of the update
  • git diff Shows what has changed
  • git log Gives a log of the commits
  • git checkout Lets you navigate to a branch
  • git reset Undo changes to local git
  • git revert Undo changes by making a commit reversing the previous commit

Within the python-fasta folder, use git init to initialize a repository and git status to check the status

Next, use the following git config commands

git config --global user.name "user_name"
git config --global user.email "email"
git config --global core.editor "nano -w"

Nano is a terminal text editor.

git add python-fasta.py
git commit 

After git commit, the editor will open. Then type in the comments for the commit. Use git status to check that the python-fasta.py file is no longer untracked. After you make changes to the file, running git status should place the file under modified: python-fasta.py

Use git diff to show the changes made. Then git diff --staged will show the change that has been staged.

Use git log, git log --oneline, git checkout 'first_7_chars_in_commit_checksum' -- python-fasta.py

About git reset, restore, and revert: https://git-scm.com/docs/git#_reset_restore_and_revert

General workflow for making commits:

  1. Make a change to the file you are working on, ie. python-fasta.py
  2. Use git add filename
  3. Use git commit, then make comments on the commit

More with Git (Tuesday Afternoon session)

Branching
Making a new branch allows you to maintain different versions of your work silmutaneously.
git branch shows that you are on the master branch

To make a new branch AND move to the new branch in one step:
git checkout -b file-as-param

  • -b means create a new branch
  • git checkout means to go this branch

Now when you run git status, it will say: On brach file-as-param at the top of the output message.

On this new branch, we will modify our python-fasta.py script to take an input file as an argument.

Open python-fasta.py with nano and add import sys to the top of the file, then add sys.argv[1] in place of the file name to indicate that you want the first input from the command line arguments to be used here.

import sys def read_fasta(filename): sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence print(read_fasta(sys.argv[1]))

Now to run this script, you will have to enter:

python python-fasta.py ae.fa

And the script will run like it did before (check that it works before updating in git).

Save your file and exit nano, then again stage your changes:
git add python-fasta-py
and then commit using a quicker way just using the command line argument -m (stands for "message"):
git commit -m 'Parameterizes name of the file'

git log will show this new commit, and will also show that it is on the branch file-as-param.

Next we will make one more change to our script to give a nicer error message if we forget to give a filename as an argument when running the script:

import sys def read_fasta(filename): sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence if len(sys.argv) < 2: print("Need to provide filename as argument") exit(1) print(read_fasta(sys.argv[1]))

Now commmit your updates:
git add
git commit -m 'Catches case of missing filename'

Next we want to go back to the "master" branch, which has the version of python-fasta.py that does not take the filename as an argument.

Check which branch you are currently on using git branch. The branch with the '*' next it is the branch you are currently on.

Note: if you try to run git checkout -b master won't work because the -b argument means "make a new branch", but this branch already exists.

Instead, run:
git checkout master to get back to the master branch.

We can check that this worked by again running git log --oneline which shows that HEAD is pointing to 'master' (HEAD -> master) in the output.

What is a branch name in git?

  • the branch name is essentially a label for a node. Each label/branch name points to a particular commit.
  • HEAD is also just a label. It points to the tip (end) of whichever branch you are on.
  • if you change to a different branch, then the pointer for the HEAD label changes to point to this different branch.
  • git log --online shows you the commits that came before the HEAD node e.g.
332eea2 (HEAD -> master) Revert "Add python as interpreter"
0cbaa3c Adds python as interpreter
f8585b4 Removes cruft from Jupyter Notebook

Working with branches

Make a new branch:
git checkout -b ignore-data
And double check that you successfully switched using
git branch (you should have a '*' next to a branch named 'ignore-data')

We will make a new hidden file so git will ignore certain data files that you don't want to back up.
nano .gitignore to open a new file.

Type into your new document:

*.fa

And save and exit.

Now when you run git status you won't see any files ending with '.fa' in your "Untracked files" section.

You can open .gitignore again with nano and add:

*.fa
*.fasta
*.ipynb 

Run git add .gitignore
and git commit -m 'Ignores data and notebook files' to save your changes with git.

Merging

To merge a branch with another one, you first have to checkout the branch you want to merge into. We want to merge our ignore-data branch with the master branch, so first:
git checkout master
then you an merge the branch you want with the branch you are currently on with the command:
git merge ignore-data

This has created a "fast forward" in git, where the pointer for "master" gets moved "forward" to a later commit (ignore-data) node. The HEAD pointer also is fast forwarded to the ignore-data node.

Now, git log --online will show:

069564e (HEAD -> master, ignore-data) Ignores data and notebook files

Fast forwarding happens when you didn't update anything on the master branch so you're just "fast forwarding" it to a later commit node.

What if we want to do something more complicated and merge two branches with different changes?
git merge file-as-param creates a merge commit by opening a new file in nano. Save and close this file to commit the merge.

Git can usually figure out how to merge two different branches, but sometimes there will be conflicts to be resolved.

Adding your repository to the cloud

Make a new repository on the Github website called python-fasta. You can choose to make it either public or private, but since we want to work with collaborators easily, we will make it public today.

NOTE Do NOT click the option to initialize this repository, because it already exists on our local computer. Only do this if you don't already have a repository created on your computer.

Github will take you to a helpful instructions page. We want to push an existing repository from the command line:
git remote add origin https://github.com/username/python-fasta.git

Git is what is known as a distributed version control system, meaning the repository exists fully on your local computer, but can also be mirrored elsewhere (like your github profile online, on a server, or on a different computer).

  • Naming your main remote repository origin is a commonly used git convention, but is not required.

If you enter the command
git remote
or
git remote -v (for more verbose options)
it will tell you your two command options:

To move your changes from a local computer TO online (remote): push
To move changes from the online version TO your local machine: fetch (sometimes pull)

Enter git push -u origin master to add your changes to your online github repository.

  • Note: the -u option here means "upstream". We will discuss this more later.

Enter your Github username and password when it prompts you to.

If you view this repository on the Github website, we can see it is no longer empty, however it doesn't yet have all of the commits and branches we've made today.

git push --all origin will add the other branches and their commits to the online repository.

Collaborating with others

You can collaborate with others by looking at their code on Github and modifying it. If you'd like to submit your changes to another person's repository, you can submit a pull request. The other person may or may not accept your changes, but this is one avenue through with you can collaborate with others. More on pull requests here: https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests

To pull from someone else's repsitory, use the command:
git clone <github-url> where <github-url> is just the url of the repository you want to clone. This will "clone" their repository from github (online) onto your local computer.

It is important to work on your own branch of someone else's repository so you don't create conflicts.
git checkout -b add-docs

Now we will edit their copy of python-fasta.py by adding some documentation.

import sys def read_fasta(filename): # Reads fasta file from disk and returns the sequence sequence = '' f = open(filename) for line in f: line = line.strip() if not '>' in line: # Append to the last sequence sequence = sequence + line f.close() return sequence if len(sys.argv) < 2: print("Need to provide filename as argument") exit(1) print(read_fasta(sys.argv[1]))

git add python-fasta.py to stage your changes.
git commit to open in nano and add the message:

Adds docstring for function

Documentation of the function will be good to have for our future self in a year from now

Using git branch still works to see how this collaborator's repository is structured (what commits have been added in the past), but you need to add a -a argument to see everything in the remote branch:
git branch -a

We have staged and committed our changes, but we still have to push to the remote repository online.
git push -u origin add-docs

  • adding the -u sets up to track the remote branch, which is very helpful when you want to continue collaborating (i.e. if the other user has accepted your pull request)
  • adding -u also allows you to just enter git push in the future instead of having to add "origin" and the branch name every time.

Let's say this other user accepted our pull request. Now we need to get up to date with their master branch:
git fetch origin

So now it's on our computer, but we have to merge:
git merge origin/master

  • note: this is a fast forward merge

We can see our pull request merged with the master branch if we use:
git log --oneline

NOTE If we had used git pull, it would have done the same thing as the two steps we did to update from the online repository (git fetch and git merge). It can be useful to do this in two stages to make sure it's doing the right thing, but usually git pull works fine.

Running programs on HPC

Login to the cluster

ssh netid@dcc-slogin-duke.oit.duke.edu

Enter password, then you should see a prompt to run additional commands

hostname gives the name of the host computer

srun hostname Connects to a remote server in the cluster and gives the hostname.

srun submits a job, you are assigned an ID and it is put into a que.

Interactive job

srun --pty bash Give the bash shell (bash) and let you interact with it (pty)

Some commands to run in the interactive session

  • pwd present working directory
  • free available memory
  • exit ends the interactive job

Exit and go to the login node:

git clone https://github.com/Duke-GCB/scicomp-hpc.git

Then cd into the new folder created, scicomp-hpc. Check out the read_fasta.py file with less fasta_gc.py

To run the file on the cluster:

srun python fasta_gc.py data/E2f1_dna.fasta

To run a set of commands, use sbatch. Use man sbatch to get more information

Write a script to use with sbatch with nano, nano countgc.sh

In the editor type:

#!/bin/batch

echo "Starting GC Counter"
python fasta_gc.py data/E2f1_dna.fasta
echo "Finished GC Counter"
sleep 60
echo "Done"

Save the file. To submit the file run sbatch countgc.sh

By default, it will make a file with the job id in the current directory. Ex: slurm-4272224.out

Results will be placed in the .out file. That file is updated as the job runs.

Checking job status in HPC

  • squeue Inidcates the que of jobs
  • sacct When the job is done, placed on the accounting list

squeue -u netid will give the jobs for just the net id selected

sacct options

  • sacct --mem amount of memory the job used
  • sacct -e all options sacct can show

By default the cluster requests 2GB of memory for each job.

You can edit the script to request a different amount of memory with #SBATCH --mem=requested_amount

Ex, requesting 100MB of memory:
nano countgc.sh

#!/bin/batch
#SBATCH --mem=100M

echo "Starting GC Counter"
python fasta_gc.py data/E2f1_dna.fasta
echo "Finished GC Counter"
sleep 60
echo "Done"

Other options to specify in the script: https://slurm.schedmd.com/sbatch.html

Add the following to have the HPC email you when the job is done

#SBATCH --mail-type=END
#SBATCH --mail-user=email_address

Placing srun in front of the commands in the bash script will indicate if specific parts of the script fail via sacct

#SBATCH --array=1-5%2 will have a batch of items run at a time.

  • 1-5 indicates to run the script 5 times
  • %2 indicates to run 2 at a time

Adding variables to the bash script

#!/bin/bash
#SBATCH --mem=400M
#SBATCH --array=1-5%2

echo $SLURM_ARRAY_TASK_ID
FILENAME=$(ls data/*.fasta | head -n $SLURM_ARRAY_TASK_ID | tail -n 1)
echo $FILENAME

Moving files in/out of the cluster (Afternoon session)

NOTE: Github should be used primarily for small text/script files, CSV files, etc, NOT for backing up your data, as there is a ~1 GB storage limit.

Reminder, you can redirect command line output with >:
echo "Data Staging > datafile.txt"

scp (stands for "secure copy") allows you to securely copy data files to the server:
scp datafile.txt username@dcc-slogin.oit.duke.edu:datafile.txt

  • You should be prompted for your password (for the server, not the one for your local computer)
  • Note: if you don't specify a path after the :, the copied file will end up in your home directory

Another way to download data to the server is with the command wget if you have a url for the data you want.
srun wget http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/chrI.fa.gz

Other files are available in an online databases: e.g. http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/chromosomes/

It's a good idea to move data files out of your home directory on the server, as these usually have limited storage space.

Finding software on the HPC

Modules
Check what software is available with:
module avail

To use software use the command: module load <X>
e.g.
module load RepeatMasker/4.0.5

Check that it is working with module list, which should list the pieces of software you have loaded

You can load multiple modules at one time. Load another program: module load Perl/5.30.0 and run module list again to see that now there are 2 items.

Now you can run the software you have loaded. e.g. RepeatMasker -species "Saccharomyces cerevisiae" chrI.fa.gz

Reminder: use command srun --pty bash to start an interactive session.

module purge will close any software you had loaded. You can double check that this worked by again entering module list to see that it is now empty.

HARDAC is another computing cluster at Duke: https://wiki.duke.edu/x/QgSsAw

Note: saving your scripts (and using version control) is crucial for being able to reproduce your own results and allowing others to repeat your experiments.

Select a repo