Duke UPGG Scientific Computing Bootcamp

Course website: https://duke-gcb.github.io/2019-08-12-Duke/ (includes links to materials)

Instructors: Hilmar Lapp, Dan Leehr

Stuck? Put your pink sticky note on the top of your computer.

Monday Morning - Shell

Launching the shell: Git Bash on Windows, Terminal on macOS.

Commands:

pwd - print working directory, where am I?
ls - list files in current directory
- ls -F - format directories by trailing /
- ls -lh - show human-readable sizes
- ls -a - include hidden files/directories
cd - change directories
man ls - manual page for a command, e.g. ls
Note: to exit the manual page, press the q key on your keyboard

Exercise: What additional information do you learn with ls -l?

Answer: Date modified, file size, owner, permissions, directory or not

Tab completion:

While typing a command or file name, hit the tab key and the shell will finish the word (or help as much as it can!)

cd un<tab> becomes cd untrimmed_fastq/

Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you.

Multiple matches? Start typing and hit tab. The shell will sound a ding, and list possible completions. Keep typing and hit tab again once it can pick a single result.

cd (by itself, no directory name) - changes to your home directory (cd ~ has the same result)
cd Desktop/shell_data/untrimmed_fastq - go directly into untrimmed_fastq with one command. Remember tab completion!
cd .. - go one level up in the directory tree (parent directory)
cd ~/Desktop/shell_data - from anywhere on the filesystem since we begin with ~ (home directory)
cd - - Go back to where I came from (the previous directory you were in)

Exercise: Find the hidden directory

Relative and Absolute paths

Relative Path: Any path that doesn't begin with a slash. (e.g. Desktop/untrimmed_fastq) Relative paths are relative to where you are.

Absolute Path: Any path that begins with a slash (e.g. /Users/sam/Desktop/untrimmed_fastq)

/ is the root directory

Files and Directories

cd ~/Desktop/untrimmed_fastq

ls *.fastq - list all files ending with .fastq
ls *977.fastq - list all files ending with 977.fastq
ls /bin/*sh - list all the commands that are shells
echo $SHELL - which shell am I running?
history - command history with number
- ! {number, n} - execute the nth command
- !! is replaced with the contents of the previous command
~ is an absolute path to the home directory
file display using head and tail
- head gives the first 10 lines of the file
- tail gives the last 10 lines of the file
- use -n to specify the number of line to display
  - e.g. head -n 4 {filename} displays the first 4 lines

Match filenames using wildcard

wildcard * - match any (0 or more) character
application: prefix, suffix
put in double quotes to interpret as-is, not as the wildcard special character
- e.g. ls "*fastq" - returns *fastq, instead of weldcard expansion

Display file content

cat {filename} - concatenate the file content and display on the terminal
less {filename} - allows you to page through the content using Space; interactive
- G – go to the end of file
- g – go to the start of file
- q – exit the less pager
- /{pattern} - search for a pattern which will take you to the next occurrence
  - n – for next match in forward
  - N – for previous match in backward
command > file redirects a command’s output to a file (overwriting any existing content).
command >> file appends a command’s output to a file.
pipe | - inputs the output from one command to another command
wc {filename} gives four-columnar output: total number of lines, words, characters, filename
- -l returns the number of lines in file only
grep [option] {pattern} {filename} search for exact pattern or by regular expression
- -B and -A: before and after the match, respectively
  - e.g. -B1 -A2 gives one line before and two lines after the match
- -v include everything except for the pattern
- -- not specifying any flag

macOS users: The -- characters that appear are called a "Group Separator". There is no flag in macOS grep that prevents these, but you can pipe grep's output to remove them:

grep NNNNNNNNNN -A 2 -B 1 SRR098026.fastq | grep -v -- --

Moving files around

mkdir {dir_name} make a directory in the current directory
cp copies files to a target named by the last argument on its command line. If the target is an existing file, cp overwrites it; if it does not exist, cp creates it.
- cp {source_file(s)} {destination_dir} - copy file(s) to a directory
- cp {source_file} {destination_file}- copy a file to another file
- -i interactive: prompt to ask whether want to overwrite an existing file

Move & rename files

mv {old_filename} {new_filename}
mv {filename} {dest-dir}
- -i prompt for overwrite
- -n make it not overwrite
  Deleting a file/directory
use rm {file} to remove/delete a file
use rm -r {dir} to recursively remove all the files in a directory
Note: There is NO trash bin on Unix system. Anything that is deleted is gone forever.
need permission from the owner of the directory

Permissions

chmod -w {filename} - take away the write permission from everyone
- good practice: don't want to modify the raw data files
- the - and + in front of w does not indicate flag

FASTQ file format (4 lines per read)

A sequence identifier with information about the sequencing run and the cluster.
The sequence (the base calls; A, C, T, G and N).
A separator, which is simply a plus (+) sign.
The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

Intro to Programming

The For-Loop
for x in *.fastq ; do [echo] cp $x $x.backup ; done

echo reflects the command, remove it to get the following effect:
- for every fastq file in this directory, create a backup

Monday Afternoon - Programming in Python

Jupyter notebook

RUN shortcut: shift + enter

Python

assignment operator: =, assigns value without returning anything
comparison operator: ==, returns a boolean value
print() can take multiple, comma-separated arguments, such as strings, int, etc.
zero-indexed

String operations

str_var[0] gives the first character in the string
negative index counts from the end, starting from -1
slicing: str_var[start : end : step], with the ending index being exclusive
in a boolean operator that allows you to check if a character is in a string
- e.g. name = "Python" then P in name would return True
Python strings are immutable: every time you do operations with strings, e.g. slice, concatenate, you are building up a separate new string

The for-loop

for item in [iterable]:
    # do something with item

Indentation matters in Python! Make sure the body of your loop is properly indented.

Reversing a string














seq = 'ABC'
rev = ''
# initialize an empty string to be built-on later

for s in seq:
    rev = rev + s 
    # append s to rev and re-assign it to rev
    # s is a variable that is visible only WITHIN the loop (the concept of 'scope'); \
    # it could be named however you like.

print(rev) 
# Returns 'CBA'
# Note: the print() statement is outside of the loop, \
# so it prints AFTER the loop is done

Dictionary

a collection which is unordered, changeable and indexed.
defined using curly brackets and it contains key-value pairs
dict = {key: value}
it is mutable (unlike string, which is immutable)
used in for-loop

Example 1: get the reverse complement of a DNA sequence







seq = 'AAGTC'
comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'}
revcomp = ''
for b in seq:
    revcomp = comps[b] + revcomp
print(revcomp)
# returns 'GACTT'

Example 2: calculate the GC content of a DNA sequence









gc = 0
total = 0
for s in seq:
    if s in 'GC': 
    # if s=='G' or s=='C':
        gc = gc + 1
    total += 1
percent = (gc *100) / total
print(percent)

Example 3: if-else statement

once a condition is evaluated to True, the other cases that follow will be skipped over
if you want both conditions to be evaluated, use two if statements instead of hooking them up in one if-else statement.






if percent < 35:
    print('low')
elif percent < 55:
    print('normal')
else:
    print('high')

Functions for iterabls

len(iterable)
sorted(iterable, reverse=False)




seqs = "rna dna protein"
seqslist = seqs.split() # split by space
print(seqslist) # returns ['rna', 'dna', 'protein']
print(len(seqslist)) # returns 3

Writing functions







def double(x):
    # the x only defined within the scope of this function
    result = x * 2
    return result
    
print(double(100)) # 200
print(double("Hello")) # HelloHello

Re-visiting examples using functions:





















def reverse(seq):
    rev = ''
    for s in seq:
        rev = s + rev
    return rev
    
def complement(seq):
    comp = ''
    comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'}
    # if comps is not defined within the function, \\
    # the function will look outside for a global varrable --> more vulnerable
    for s in seq:
        c = comps[s]
        comp = comp + c
    return comp
    
def reverse_complement(seq):
    return reverse(complement(seq))
    
test_seq = "AGCTCCA"
test_seq == reverse_complement(reverse_complement(test_seq)) #True

Reversing a string using slice

new_str = old_str[::-1]
Remember, the args for the slice notation are: [start, end, step]
So the notation is saying: build new_str that includes the entire old_str, and do it by back-stepping, one character at a time

Working with Files













def read_fasta(filename):
    '''
    Extract sequence data from FASTA file in filename
    '''
    seq = ''
    f = open(filename)
    for line in f:
        line = line.strip() # remove leading and trailing white spaces and newline characters
        # check if not empty line and doesn't start with >
        if len(line)>0 and not line[0] == '>':
            seq = seq + line 
    f.close() 
    return seq

Note:

if A and B: if A is evaluated to False, it will stop and return without checking B, because False "and" with anything is False

Nano: the Linux command line text-editor

nano {filename}: create new file if filename doesn't exist, else open it for editing
take input from command line using the sys module



# within file named test.py
import sys
sys.argv[1] # use it somewhere in the script...

Run test.py with the command

python test.py {input_file}

Note:

argv[0] is the name of the script; argv[1] is the input filename
when path is not specified, the interpreter looks for a file within the directory where the script is. If the input file is in a different directory, make sure to specify the path (absolvute and relative paths both work).

Tuesday morning - Data & Project organization

Use standardized and consistent notation for naming files and labeling columns
Keep your code and data in seperate folders
Avoid manual data cleaning and adjustments, having it in code keeps your modifications documented
Make sure to do regular backups

Tuesday morning - Data Exploration

Importing the pandas module

Allows you to work with dateframes and large datasets

import pandas as pd

Jupyter notebook, magic commands

Start with a '%'
Specific to a notebook enviornment
%matplotlib inline will produce plots within the notebook

%matplotlib inline

Initial setup




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Reading a table with pandas

gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")

Outline of the read_table command used:

pd.read_table(x,y)
- x is the file location as a string
- y is how the data in the file is delimited. i.e. ',' or '\t'

Commands to explore a dataframe, where gapminder is the variable name for the dataframe read through pd.read_table()






gapminder.head() # returns the first 5 rows
gapminder.tail() # returns the last 5 rows
gapminder.columns.values # returns the column names
gapminder.shape # returns the number of rows and columns
gapminder.info() # returns information on variable types
gapminder.describe() # returns descriptive statistics

How to subset a dataframe by columns

Use square brackets containing a list of the column names that you want


gapminder[['pop', 'life Exp', 'gdpPercap']]

You can also select a specific column via the following format:


gapminder.pop # but only works for column names w/o spaces
gapminder['pop'] # the same, but works for any column name

This returns a subset of the dataframe. You can also use some of the commands given above on subsets




# Get descriptive statistics for specific columns
gapminder[['pop', 'life Exp', 'gdpPercap']].describe()
# Get the mean for one column
gapminder['life Exp'].mean()

Getting unique values in pandas via pd.unique()


pd.unique(gapminder['region'])

For each unique value, getting how many times it occurs with .value_counts()

Example 1: Unique values for region

# How many unique regions are in the data?
print(len(gapminder['region'].unique()))

# How many times does each unique region occur?
gapminder['region'].value_counts()

Example 2: Unique values for years

# How many unique regions are in the data?
print(len(gapminder['year'].unique()))

# How many times does each unique region occur?
gapminder['year'].value_counts()

Important note on dataframes and copies

If you just subset a dataframe and work on it, or set another variable to equal a dataframe then modifications of those will modify the original dataframe.

The solution is to use the .copy() command



gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
gapminder_copy = gapminder.copy()
gapminder_copy.head()

Now, if you work with gapminder_copy, it won't affect gapminder

If your data table has missing data points, you can remove them with the .dropna() command


gapminder_copy = gapminder_copy.dropna()
gapminder_copy.head()

You can change the data type of a column via the .astype() command

Example: changing year and population from floating points to integers



gapminder_copy['year'] = gapminder_copy['year'].astype(int)
gapminder_copy['pop'] = gapminder_copy['pop'].astype(int)
gapminder_copy.info()

Removing duplicates with .drop_duplicates()


gapminder_copy = gapminder_copy.drop_duplicates()
gapminder_copy.head()

Reindexing a dataframe with .reset_index()




# Resets the index, but saves the old index as a new column
gapminder_copy = gapminder_copy.reset_index()
# Resets the index and does not save the old index
gapminder_copy = gapminder_copy.reset_index(drop=True)

Regular expressions and the .replace() command

Regular expressions allow you to specifically define a search pattern

Example: Replace 'congo' with 'africa_dem rep congo'



gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].value_counts()

Basic regular expressions:

A dot ' . ' character is a wildchard for any character
An asterisk ' * ' indicates any number of times (0 or more occurrences)
A caret ' ^ ' means at the beginning of the string

Tidy data

Tidy datasets have two main properties

every variable has its own column
every observation has its own row

The gapminder dataset used one column for both the continent and country, which can be fixed with the .split() command



gapminder_copy['country']=gapminder_copy['region'].str.split('_', 1).str[1]
gapminder_copy['continent']=gapminder_copy['region'].str.split('_', 1).str[0]
gapminder_copy.head()

In the example, the .split() command takes two arguments.

The first argument is what deliniates the string. In this case, continent and country were seperated with an underscore
The second argument indicates that it should seperate at the first underscore

Removing a column from a dataframe with the .drop() command


gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column
gapminder_copy.head()

Concatenating two dataframes with pd.concat()


gapminder_comb = pd.concat([gapminder_copy, PRB])

Slicing dataframes

Slicing dataframes to select rows works in much the same way as selecting characters in strings

Example


gapminder_copy[0:15] # first 15 rows
gapminder_copy[-10:] # last 10 rows

Wednesday morning - data exploration cont'd

Exploring data

Summaries

.mean() : mean
.median() : median
.groupby(by='continent').mean() : mean by categorical groups
.count(): count the number of entries
pro-tip: sorting can be done by using .sort_value()

SubsetDataFrame to Series for a specific column


gdpcap = gapminder_comb[['continent', 'country']].groupby(by='continent').size()
gdpcap = gapminder_comb[['continent', 'country']].groupby(by='country').mean()

Creating new DataFrame of summaries









continent_mean_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder_comb[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs

Visualization

libraries to load:

numpy
matplotlib
seaborn




import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

Histogram




# import numpy as np
plt.hist(gapminder_copy['lifeexp'])
plt.xlabel('lifeexp')
plt.ylabel('count')

Boxplot



sns.boxplot(x='year', y='lifeexp', data=gapminder_copy)
plt.xlabel('year')
plt.ylabel('count')

Scatterplot










plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')


# let's try plotting the log of x
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xscale('log')
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')

Wednesday Morning - Automation with Python

Download gapminder_cleaned.csv

Setting up Directories





mkdir automation
mkdir automation/code
mkdir automation/data
mv gapminder_cleaned.csv automation/data/
ls automation/data

If your file downloaded as .txt or .tsv, rename using the mv command



mv gapminder_cleaned.txt gapminder_cleaned.csv
#or move it and rename it
mv gapminder_cleaned.txt automation/data/gapminder_cleaned.csv

Layout of python notebook

load libraries
establish variables (DataFrames, file names)
exploratory analysis
visualization

Defensive coding

create a try-catch block to 'catch' your errors










# set VERBOSE True if you want to head the data, false otherwise
VERBOSE = True 

cleaned_data_location = 'data/gapminder_cleaned.csv'
try:
    df = pd.read_csv(cleaned_data_location)
    if VERBOSE:
    print(df.head())
except: FileNotFoundError:
    print('Could not find data file, check path? You tried', cleaned_data_location)

Use assertion errors to test assumptions




years = df['year'].unique()
years.sort()
assert years[-1] == 2007
assert years[0] == 1952

DRY: dont repeat yourself!

Calculate mean life expectancy of Asia in 1952

















mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]

mask_1952 = df_asia['year'] == 1952
df_1952 = df_asia[mask_1952]
value = np.mean(df_1952['lifeexp'])

if VERBOSE:
    print('mean for 1952 is ', value)
# create an empty list to store life expectancies over time
result = []

# append a row to list with a tuple containing your result
result.append(('asia', '1952', value))
    
# Turn the summary into a dataframe so that we can visualize easily
result = pd.DataFrame(result, columns=['continent', 'year', 'lifeexp'])

Calculate mean life expectancy of Asia for each year


















mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]

# Pull out all the years from the dataset
years = df_asia['year'].unique()

# create an empty list to store the summary values
summary = []

# pull out the mean life expectancy at each year
for year in years:
    mask_year = df_asia['year'] == year
    df_year = df_asia[mask_year]
    value = np.mean(df_year['lifeexp'])
    summary.append(('asia', year, value))
    
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', 'lifeexp'])

Change code to calculate mean life expectancy for any continent




















# Define which continent / category we will use
category = 'lifeexp'
continent = 'africa'

# Create a mask that selects the continent of choice
mask_continent = df['continent'] == continent
df_continent = df[mask_continent]

# Loop through years and calculate the statistic of interest
years = df_continent['year'].unique()
summary = []

for year in years:
    mask_year = df_continent['year'] == year
    df_year = df_continent[mask_year]
    value = np.mean(df_year[category])
    summary.append((continent, year, value))
    
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])

Refactor code into a function to calculate mean/median/mode/etc.. over time of any user defined category and continent

def [function]: defines a function
your function can have arguments which you can set defaults (i.e. the default for verbose is false)
- def calculate_mean_over_time(data, category, continent, func, verbose=False):
including verbose conditional statements are great for debugging code
when passing a function as an argument, do not include the parenthesis
- calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)
It is good programming to include docstrings in functions so that help() can be used on your function
add assertions to test that the arguments the user inputs are valid (checking types is not reccomendend, instead one can check the behavior or an object)








































def calculate_func_over_time(data, category, continent, func, verbose=False):
    """ calculate values of a statistic through time

    Args:
        data: a data frame
        category: one of the column headers of the data frame (e.g. 'lifeexp')
        continent: possible value of the continent column of that data frame (e.g. 'asia')
        func: the funtion to apply to data values (e.g. np.mean)
        verbose: show debugging output (optional, defaults to False)
        
    Returns:
        a summary table of value per year.

    """
    # check the values provided
    assert type(df) == pd.core.frame.DataFrame
    assert category in data.columns.value 
    assert 'continent' in data.columns.value
    assert continent in data['continent'].unique()
    
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        if verbose:
            print(year)
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        if verbose:
            print('adding ', continent , value)
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary

Run function


VERBOSE = False
calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)

Visualize








#use this function to plot life expectancy| through time
continents = df['continent'].unique()
VERBOSE = False
fig, ax = plt.subplots()

for continent in continents:
    output = calculate_statistic_over_time(df,"lifeexp", continent, np.median)
    output.plot.line('year', "lifeexp", ax=ax, label=continent)

Putting functions into modules

create a new file to hold functions ex. automation.py
place function in file
import the necessary libraries for function
in the main file, import function at the top of the file





import numpy as np
import pandas as pd
import pylab as plt
import matplotlib
from automation import calculate_func_over_time

Bonus tips:

dir(df) : output all attributes of df
- i.e. .shape, .info
interactive text editors (i.e. Jupyter Notebook) are reccomended for code that has an output where as functions can be put in a regular text editor
parsing error messages in python – the most helpful part is usually at the end

share work using GitHub
Code shared on GitHub is static
Using binder, you can share interactive versions of notebooks on GitHub
- need to specify all of the library dependecies for the notebook in requirements.txt

Thursday morning - version control

https://github.com/Duke-GCB/scicomp-python/archive/master.zip

Git commands:

git status: gives information about repository
git init: initialize git repository
git config: configure git settings
- --globabl: changes settings on a global level. ex git config --global user.email "e@mail.com"
- --list: list all the configurations ex git config --global --list
git add: adds a file to a staging area
- git add . to add all files
git commit: commits file to repo and will prompt the user to leave a message
- -a or --all: will commit everything in repo
git diff: prints out the changes of files in repo
- git diff --staged to view all the staged changes
git checkout: restore prior version of file
- -b [branch name] : create a new branch
git log: shows the commit logs
- --name-only: shows file names with commit messages
git merge [branch name]: makes any branch the master branch, aka a fast-forward. When used after fetching the global repo, the merge will make both the local and remote branch parents.
git branch: list branches of repo
- -a: to see all (local and remote) repo

Tell git to ignore files

use nano .gitignore to edit file
then add the files you want to ignore. ex *.fasta *.fa
git add .gitignore
git commit -m "add files to be ignored"

GitHub Remotes

git remote : manages the set of remote repos
git push: pushes local repo content to a remote repo
- --set-upstream: set default orgin master
git fetch [branch]: fetches content from a remote, by default from origin
git pull = git fetch and git merge

Merge local and remote repositories



git fetch [repo]
git merge origin/master
git push --set-upstream origin master


git pull [repo]
git push --set-upstream origin master

Collaborting

Add collaborator on GitHub
Clone repository locally git clone [link to repository]
Make contribution in isolated branch git checkout -b [branch name]
Add changes to staging area git add [filename]
Commit changes git commit
Propose contribution
- git push --set-upstream origin [branch name]
- Open pull request on GitHub (will be populated by your local commit) and create the pull request
Owner accepts or denies contributions

Open questions

How do we export to an excel document or other document (i.e. csv, txt) from python?

See .to_csv on Pandas API