Try   HackMD

Duke UPGG Scientific Computing Bootcamp

Course website: https://duke-gcb.github.io/2019-08-12-Duke/ (includes links to materials)

Instructors: Hilmar Lapp, Dan Leehr

Stuck? Put your pink sticky note on the top of your computer.

Monday Morning - Shell

Launching the shell: Git Bash on Windows, Terminal on macOS.

Commands:

  • pwd - print working directory, where am I?
  • ls - list files in current directory
    • ls -F - format directories by trailing /
    • ls -lh - show human-readable sizes
    • ls -a - include hidden files/directories
  • cd - change directories
  • man ls - manual page for a command, e.g. ls
    Note: to exit the manual page, press the q key on your keyboard

Exercise: What additional information do you learn with ls -l?

Answer: Date modified, file size, owner, permissions, directory or not

Tab completion:

While typing a command or file name, hit the tab key and the shell will finish the word (or help as much as it can!)

  • cd un<tab> becomes cd untrimmed_fastq/

Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you.

Multiple matches? Start typing and hit tab. The shell will sound a ding, and list possible completions. Keep typing and hit tab again once it can pick a single result.

  • cd (by itself, no directory name) - changes to your home directory (cd ~ has the same result)
  • cd Desktop/shell_data/untrimmed_fastq - go directly into untrimmed_fastq with one command. Remember tab completion!
  • cd .. - go one level up in the directory tree (parent directory)
  • cd ~/Desktop/shell_data - from anywhere on the filesystem since we begin with ~ (home directory)
  • cd - - Go back to where I came from (the previous directory you were in)

Exercise: Find the hidden directory

Relative and Absolute paths

Relative Path: Any path that doesn't begin with a slash. (e.g. Desktop/untrimmed_fastq) Relative paths are relative to where you are.

Absolute Path: Any path that begins with a slash (e.g. /Users/sam/Desktop/untrimmed_fastq)

  • / is the root directory

Files and Directories

cd ~/Desktop/untrimmed_fastq

  • ls *.fastq - list all files ending with .fastq

  • ls *977.fastq - list all files ending with 977.fastq

  • ls /bin/*sh - list all the commands that are shells

  • echo $SHELL - which shell am I running?

  • history - command history with number

    • ! {number, n} - execute the nth command
    • !! is replaced with the contents of the previous command
  • ~ is an absolute path to the home directory

  • file display using head and tail

    • head gives the first 10 lines of the file
    • tail gives the last 10 lines of the file
    • use -n to specify the number of line to display
      • e.g. head -n 4 {filename} displays the first 4 lines

Match filenames using wildcard

  • wildcard * - match any (0 or more) character
  • application: prefix, suffix
  • put in double quotes to interpret as-is, not as the wildcard special character
    • e.g. ls "*fastq" - returns *fastq, instead of weldcard expansion

Display file content

  • cat {filename} - concatenate the file content and display on the terminal

  • less {filename} - allows you to page through the content using Space; interactive

    • G – go to the end of file
    • g – go to the start of file
    • q – exit the less pager
    • /{pattern} - search for a pattern which will take you to the next occurrence
      • n – for next match in forward
      • N – for previous match in backward
  • command > file redirects a command’s output to a file (overwriting any existing content).

  • command >> file appends a command’s output to a file.

  • pipe | - inputs the output from one command to another command

  • wc {filename} gives four-columnar output: total number of lines, words, characters, filename

    • -l returns the number of lines in file only
  • grep [option] {pattern} {filename} search for exact pattern or by regular expression

    • -B and -A: before and after the match, respectively

      • e.g. -B1 -A2 gives one line before and two lines after the match
    • -v include everything except for the pattern

    • -- not specifying any flag

macOS users: The -- characters that appear are called a "Group Separator". There is no flag in macOS grep that prevents these, but you can pipe grep's output to remove them:

grep NNNNNNNNNN -A 2 -B 1 SRR098026.fastq | grep -v -- --

Moving files around

  • mkdir {dir_name} make a directory in the current directory
  • cp copies files to a target named by the last argument on its command line. If the target is an existing file, cp overwrites it; if it does not exist, cp creates it.
    • cp {source_file(s)} {destination_dir} - copy file(s) to a directory
    • cp {source_file} {destination_file}- copy a file to another file
    • -i interactive: prompt to ask whether want to overwrite an existing file

Move & rename files

  • mv {old_filename} {new_filename}
  • mv {filename} {dest-dir}
    • -i prompt for overwrite
    • -n make it not overwrite
      Deleting a file/directory
  • use rm {file} to remove/delete a file
  • use rm -r {dir} to recursively remove all the files in a directory
  • Note: There is NO trash bin on Unix system. Anything that is deleted is gone forever.
  • need permission from the owner of the directory

Permissions

  • chmod -w {filename} - take away the write permission from everyone
    • good practice: don't want to modify the raw data files
    • the - and + in front of w does not indicate flag

FASTQ file format (4 lines per read)

  1. A sequence identifier with information about the sequencing run and the cluster.
  2. The sequence (the base calls; A, C, T, G and N).
  3. A separator, which is simply a plus (+) sign.
  4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

Intro to Programming

The For-Loop
for x in *.fastq ; do [echo] cp $x $x.backup ; done

  • echo reflects the command, remove it to get the following effect:
    • for every fastq file in this directory, create a backup

Monday Afternoon - Programming in Python

Jupyter notebook

  • RUN shortcut: shift + enter

Python

  • assignment operator: =, assigns value without returning anything
  • comparison operator: ==, returns a boolean value
  • print() can take multiple, comma-separated arguments, such as strings, int, etc.
  • zero-indexed

String operations

  • str_var[0] gives the first character in the string
  • negative index counts from the end, starting from -1
  • slicing: str_var[start : end : step], with the ending index being exclusive
  • in a boolean operator that allows you to check if a character is in a string
    • e.g. name = "Python" then P in name would return True
  • Python strings are immutable: every time you do operations with strings, e.g. slice, concatenate, you are building up a separate new string

The for-loop

for item in [iterable]:
    # do something with item
  • Indentation matters in Python! Make sure the body of your loop is properly indented.

Reversing a string

seq = 'ABC' rev = '' # initialize an empty string to be built-on later for s in seq: rev = rev + s # append s to rev and re-assign it to rev # s is a variable that is visible only WITHIN the loop (the concept of 'scope'); \ # it could be named however you like. print(rev) # Returns 'CBA' # Note: the print() statement is outside of the loop, \ # so it prints AFTER the loop is done

Dictionary

  • a collection which is unordered, changeable and indexed.
  • defined using curly brackets and it contains key-value pairs
    dict = {key: value}
  • it is mutable (unlike string, which is immutable)
  • used in for-loop

Example 1: get the reverse complement of a DNA sequence

seq = 'AAGTC' comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'} revcomp = '' for b in seq: revcomp = comps[b] + revcomp print(revcomp) # returns 'GACTT'

Example 2: calculate the GC content of a DNA sequence

gc = 0 total = 0 for s in seq: if s in 'GC': # if s=='G' or s=='C': gc = gc + 1 total += 1 percent = (gc *100) / total print(percent)

Example 3: if-else statement

  • once a condition is evaluated to True, the other cases that follow will be skipped over
  • if you want both conditions to be evaluated, use two if statements instead of hooking them up in one if-else statement.
if percent < 35: print('low') elif percent < 55: print('normal') else: print('high')

Functions for iterabls

  • len(iterable)
  • sorted(iterable, reverse=False)
seqs = "rna dna protein" seqslist = seqs.split() # split by space print(seqslist) # returns ['rna', 'dna', 'protein'] print(len(seqslist)) # returns 3

Writing functions

def double(x): # the x only defined within the scope of this function result = x * 2 return result print(double(100)) # 200 print(double("Hello")) # HelloHello

Re-visiting examples using functions:

def reverse(seq): rev = '' for s in seq: rev = s + rev return rev def complement(seq): comp = '' comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'} # if comps is not defined within the function, \\ # the function will look outside for a global varrable --> more vulnerable for s in seq: c = comps[s] comp = comp + c return comp def reverse_complement(seq): return reverse(complement(seq)) test_seq = "AGCTCCA" test_seq == reverse_complement(reverse_complement(test_seq)) #True

Reversing a string using slice

  • new_str = old_str[::-1]
  • Remember, the args for the slice notation are: [start, end, step]
  • So the notation is saying: build new_str that includes the entire old_str, and do it by back-stepping, one character at a time

Working with Files

def read_fasta(filename): ''' Extract sequence data from FASTA file in filename ''' seq = '' f = open(filename) for line in f: line = line.strip() # remove leading and trailing white spaces and newline characters # check if not empty line and doesn't start with > if len(line)>0 and not line[0] == '>': seq = seq + line f.close() return seq

Note:

  • if A and B: if A is evaluated to False, it will stop and return without checking B, because False "and" with anything is False

Nano: the Linux command line text-editor

  • nano {filename}: create new file if filename doesn't exist, else open it for editing
  • take input from command line using the sys module
# within file named test.py import sys sys.argv[1] # use it somewhere in the script...

Run test.py with the command

python test.py {input_file}

Note:

  • argv[0] is the name of the script; argv[1] is the input filename
  • when path is not specified, the interpreter looks for a file within the directory where the script is. If the input file is in a different directory, make sure to specify the path (absolvute and relative paths both work).

Tuesday morning - Data & Project organization

  • Use standardized and consistent notation for naming files and labeling columns
  • Keep your code and data in seperate folders
  • Avoid manual data cleaning and adjustments, having it in code keeps your modifications documented
  • Make sure to do regular backups

Tuesday morning - Data Exploration

Importing the pandas module

  • Allows you to work with dateframes and large datasets
import pandas as pd

Jupyter notebook, magic commands

  • Start with a '%'
  • Specific to a notebook enviornment
  • %matplotlib inline will produce plots within the notebook
%matplotlib inline

Initial setup

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

Reading a table with pandas

gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")

Outline of the read_table command used:

  • pd.read_table(x,y)
    • x is the file location as a string
    • y is how the data in the file is delimited. i.e. ',' or '\t'

Commands to explore a dataframe, where gapminder is the variable name for the dataframe read through pd.read_table()

gapminder.head() # returns the first 5 rows gapminder.tail() # returns the last 5 rows gapminder.columns.values # returns the column names gapminder.shape # returns the number of rows and columns gapminder.info() # returns information on variable types gapminder.describe() # returns descriptive statistics

How to subset a dataframe by columns

Use square brackets containing a list of the column names that you want

gapminder[['pop', 'life Exp', 'gdpPercap']]

You can also select a specific column via the following format:

gapminder.pop # but only works for column names w/o spaces gapminder['pop'] # the same, but works for any column name

This returns a subset of the dataframe. You can also use some of the commands given above on subsets

# Get descriptive statistics for specific columns gapminder[['pop', 'life Exp', 'gdpPercap']].describe() # Get the mean for one column gapminder['life Exp'].mean()

Getting unique values in pandas via pd.unique()

pd.unique(gapminder['region'])

For each unique value, getting how many times it occurs with .value_counts()

Example 1: Unique values for region

# How many unique regions are in the data?
print(len(gapminder['region'].unique()))

# How many times does each unique region occur?
gapminder['region'].value_counts()

Example 2: Unique values for years

# How many unique regions are in the data?
print(len(gapminder['year'].unique()))

# How many times does each unique region occur?
gapminder['year'].value_counts()

Important note on dataframes and copies

If you just subset a dataframe and work on it, or set another variable to equal a dataframe then modifications of those will modify the original dataframe.

The solution is to use the .copy() command

gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t") gapminder_copy = gapminder.copy() gapminder_copy.head()

Now, if you work with gapminder_copy, it won't affect gapminder

If your data table has missing data points, you can remove them with the .dropna() command

gapminder_copy = gapminder_copy.dropna() gapminder_copy.head()

You can change the data type of a column via the .astype() command

Example: changing year and population from floating points to integers

gapminder_copy['year'] = gapminder_copy['year'].astype(int) gapminder_copy['pop'] = gapminder_copy['pop'].astype(int) gapminder_copy.info()

Removing duplicates with .drop_duplicates()

gapminder_copy = gapminder_copy.drop_duplicates() gapminder_copy.head()

Reindexing a dataframe with .reset_index()

# Resets the index, but saves the old index as a new column gapminder_copy = gapminder_copy.reset_index() # Resets the index and does not save the old index gapminder_copy = gapminder_copy.reset_index(drop=True)

Regular expressions and the .replace() command

Regular expressions allow you to specifically define a search pattern

Example: Replace 'congo' with 'africa_dem rep congo'

gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True) gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True) gapminder_copy['region'].value_counts()

Basic regular expressions:

  • A dot ' . ' character is a wildchard for any character
  • An asterisk ' * ' indicates any number of times (0 or more occurrences)
  • A caret ' ^ ' means at the beginning of the string

Tidy data

Tidy datasets have two main properties

  • every variable has its own column
  • every observation has its own row

The gapminder dataset used one column for both the continent and country, which can be fixed with the .split() command

gapminder_copy['country']=gapminder_copy['region'].str.split('_', 1).str[1] gapminder_copy['continent']=gapminder_copy['region'].str.split('_', 1).str[0] gapminder_copy.head()

In the example, the .split() command takes two arguments.

  • The first argument is what deliniates the string. In this case, continent and country were seperated with an underscore
  • The second argument indicates that it should seperate at the first underscore

Removing a column from a dataframe with the .drop() command

gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column gapminder_copy.head()

Concatenating two dataframes with pd.concat()

gapminder_comb = pd.concat([gapminder_copy, PRB])

Slicing dataframes

Slicing dataframes to select rows works in much the same way as selecting characters in strings

Example

gapminder_copy[0:15] # first 15 rows gapminder_copy[-10:] # last 10 rows

Wednesday morning - data exploration cont'd

Exploring data

Summaries

  • .mean() : mean
  • .median() : median
  • .groupby(by='continent').mean() : mean by categorical groups
  • .count(): count the number of entries
  • pro-tip: sorting can be done by using .sort_value()

SubsetDataFrame to Series for a specific column

gdpcap = gapminder_comb[['continent', 'country']].groupby(by='continent').size() gdpcap = gapminder_comb[['continent', 'country']].groupby(by='country').mean()

Creating new DataFrame of summaries

continent_mean_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').mean() continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'}) continent_row_ct = gapminder_comb[['continent', 'country']].groupby(by='continent').count() continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'}) continent_median_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').median() continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'}) gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1) gapminder_summs = gapminder_summs.rename(columns = {'y':'year'}) gapminder_summs

Visualization

libraries to load:

  • numpy
  • matplotlib
  • seaborn
import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns sns.set()

Histogram

# import numpy as np plt.hist(gapminder_copy['lifeexp']) plt.xlabel('lifeexp') plt.ylabel('count')

Boxplot

sns.boxplot(x='year', y='lifeexp', data=gapminder_copy) plt.xlabel('year') plt.ylabel('count')

Scatterplot

plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp']) plt.xlabel('gdppercap') plt.ylabel('lifeexp') # let's try plotting the log of x plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp']) plt.xscale('log') plt.xlabel('gdppercap') plt.ylabel('lifeexp')

Wednesday Morning - Automation with Python

Download gapminder_cleaned.csv

Setting up Directories

mkdir automation mkdir automation/code mkdir automation/data mv gapminder_cleaned.csv automation/data/ ls automation/data

If your file downloaded as .txt or .tsv, rename using the mv command

mv gapminder_cleaned.txt gapminder_cleaned.csv #or move it and rename it mv gapminder_cleaned.txt automation/data/gapminder_cleaned.csv

Layout of python notebook

  1. load libraries
  2. establish variables (DataFrames, file names)
  3. exploratory analysis
  4. visualization

Defensive coding

create a try-catch block to 'catch' your errors

# set VERBOSE True if you want to head the data, false otherwise VERBOSE = True cleaned_data_location = 'data/gapminder_cleaned.csv' try: df = pd.read_csv(cleaned_data_location) if VERBOSE: print(df.head()) except: FileNotFoundError: print('Could not find data file, check path? You tried', cleaned_data_location)

Use assertion errors to test assumptions

years = df['year'].unique() years.sort() assert years[-1] == 2007 assert years[0] == 1952

DRY: dont repeat yourself!

Calculate mean life expectancy of Asia in 1952

mask_asia = df['continent'] == 'asia' df_asia = df[mask_asia] mask_1952 = df_asia['year'] == 1952 df_1952 = df_asia[mask_1952] value = np.mean(df_1952['lifeexp']) if VERBOSE: print('mean for 1952 is ', value) # create an empty list to store life expectancies over time result = [] # append a row to list with a tuple containing your result result.append(('asia', '1952', value)) # Turn the summary into a dataframe so that we can visualize easily result = pd.DataFrame(result, columns=['continent', 'year', 'lifeexp'])

Calculate mean life expectancy of Asia for each year

mask_asia = df['continent'] == 'asia' df_asia = df[mask_asia] # Pull out all the years from the dataset years = df_asia['year'].unique() # create an empty list to store the summary values summary = [] # pull out the mean life expectancy at each year for year in years: mask_year = df_asia['year'] == year df_year = df_asia[mask_year] value = np.mean(df_year['lifeexp']) summary.append(('asia', year, value)) # Turn the summary into a dataframe so that we can visualize easily summary = pd.DataFrame(summary, columns=['continent', 'year', 'lifeexp'])

Change code to calculate mean life expectancy for any continent

# Define which continent / category we will use category = 'lifeexp' continent = 'africa' # Create a mask that selects the continent of choice mask_continent = df['continent'] == continent df_continent = df[mask_continent] # Loop through years and calculate the statistic of interest years = df_continent['year'].unique() summary = [] for year in years: mask_year = df_continent['year'] == year df_year = df_continent[mask_year] value = np.mean(df_year[category]) summary.append((continent, year, value)) # Turn the summary into a dataframe so that we can visualize easily summary = pd.DataFrame(summary, columns=['continent', 'year', category])

Refactor code into a function to calculate mean/median/mode/etc.. over time of any user defined category and continent

  • def [function]: defines a function
  • your function can have arguments which you can set defaults (i.e. the default for verbose is false)
    • def calculate_mean_over_time(data, category, continent, func, verbose=False):
  • including verbose conditional statements are great for debugging code
  • when passing a function as an argument, do not include the parenthesis
    • calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)
  • It is good programming to include docstrings in functions so that help() can be used on your function
  • add assertions to test that the arguments the user inputs are valid (checking types is not reccomendend, instead one can check the behavior or an object)
def calculate_func_over_time(data, category, continent, func, verbose=False): """ calculate values of a statistic through time Args: data: a data frame category: one of the column headers of the data frame (e.g. 'lifeexp') continent: possible value of the continent column of that data frame (e.g. 'asia') func: the funtion to apply to data values (e.g. np.mean) verbose: show debugging output (optional, defaults to False) Returns: a summary table of value per year. """ # check the values provided assert type(df) == pd.core.frame.DataFrame assert category in data.columns.value assert 'continent' in data.columns.value assert continent in data['continent'].unique() # Create a mask that selects the continent of choice mask_continent = data['continent'] == continent data_continent = data[mask_continent] # Loop through years and calculate the statistic of interest years = data_continent['year'].unique() summary = [] for year in years: if verbose: print(year) mask_year = data_continent['year'] == year data_year = data_continent[mask_year] value = func(data_year[category]) if verbose: print('adding ', continent , value) summary.append((continent, year, value)) # Turn the summary into a dataframe so that we can visualize easily summary = pd.DataFrame(summary, columns=['continent', 'year', category]) return summary

Run function

VERBOSE = False calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)

Visualize

#use this function to plot life expectancy| through time continents = df['continent'].unique() VERBOSE = False fig, ax = plt.subplots() for continent in continents: output = calculate_statistic_over_time(df,"lifeexp", continent, np.median) output.plot.line('year', "lifeexp", ax=ax, label=continent)

Putting functions into modules

  • create a new file to hold functions ex. automation.py
  • place function in file
  • import the necessary libraries for function
  • in the main file, import function at the top of the file
import numpy as np import pandas as pd import pylab as plt import matplotlib from automation import calculate_func_over_time

Bonus tips:

  • dir(df) : output all attributes of df
    • i.e. .shape, .info
  • interactive text editors (i.e. Jupyter Notebook) are reccomended for code that has an output where as functions can be put in a regular text editor
  • parsing error messages in python the most helpful part is usually at the end

Wednesday afternoon - Sharing your computational work

  • share work using GitHub
  • Code shared on GitHub is static
  • Using binder, you can share interactive versions of notebooks on GitHub
    • need to specify all of the library dependecies for the notebook in requirements.txt

Thursday morning - version control

https://github.com/Duke-GCB/scicomp-python/archive/master.zip

Git commands:

  • git status: gives information about repository
  • git init: initialize git repository
  • git config: configure git settings
    • --globabl: changes settings on a global level. ex git config --global user.email "e@mail.com"
    • --list: list all the configurations ex git config --global --list
  • git add: adds a file to a staging area
    • git add . to add all files
  • git commit: commits file to repo and will prompt the user to leave a message
    • -a or --all: will commit everything in repo
  • git diff: prints out the changes of files in repo
    • git diff --staged to view all the staged changes
  • git checkout: restore prior version of file
    • -b [branch name] : create a new branch
  • git log: shows the commit logs
    • --name-only: shows file names with commit messages
  • git merge [branch name]: makes any branch the master branch, aka a fast-forward. When used after fetching the global repo, the merge will make both the local and remote branch parents.
  • git branch: list branches of repo
    • -a: to see all (local and remote) repo

Tell git to ignore files

  • use nano .gitignore to edit file
  • then add the files you want to ignore. ex *.fasta *.fa
  • git add .gitignore
  • git commit -m "add files to be ignored"

GitHub Remotes

  • git remote : manages the set of remote repos
  • git push: pushes local repo content to a remote repo
    • --set-upstream: set default orgin master
  • git fetch [branch]: fetches content from a remote, by default from origin
  • git pull = git fetch and git merge

Merge local and remote repositories

git fetch [repo] git merge origin/master git push --set-upstream origin master

or

git pull [repo] git push --set-upstream origin master

Collaborting

  1. Add collaborator on GitHub
  2. Clone repository locally git clone [link to repository]
  3. Make contribution in isolated branch git checkout -b [branch name]
  4. Add changes to staging area git add [filename]
  5. Commit changes git commit
  6. Propose contribution
    • git push --set-upstream origin [branch name]
    • Open pull request on GitHub (will be populated by your local commit) and create the pull request
  7. Owner accepts or denies contributions

Open questions

How do we export to an excel document or other document (i.e. csv, txt) from python?