Course website: https://duke-gcb.github.io/2019-08-12-Duke/ (includes links to materials)
Instructors: Hilmar Lapp, Dan Leehr
Stuck? Put your pink sticky note on the top of your computer.
Launching the shell: Git Bash on Windows, Terminal on macOS.
pwd
- print working directory, where am I?ls
- list files in current directory
ls -F
- format directories by trailing /
ls -lh
- show human-readable sizesls -a
- include hidden files/directoriescd
- change directoriesman ls
- manual page for a command, e.g. ls
Exercise: What additional information do you learn with ls -l
?
Answer: Date modified, file size, owner, permissions, directory or not
While typing a command or file name, hit the tab key and the shell will finish the word (or help as much as it can!)
cd un<tab>
becomes cd untrimmed_fastq/
Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you.
Multiple matches? Start typing and hit tab. The shell will sound a ding, and list possible completions. Keep typing and hit tab again once it can pick a single result.
cd
(by itself, no directory name) - changes to your home directory (cd ~
has the same result)cd Desktop/shell_data/untrimmed_fastq
- go directly into untrimmed_fastq
with one command. Remember tab completion!cd ..
- go one level up in the directory tree (parent directory)cd ~/Desktop/shell_data
- from anywhere on the filesystem since we begin with ~
(home directory)cd -
- Go back to where I came from (the previous directory you were in)Exercise: Find the hidden directory
Relative Path: Any path that doesn't begin with a slash. (e.g. Desktop/untrimmed_fastq
) Relative paths are relative to where you are.
Absolute Path: Any path that begins with a slash (e.g. /Users/sam/Desktop/untrimmed_fastq
)
/
is the root directorycd ~/Desktop/untrimmed_fastq
ls *.fastq
- list all files ending with .fastq
ls *977.fastq
- list all files ending with 977.fastq
ls /bin/*sh
- list all the commands that are shells
echo $SHELL
- which shell am I running?
history
- command history with number
! {number, n}
- execute the nth command!!
is replaced with the contents of the previous command~
is an absolute path to the home directory
file display using head
and tail
head
gives the first 10 lines of the filetail
gives the last 10 lines of the file-n
to specify the number of line to display
head -n 4 {filename}
displays the first 4 linesMatch filenames using wildcard
ls "*fastq"
- returns *fastq
, instead of weldcard expansionDisplay file content
cat {filename}
- concatenate the file content and display on the terminal
less {filename}
- allows you to page through the content using Space; interactive
/{pattern}
- search for a pattern which will take you to the next occurrence
command > file
redirects a command’s output to a file (overwriting any existing content).
command >> file
appends a command’s output to a file.
pipe |
- inputs the output from one command to another command
wc {filename}
gives four-columnar output: total number of lines, words, characters, filename
-l
returns the number of lines in file onlygrep [option] {pattern} {filename}
search for exact pattern or by regular expression
-B
and -A
: before and after the match, respectively
-B1 -A2
gives one line before and two lines after the match-v
include everything except for the pattern
--
not specifying any flag
macOS users: The --
characters that appear are called a "Group Separator". There is no flag in macOS grep
that prevents these, but you can pipe grep's output to remove them:
grep NNNNNNNNNN -A 2 -B 1 SRR098026.fastq | grep -v -- --
Moving files around
mkdir {dir_name}
make a directory in the current directorycp
copies files to a target named by the last argument on its command line. If the target is an existing file, cp overwrites it; if it does not exist, cp creates it.
cp {source_file(s)} {destination_dir}
- copy file(s) to a directorycp {source_file} {destination_file}
- copy a file to another file-i
interactive: prompt to ask whether want to overwrite an existing fileMove & rename files
mv {old_filename} {new_filename}
mv {filename} {dest-dir}
-i
prompt for overwrite-n
make it not overwriterm {file}
to remove/delete a filerm -r {dir}
to recursively remove all the files in a directoryPermissions
chmod -w {filename}
- take away the write permission from everyone
-
and +
in front of w
does not indicate flagFASTQ file format (4 lines per read)
The For-Loop
for x in *.fastq ; do [echo] cp $x $x.backup ; done
echo
reflects the command, remove it to get the following effect:
Jupyter notebook
Python
=
, assigns value without returning anything==
, returns a boolean valueprint()
can take multiple, comma-separated arguments, such as strings, int, etc.String operations
str_var[0]
gives the first character in the stringstr_var[start : end : step]
, with the ending index being exclusivein
a boolean operator that allows you to check if a character is in a string
name = "Python"
then P in name
would return TrueThe for-loop
for item in [iterable]:
# do something with item
Reversing a string
seq = 'ABC'
rev = ''
# initialize an empty string to be built-on later
for s in seq:
rev = rev + s
# append s to rev and re-assign it to rev
# s is a variable that is visible only WITHIN the loop (the concept of 'scope'); \
# it could be named however you like.
print(rev)
# Returns 'CBA'
# Note: the print() statement is outside of the loop, \
# so it prints AFTER the loop is done
Dictionary
dict = {key: value}
Example 1: get the reverse complement of a DNA sequence
seq = 'AAGTC'
comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'}
revcomp = ''
for b in seq:
revcomp = comps[b] + revcomp
print(revcomp)
# returns 'GACTT'
Example 2: calculate the GC content of a DNA sequence
gc = 0
total = 0
for s in seq:
if s in 'GC':
# if s=='G' or s=='C':
gc = gc + 1
total += 1
percent = (gc *100) / total
print(percent)
Example 3: if-else statement
if
statements instead of hooking them up in one if-else
statement.
if percent < 35:
print('low')
elif percent < 55:
print('normal')
else:
print('high')
Functions for iterabls
len(iterable)
sorted(iterable, reverse=False)
seqs = "rna dna protein"
seqslist = seqs.split() # split by space
print(seqslist) # returns ['rna', 'dna', 'protein']
print(len(seqslist)) # returns 3
Writing functions
def double(x):
# the x only defined within the scope of this function
result = x * 2
return result
print(double(100)) # 200
print(double("Hello")) # HelloHello
Re-visiting examples using functions:
def reverse(seq):
rev = ''
for s in seq:
rev = s + rev
return rev
def complement(seq):
comp = ''
comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'}
# if comps is not defined within the function, \\
# the function will look outside for a global varrable --> more vulnerable
for s in seq:
c = comps[s]
comp = comp + c
return comp
def reverse_complement(seq):
return reverse(complement(seq))
test_seq = "AGCTCCA"
test_seq == reverse_complement(reverse_complement(test_seq)) #True
Reversing a string using slice
new_str = old_str[::-1]
def read_fasta(filename):
'''
Extract sequence data from FASTA file in filename
'''
seq = ''
f = open(filename)
for line in f:
line = line.strip() # remove leading and trailing white spaces and newline characters
# check if not empty line and doesn't start with >
if len(line)>0 and not line[0] == '>':
seq = seq + line
f.close()
return seq
Note:
if A and B
: if A is evaluated to False, it will stop and return without checking B, because False "and" with anything is Falsenano {filename}
: create new file if filename doesn't exist, else open it for editingsys
module
# within file named test.py
import sys
sys.argv[1] # use it somewhere in the script...
Run test.py with the command
python test.py {input_file}
Note:
argv[0]
is the name of the script; argv[1]
is the input filenameImporting the pandas module
import pandas as pd
Jupyter notebook, magic commands
%matplotlib inline
will produce plots within the notebook%matplotlib inline
Initial setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Reading a table with pandas
gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
Outline of the read_table command used:
pd.read_table(x,y)
Commands to explore a dataframe, where gapminder is the variable name for the dataframe read through pd.read_table()
gapminder.head() # returns the first 5 rows
gapminder.tail() # returns the last 5 rows
gapminder.columns.values # returns the column names
gapminder.shape # returns the number of rows and columns
gapminder.info() # returns information on variable types
gapminder.describe() # returns descriptive statistics
How to subset a dataframe by columns
Use square brackets containing a list of the column names that you want
gapminder[['pop', 'life Exp', 'gdpPercap']]
You can also select a specific column via the following format:
gapminder.pop # but only works for column names w/o spaces
gapminder['pop'] # the same, but works for any column name
This returns a subset of the dataframe. You can also use some of the commands given above on subsets
# Get descriptive statistics for specific columns
gapminder[['pop', 'life Exp', 'gdpPercap']].describe()
# Get the mean for one column
gapminder['life Exp'].mean()
Getting unique values in pandas via pd.unique()
pd.unique(gapminder['region'])
For each unique value, getting how many times it occurs with .value_counts()
Example 1: Unique values for region
# How many unique regions are in the data?
print(len(gapminder['region'].unique()))
# How many times does each unique region occur?
gapminder['region'].value_counts()
Example 2: Unique values for years
# How many unique regions are in the data?
print(len(gapminder['year'].unique()))
# How many times does each unique region occur?
gapminder['year'].value_counts()
If you just subset a dataframe and work on it, or set another variable to equal a dataframe then modifications of those will modify the original dataframe.
The solution is to use the .copy()
command
gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
gapminder_copy = gapminder.copy()
gapminder_copy.head()
Now, if you work with gapminder_copy, it won't affect gapminder
If your data table has missing data points, you can remove them with the .dropna()
command
gapminder_copy = gapminder_copy.dropna()
gapminder_copy.head()
You can change the data type of a column via the .astype()
command
Example: changing year and population from floating points to integers
gapminder_copy['year'] = gapminder_copy['year'].astype(int)
gapminder_copy['pop'] = gapminder_copy['pop'].astype(int)
gapminder_copy.info()
Removing duplicates with .drop_duplicates()
gapminder_copy = gapminder_copy.drop_duplicates()
gapminder_copy.head()
Reindexing a dataframe with .reset_index()
# Resets the index, but saves the old index as a new column
gapminder_copy = gapminder_copy.reset_index()
# Resets the index and does not save the old index
gapminder_copy = gapminder_copy.reset_index(drop=True)
Regular expressions and the .replace()
command
Regular expressions allow you to specifically define a search pattern
Example: Replace 'congo' with 'africa_dem rep congo'
gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].value_counts()
Basic regular expressions:
Tidy datasets have two main properties
The gapminder dataset used one column for both the continent and country, which can be fixed with the .split()
command
gapminder_copy['country']=gapminder_copy['region'].str.split('_', 1).str[1]
gapminder_copy['continent']=gapminder_copy['region'].str.split('_', 1).str[0]
gapminder_copy.head()
In the example, the .split() command takes two arguments.
Removing a column from a dataframe with the .drop()
command
gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column
gapminder_copy.head()
Concatenating two dataframes with pd.concat()
gapminder_comb = pd.concat([gapminder_copy, PRB])
Slicing dataframes
Slicing dataframes to select rows works in much the same way as selecting characters in strings
Example
gapminder_copy[0:15] # first 15 rows
gapminder_copy[-10:] # last 10 rows
Summaries
.mean()
: mean.median()
: median.groupby(by='continent').mean()
: mean by categorical groups.count()
: count the number of entries.sort_value()
SubsetDataFrame
to Series
for a specific column
gdpcap = gapminder_comb[['continent', 'country']].groupby(by='continent').size()
gdpcap = gapminder_comb[['continent', 'country']].groupby(by='country').mean()
Creating new DataFrame
of summaries
continent_mean_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder_comb[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs
libraries to load:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
Histogram
# import numpy as np
plt.hist(gapminder_copy['lifeexp'])
plt.xlabel('lifeexp')
plt.ylabel('count')
Boxplot
sns.boxplot(x='year', y='lifeexp', data=gapminder_copy)
plt.xlabel('year')
plt.ylabel('count')
Scatterplot
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')
# let's try plotting the log of x
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xscale('log')
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')
Download gapminder_cleaned.csv
Setting up Directories
mkdir automation
mkdir automation/code
mkdir automation/data
mv gapminder_cleaned.csv automation/data/
ls automation/data
If your file downloaded as .txt or .tsv, rename using the mv
command
mv gapminder_cleaned.txt gapminder_cleaned.csv
#or move it and rename it
mv gapminder_cleaned.txt automation/data/gapminder_cleaned.csv
Layout of python notebook
DataFrames
, file names)create a try-catch block to 'catch' your errors
# set VERBOSE True if you want to head the data, false otherwise
VERBOSE = True
cleaned_data_location = 'data/gapminder_cleaned.csv'
try:
df = pd.read_csv(cleaned_data_location)
if VERBOSE:
print(df.head())
except: FileNotFoundError:
print('Could not find data file, check path? You tried', cleaned_data_location)
Use assertion errors to test assumptions
years = df['year'].unique()
years.sort()
assert years[-1] == 2007
assert years[0] == 1952
DRY: dont repeat yourself!
Calculate mean life expectancy of Asia in 1952
mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]
mask_1952 = df_asia['year'] == 1952
df_1952 = df_asia[mask_1952]
value = np.mean(df_1952['lifeexp'])
if VERBOSE:
print('mean for 1952 is ', value)
# create an empty list to store life expectancies over time
result = []
# append a row to list with a tuple containing your result
result.append(('asia', '1952', value))
# Turn the summary into a dataframe so that we can visualize easily
result = pd.DataFrame(result, columns=['continent', 'year', 'lifeexp'])
Calculate mean life expectancy of Asia for each year
mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]
# Pull out all the years from the dataset
years = df_asia['year'].unique()
# create an empty list to store the summary values
summary = []
# pull out the mean life expectancy at each year
for year in years:
mask_year = df_asia['year'] == year
df_year = df_asia[mask_year]
value = np.mean(df_year['lifeexp'])
summary.append(('asia', year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', 'lifeexp'])
Change code to calculate mean life expectancy for any continent
# Define which continent / category we will use
category = 'lifeexp'
continent = 'africa'
# Create a mask that selects the continent of choice
mask_continent = df['continent'] == continent
df_continent = df[mask_continent]
# Loop through years and calculate the statistic of interest
years = df_continent['year'].unique()
summary = []
for year in years:
mask_year = df_continent['year'] == year
df_year = df_continent[mask_year]
value = np.mean(df_year[category])
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
Refactor code into a function to calculate mean/median/mode/etc.. over time of any user defined category and continent
def [function]
: defines a functiondef calculate_mean_over_time(data, category, continent, func, verbose=False):
calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)
help()
can be used on your function
def calculate_func_over_time(data, category, continent, func, verbose=False):
""" calculate values of a statistic through time
Args:
data: a data frame
category: one of the column headers of the data frame (e.g. 'lifeexp')
continent: possible value of the continent column of that data frame (e.g. 'asia')
func: the funtion to apply to data values (e.g. np.mean)
verbose: show debugging output (optional, defaults to False)
Returns:
a summary table of value per year.
"""
# check the values provided
assert type(df) == pd.core.frame.DataFrame
assert category in data.columns.value
assert 'continent' in data.columns.value
assert continent in data['continent'].unique()
# Create a mask that selects the continent of choice
mask_continent = data['continent'] == continent
data_continent = data[mask_continent]
# Loop through years and calculate the statistic of interest
years = data_continent['year'].unique()
summary = []
for year in years:
if verbose:
print(year)
mask_year = data_continent['year'] == year
data_year = data_continent[mask_year]
value = func(data_year[category])
if verbose:
print('adding ', continent , value)
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
return summary
Run function
VERBOSE = False
calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)
Visualize
#use this function to plot life expectancy| through time
continents = df['continent'].unique()
VERBOSE = False
fig, ax = plt.subplots()
for continent in continents:
output = calculate_statistic_over_time(df,"lifeexp", continent, np.median)
output.plot.line('year', "lifeexp", ax=ax, label=continent)
Putting functions into modules
automation.py
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib
from automation import calculate_func_over_time
Bonus tips:
dir(df)
: output all attributes of df
.shape
, .info
requirements.txt
https://github.com/Duke-GCB/scicomp-python/archive/master.zip
git status
: gives information about repositorygit init
: initialize git repositorygit config
: configure git settings
--globabl
: changes settings on a global level. ex git config --global user.email "e@mail.com"
--list
: list all the configurations ex git config --global --list
git add
: adds a file to a staging area
git add .
to add all filesgit commit
: commits file to repo and will prompt the user to leave a message
-a
or --all
: will commit everything in repogit diff
: prints out the changes of files in repo
git diff --staged
to view all the staged changesgit checkout
: restore prior version of file
-b [branch name]
: create a new branchgit log
: shows the commit logs
--name-only
: shows file names with commit messagesgit merge [branch name]
: makes any branch the master branch, aka a fast-forward. When used after fetching the global repo, the merge will make both the local and remote branch parents.git branch
: list branches of repo
-a
: to see all (local and remote) repoTell git to ignore files
nano .gitignore
to edit file*.fasta *.fa
git add .gitignore
git commit -m "add files to be ignored"
git remote
: manages the set of remote reposgit push
: pushes local repo content to a remote repo
--set-upstream
: set default orgin mastergit fetch [branch]
: fetches content from a remote, by default from origingit pull
= git fetch
and git merge
Merge local and remote repositories
git fetch [repo]
git merge origin/master
git push --set-upstream origin master
or
git pull [repo]
git push --set-upstream origin master
git clone [link to repository]
git checkout -b [branch name]
git add [filename]
git commit
git push --set-upstream origin [branch name]
How do we export to an excel document or other document (i.e. csv, txt) from python?