---
tags: course notes, archive
---
Duke UPGG Scientific Computing Bootcamp
=======================================
Course website: https://duke-gcb.github.io/2019-08-12-Duke/ (includes links to materials)
Instructors: Hilmar Lapp, Dan Leehr
Stuck? Put your pink sticky note on the top of your computer.
## Monday Morning - Shell
Launching the shell: **Git Bash** on Windows, **Terminal** on macOS.
### Commands:
- `pwd` - print working directory, where am I?
- `ls` - list files in current directory
- `ls -F` - format directories by trailing `/`
- `ls -lh` - show human-readable sizes
- `ls -a` - include hidden files/directories
- `cd` - change directories
- `man ls` - manual page for a command, e.g. `ls`
Note: to exit the manual page, press the <kbd>q</kbd> key on your keyboard
_Exercise_: What additional information do you learn with `ls -l`?
_Answer_: Date modified, file size, owner, permissions, directory or not
### Tab completion:
While typing a command or file name, hit the <kbd>tab</kbd> key and the shell will finish the word (or help as much as it can!)
- `cd un<tab>` becomes `cd untrimmed_fastq/`
Tab completion also saves you from simple mistakes that are hard for us to spot but easy for computers. The computer knows its filenames, let it finish them for you.
Multiple matches? Start typing and hit <kbd>tab</kbd>. The shell will sound a ding, and list possible completions. Keep typing and hit <kbd>tab</kbd> again once it can pick a single result.
### Navigation shortcuts
- `cd` _(by itself, no directory name)_ - changes to your home directory (`cd ~` has the same result)
- `cd Desktop/shell_data/untrimmed_fastq` - go directly into `untrimmed_fastq` with one command. Remember <kbd>tab</kbd> completion!
- `cd ..` - go one level up in the directory tree (parent directory)
- `cd ~/Desktop/shell_data` - from anywhere on the filesystem since we begin with `~` (home directory)
- `cd -` - Go back to where I came from (the previous directory you were in)
_Exercise_: Find the hidden directory
### Relative and Absolute paths
**Relative Path**: Any path that doesn't begin with a slash. (e.g. `Desktop/untrimmed_fastq`) Relative paths are relative to where you are.
**Absolute Path**: Any path that begins with a slash (e.g. `/Users/sam/Desktop/untrimmed_fastq`)
- `/` is the root directory
### Files and Directories
`cd ~/Desktop/untrimmed_fastq`
- `ls *.fastq` - list all files ending with `.fastq`
- `ls *977.fastq` - list all files ending with `977.fastq`
- `ls /bin/*sh` - list all the commands that are shells
- `echo $SHELL` - which shell am I running?
- `history` - command history with number
- `! {number, n}` - execute the nth command
- `!!` is replaced with the contents of the previous command
- `~` is an absolute path to the home directory
- file display using `head` and `tail`
- `head` gives the first 10 lines of the file
- `tail` gives the last 10 lines of the file
- use `-n` to specify the number of line to display
- e.g. `head -n 4 {filename}` displays the first 4 lines
Match filenames using wildcard
- wildcard * - match any (0 or more) character
- application: prefix, suffix
- put in double quotes to interpret as-is, not as the wildcard special character
- e.g. `ls "*fastq"` - returns `*fastq`, instead of weldcard expansion
Display file content
- `cat {filename}` - concatenate the file content and display on the terminal
- `less {filename}` - allows you to page through the content using Space; interactive
- G – go to the end of file
- g – go to the start of file
- q – exit the less pager
- `/{pattern}` - search for a pattern which will take you to the next occurrence
- n – for next match in forward
- N – for previous match in backward
- `command > file` redirects a command’s output to a file (overwriting any existing content).
- `command >> file` appends a command’s output to a file.
- pipe `|` - inputs the output from one command to another command
- `wc {filename}` gives four-columnar output: total number of lines, words, characters, filename
- `-l` returns the number of lines in file only
- `grep [option] {pattern} {filename}` search for exact pattern or by regular expression
- `-B` and `-A`: before and after the match, respectively
- e.g. `-B1 -A2` gives one line before and two lines after the match
- `-v` include everything except for the pattern
- `--` not specifying any flag
_macOS users_: The `--` characters that appear are called a "Group Separator". There is no flag in macOS `grep` that prevents these, but you can pipe grep's output to remove them:
`grep NNNNNNNNNN -A 2 -B 1 SRR098026.fastq | grep -v -- --`
Moving files around
- `mkdir {dir_name}` make a directory in the current directory
- `cp` copies files to a target named by the last argument on its command line. If the target is an existing file, cp overwrites it; if it does not exist, cp creates it.
- `cp {source_file(s)} {destination_dir}` - copy file(s) to a directory
- `cp {source_file} {destination_file}`- copy a file to another file
- `-i` interactive: prompt to ask whether want to overwrite an existing file
Move & rename files
- `mv {old_filename} {new_filename}`
- `mv {filename} {dest-dir}`
- `-i` prompt for overwrite
- `-n` make it not overwrite
Deleting a file/directory
- use `rm {file}` to remove/delete a file
- use `rm -r {dir}` to recursively remove all the files in a directory
- Note: There is NO trash bin on Unix system. Anything that is deleted is gone forever.
- need permission from the owner of the directory
Permissions
- `chmod -w {filename}` - take away the write permission from everyone
- good practice: don't want to modify the raw data files
- the `-` and `+` in front of `w` does not indicate flag
FASTQ file format (4 lines per read)
1. A sequence identifier with information about the sequencing run and the cluster.
2. The sequence (the base calls; A, C, T, G and N).
3. A separator, which is simply a plus (+) sign.
4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.
### Intro to Programming
The For-Loop
`for x in *.fastq ; do [echo] cp $x $x.backup ; done`
- `echo` reflects the command, remove it to get the following effect:
- for every fastq file in this directory, create a backup
## Monday Afternoon - Programming in Python
Jupyter notebook
- RUN shortcut: <kbd>shift</kbd> + <kbd>enter</kbd>
Python
- assignment operator: `=`, assigns value without returning anything
- comparison operator: `==`, returns a boolean value
- `print()` can take multiple, comma-separated arguments, such as strings, int, etc.
- zero-indexed
String operations
- `str_var[0]` gives the first character in the string
- negative index counts from the end, starting from -1
- slicing: `str_var[start : end : step]`, with the ending index being exclusive
- `in` a boolean operator that allows you to check if a character is in a string
- e.g. `name = "Python"` then `P in name` would return True
- Python strings are immutable: every time you do operations with strings, e.g. slice, concatenate, you are building up a separate new string
The for-loop
```
for item in [iterable]:
# do something with item
```
- Indentation matters in Python! Make sure the body of your loop is properly indented.
Reversing a string
```python=
seq = 'ABC'
rev = ''
# initialize an empty string to be built-on later
for s in seq:
rev = rev + s
# append s to rev and re-assign it to rev
# s is a variable that is visible only WITHIN the loop (the concept of 'scope'); \
# it could be named however you like.
print(rev)
# Returns 'CBA'
# Note: the print() statement is outside of the loop, \
# so it prints AFTER the loop is done
```
Dictionary
- a collection which is unordered, changeable and indexed.
- defined using curly brackets and it contains key-value pairs
`dict = {key: value}`
- it is mutable (unlike string, which is immutable)
- used in for-loop
Example 1: get the reverse complement of a DNA sequence
```python=
seq = 'AAGTC'
comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'}
revcomp = ''
for b in seq:
revcomp = comps[b] + revcomp
print(revcomp)
# returns 'GACTT'
```
Example 2: calculate the GC content of a DNA sequence
```python=
gc = 0
total = 0
for s in seq:
if s in 'GC':
# if s=='G' or s=='C':
gc = gc + 1
total += 1
percent = (gc *100) / total
print(percent)
```
Example 3: if-else statement
- once a condition is evaluated to True, the other cases that follow will be skipped over
- if you want both conditions to be evaluated, use two `if` statements instead of hooking them up in one `if-else` statement.
```python=
if percent < 35:
print('low')
elif percent < 55:
print('normal')
else:
print('high')
```
Functions for iterabls
- `len(iterable)`
- `sorted(iterable, reverse=False)`
```python=
seqs = "rna dna protein"
seqslist = seqs.split() # split by space
print(seqslist) # returns ['rna', 'dna', 'protein']
print(len(seqslist)) # returns 3
```
Writing functions
```python=
def double(x):
# the x only defined within the scope of this function
result = x * 2
return result
print(double(100)) # 200
print(double("Hello")) # HelloHello
```
Re-visiting examples using functions:
```python=
def reverse(seq):
rev = ''
for s in seq:
rev = s + rev
return rev
def complement(seq):
comp = ''
comps = {'A':'T', 'C': 'G', 'G':'C', 'T':'A'}
# if comps is not defined within the function, \\
# the function will look outside for a global varrable --> more vulnerable
for s in seq:
c = comps[s]
comp = comp + c
return comp
def reverse_complement(seq):
return reverse(complement(seq))
test_seq = "AGCTCCA"
test_seq == reverse_complement(reverse_complement(test_seq)) #True
```
Reversing a string using slice
- `new_str = old_str[::-1]`
- Remember, the args for the slice notation are: [start, end, step]
- So the notation is saying: build new_str that includes the entire old_str, and do it by back-stepping, one character at a time
### Working with Files
```python=
def read_fasta(filename):
'''
Extract sequence data from FASTA file in filename
'''
seq = ''
f = open(filename)
for line in f:
line = line.strip() # remove leading and trailing white spaces and newline characters
# check if not empty line and doesn't start with >
if len(line)>0 and not line[0] == '>':
seq = seq + line
f.close()
return seq
```
Note:
- `if A and B`: if A is evaluated to False, it will stop and return without checking B, because False "and" with anything is False
### Nano: the Linux command line text-editor
- `nano {filename}`: create new file if filename doesn't exist, else open it for editing
- take input from command line using the `sys` module
```python=
# within file named test.py
import sys
sys.argv[1] # use it somewhere in the script...
```
Run test.py with the command
```
python test.py {input_file}
```
Note:
- `argv[0]` is the name of the script; `argv[1]` is the input filename
- when path is not specified, the interpreter looks for a file within the directory where the script is. If the input file is in a different directory, make sure to specify the path (absolvute and relative paths both work).
## Tuesday morning - Data & Project organization
- Use standardized and consistent notation for naming files and labeling columns
- Keep your code and data in seperate folders
- Avoid manual data cleaning and adjustments, having it in code keeps your modifications documented
- Make sure to do regular backups
## Tuesday morning - Data Exploration
Importing the pandas module
- Allows you to work with dateframes and large datasets
```
import pandas as pd
```
Jupyter notebook, magic commands
- Start with a '%'
- Specific to a notebook enviornment
- `%matplotlib inline` will produce plots within the notebook
```
%matplotlib inline
```
Initial setup
```python=
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```
Reading a table with pandas
```
gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
```
Outline of the read_table command used:
- `pd.read_table(x,y)`
- x is the file location as a string
- y is how the data in the file is delimited. i.e. ',' or '\t'
Commands to explore a dataframe, where gapminder is the variable name for the dataframe read through pd.read_table()
```python=
gapminder.head() # returns the first 5 rows
gapminder.tail() # returns the last 5 rows
gapminder.columns.values # returns the column names
gapminder.shape # returns the number of rows and columns
gapminder.info() # returns information on variable types
gapminder.describe() # returns descriptive statistics
```
How to subset a dataframe by columns
Use square brackets containing a list of the column names that you want
```python=
gapminder[['pop', 'life Exp', 'gdpPercap']]
```
You can also select a specific column via the following format:
```python=
gapminder.pop # but only works for column names w/o spaces
gapminder['pop'] # the same, but works for any column name
```
This returns a subset of the dataframe. You can also use some of the commands given above on subsets
```python=
# Get descriptive statistics for specific columns
gapminder[['pop', 'life Exp', 'gdpPercap']].describe()
# Get the mean for one column
gapminder['life Exp'].mean()
```
Getting unique values in pandas via `pd.unique()`
```python=
pd.unique(gapminder['region'])
```
For each unique value, getting how many times it occurs with `.value_counts()`
Example 1: Unique values for region
```python
# How many unique regions are in the data?
print(len(gapminder['region'].unique()))
# How many times does each unique region occur?
gapminder['region'].value_counts()
```
Example 2: Unique values for years
```python
# How many unique regions are in the data?
print(len(gapminder['year'].unique()))
# How many times does each unique region occur?
gapminder['year'].value_counts()
```
### Important note on dataframes and copies
If you just subset a dataframe and work on it, or set another variable to equal a dataframe then modifications of those will modify the original dataframe.
The solution is to use the `.copy()` command
```python=
gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
gapminder_copy = gapminder.copy()
gapminder_copy.head()
```
Now, if you work with gapminder_copy, it won't affect gapminder
If your data table has missing data points, you can remove them with the `.dropna()` command
```python=
gapminder_copy = gapminder_copy.dropna()
gapminder_copy.head()
```
You can change the data type of a column via the `.astype()` command
Example: changing year and population from floating points to integers
```python=
gapminder_copy['year'] = gapminder_copy['year'].astype(int)
gapminder_copy['pop'] = gapminder_copy['pop'].astype(int)
gapminder_copy.info()
```
Removing duplicates with `.drop_duplicates()`
```python=
gapminder_copy = gapminder_copy.drop_duplicates()
gapminder_copy.head()
```
Reindexing a dataframe with .reset_index()
```python=
# Resets the index, but saves the old index as a new column
gapminder_copy = gapminder_copy.reset_index()
# Resets the index and does not save the old index
gapminder_copy = gapminder_copy.reset_index(drop=True)
```
Regular expressions and the `.replace()` command
Regular expressions allow you to specifically define a search pattern
Example: Replace 'congo' with 'africa_dem rep congo'
```python=
gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].value_counts()
```
Basic regular expressions:
- A dot ' . ' character is a wildchard for any character
- An asterisk ' * ' indicates any number of times (0 or more occurrences)
- A caret ' ^ ' means at the beginning of the string
### Tidy data
Tidy datasets have two main properties
- every variable has its own column
- every observation has its own row
The gapminder dataset used one column for both the continent and country, which can be fixed with the `.split()` command
```python=
gapminder_copy['country']=gapminder_copy['region'].str.split('_', 1).str[1]
gapminder_copy['continent']=gapminder_copy['region'].str.split('_', 1).str[0]
gapminder_copy.head()
```
In the example, the .split() command takes two arguments.
- The first argument is what deliniates the string. In this case, continent and country were seperated with an underscore
- The second argument indicates that it should seperate at the first underscore
Removing a column from a dataframe with the `.drop()` command
```python=
gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column
gapminder_copy.head()
```
Concatenating two dataframes with `pd.concat()`
```python=
gapminder_comb = pd.concat([gapminder_copy, PRB])
```
Slicing dataframes
Slicing dataframes to select rows works in much the same way as selecting characters in strings
Example
```python=
gapminder_copy[0:15] # first 15 rows
gapminder_copy[-10:] # last 10 rows
```
## Wednesday morning - data exploration cont'd
### Exploring data
Summaries
* `.mean()` : mean
* `.median()` : median
* `.groupby(by='continent').mean()` : mean by categorical groups
* `.count()`: count the number of entries
* *pro-tip: sorting can be done by using `.sort_value()`*
Subset`DataFrame` to `Series` for a specific column
```python=
gdpcap = gapminder_comb[['continent', 'country']].groupby(by='continent').size()
gdpcap = gapminder_comb[['continent', 'country']].groupby(by='country').mean()
```
Creating new `DataFrame` of summaries
```python=
continent_mean_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder_comb[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs
```
### Visualization
libraries to load:
* numpy
* matplotlib
* seaborn
```python=
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
```
Histogram
```python=
# import numpy as np
plt.hist(gapminder_copy['lifeexp'])
plt.xlabel('lifeexp')
plt.ylabel('count')
```
Boxplot
```python=
sns.boxplot(x='year', y='lifeexp', data=gapminder_copy)
plt.xlabel('year')
plt.ylabel('count')
```
Scatterplot
```python=
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')
# let's try plotting the log of x
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xscale('log')
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')
```
## Wednesday Morning - Automation with Python
Download [gapminder_cleaned.csv](https://raw.githubusercontent.com/Reproducible-Science-Curriculum/automation-RR-Jupyter/gh-pages/notebooks/gapminder_cleaned.csv)
Setting up Directories
```bash=
mkdir automation
mkdir automation/code
mkdir automation/data
mv gapminder_cleaned.csv automation/data/
ls automation/data
```
If your file downloaded as .txt or .tsv, rename using the `mv` command
```bash=
mv gapminder_cleaned.txt gapminder_cleaned.csv
#or move it and rename it
mv gapminder_cleaned.txt automation/data/gapminder_cleaned.csv
```
Layout of python notebook
1. load libraries
2. establish variables (`DataFrames`, file names)
3. exploratory analysis
4. visualization
### Defensive coding
create a **try-catch** block to 'catch' your errors
```python=
# set VERBOSE True if you want to head the data, false otherwise
VERBOSE = True
cleaned_data_location = 'data/gapminder_cleaned.csv'
try:
df = pd.read_csv(cleaned_data_location)
if VERBOSE:
print(df.head())
except: FileNotFoundError:
print('Could not find data file, check path? You tried', cleaned_data_location)
```
Use **assertion errors** to test assumptions
```python=
years = df['year'].unique()
years.sort()
assert years[-1] == 2007
assert years[0] == 1952
```
**DRY**: dont repeat yourself!
Calculate mean life expectancy of Asia in 1952
```python=
mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]
mask_1952 = df_asia['year'] == 1952
df_1952 = df_asia[mask_1952]
value = np.mean(df_1952['lifeexp'])
if VERBOSE:
print('mean for 1952 is ', value)
# create an empty list to store life expectancies over time
result = []
# append a row to list with a tuple containing your result
result.append(('asia', '1952', value))
# Turn the summary into a dataframe so that we can visualize easily
result = pd.DataFrame(result, columns=['continent', 'year', 'lifeexp'])
```
Calculate mean life expectancy of Asia for each year
```python=
mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]
# Pull out all the years from the dataset
years = df_asia['year'].unique()
# create an empty list to store the summary values
summary = []
# pull out the mean life expectancy at each year
for year in years:
mask_year = df_asia['year'] == year
df_year = df_asia[mask_year]
value = np.mean(df_year['lifeexp'])
summary.append(('asia', year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', 'lifeexp'])
```
Change code to calculate mean life expectancy for any continent
```python=
# Define which continent / category we will use
category = 'lifeexp'
continent = 'africa'
# Create a mask that selects the continent of choice
mask_continent = df['continent'] == continent
df_continent = df[mask_continent]
# Loop through years and calculate the statistic of interest
years = df_continent['year'].unique()
summary = []
for year in years:
mask_year = df_continent['year'] == year
df_year = df_continent[mask_year]
value = np.mean(df_year[category])
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
```
**Refactor** code into a function to calculate mean/median/mode/etc.. over time of any user defined category and continent
* `def [function]`: defines a function
* your function can have arguments which you can set defaults (i.e. the default for verbose is false)
* `def calculate_mean_over_time(data, category, continent, func, verbose=False):`
* including verbose conditional statements are great for debugging code
* when passing a function as an argument, do not include the parenthesis
* `calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)`
* It is good programming to include docstrings in functions so that `help()` can be used on your function
* add assertions to test that the arguments the user inputs are valid (checking types is not reccomendend, instead one can check the behavior or an object)
```python=
def calculate_func_over_time(data, category, continent, func, verbose=False):
""" calculate values of a statistic through time
Args:
data: a data frame
category: one of the column headers of the data frame (e.g. 'lifeexp')
continent: possible value of the continent column of that data frame (e.g. 'asia')
func: the funtion to apply to data values (e.g. np.mean)
verbose: show debugging output (optional, defaults to False)
Returns:
a summary table of value per year.
"""
# check the values provided
assert type(df) == pd.core.frame.DataFrame
assert category in data.columns.value
assert 'continent' in data.columns.value
assert continent in data['continent'].unique()
# Create a mask that selects the continent of choice
mask_continent = data['continent'] == continent
data_continent = data[mask_continent]
# Loop through years and calculate the statistic of interest
years = data_continent['year'].unique()
summary = []
for year in years:
if verbose:
print(year)
mask_year = data_continent['year'] == year
data_year = data_continent[mask_year]
value = func(data_year[category])
if verbose:
print('adding ', continent , value)
summary.append((continent, year, value))
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
return summary
```
Run function
```python=
VERBOSE = False
calculate_mean_over_time(df, "lifeexp", "asia", np.mean, VERBOSE)
```
Visualize
```python=
#use this function to plot life expectancy| through time
continents = df['continent'].unique()
VERBOSE = False
fig, ax = plt.subplots()
for continent in continents:
output = calculate_statistic_over_time(df,"lifeexp", continent, np.median)
output.plot.line('year', "lifeexp", ax=ax, label=continent)
```
Putting functions into **modules**
* create a new file to hold functions ex. `automation.py`
* place function in file
* import the necessary libraries for function
* in the main file, import function at the top of the file
```python=
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib
from automation import calculate_func_over_time
```
***Bonus tips:***
* `dir(df)` : output all attributes of df
* i.e. `.shape`, `.info`
* interactive text editors (i.e. Jupyter Notebook) are reccomended for code that has an output where as functions can be put in a regular text editor
* parsing error messages in python -- the most helpful part is usually at the end
## Wednesday afternoon - Sharing your computational work
* share work using [GitHub](https://github.com/github)
* Code shared on GitHub is static
* Using [binder](https://mybinder.readthedocs.io/en/latest/#), you can share interactive versions of notebooks on GitHub
* need to specify all of the library dependecies for the notebook in `requirements.txt`
## Thursday morning - version control
https://github.com/Duke-GCB/scicomp-python/archive/master.zip
### Git commands:
* `git status`: gives information about repository
* `git init`: initialize git repository
* `git config`: configure git settings
* `--globabl`: changes settings on a global level. ex `git config --global user.email "e@mail.com"`
* `--list`: list all the configurations ex `git config --global --list`
* `git add`: adds a file to a staging area
* `git add .` to add all files
* `git commit`: commits file to repo and will prompt the user to leave a message
* `-a` or `--all`: will commit everything in repo
* `git diff`: prints out the changes of files in repo
* `git diff --staged` to view all the staged changes
* `git checkout`: restore prior version of file
* `-b [branch name]` : create a new branch
* `git log`: shows the commit logs
* `--name-only`: shows file names with commit messages
* `git merge [branch name]`: makes any branch the master branch, aka a fast-forward. When used after fetching the global repo, the merge will make both the local and remote branch parents.
* `git branch`: list branches of repo
* `-a`: to see all (local and remote) repo
Tell git to ignore files
* use `nano .gitignore` to edit file
* then add the files you want to ignore. ex `*.fasta *.fa`
* `git add .gitignore`
* `git commit -m "add files to be ignored"`
### GitHub Remotes
* `git remote `: manages the set of remote repos
* `git push`: pushes local repo content to a remote repo
* `--set-upstream`: set default orgin master
* `git fetch [branch]`: fetches content from a remote, by default from origin
* `git pull` = `git fetch` and `git merge`
Merge local and remote repositories
```bash=
git fetch [repo]
git merge origin/master
git push --set-upstream origin master
```
or
```bash=
git pull [repo]
git push --set-upstream origin master
```
### Collaborting
1. Add collaborator on GitHub
2. Clone repository locally `git clone [link to repository]`
3. Make contribution in isolated branch `git checkout -b [branch name]`
4. Add changes to staging area `git add [filename]`
5. Commit changes `git commit`
6. Propose contribution
* ` git push --set-upstream origin [branch name]`
* Open pull request on GitHub (will be populated by your local commit) and create the pull request
7. Owner accepts or denies contributions
## Open questions
How do we export to an excel document or other document (i.e. csv, txt) from python?
* See .to_csv on [Pandas API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)