owned this note
owned this note
Published
Linked with GitHub
---
tags: course notes
---
# UPGG Informatics Orientation Bootcamp 2022
:::warning
The [notes for Monday and Tuesday](https://hackmd.io/mtdAqaA_Qb-9WeeuZvkKTA) have been moved to a separate document.
:::
## Wednesday Morning
## Defensive programming and automation
### Overview of the lesson
- Documentation
- D.R.Y.: Don't repeat yourself
- Defensive programming tactics in Python
- File input/output
- Combining bash and python
- constructing a computational pipeline
### Style guides
Naming conventions for variables.
Submitted answer for exercise 1:
```python
def velocity(total_distance, time):
"This calculates the distance over time"
velocity_result = total_distance / time
return(velocity_result)
velocity(10, 2)
```
Dangers in variable naming.
One danger of variable naming is overwriting built-in python functions. Example given:
```python=
sum = 10 + 5
```
Will assign 15 to sum instead of the sum function.
Comments are specified with a '#'. What is written after a '#' will not be run when the code is run.
Docstrings are specified using triple quotes: """""" which can be used to. specify a multi-line string.
```python
def function():
"""
This is a multi-line string.
The documentation can be written within the boundries of the triple quotes.
"""
```
### Exercise 2.1: Importance of naming and documentation
Post answers here!
"""
calculate percentage of each value of dictionary (a list of ints/floats) in words (an int/float) for every value of dictionary.
for each percentage, if percentage is greater than 0, then save, and finally return largest percentage.
"""
#Well, we don't have a lot of information and even the information we have is suspect
#The inputs are the variables "number" which is probably a list with integers or floats within and "words" which appears to likely be an integer or float
#The function searches for largest (the word smallest is a misnomer/just incorrect!) item in number and finds the percentage of "dictionary" in "words"
#It then returns the largest of such percentages
#number: list of numbers
#words: single number
#returns percentage of largest number
#function([1,2,3],4) would return (3/4)*100, (max([1,2,3])/4)*100
### Exercise 2.2: Refactoring a function.
Refactor (re-write the code without changing the task it performs) the previous function to be clearer using appropriate variable names and a docstring.
Post answers here!
def find_largest_percent(values_list, denominator):
largest_percentage = 0
for value in values_list:
percent = (value / denominator) * 100
if percent > largest_percentage:
largest_percentage = percent
return largest_percentage
```python
def largest_percent(num_list, denom):
"""Takes a list and returns a float value for the largest percentage
Args:
num_list (list): list of int or float values
denom (int/float): single int or float value
Returns:
largest_value: int or float largest value as percentage
"""
largest_value = 0
for value in num_list:
loop_val = (value / denom) * 100
if loop_val > largest_value:
largest_value = loop_val
return largest_value
```
### Defensive programming: Don't repeat yourself (D.R.Y)
Exercise: Refactor an analysis into a function
Write a function that performs the repeated task shown in the jupyter notebook.
Post answers here!
```python
def protein_search(protein_data_list,search_list):
'''Take a protein list and returns matches in search list'''
match_list = []
for protein in protein_data_list:
if protein in search_list:
match_list.append(protein)
return match_list
def match_prot_int(protein_data, proteins_of_interest):
"""given a list of proteins, identify those within the list that match a master list of interest"""
match_list = []
for protein in protein_data:
if protein in proteins_of_interest:
match_list.append(protein)
return match_list
def compare_lists(primary_list,secondary_list):
'''Take two lists of values and returns matches in between lists'''
match_list=[]
for secondary_value in secondary_list:
if secondary_value in primary_list:
match_list.append(secondary_value)
return match_list
protein_data_lists = [["CREG1", "ELK1", "SF1", "GATA1", "GATA3", "CREB1"],["ATF1", "GATA1", "STAT3", "P53", "CREG1"],["RELA", "MYC", "SF1", "CREG1", "GATA3", "ELK1"]]
proteins_of_interest = ["ELK1", "MITF", "KAL1", "CREG1"]
protein_match_list = []
for protein_test_list in protien_data_lists:
protein_match_list.append(compare_lists(proteins_of_interest,protein_test_list))```
### A helpful way to format strings: f-strings
Cannot directly concatinate or embed strings and non-string data types.
Using an f-string, specified by placing an 'f' character before the quotation marks you can embed variables by placing them in curly braces.
```python=
my_int = 5
example_fstring = f"The integer in my_int is: {my_int}"
```
### Errors in python
Python has many different types of errors built-in including:
- NameError
- ZeroDivisionError
- TypeError
- IndexError
Errors given in python are known as exceptions. You can use the "raise" keyword to return an error on purpose.
Run the example reverse_complement function. Then the following code block with lower case letters.
### Try and except keywords
The "try" keyword tries the code indented after it. If an error that matches the following "except" keyword for a given error, it will run that code. If no error is specified with "except", any error will result in that code block being run.
If an error is returned in try and the code under except is run, the rest of the code written will still run. Using the raise keyword will return the traceback error message.
Why use try, except, and raise keywords?
- To "fail fast"
- Avoid returning incorrect or unexpected results
### Printing error messages
Two output sources to consider are `stdout` and `stderr`. Its important to send error messages to stderr instead of stdout because error messages and output data can be mixed up in workflows.
### Exercise: adding try, except, and raise to a function
Take the reverse complement function and add error handling functionality to it. Consider the previous example where a lower case letter (or letter not in the keys of the dictionary) returned a potentially cryptic error message.
Post your answers here!
```python=
def reverse_complement(dna_sequence):
"""Reverses the complement of a dna sequence"""
complements = {"T":"A", "A":"T", "C":"G", "G":"C"}
reverse = dna_sequence[::-1]
result = ""
# Add try - except - raise statements
try:
for letter in reverse:
result = result + complements[letter]
return(result)
except KeyError:
print("DNA input sequence has lowercase bases.",file = sys.stderr)
raise
help(reverse_complement)
print(reverse_complement("CAAg"))
def reverse_complement(dna_sequence):
"""Reverses the complement of a dna sequence"""
complements = {"T":"A", "A":"T", "C":"G", "G":"C"}
reverse = dna_sequence[::-1]
result = ""
# Add try - except - raise statements
try:
for letter in reverse:
result = result + complements[letter]
return(result)
except KeyError as letter:
raise KeyError(f"Please check the formatting of your input sequence (i.e. capitalization)! {letter} is incorrectly formatted.")
help(reverse_complement)
print(reverse_complement("CAAg"))
```
### Exercise: Sanitize input
Refactor the kmers_from_sequence function to check that the input of k is:
- A positive number
- Not longer than the length of `dna_sequence`
If there is a problem, `raise` a `ValueError` with an appropriate message.
Post your answers here!
```python=
def kmers_from_sequence(dna_sequence, k):
"""Prints all kmers from a sequence
"""
# Write code to check input here!
if k<=0:
raise ValueError(f"{k} is not a positive number.")
elif (k>=len(dna_sequence)):
raise ValueError(f"{k} value is longer than input sequence length.")
positions = len(dna_sequence) - k + 1
for i in range(positions):
kmer = dna_sequence[i:i + k]
print(kmer)
def kmers_from_sequence(dna_sequence, k):
"""Prints all kmers from a sequence
"""
if k >= len(dna_sequence) or k <= 0:
raise ValueError(f"Invalid kmer: the kmer of length {k} is either not a positive number, or greater than/equal to the input sequence length.")
positions = len(dna_sequence) - k + 1
for i in range(positions):
kmer = dna_sequence[i:i + k]
print(kmer)
```
### Syntactical shortcut - Separate code with line breaks
If you have lines that are long and hard to read, putting in line breaks can help. In Python, you can have line breaks inside parentheses. Let's demonstrate this on a piece of code we've written yesterday:
```python=
# import data from yesterday
import pandas as pd
gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t")
gapminder['region'] = gapminder['region'].astype(str)
# Method 1 for formatting the 'region' column:
gapminder_copy['region'] = gapminder_copy['region'].str.lstrip() # Strip white space on left
gapminder_copy['region'] = gapminder_copy['region'].str.rstrip() # Strip white space on right
gapminder_copy['region'] = gapminder_copy['region'].str.lower() # Convert to lowercase
# Method 2 for formatting the 'region' column:
gapminder_copy['region'] = gapminder['region'].str.lstrip().str.rstrip().str.lower() # Strip white space on left and right, and convert to lowercase
print(gapminder_copy['region'])
```
There are three different transformations happening above: removing whitespace on the left, removing whitespace on the right, and converting the text to lowercase. We can make this one line more intuitive by breaking it up into three:
```python=
# New method of chaining functions
gapminder_copy['region']
print(gapminder_copy['region'])
```
We get the same output as above! This code is functionally the same as methods 1 and 2. We benefit from explicitly delineating each step like in method 1, and we also get the nicer syntax of applying all cleaning steps at the same time with method 2.
### Outlining your code
Thinking about what you want your future code to do for you before coding anything reduces the time you spend physically coding. It forces you to think about the big pieces that go into solving your problem and how they'll fit together, revealing potential problems much earlier. Let's take an example from day 1 to illustrate:
```python=
percent = 20
if percent < 38:
print('Low')
elif percent < 47:
print('Normal')
else:
print('High')
```
To make this code more relevant to biology, let's introduce a real biological variable: hematocrit. Hematocrit is the volume percentage of red blood cells in blood. The normal values for humans are:
- Males: 41% - 50%
- Females: 36% - 44%
- Average: 39% - 47% - these numbers used in the code above
Now let's think about our code:
1. We first check if the percent is less than 38: if so, then label as "Low"
2. We then check if the percent is less than 47: if so, then label as "Normal"
3. Otherwise, label as "High"
This code seems intuitive from a first glance, but there's a conceptual oversight: the second case here isn't actually correct. Values less than 47% are normal only if the value isn't already less than 38%. The logic presented in this code works by virtue of checking the 38% case before the 47% case. Suppose we unintentionally coded in case 2 before case 1. Now we first check for values less than 47, and any values that fulfill that condition are "Normal", even if they're also less than 38. This is an easy accident that can lead to erroneous results.
Let's think a bit more about the biological meaning of HCT percentage: it's more accurate to explicitly state the values that fall under each category:
- Values between 0% - 38% are "Low"
- Values between 39% - 47% are "Normal"
- Values between 48% - 100% are "High"
To follow this biological meaning, we would rewrite our code as follows:
1. Check if the percent is between 0 and 38: if so, then label as "Low"
2. Then, check if the percent is between 39 and 47: if so, then label as "Normal"
3. Then, check if the percent is between 48 and 100: if so, then label as "High"
4. **New condition!** If it's any other integer, raise an error
This is better! We've done a few defensive programming concepts here:
- Written up **pseudo-code**: this is not actual code, but an outline of how you want your code to be structured
- Sanitized our input and guarded against a potential error
- More explicitly stated the biological meaning of our code
- Defended against the concept of a "wrong order" for our if/else statements - now it doesn't matter how the three conditions are ordered
So now this code would look like:
```python=
percent = 20
if 0 <= percent <= 38:
print('Low')
elif 39 <= percent <= 47:
print('Normal')
elif 48 <= percent <= 100:
print('High')
else
raise ValueError("Percent value must be between 0 and 100")
```
### Working with Files
When working with real-world data, it will typically be in a file, and **not** in your code. Fortunately, Python has functions to read files. These work with simple text files, and if you need to handle images or other binary formats, there are libraries that can help with that.
Sequence data is typically in stored text files, like fasta. So let's walk through reading one of those.
### Open and Close
When you use a word processor or spreadsheet, you open files, work with them, and then close them when you're done. In Python, you do the same thing.
```python=
f = open('ae.fa')
for line in f:
print(line)
f.close()
```
Let's go through the steps we just did.
1. We used the `open()` function on a string that represents a path to a file.
- The result of that function was saved to the variable `f`. This value is called a file object.
2. We wrote a for loop. When you write a for loop for a file object, each loop variable represents a line in the file.
3. We printed the loop variable for each loop.
4. We used the method `.close()` to close the file.
### Reading lines
When we work with text files, we can read them line by line in a loop. Each line of text from the file is set into our loop variable, and we print it out from the loop.
You'll probably notice that we have a blank line in between each line from the file. Text files and programs indicate that there is a new line using a special newline character, `\n`. In our previous example, each line in the file includes the `\n` newline character at the end.
Let's look into this by adding each line to a single string, all_lines, and compare printing the string vs the raw data.
```
# Example of text file with new line characters
This is line number 1\n
This is line number 2\n
\n
This is line number 4\n
```
```python=
f = open('ae.fa')
all_lines = ''
for line in f:
all_lines = all_lines + line
f.close()
print(all_lines) # First output is the print result
all_lines # Second is what the string data looks like
```
If we know we have the whole line, we can strip off the newline character with the `.strip()` method.
```python=
string_with_newline = "There is a newline at the end\n"
string_with_newline
```
```python=
string_with_newline.strip()
```
Let's try this with our original code to print each line in a file.
```python=
f = open('ae.fa')
for line in f:
line = line.strip()
print(line, end='')
f.close()
```
Now all the whitespace is removed.
### Reading a fasta file
Write a function called `read_fasta(filename)` that takes in the input filename and returns a single string for all of the DNA sequences in the file.
**Paste your code here:**
```python=
read_fasta = open('ls_orchid.fasta')
for line in read_fasta:
line = line.strip()
print(line, end='\n')
read_fasta.close()
```
```python=
def read_fasta(input_file):
"""Takes the input file and converts all DNA information into a single string"""
f = open(input_file)
all_lines = ''
for line in f:
if line.startswith(">"):
continue
all_lines = all_lines + line.strip()
f.close()
return all_lines
```
```python=
def read_fasta(file_name):
'''Takes in a fasta file and returns single string for all DNA sequences.'''
DNA_strand=''
if ".fa" in file_name:
loop_file=open(file_name)
for line in loop_file:
if ">gi" in line:
print('New read...')
pass
else:
line=line.strip()
DNA_strand+=line
loop_file.close()
return DNA_strand
else:
raise ValueError("File not fasta file type.")
```
### Scripts
We've done a good job of organizing our code into functions here, but we've only been running them from this notebook. So next, we're going to take our code and put it in a script - starting with the `read_fasta` function.
Let's start with a script that reads the ae.fa file specifically and prints it.
Notice that the first line contains a `%%` operator followed by the command writefile and a file name. This operator is specific to jupyter notebooks, called a "Cell Magic Command", and copies the code written in a cell into a file.
```python=
%%writefile read_fasta_v1.py
def read_fasta(filename):
"""Reads a fasta file and returns all sequences concatenated"""
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if '>' not in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
print(read_fasta('ae.fa'))
```
Our script reads our `ae.fa` file every time we run it, but we know most programs don't work that way. The programs we used in bash expected a data file as an argument, and that's a good convention for programs we write too.
In Python, our program can get these arguments, but we have to load a module called sys from the standard library, a collection of modules included in python but not available by default. The documentation for these is part of the documentation for python: [https://docs.python.org/3/library/sys.html](https://docs.python.org/3/library/sys.html)
Libraries are incredibly useful - there are libraries for working with numeric and scientific data, generating plots, fetching data from the web, working with image and document files, databases, etc. And of course, there's a library for getting things like your script's command-line arguments.
So, let's change our `read_fasta.py` program slightly.
```python=
%%writefile read_fasta_v2.py
import sys
def read_fasta(filename):
"""Reads a fasta file and returns all sequences concatenated"""
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if '>' not in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
print(read_fasta(sys.argv[1]))
```
But what happens if we don't have an input file name? According to the documentation, `sys.argv` returns a list where the first item `sys.argv[0]` is the name of the script by default, and each additional item in the list are the command line arguments. If no argument was passed, `sys.argv` should be a list of just the script name.
```python=
%%writefile read_fasta_v3.py
import sys
def read_fasta(filename):
"""Reads a fasta file and returns all sequences concatenated"""
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if '>' not in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
if len(sys.argv) < 2:
print('Usage:', sys.argv[0], '<sequence.fa>')
sys.exit(1)
print(read_fasta(sys.argv[1]))
```
### Making scripts you can import
So far, we have used modules to help us work on our analyses such as:
Standard Library:
- sys
Third Party:
- pandas
- numpy
- matplotlib
These are imported using the import keyword and we can use functions from them. We also write functions for use in our own code. Having these available to import into other scripts gives the benefit of:
1. Letting us reuse code over multiple analyses (DRY)
2. Letting others use our code in their own scripts without copy/pasting (DRY)
While it may seem like going out of one's way to write a module and a script for analysis, you can actually have one python file act as both a module and run it from the command line to perform a task.
In your Jupyter notebook server (in your browser):
- New -> Python3 Notebook
- name: **demo_for_imports**
In **demo_for_imports.ipynb**:
```python=
import read_fasta_v3
```
But, why does it print some output when we import into our jupyter notebook?
This is because we have a `print()` statement at the bottom of cool_functions.py! This is not good practice. How do we separate the print statement from the useful functions in our .py document???
Answer: Add this weird statement `if __name__ == "__main__": `before the data & print statement (after the function, still in our `automation_python_2022.ipynb` document):
```python=
%%writefile read_fasta_v3.py
import sys
def read_fasta(filename):
"""Reads a fasta file and returns all sequences concatenated"""
sequence = ''
f = open(filename)
for line in f:
line = line.strip()
if '>' not in line:
# Append to the last sequence
sequence = sequence + line
f.close()
return sequence
if __name__ == "__main__":
if len(sys.argv) < 2:
print('Usage:', sys.argv[0], '<sequence.fa>')
sys.exit(1)
print(read_fasta(sys.argv[1]))
```
Now the script will import without running immediately.
### Exercise: Extending your Python scripts to do more
In your `demo_for_imports.ipynb` notebook, you've now imported `read_fasta_v3`, which returns a single DNA string from a fasta file. Building off of this module, how would you get a count of unique dinucleotides ("AA, AT, AC, AG, TA, etc.) for each fasta file? Writing pseudocode may help if you're not sure how to approach it.
**Put your answer here:**
- pseudocode:
- read fasta
- have a container to store unique dinucleotides
- go through nucleotide positions in the length of the DNA string
- check that we have not reached the end of the DNA string
- don't include N
- append unique dinucleotides to the container
```python=
import read_fasta_v3 as rf
def count_dinucleotides(in_fasta_str):
"""Inputs a string of DNA and outputs all unique dinucleotides"""
unique_dinucleotides = []
for nucleotide_position in range(len(in_fasta_str)):
try:
dinucleotide_to_check = (in_fasta_str[nucleotide_position]+in_fasta_str[nucleotide_position+1])
except IndexError:
print("Reached End of string")
if dinucleotide_to_check not in unique_dinucleotides and "N" not in dinucleotide_to_check:
unique_dinucleotides.append(dinucleotide_to_check)
return len(unique_dinucleotides)
count_dinucleotides(rf.read_fasta('ae.fa'))
```
- pseudocode:
- create dictionary of all possible dinucleotide combinations, initialize count as 0
- go through DNA string
- by matching dinucleotide combinations in the dictionary to the dinucleotides in the DNA string, add counts to the diction
```python=
def dinucleotide_count(DNA_strand):
'''Takes in a DNA sequence and returns dictionary with number dinucleotides.'''
bases=["A","T","C","G"]
dinuc_dict={}
for b1 in bases:
for b2 in bases:
dinuc=b1+b2
dinuc_dict[dinuc]=0
for dn in dinuc_dict:
dinuc_dict[dn]=DNA_strand.count(dn)
return dinuc_dict
dinucleotide_count(rf.read_fasta('ae.fa'))
```
### The next level of automation: combining Python and bash
Suppose you're given a few hundred fasta files you need to concatenate. You could type them all into a list in your Jupyter notebook and run it, or you could have bash automate the Python script for you!
First, a technical check: in your Git Bash or Terminal window, run python --version. Hopefully, you get some info about the version of python you're running.
Recall our bash lesson on the first day:
```shell
for filename in *.fa*
do
echo $filename
done
```
Using the `python` command in the terminal, we can also run Python files without needing to open up Jupyter notebook! You use this python command just like any other bash command:
```bash
for filename in *.fa*
do
python read_fasta_v3.py $filename
done
```
You should get a lot of letters outputted to your window. Better yet, let's redirect it to another file to save for later:
```bash
for filename in *.fa*
do
python read_fasta_v3.py $filename >> output_fastas.txt
done
```
You may have seen the `>` bash symbol before, which means to redirect the output of a command into another file. Here, we use `>>`: it's is similar in that it also redirects output to a file, but this concatenates the result to whatever is in the file, instead of overwriting the whole file with the new results.
Now we can open up and check `output_fastas.txt`. There's one line for every file processed.
Okay, time for one more level of automation: it's great that we can get bash to loop over our Python file and our input files, but what if we don't want to type out the `for` loop in bash every single time we run an analysis? We can store that command in a `bash` script, similar to how we stored our Python code in a Python script.
In either `nano` or `vim`, copy and paste the above code into a new file called `script.sh`. Remember that we can create a new file by directly typing `nano script.sh` or `vim script.sh` on the command line.
Once you've created your `script.sh` file, run it on the command line with:
```bash
./script.sh
```
It should work as intended (i.e. not output anything to the terminal), and you can open up your `output_fastas.txt` file to see the results.
### Exercise - optimizing your script
How would you modify this `script.sh` so that it empties the contents of `output_fastas.txt` before running your program? Hint: there are multiple bash commands you could potentially use.
#### Put your new script here!
```bash
```
We can go even further: walking away from your computer
Congrats, you have a bash script that automates a Python file over hundreds of fasta files! What if you were dealing with gigabytes (or even terabytes) of data? You'd probably be waiting forever for your script to finish, but you probably don't want to sit around that long. What can you do?
```bash
nohup ./script.sh &
```
- `nohup` - Stands for `no hangup` - even if you close your bash terminal, the program will continue to run in the background of your computer. Just make sure you don't shut down your computer before it finishes! On a compute cluster, this isn't really a problem since compute clusters generally stay online 24/7.
- `&` - Run this program in the background of your terminal. This frees up your terminal so you can work on other things and run more commands. Note that this does not keep it running if the terminal is closed; you'll still need `nohup` for that.
- You can look into `nohup` documentation for useful options e.g. letting it email you when the job is done etc.
---
## Wednesday Afternoon
## Sharing Your Computational Work
I doesn't matter if you are working with a collaborator or not, you at least have one collaborator by default: your future self. Your future self cannot communicate with you, unlike your present, physical collaborators.
### Sharing using GitHub
#### First step, make a GitHub account.
If you are concerned about privacy, there is a way to make your email hidden once you are on GitHub. But first, you still need to use your email to sign up at least.
It is also suggested to use 2-factor authentication with GitHub, as well as a password manager. Duke provides premium subscription to 1Password at no cost!
#### Create a new repository
- Make sure it's public.
- Check the box for "Add a README file"
- There is a way to access GitHub repository through command line, we'll go over that on Thursday.
- You can do a lot without touching the command line
#### Upload a file
- You can create one from scratch, but you can just upload a preexisting file on your local computer.
- Add a commit message describing the nature of the change you are making to the repository. For example, "added a jupyter notebook."
- Click "Commit changes"
- GitHub recognizes many file formats. It can render a jupyter notebook file into JSON format for better viewing. You can look at the raw version of the file by pressing the `raw` button close to the top-right border of the file.
- This is useful because it will render the Markdown part of the notebook for you as well.
- You cannot make changes to the rendered view of the file as there is no engine running under the hood. You can make changes in the raw and commit those changes.
- Jupyter notebook saves the image of the notebook; i.e. whatever is run in the notebook as well as the warning, error, results. Make sure to save the notebook
### Introduction to Binder
#### Running code is more complicated than displaying code
To run code, you need (at least)
- Hardware
- Software
- The code
- Hardware for running the programming language
- Something to run your code (e.g., python)
- Packages needed to run the code (e.g., pandas)
#### Binder provides these things
- Containers hold software environment that allows program to run on that
- Containers are thinner than machines with full operating system
Go to the Jupyter notebook file
- Edit -> clear all output
- Now, you can rerun the cells and reproduce outputs again.
- This is interactive, because it doesn't require packages except Jupyter Notebook
- This is different from GitHub repository where you can't rerun the cells nor change the codes.
Try adding more file on GitHub repository
- It's a one way dynamic interaction where the Jupyter notebook environment is separated from the repository.
- Compiling Binder again will take less time because it saves caches from the first time.
- If you import different dependencies inside the Jupyter notebook, it will run into the error. You need to also include the dependencies and tell infrastructure of Binder where to look for.
- For `python`, dependencies are listed inside `requirements.txt`
- Cannot make typos here
- List packages you will import in your Jupyter Notebook here
- There is a syntax to specify a version for each package if you need a specific version of them
- Often dependencies contain their own dependencies
- Rerun Binder again, and this time you should be able to run the cell that imports the packages you need
- Double check the path to the data file that it is in the same directory with the Notebook file. (i.e., Doesn't have `../` etc.)
- Keep in mind that Binder is not the only infrastructure out there for sharing your work. Explore different options!
## Thursday Morning
### Version Control with Git
*Instructor*: Yanting "Raven" Luo
Version control is a great way to keep track of your work. You can "hit rewind" to go back and see previous edits.
Git also allows for unlimited undo and is great for collaborative repositories.
#### Setting up Git
When we use Git on a new computer for the first time, we need to configure a few things. Below are a few examples of configurations we will set as we get started with Git:
* our name and email address,
* what our preferred text editor is,
* and that we want to use these settings globally (i.e. for every project).
```bash
git config --global user.name "Vlad Dracula"
git config --global user.email "vlad@tran.sylvan.ia"
git config --list # shows currently configured git options
```
Windows and Macs handle line endings a bit differently, but we can standardize that with specific commands
Windows:
```bash
git config --global core.autocrlf true
```
MacOS/Linux:
```bash
git config --global core.autocrlf input
```
Set default text editor to nano:
```bash
git config --global core.editor "nano -w"
```
Set branch name to "main":
```bash
git config --global init.defaultBranch main
```
Access the help information for git:
```bash
git help
```
#### Create a repository
First make a directory that we want to back up with Git:
```bash
cd ~/Desktop
mkdir planets
cd planets
```
Then we tell Git to make `planets` a repository – a place where Git can store versions of our files:
```bash
git init
```
Next, we will change the default branch to be called main. This might be the default branch depending on your settings and version of git. See the setup episode for more information on this change.
```bash
git checkout -b main
```
Double check that everything is correct with `status`
```bash
git status
```
If you created an undesired repository in a directory, you can use the following command to remove the `.git` directory. But be careful! Running this command in the wrong directory will remove the entire Git history of a project you might want to keep. Therefore, always check your current directory using the command `pwd`.
```bash
rm -rf moons/.git
```
#### Tracking changes
Let's make a text file inside the `planets` directory
```bash
nano mars.txt
```
Add some text to the file in nano such as:
> Cold and dry, but everything is my favorite color
Now we can commit those changes with git:
```bash
git status # check status of project
git add mars.txt # add text file to staging area
git status # check that the above command worked
git commit -m "Start notes on Mars as a base" # commit changes and provide a short message
```
If we check our status and log, we can see some updates
```bash
git status # should show no files to commit
git log # shows history of commits
```
We can make changes to our `mars.txt` file and use git to see what's different
```
git diff
```
The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:
1. The first line tells us that Git is producing output similar to the Unix `diff` command comparing the old and new versions of the file.
2. The second line tells exactly which versions of the file Git is comparing; `df0654a` and `315bf3a` are unique computer-generated labels for those versions.
3. The third and fourth lines once again show the name of the file being changed.
4. The remaining lines are the most interesting, they show us the actual differences and the lines on which they occur. In particular, the `+` marker in the first column shows where we added a line.
After reviewing our change, it’s time to commit it:
```bash
git add mars.txt # make sure to run this command before committing
git commit -m "Add concerns about effects of Mars' moons on Wolfman"
```
We can repeat this process after addding another line to the file:
```bash
nano mars.txt
cat mars.txt
git diff
git add mars.txt
git diff --staged
git commit -m "Discuss concerns about Mars' climate for Mummy"
git status # check status
git log # check commit log
```
You can limit your git log if it's getting long:
```bash
git log -1 # only see the last commit
git log --oneline # reduce quantity of info displayed
git log --oneline --graph # displays info as a text-based graph
```
We can commit all the files in a directory as follows:
```bash
mkdir spaceships
git status
git add spaceships
git status ## notice that spaceships doesn't show up. That's because git does not track empty directories
touch spaceships/apollo-11 spaceships/sputnik-1 # add files to the spaceships directory
git status
git add spaceships
git status
git commit -m "Added some thoughts about spaceships"
```
##### Exercise
Which of the following commit messages would be most appropriate for the last commit made to `mars.txt?``
1. “Changes”
2. “Added line ‘But the Mummy will appreciate the lack of humidity’ to mars.txt”
3. “Discuss effects of Mars’ climate on the Mummy”
Aim for descriptive but not explicit description of changes. (3) is a great option.
##### Exercise
Make a change to your file, commit those changes, and display the changes through something like `git diff`
```bash
# if you want to post code, go for it
```
#### Exploring History
The `HEAD` is the most recently committed version of your file
Let's add a new line of text to our file
```bash
nano mars.txt
cat mars.txt
```
Now we can use `diff` to see how these changes compare to committed files
```bash
git diff HEAD mars.txt
```
If we want to see the differences between older commits we can use `git diff` again, but with the notation `HEAD~1`, `HEAD~2`, and so on, to refer to them:
```bash
git diff HEAD~3 mars.txt
```
We could also use `git show` which shows us what changes we made at an older commit as well as the commit message, rather than the differences between a commit and our working directory that we see by using `git diff`.
```bash
git show HEAD~3 mars.txt
```
In this way, we can build up a chain of commits. The most recent end of the chain is referred to as HEAD; we can refer to previous commits using the ~ notation, so `HEAD~1` means “the previous commit”, while `HEAD~123` goes back 123 commits from where we are now.
We can also refer to commits using those long strings of digits and letters that git log displays. These are unique IDs for the changes, and “unique” really does mean unique: every change to any set of files on any computer has a unique 40-character identifier. Our first commit was given the ID `f22b25e3233b4645dabd0d81e651fe074bd8e73b`, so let’s try this:
```bash
git diff f22b25e3233b4645dabd0d81e651fe074bd8e73b mars.txt
```
We can shorten that ID, as long as it's not ambiguous:
```bash
git diff f22b25e mars.txt
```
We can put things back the way they were by using `git checkout`:
```bash
git checkout HEAD mars.txt # reverts file back to most recent commit
cat mars.txt
```
If we want to go back even further, we can use commit identifiers:
```bash
git checkout f22b25e mars.txt
```
The command `checkout` has other important functionalities and Git will misunderstand your intentions if you are not accurate with the typing. For example, if you forget `mars.txt` in the previous command, you will cause an error called "detached HEAD"
```bash
git checkout f22b25e
```
##### Exercise: reverting a commit
The command `git revert` is different from `git checkout [commit ID]` because `git checkout` returns the files not yet committed within the local repository to a previous state, whereas `git revert` reverses changes committed to the local and project repositories.
Below are the right steps and explanations for Jennifer to use `git revert`, what is the missing command?
1. ________ # Look at the git history of the project to find the commit ID
2. Copy the ID (the first few characters of the ID, e.g. 0b1d055).
3. `git revert [commit ID]`
4. Type in the new commit message.
5. Save and close
*Answer*: `git log`
`Git stash` allows you to "stash" some changes to commit later
`Git show` prints whatever you want to the standard output which then you can redirect into a file.
- To print the contents of a file from a certain version, use `git show <version>:<filename>`, with `<version>` being a commit ID or branch (or tag) name.
#### Ignoring things
What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis? Let’s create a few dummy files:
```bash
mkdir results
touch a.dat b.dat c.dat results/a.out results/b.out
git status # see what Git says
```
Putting these files under version control would be a waste of disk space. What’s worse, having them all listed could distract us from changes that actually matter, so let’s tell Git to ignore them.
We do this by creating a file in the root directory of our project called ``.gitignore`:
```bash
nano .gitignore # add what files you want to ignore to this file
```
> *.dat # ignore any files ending in '.dat'
results/ # ignore any files in the results directory
Once we have created this file, the output of `git status` is much cleaner. The only thing Git notices now is the newly-created `.gitignore` file. You might think we wouldn’t want to track it, but everyone we’re sharing our repository with will probably want to ignore the same things that we’re ignoring. Let’s add and commit `.gitignore`:
```bash
git add .gitignore
git commit -m "Ignore data files and the results folder."
git status
```
As a bonus, using `.gitignore` helps us avoid accidentally adding files to the repository that we don’t want to track:
```bash
git add a.dat # will raise flag that a.dat is supposed to be ignored
```
If we really want to override our ignore settings, we can use git add `-f` to force Git to add something. For example, `git add -f a.dat`. We can also always see the status of ignored files if we want:
```bash
git status --ignored
```
You can include specific files by editing your `.gitignore` file with the `!` operator:
```
*.dat # ignore all data files
!final.dat # except final.data
```
The exception is listed after the main command.
### Connect local to remote repository
We use SSH here because, while it requires some additional configuration, it is a security protocol widely used by many applications. The steps below describe SSH at a minimum level for GitHub. A supplemental episode to this lesson discusses advanced setup and concepts of SSH and key pairs, and other material supplemental to git related SSH.
Copy that URL from the browser, go into the local planets repository, and run this command:
```bash=
git remote add origin git@github.com:vlad/planets.git
```
Make sure to use the URL for your repository rather than Vlad’s: the only difference should be your username instead of `vlad`.
`origin` is a local name used to refer to the remote repository. It could be called anything, but `origin` is a convention that is often used by default in git and GitHub, so it’s helpful to stick with this unless there’s a reason not to.
We can check that the command has worked by running `git remote -v`:
```bash=
git remote -v
```
### SSH Background and Setup
Before Dracula can connect to a remote repository, he needs to set up a way for his computer to authenticate with GitHub so it knows it’s him trying to connect to his remote repository.
We are going to set up the method that is commonly used by many different services to authenticate access on the command line. This method is called Secure Shell Protocol (SSH). SSH is a cryptographic network protocol that allows secure communication between computers using an otherwise insecure network.
SSH uses what is called a key pair. This is two keys that work together to validate access. One key is publicly known and called the public key, and the other key called the private key is kept private. Very descriptive names.
You can think of the public key as a padlock, and only you have the key (the private key) to open it. You use the public key where you want a secure method of communication, such as your GitHub account. You give this padlock, or public key, to GitHub and say “lock the communications to my account with this so that only computers that have my private key can unlock communications and send git commands as my GitHub account.”
What we will do now is the minimum required to set up the SSH keys and add the public key to a GitHub account.
The first thing we are going to do is check if this has already been done on the computer you’re on. Because generally speaking, this setup only needs to happen once and then you can forget about it.
We will run the list command to check what key pairs already exist on your computer.
```bash=
ls -al ~/.ssh
```
Your output is going to look a little different depending on whether or not SSH has ever been set up on the computer you are using.
If SSH has been set up on the computer you’re using, the public and private key pairs will be listed. The file names are either `id_ed25519`/`id_ed25519.pub` or `id_rsa`/`id_rsa.pub` depending on how the key pairs were set up.
Since they don’t exist on Dracula’s computer, he uses this command to create them.
### Create an SSH key pair
To create an SSH key pair Vlad uses this command, where the `-t` option specifies which type of algorithm to use and `-C` attaches a comment to the key (here, Vlad’s email):
```bash=
ssh-keygen -t ed25519 -C "vlad@tran.sylvan.ia"
```
If you are using a legacy system that doesn’t support the Ed25519 algorithm, use:
```bash=
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
```
We want to use the default file, so just press ENTER.
Now, it is prompting Dracula for a passphrase. Since he is using his lab’s laptop that other people sometimes have access to, he wants to create a passphrase. Be sure to use something memorable or save your passphrase somewhere, as there is no “reset my password” option.
After entering the same passphrase a second time, we receive the confirmation.
The “identification” is actually the private key. You should never share it. The public key is appropriately named. The “key fingerprint” is a shorter version of a public key.
Now that we have generated the SSH keys, we will find the SSH files when we check.
```bash=
ls -al ~/.ssh
```
### Copy the public key to GitHub
Now we have a SSH key pair and we can run this command to check if GitHub can read our authentication.
```bash=
ssh -T git@github.com
```
Right, we forgot that we need to give GitHub our public key!
First, we need to copy the public key. Be sure to include the `.pub` at the end, otherwise you’re looking at the private key.
```bash=
cat ~/.ssh/id_ed25519.pub
```
Now, going to GitHub.com, click on your profile icon in the top right corner to get the drop-down menu. Click “Settings,” then on the settings page, click “SSH and GPG keys,” on the left side “Account settings” menu. Click the “New SSH key” button on the right side. Now, you can add the title (Dracula uses the title “Vlad’s Lab Laptop” so he can remember where the original key pair files are located), paste your SSH key into the field, and click the “Add SSH key” to complete the setup.
Now that we’ve set that up, let’s check our authentication again from the command line.
```bash=
ssh -T git@github.com
```
Good! This output confirms that the SSH key works as intended. We are now ready to push our work to the remote repository.
### Push local changes to a remote
Now that authentication is setup, we can return to the remote. This command will push the changes from our local repository to the repository on GitHub:
```bash=
git push origin main
```
Since Dracula set up a passphrase, it will prompt him for it. If you completed advanced settings for your authentication, it will not prompt for a passphrase.
#### Proxy
If the network you are connected to uses a proxy, there is a chance that your last command failed with “Could not resolve hostname” as the error message. To solve this issue, you need to tell Git about the proxy:
```bash=
git config --global http.proxy http://user:password@proxy.url
git config --global https.proxy https://user:password@proxy.url
```
When you connect to another network that doesn’t use a proxy, you will need to tell Git to disable the proxy using:
```bash=
git config --global --unset http.proxy
git config --global --unset https.proxy
```
#### Password Managers
If your operating system has a password manager configured, `git push` will try to use it when it needs your username and password. For example, this is the default behavior for Git Bash on Windows. If you want to type your username and password at the terminal instead of using a password manager, type:
```bash=
unset SSH_ASKPASS
```
in the terminal, before you run `git push`. Despite the name, Git uses `SSH_ASKPASS` for all credential entry, so you may want to unset `SSH_ASKPASS` whether you are using Git via SSH or https.
You may also want to add `unset SSH_ASKPASS` at the end of your `~/.bashrc` to make Git default to using the terminal for usernames and passwords.
#### The '-u' Flag
You may see a `-u` option used with `git push` in some documentation. This option is synonymous with the `--set-upstream-to` option for the `git branch` command, and is used to associate the current branch with a remote branch so that the `git pull` command can be used without any arguments. To do this, simply use `git push -u origin main` once the remote has been set up.
We can pull changes from the remote repository to the local one as well:
```bash=
git pull origin main
```
Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository.
### Collaborating
For the next step, get into pairs. One person will be the “Owner” and the other will be the “Collaborator”. The goal is that the Collaborator add changes into the Owner’s repository. We will switch roles at the end, so both persons will play Owner and Collaborator.
The Owner needs to give the Collaborator access. On GitHub, click the “Settings” button on the right, select “Collaborators”, click “Add people”, and then enter your partner’s username.
To accept access to the Owner’s repo, the Collaborator needs to go to [https://github.com/notifications](https://github.com/notifications) or check for email notification. Once there she can accept access to the Owner’s repo.
Next, the Collaborator needs to download a copy of the Owner’s repository to her machine. This is called “cloning a repo”.
The Collaborator doesn’t want to overwrite her own version of `planets.git`, so needs to clone the Owner’s repository to a different location than her own repository with the same name.
To clone the Owner’s repo into her `Desktop` folder, the Collaborator enters:
```bash=
git clone git@github.com:vlad/planets.git ~/Desktop/vlad-planets
```
Replace ‘vlad’ with the Owner’s username.
If you choose to clone without the clone path (`~/Desktop/vlad-planets`) specified at the end, you will clone inside your own planets folder! Make sure to navigate to the `Desktop` folder first.
The Collaborator can now make a change in her clone of the Owner’s repository, exactly the same way as we’ve been doing before:
```bash=
cd ~/Desktop/vlad-planets
nano pluto.txt
cat pluto.txt
```
```bash=
git add pluto.txt
git commit -m "Add notes about Pluto"
```
Then push the change to the Owner’s repository on GitHub:
```bash=
git push origin main
```
Note that we didn’t have to create a remote called `origin`: Git uses this name by default when we clone a repository. (This is why `origin` was a sensible choice earlier when we were setting up remotes by hand.)
Take a look at the Owner’s repository on GitHub again, and you should be able to see the new commit made by the Collaborator. You may need to refresh your browser to see the new commit.
#### Some more about remotes
In this episode and the previous one, our local repository has had a single “remote”, called `origin`. A remote is a copy of the repository that is hosted somewhere else, that we can push to and pull from, and there’s no reason that you have to work with only one. For example, on some large projects you might have your own copy in your own GitHub account (you’d probably call this `origin`) and also the main “upstream” project repository (let’s call this `upstream` for the sake of examples). You would pull from `upstream` from time to time to get the latest updates that other people have committed.
Remember that the name you give to a remote only exists locally. It’s an alias that you choose - whether `origin`, or `upstream`, or `fred` - and not something intrinstic to the remote repository.
The `git remote` family of commands is used to set up and alter the remotes associated with a repository. Here are some of the most useful ones:
- `git remote -v` lists all the remotes that are configured (we already used this in the last episode)
- `git remote add [name] [url]` is used to add a new remote
- `git remote remove [name]` removes a remote. Note that it doesn’t affect the remote repository at all - it just removes the link to it from the local repo.
- `git remote set-url [name] [newurl]` changes the URL that is associated with the remote. This is useful if it has moved, e.g. to a different GitHub account, or from GitHub to a different hosting service. Or, if we made a typo when adding it!
- `git remote rename [oldname] [newname]` changes the local alias by which a remote is known - its name. For example, one could use this to change `upstream` to `fred`.
To download the Collaborator’s changes from GitHub, the Owner now enters:
```bash=
git pull origin main
```
Now the three repositories (Owner’s local, Collaborator’s local, and Owner’s on GitHub) are back in sync.
#### A Basic Collaborative Workflow
In practice, it is good to be sure that you have an updated version of the repository you are collaborating on, so you should `git pull` before making our changes. The basic collaborative workflow would be:
- update your local repo with `git pull origin main`
- make your changes and stage them with `git add`
- commit your changes with `git commit -m`
- upload the changes to GitHub with `git push origin main`
It is better to make many commits with smaller changes rather than of one commit with massive changes: small commits are easier to read and review.
#### Exercise
The Owner pushed commits to the repository without giving any information to the Collaborator. How can the Collaborator find out what has changed with command line? And on GitHub?
On the command line, the Collaborator can use `git fetch origin main` to get the remote changes into the local repository, but without merging them. Then by running `git diff main origin/main` the Collaborator will see the changes output in the terminal.
On GitHub, the Collaborator can go to the repository and click on “commits” to view the most recent commits pushed to the repository.
### Conflicts
As soon as people can work in parallel, they’ll likely step on each other’s toes. This will even happen with a single person: if we are working on a piece of software on both our laptop and a server in the lab, we could make different changes to each copy. Version control helps us manage these conflicts by giving us tools to resolve overlapping changes.
To see how we can resolve conflicts, we must first create one. The file `mars.txt` currently looks like this in both partners’ copies of our `planets` repository:
```bash=
cat mars.txt
```
Let’s add a line to the collaborator’s copy only:
```bash=
nano mars.txt
cat mars.txt
```
and then push the change to GitHub:
```bash=
git add mars.txt
git commit -m "Add a line in our home copy"
```
```bash=
git push origin main
```
Now let’s have the owner make a different change to their copy without updating from GitHub:
```bash=
nano mars.txt
cat mars.txt
```
We can commit the change locally:
```bash=
git add mars.txt
git commit -m "Add a line in my copy"
```
but Git won’t let us push it to GitHub:
```bash=
git push origin main
```
Git rejects the push because it detects that the remote repository has new updates that have not been incorporated into the local branch. What we have to do is pull the changes from GitHub, merge them into the copy we’re currently working in, and then push that. Let’s start by pulling:
```bash=
git pull origin main
```
The `git pull` command updates the local repository to include those changes already included in the remote repository. After the changes from remote branch have been fetched, Git detects that changes made to the local copy overlap with those made to the remote repository, and therefore refuses to merge the two versions to stop us from trampling on our previous work. The conflict is marked in in the affected file:
```bash=
cat mars.txt
```
Our change is preceded by `<<<<<<< HEAD`. Git has then inserted `=======` as a separator between the conflicting changes and marked the end of the content downloaded from GitHub with `>>>>>>>`. (The string of letters and digits after that marker identifies the commit we’ve just downloaded.)
It is now up to us to edit this file to remove these markers and reconcile the changes. We can do anything we want: keep the change made in the local repository, keep the change made in the remote repository, write something new to replace both, or get rid of the change entirely. Let’s replace both so that the file looks like this:
```bash=
cat mars.txt
```
```
Cold and dry, but everything is my favorite color
The two moons may be a problem for Wolfman
But the Mummy will appreciate the lack of humidity
We removed the conflict on this line
```
To finish merging, we add mars.txt to the changes being made by the merge and then commit:
```bash=
git add mars.txt
git status
```
```
On branch main
All conflicts fixed but you are still merging.
(use "git commit" to conclude merge)
Changes to be committed:
modified: mars.txt
```
```bash=
git commit -m "Merge changes from GitHub"
```
Now we can push our changes to GitHub:
```bash=
git push origin main
```
Git keeps track of what we’ve merged with what, so we don’t have to fix things by hand again when the collaborator who made the first change pulls again:
```bash=
git pull origin main
```
We get the merged file:
```bash=
cat mars.txt
```
We don’t need to merge again because Git knows someone has already done that.
Git’s ability to resolve conflicts is very useful, but conflict resolution costs time and effort, and can introduce errors if conflicts are not resolved correctly. If you find yourself resolving a lot of conflicts in a project, consider these technical approaches to reducing them:
- Pull from upstream more frequently, especially before starting new work
- Use topic branches to segregate work, merging to main when complete
- Make smaller more atomic commits
- Where logically appropriate, break large files into smaller ones so that it is less likely that two authors will alter the same file simultaneously
Conflicts can also be minimized with project management strategies:
- Clarify who is responsible for what areas with your collaborators
- Discuss what order tasks should be carried out in with your collaborators so that tasks expected to change the same lines won’t be worked on simultaneously
- If the conflicts are stylistic churn (e.g. tabs vs. spaces), establish a project convention that is governing and use code style tools (e.g. `htmltidy`, `perltidy`, `rubocop`, etc.) to enforce, if necessary