UPGG Informatics Orientation Bootcamp 2021

--- tags: course notes --- # UPGG Informatics Orientation Bootcamp 2021 ## Day 1 & 2 Notes Link Notes from Monday & Tuesday can be accessed here: https://hackmd.io/3NHjDb1MRxWUke7CiXqm2A ## Day 3 (August 11): Automation through Programming Instructor: Zach M. Notebook with materials for today at: https://github.com/zmielko/scicomp-python/blob/master/automation_python.ipynb Happy Birthday Brandon!!!! 🎉🎉🎉 ### Recap of what we've learned so far in Python: * fundamentals of programming (Monday) * data types * for loops * conditionals (if/elif/else) * functions * data exploration with Pandas & Numpy (Tuesday) #### Reminder: * This is a LOT of material if you're new to programming/Python!!! For this lesson, we will be going back to using some more fundamental Python concepts. ## Writing Professional Code #### Learning Objectives 1. Formatting your code: style guides (PEPs and docstrings) 2. DRY: Don't Repeat Yourself 3. Defensive programming & errors 4. Writing your own scripts and modules: sharable code #### PEP: Python Enhancement Proposal * Coding standards adopted by Python maintainers & developers * Different languages have different standards * Many standards are shared among languages * **Concepts are fundamental**, though implementation varies * Standards allow for: * other professionals to easily read/debug your code * tools that interact with your code to parse it correctly ### Style Guides * The quasi-official guide for Python is called [Pep8](https://www.python.org/dev/peps/pep-0008/) * We will be learning and using a subset of these conventions today ### Naming conventions for variables 1. **snake_case**: All words are lower case with underscores between them 2. **CamelCase**: Words start with capital letters and are not seperated 3. **mixedCase**: Like CamelCase but the first word is lowercase 4. **UPPERCASE_WITH_UNDERSCORES**: All letters are uppercase, seperated by underscores Different styles are used in different contexts in Python: 1. Variables: **snake_case** * variable_name, dna_sequence 2. Functions: **snake_case** * combine_replicates() 3. Errors: **CamelCase** * ValueError, SyntaxError #### **Exercise 1** Edit the code block below to conform to PEP8 naming conventions. ```python= def Velocity(TOTALDISTANCE, time): "This calculates the distance over time" Velocity_Result = TOTALDISTANCE / time return(Velocity_Result) Velocity(10, 2) ``` #### POST ANSWERS HERE: Zach's edits: ```python= def velocity(total_distance, time): "This calculates the distance over time" velocity_result = total_distance / time return(velocity_result) velocity(10, 2) ``` #### Dangers in variable naming Never give a variable the same name as a built-in function in Python! ```python= sum_of_two_numbers = sum([5, 4]) print("Sum data type", type(sum)) sum = 10 + 5 print("Sum data type", type(sum)) print(sum_of_two_numbers) print(sum) ``` Now we've renamed the built-in `sum()` function to equal 15. If we try to use the `sum()` function now, it will no longer calculate the sum for us (because it is an integer now) To undo this overwriting of the `sum()` function, go to the top of your notebook and select "restart" from the Kernel drop down menu #### Comments Adding comments to your code is very important to include notes to yourself & others explaining what your code is doing, or to comment out code you don't want to run. Comments in python are everything after the `#` symbol (pound sign or "hash tag") ```python= def my_function(x): "Docstring for my_function" # This is a comment. # print(x + x) # The print function above will not run due to the '#' print(x) my_function(1) ``` You can also make multi-line comments using triple quotes ` """ x """ ` ```python= # Example """ Multi-line comments... can extend... across multiple lines... """ multi_line_string = """ Hello this is a string this is another line """ print(multi_line_string) ``` #### PEP guidelines on docstrings Python PEP guidelines suggest the following format: """One line description More details about your function, in triple-quotes """ or """Only a single line description in triple-quotes""" #### Different communities will use different conventions ```python= # Google format """Takes a string and returns a list of letters Args: string (list): A string to parse for letters upper (bool): The letters are returned uppercase (default is False) Returns: list: A list of each letter in the string """ #Numpy format """Takes a string and returns a list of letters Parameters ---------- string : str A string to parse for letters upper : bool, optional The letters are returned uppercase (default is False) Returns ------- list A list of each letter in the string """ #reStrucured text """Takes a string and returns a list of letters :param string: A string to parse for letters :type string: str :param upper: A string used to join each string (default is False) :type upper: bool :returns: A list of each letter in the string :rtype: list """ ``` #### Exercise 2.1: importance of naming & documentation Given the following function with poor naming and no documentation, determine: * What are the 2 inputs * What does it return ```python= def FUNCTION(number, words): Smallest = 0 for dictionary in number: LETTER = (dictionary / words) * 100 if LETTER > Smallest: Smallest = LETTER return Smallest ``` Answers: 1. What are the two inputs: * number = list of int * word = int 2. What does the function return: * Maximum percent of elements in list "number", from the total "words" #### Exercise 2.2: Refactor a function **Refactoring** is a term that means re-writing code without changing the task it performs. Refactor the following function with poor naming and no documentation. You will want to: * Rename variables to an appropriate name * Write a docstring explaining what the function does using one of the example formats (Google, Numpy, reStructured) Test your function by running it after you refactor to see if it still produces the same output. Refactored version: ```python= def FUNCTION(n, w): S = 0 for d in n: L = (d / w) * 100 if L > S: S = L return S ``` #### Share your code here: ```python= def max_percent(n, total): """The function returns the largest percent of each number divided by total Args: n (list of int): a list containing int total (int): the denominator for percent calculation; the total Returns: float: largest percent of each integer from n divided by total """ largest = 0 for i in n: percent = (i / total) * 100 if percent > largest: largest = percent return largest ``` ### D.R.Y. Don't Repeat Yourself Avoid redundancy wherever possible! e.g. ```python= protein_data1 = ["CREG1", "ELK1", "SF1", "GATA1", "GATA3", "CREB1"] protein_data2 = ["ATF1", "GATA1", "STAT3", "P53", "CREG1"] protein_data3 = ["RELA", "MYC", "SF1", "CREG1", "GATA3", "ELK1"] proteins_of_interest = ["ELK1", "MITF", "KAL1", "CREG1"] # Are there any matches in the first list? match_list1 = [] for protein in protein_data1: if protein in proteins_of_interest: match_list1.append(protein) # Are there any matches in the second? match_list2 = [] for protein in protein_data2: if protein in proteins_of_interest: match_list2.append(protein) # Are there any matches in the third? match_list3 = [] for protein in protein_data3: if protein in proteins_of_interest: match_list3.append(protein) ``` It would be much better to write one abstracted/general function to do this matching task for us, so we don't have to make edites for every comparison we want to make! #### Exercise 3: Refactoring reduntant code into a function * Refactor the code chunk above to turn the duplicated code into a function that you can call each time you want to perform the comparison #### Your answers here: Student answer #1: ```python= # Your function here protein_data1 = ["CREG1", "ELK1", "SF1", "GATA1", "GATA3", "CREB1"] protein_data2 = ["ATF1", "GATA1", "STAT3", "P53", "CREG1"] protein_data3 = ["RELA", "MYC", "SF1", "CREG1", "GATA3", "ELK1"] proteins_of_interest = ["ELK1", "MITF", "KAL1", "CREG1"] def match_protein(protein): """ Takes a list of str and return a list of str that contains proteins that match with proteins of interest Args: protein (list): list of str containing protein data Returns: list: list of str containing proteins that matche proteins of interest """ match_list = [] for p in protein: if p in proteins_of_interest: match_list.append(p) return match_list match_list1 = match_protein(protein_data1) match_list2 = match_protein(protein_data2) match_list3 = match_protein(protein_data3) print(match_list1) print(match_list2) print(match_list3) ``` * This function works (good job!) but refers to a variable defined outside the scope of the function, which can be risky business! Student answer #2: ```python= def protein_match(proteins, proteins_of_interest): """Takes a list of proteins, compares them against a list of interest, and returns any matches Parameters ---------- proteins : list A list of proteins to compare against the list of interest proteins_of_interest : list A list of proteins of interest Returns ------- match : list The proteins from the list of interest found in our list of proteins """ match = [] for protein in proteins: if protein in proteins_of_interest: match.append(protein) return match ``` * This function instead takes the "proteins_of_interest" list as a second argument to the function, which makes this function more generalizable (and much less at risk to returning errors) Another anwer: ```python= def protein_match(n, pattern): """This function takes a protein list & checks if there are any matches in a list of proteins of interest. Args: n (string): A string list of protein names pattern (string): A string list of proteins of interest Returns: match: A string of protein names that matched """ match = [] for i in n: if i in pattern: match.append(i) return match ``` To set a default variable for your function: ```python= def function_with_defaults(needs_input, default = 5): result = needs_input + default return result function_with_defaults(10) # or function_with_defaults(10, default = 10) ``` * Now there will be a default value of 5 for the second argument of the function (though we can give a different value too if we want) ### Break until 10:38 am! ### A helpful way to format strings ```python= # Example my_int = 5 message = "This number does not work:" message + my_int ``` * The '+' operator will work with two strings or two numbers, but not to concatenate a number AND a string There are a number of ways to format this, but here's one way: Formatting with f strings example: ```python= m = f"This number does not work {my_int}. The number is lower than 10." print(m) ``` More on f strings here: https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python ## Errors and Defensive programming in Python Troubleshooting errors is a critical skill in Python programming. Being able to write informative errors will help tremendously when troubleshooting your code. #### There are a number of built-in type errors in Python * NameError * ZeroDivisionError * TypeError * IndexError * ... and more These errors raise **exceptions**. Read more on exceptions here: https://docs.python.org/3/library/exceptions.html We may want to intentionally **raise** an error intentionally in some cases (to make sure our code works only as intended) ```python= # Example raise NameError("My custom error message") ``` #### Try and Except keywords These keywords will help catch errors and provide useful error messages ```python= # Example try: print(hello) except NameError: print("A custom error message") ``` If I try to do something like this (`print(hello)`), I want to return a specific error message (`"A custom error message"`) Note that the above code will still continue running even though there was an error... : ```python= # Example with print() before and after print("Before") try: print(hello) except NameError: print("A custom error message") print("After") ``` *output:* Before A custom error message After We might want the code to stop running if it encounters an error. We can do this by adding the **raise** keyword to our statement. ```python= # Example with raise added within the except code print("Before") try: print(hello) except NameError: print("A custom error message") raise print("After") ``` Stopping the code when it encounters an error is important because: * you want your code to "fail fast" so you can quickly address the problem * you want to avoid returning **incorrect** or **unexpected** results #### Printing Error Messages There are two places where printed errors can go: 1. `stdout` or **standard output** 2. `stderr` or **standard error** (dedicated place for errors to go to) We'll take a look at the help page for `print()` to learn how to do this ```python= help(print) ``` ```python= # Import sys, compare stdout vs stderr import sys print("Hello this is going to stdout", file = sys.stdout) print("Hello this is going to stderr", file = sys.stderr) ``` * the second statement will appear in your Jupyter notebook output with a red background (denotes standard error) Note: Python Traceback messages don't use the red background in Jupyter Notebook (even though it technically is from standard error) just to improve readability of the error message. To get Jupyter notebook to print to standard error, we need to import the `sys` library: ```python= import sys ``` Let's try adding try & except statements to familiar code: ```python= def reverse_complement(dna_sequence): """Reverses the complement of a dna sequence""" complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" try: for letter in reverse: result = result + complements[letter] except KeyError: #print("All letters must be capitalized (A, C, G, or T)", file=sys.stderror) raise ValueError("Input needs to be a capital A, C, G, or T") return(result) print(reverse_complement("CAAg")) ``` #### Another cample of when to use **try**, **except**, and/or **raise**: Another problem is when programs produce incorrect results instead of producing an error. Suppose we have a function that prints all kmers of a given k from a sequence: ```python= def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ # Formula for number of kmers positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) help(kmers_from_sequence) kmers_from_sequence("CACGTGACTAG", 3) print("After the function") ``` output as expected ```python= kmers_from_sequence("CACGTGACTAG", -3) print("After the function") ``` output is wrong but code did not tell us there was an error :( We can sanitize the inputs to solve this. The value, k, should be a number less than the length of the sequence but more than 0. #### Excercise 4: Sanitize Input Refactor the following function to check that the value of k is: * A positive number * Not longer than the length of dna_sequence If there is a problem, raise a ValueError with an appropriate message. ```python= # Example: def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ # Write code to check input here! positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) kmers_from_sequence("CAATCGACGTA", 12) # Should return an error ``` Share Answers Here: Answer 1: ```python= def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ # Formula for number of kmers positions = len(dna_sequence) - k + 1 if k < 0: print("Length of kmer must be greater than 0. Please input a positive integer for k.") raise ValueError("k value cannot be < 0") if k > len(dna_sequence): print("Length of kmer cannot be greater than length of DNA sequence. Please input a smaller value for k.") raise ValueError("k value cannot be > len(dna_sequence)") for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) kmers_from_sequence("CAATCGACGTA", -5) ``` Answer 2: ```python= # Example: def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ # Write code to check input here! if k <= len(dna_sequence) and k > 0: positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) else: raise ValueError("'k' should be a number less than or equal to the length of the DNA sequence and more than 0.") kmers_from_sequence("CAATCGACGT", 12) # Should return an error ``` Answer 3: ```python= def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ if k <= 0 or k > len(dna_sequence): raise ValueError("k must be greater than 0 and less than the length of the DNA Sequence") positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) kmers_from_sequence("CAATCGACGTA", 12) ``` Answer 4: ```python= def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ # Write code to check input here! if k < 1 or k > len(dna_sequence): raise ValueError("innapropriate k value") positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) kmers_from_sequence("CAATCGACGTA", 12) # Should return an error ``` Answer 5: ```python= def kmers_from_sequence(dna_sequence, k): """Prints all kmers from a sequence """ # Write code to check input here! if k <= 0 or k > len(dna_sequence): raise ValueError("k must be a positive integer less than the length of dna_sequence") positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) kmers_from_sequence("CAATCGACGTA", 12) # Should return an error ``` ### Making scripts you can import We will start new .ipynb and .py files for this demo. In your Jupyter notebook server (in your browser): * New -> Python3 Notebook * name: **demo_for_imports** * New -> Text File * we can save as a .py file by renaming the document: **cool_functions.py** in **cool_functions.py**: ```python= dna_sequence = "CACGTGATT" complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" # Add try - except - raise statements for letter in reverse: try: result = result + complements[letter] except KeyError: print("All letters must be capital A, C, G, or T", file= sys.stderr) raise print(result) ``` in **demo_for_imports.ipynb**: ```python= import cool_functions ``` * as is, this import statement alone will spit out some output To find out what's going on, lets use the `help()` function: ```python= help(cool_functions) ``` * this will tell us a little bit about the code we imported Let's add some more documdentation to `cool_functions.py`: ```python= """ Demo function This is a demo for importing from a script """ dna_sequence = "CACGTGATT" complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" # Add try - except - raise statements for letter in reverse: try: result = result + complements[letter] except KeyError: print("All letters must be capital A, C, G, or T", file= sys.stderr) raise print(result) ``` Let's turn this script back into a function: ```python= """ Demo function This is a demo for importing from a script """ def reverse_complement(dna_sequence): complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" # Add try - except - raise statements for letter in reverse: try: result = result + complements[letter] except KeyError: print("All letters must be capital A, C, G, or T", file= sys.stderr) raise return(result) reverse_result = reverse_complement("CACATTT") print(reverse_result) ``` Now if we go back to **demo_for_imports.ipynb** and restart our kernel (and clear outputs), it still prints some output unexpectedly after we import, but now running `help(cool_functions)` has a new section to describe the function in `cool_function.py` If we try to use that function in our **demo_for_imports.ipynb** jupyter notebook: ```python= # Example cool_functions.reverse_complement("AAATGTG") ``` * We can use this function! :) But, why does it print some output when we import into our jupyter notebook? This is because we have a `print()` statement at the bottom of `cool_functions.py`! This is not good practice. How do we separate the print statement from the useful functions in our `.py` document??? Answer: Add this weird statement `if __name__ == "__main__":` before the data & print statement (after the function, still in our `.py` document): ```python= if __name__ == "__main__": reverse_result = reverse_complement("CACATTT") print(reverse_result) ``` * This message is a behind-the-scenes message to Python on how to operate. in **demo_for_imports.ipynb**: ```python= print(__name__) ``` will return: `__main__` * this keeps track of everything we've imported already into our Jupyter notebook Now if we run the code chunk `import cool_function` and then print: ```python= print(cool_functions.__name__) ``` it will output `cool_functions` Now on the command line, let's try running our `cool_functions.py` script outside the jupyter notebook. ```shell= $ python cool_funtions.py ``` output: AAATGTG * this script prints the output of the function call because now, `__name__ == __main__` is True (the function isn't being imported by a different document) ### Editing scripts to take arguments from the command line Two ways to get Python scripts to take arguments using built-in packages * `import sys` * `import argparse` - read more about argparse here: https://docs.python.org/3/library/argparse.html Today we're going to add `import sys` to the top of our script ```python= """ Demo function This is a demo for importing from a script """ import sys def reverse_complement(dna_sequence): complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" # Add try - except - raise statements for letter in reverse: try: result = result + complements[letter] except KeyError: print("All letters must be capital A, C, G, or T", file= sys.stderr) raise return(result) if __name__ == "__main__": print(sys.argv[0]) #what is at index 0? print(sys.argv[1]) #what is at index 1? input_sequence = sys.argv[1] #this is our first argument from the command line reverse_result = reverse_complement(input_sequence) print(reverse_result) ``` Now we can add an argument on the command line: ```shell= python cool_functions.py CACAAA ``` **Output:** cool_functions.py CACAAA TTTGTG * The first output line was the result of `print(sys.argv[0])` * the second output line was the result of `print(sys.argv[1])` * The third output line is the result of `print(reverse_result)` ### More on defensive programming What if we don't give this script any arguments? It will return an error because there's nothing to access for sys.argv[1]. We have a built-in error message for the case where we type in a number instead of a string of DNA bases, but no error message for missing arguments. How might you build-in safeguards and informative error messages for this kind of misatke? ## Wed morning summary: * Python style guide, DRY, error messages, Importing custom modules * anatomy of writing a python script * ways to style naming variables, functions, etc. * refactoring code * docstrings * help function `help()` * `if __name__ = "__main__"` * more reading on this: https://stackoverflow.com/questions/419163/what-does-if-name-main-do ## Break - resume at 2:15pm ## Wednesday Afternoon : Sharing Jupyter Notebooks Instructor : Hilmar Lapp ### Overview In this session we'll discuss ways to share your work, whether it be with labmates, PIs, collaborators, or anyone looking to run your code. It focuses on setting up a GitHub account and putting your code on a public repository, as well as adding dependency files to your repository so that you can share links to an interactive version of your code using Binder. This will lay the ground for tomorrow's discussion on version control. ### Sharing Jupyter Noteboooks using GitHub How can you share your computational work that's in the form of a Jupyter notebook? This can be done via GitHub. GitHub is a development platform where we “can host and review code, manage projects, and build software.” It has several features that make sharing and collaborating on projects easier. * To get started, we'll first need a GitHub account. To register, we need to: Open a web browser Navigate to github.com * On the GitHub homepage enter: a username an email address a password Click the green Sign up for GitHub button. ![](https://i.imgur.com/BIt3Yzz.png) Once logged in, we'll **create a repository**. * Click on Start a project as shown below. ![](https://i.imgur.com/7Yvibzj.png) * Add a repository name. We choose to name our repository `sharing-jupyter` * Personal GitHub accounts require that projects be public. * Check the Initialize this repository with a README option. * Click the green Create repository button. ![](https://i.imgur.com/8ks5Ooe.png) The repository is now created, and should be visibie on your Github homepage. ### Uploading an example file (notebook) to the repository * Open Anaconda Navigator and Launch a new Jupyter Notebook. * Click New > Python 3. * In the first cell of the notebok, enter the following Python statement: `print('hello world')`` * Run the cell to confim that what your code is free of syntax and other errors. * Save the notebook Now, we will upload this notebook to our repository. * * Go to GitHub in our browser. We can click the link to the left. * Click on the sharing-github repository * Click on the Upload file button * Drag and drop the sample notebook or click the choose your files link to select the sample notebook. * We will see any files that we have uploaded at the bottom of the drag and drop area. * Add a message describing the change we are about to make. * Type “Adds sample notebook” in the subject field. * We can either add the same message below in the extended description or leave it blank. * Click on Commit changes button to complete the upload. ![](https://i.imgur.com/4TbUwAm.png) Under the hood, the notebook file is simply a JSON script, but GitHub can natively render it in the fashion that it's displayed in a Jupyter Notebook. NOTE: You can see that Github also shows the output cells. This is because we've run the notebook locally, and saved it. Everything that's displayed in GitHub is what was present in the file - it doesn't do computations on it's own! Jupyter Notebooks shared through GitHub are rendered, but are static. GitHub does not run the notebook(s) in a repository. ![](https://i.imgur.com/5tFBfxp.png) TASK: Upload the notebook from the `data-exploration ` lesson. You can see that the repository will also render markdown text, tables, and any graphical content, such as the graphs we plotted using `matplotlib` and `seaborn`. Note that the output of a cell will be displayed on GitHub only if the notebook is saved *after* the cell was run. ### Sharing Interactive repositories #### Overview Many genomic analyses will involve the tweaking of several parameters, and repetition of certain analyses. While one way to work collaboratively might be running analyses locally and sharing each iteration as a jupyter notebook, this is not necessarily ideal. We will now discuss how to share **interactive repositories** which will allow your PIs/collaborators to interact with your jupyter notebook and run analyses themselves using just the browser. Slides used during the lecture can be found [here](https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/slides/02-intro_to_binder.slides.html#/) #### Sharing Jupyter Notebooks using Binder Running code is much more complicated than simply displaying it. It requires: * Hardware resources necessary to perform the relevant computations * Any dependencies, languages, compilers and other software that the code needs. Binder is such a service, that allows you to create a link to an interactive version of your code. Here's an example of a Binder link: https://mybinder.org/v2/gh/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages?filepath=notebooks%2FData_exploration_run.ipynb Clicking it will create a live version of the Data Exploration notebook from the earlier lesson, which is maintained in this GitHub repository: https://github.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter ### Creating a Binder Link To build a Binder go to mybinder.org and paste in the URL of your repository, like so: ![](https://i.imgur.com/HNNxfvE.png) Binder will now build the environment necessary to run the notebook. This is done using dockers, which allow Binder to import the dependencies necessary to run the code. You will now see a structure similar to what is seen on a jupyter environment, with interactive notebooks that can be run by the end user. Try running the `hello world` repository. Now try running the `data-exploration` repository. You likely saw the following error ![](https://i.imgur.com/rJ6KK6s.png) This is because Binder isn't aware of all the dependencies our code assumes are installed - such as the library pandas. We need to tell Binder to install specific software that the code depends upon. To do this, create a file called `requirements.txt`. We will now make a list of the dependencies of our notebook. This file **must** be named `requirements.txt`. Upload this file in your github repo. What should this file contain? First, we will list all the libraries used to run the code in the notebook. ![](https://i.imgur.com/Imm2Bd9.png) Note that our notebook also reads in a file at some point. We also need to provide this file ! ![](https://i.imgur.com/yCClvhW.png) This can be achieved via the `Upload File` option in the repository, using which we can now upload the `gapminder` text file. Your repository should now look something like this: ![](https://i.imgur.com/sb4FI0j.png) Because binder is installing dependencies as well, the execution this time will take a little time. Once the notebook is up and running, make sure the file path to data file is correct. In this case, there's no `data` subfolder, so amend the original code from: ```python= gapminder=pd.read_table("../data/gapminderDataFiveYear_superDirty.txt", sep = "\t")` ``` to ```python= gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t") ``` Since Binder has an interactive notebook, we could easily make this change! You can similarly change the other cells in the notebook to play around with the data, or add new plots and analyses. Question: If you make changes to a Binder link, do they get saved? - No. Any changes made to a notebook accessed via a Binder link will be lost once that instance/browser window is closed. Only the original shared repository remains on GitHub. To make permanent changes, push them to the original repository. The other option is to make changes on a notebook accessed via a Binder link, and saving it locally. ### Key Points In this session, we discussed how to share our jupyter notebooks in an interactive fashion. To do this, we : * Created a file called requirements.txt that specified which software was needed to run our code. * Used the mybinder.org interface to build a Binder from this repository. * Created a Binder link in order to share our interactive repository with others. Often, our data might be too large to share on our repository. In these cases, the data can be loaded from a url pointing to the server where it's stored, instead of being locally in the repository. Duke investigators also have preferential access to resources such as https://codeocean.com, which builds on this code-in-a-capsule idea to facilitate computational research sharing. ## Thursday Morning ## Version Control with `git` Instructor: Katharine This lesson is loosely based on the Software Carpentry lesson here: http://swcarpentry.github.io/git-novice/ ### Automated Version Control What is version control? Good version control records each version of your work as a set of files in order to refer back to previous versions later. As academics, we are very used to making many versions of written manuscripts, but this is unsustainable for scripts and files where the content of the project far exceeds the length of a manuscript. How can we protect rigor, reproducibility, and robustness in our code with version control? Version control also enables collaboration in code. When we write manuscripts, we typically start with a base version and build on that base. ![](https://i.imgur.com/9PHw2b8.png) However, when multiple people are working on a project or you're working on the same project in a few different places, you can see how there may arise a few conflicting changes that need to be reconciled. ![](https://i.imgur.com/fJ6cKXG.png) Effective version control makes it very hard to lose work. It is also difficult to accidentally overlook changes to the code. `git` enables thoughtful merging of changes. Even if you're working alone, this system enables documentation of what changes you've made and why you made them. One way to save yourself a lot of grief at the time of publication is to maintain your work on GitHub throughout the course of your project. A version control system is a tool that keeps track of these changes for us, effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a **commit**), and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a **repository**. Repositories can be kept in sync across different computers, facilitating collaboration among different people. Some people refer to your GitHub as the "lab notebook of the modern world". However, this may not be the best way to use GitHub. **Technical Note:** `git` is the underlying tool for GitHub. `git` is the language of GitHub. GitHub is just a nice interface through which to use `git`. ### Setting Up `git` When we use `git` on a new computer for the first time, we need to configure a few things. Below are a few examples of configurations we will set as we get started with `git`: * our name and email address (*it helps to use the email address associated with your GitHub account*), * what our preferred text editor is, * and that we want to use these settings globally (i.e. for every project). On a command line, `git` commands are written as `git verb options`, where `verb` is what we actually want to do and `options` is additional optional information which may be needed for the verb. So here is how Dracula sets up his new laptop: ```bash= $ git config --global user.name "Vlad Dracula" $ git config --global user.email "vlad@tran.sylvan.ia" ``` Please use your own name and email address instead of Dracula’s. This user name and email will be associated with your subsequent `git` activity, which means that any changes pushed to GitHub, BitBucket, GitLab or another `git` host server after this lesson will include this information. For this lesson, we will be interacting with GitHub and so the email address used should be the same as the one used when setting up your GitHub account. If you are concerned about privacy, please review GitHub’s instructions for keeping your email address private. If you're confused about how config works, use: ```bash= $ git config --help ``` Type <kbd>q</kbd> to exit the help page. To see all of your global configuration settings, use: ```bash= $ git config --list ``` You can change the way `git` recognizes and encodes line endings using the `core.autocrlf` command to `git config`. The following settings are recommended: ```bash= $ git config --global core.autocrlf input # for MacOS or Linux $ git config --global core.autocrlf true # for Windows ``` To set your preferred text editor, use one of these: ```bash= $ git config --global core.editor "atom --wait" #Atom $ git config --global core.editor "nano -w" #nano $ git config --global core.editor "bbedit -w" #BBEdit (Mac, with command line tools) $ git config --global core.editor "/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -n -w" #Sublime Text (Mac) $ git config --global core.editor "'c:/program files (x86)/sublime text 3/sublime_text.exe' -w" #Sublime Text (Win, 32-bit install) $ git config --global core.editor "'c:/program files/sublime text 3/sublime_text.exe' -w" #Sublime Text (Win, 64-bit install) $ git config --global core.editor "c:/Windows/System32/notepad.exe" #Notepad (Win) $ git config --global core.editor "'c:/program files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin" #Notepad++ (Win, 32-bit install) $ git config --global core.editor "'c:/program files/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin" #Notepad++ (Win, 64-bit install) $ git config --global core.editor "kate" #Kate (Linux) $ git config --global core.editor "gedit --wait --new-window" #Gedit (Linux) $ git config --global core.editor "scratch-text-editor" #Scratch (Linux) $ git config --global core.editor "emacs" #Emacs $ git config --global core.editor "vim" #Vim $ git config --global core.editor "code --wait" #VS Code ``` `nano` is a good starting editor. It is possible to reconfigure the text editor for `git` whenever you want to change it. Git (2.28+) allows configuration of the name of the branch created when you initialize any new repository. Dracula decides to use that feature to set it to main so it matches the cloud service he will eventually use. ```bash= $ git config --global init.defaultBranch main ``` ### Creating a Repository First, let’s create a directory in Desktop folder for our work and then move into that directory: ```bash= $ cd Desktop/ $ mkdir Project_GenomeAssembly $ cd Project_GenomeAssembly $ ls -l ``` You should now be in an empty folder called "Project_GenomeAssembly". Now to initialize git. ```bash= $ git init $ ls -a ``` When we look at the hidden contents of a folder using the `ls` argument `-a`, we can see that `git` has created a hidden directory in which it will keep all versions of our work. If you were to delete this hiddnen directory, you will delete the project history. So maybe don't do that... If we want to make a change to the project, this is called a `commit`, as if you have committed to a changes that is being made to the main version of a project. First, we will use git status to see what changes have been made in this directory. ```bash= $ git status ``` ``` On branch main No commits yet nothing to commit (create/copy files and use "git add" to track) ``` If you are using a different version of `git`, the exact wording of the output might be slightly different. Let's make a `README`! Here, use the command specific to the text editor you have selected. ```bash= $ vim README.txt ``` In your text editor, write something to the effect of: ``` Project: Genome Assembly ``` Exit `vim` with `:wq` to save your work. To ensure that the work saved, use: ```bash= $ cat README.txt ``` ``` Project: Genome Assembly ``` If we run `git status`, we see: ```bash= $ git status ``` ``` On branch main No commits yet Untracked files: (use "git add <file>..." to include in what will be committed) README.txt nothing added to commit but untracked files present (use "git add" to track) ``` To tell git what to track, we use `git add`. ```bash= $ git add README.txt $ git status ``` ``` On branch main No commits yet Changes to be committed: (use "git rm --cached <file> ..." to unstage) new file: README.txt ``` Now that the README is in the staging area, let's commit this change with a note about what we changed. ```bash= $ git commit -m "Start a readme for new genome assembly project" ``` ``` Start a readme for new genome assembly project 1 file changed, 1 insertion(+) create mode 100644 README.txt ``` Now what is the status? ```bash= $ git status ``` ``` Nothing to commit, working tree clean ``` To see a history of changes, we use log to show a full history of changes with each commit's unique ID, time, user, and notes about what changes were made. ```bash= $ git log ``` `git log` lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the `git commit` command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created. In real life, you will be committing entire functions or including an entire file to the project, not just one line. Let's change the file again: ```bash= $ vim README.txt ``` We're adding maintainer info so that `README.txt` now says ``` Project: Genome Assembly Maintainer: Katharine Korunes ``` Let's add and commit this change. ```bash= $ git add README.txt $ git commit -m "Add maintainer info" ``` **If you do not `git add` first, you cannot commit!** If you think of `git` as taking snapshots of changes over the life of a project, `git add` specifies what will go in a snapshot (putting things in the staging area), and `git commit` then actually takes the snapshot, and makes a permanent record of it (as a commit). If you don’t have anything staged when you type `git commit`, Git will prompt you to use `git commit -a` or `git commit --all`, which is kind of like gathering everyone to take a group photo! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to the group photo simile, you might get an extra with incomplete makeup walking on the stage for the picture because you used `-a`!) Try to stage things manually, or you might find yourself searching for “git undo commit” more than you would like! ![](https://i.imgur.com/MN0MmfS.png) Let's do this again. ```bash= $ vim README.txt ``` ``` Project: Genome Assembly Maintainer: Katharine Korunes Date Started: 12 Aug 2021 ``` We have changed this file, but we haven’t told `git` we will want to save those changes (which we do with `git add`) nor have we saved them (which we do with `git commit`). So let’s do that now. It is good practice to always review our changes before saving them. We do this using `git diff`. This shows us the differences between the current state of the file and the most recently saved version: ```bash= $ git diff ``` We can see the new line insertion listed in green. Let's commit this change. ```bash= $ git add README.txt $ git commit -m "Add project start date" ``` Use `git log` to see how your log of changes has grown. ```bash= $ git log ``` If you're dealing with a project with a long log, use the `-#` argument to recall a specific number of most recent changes. ```bash= $ git log -1 ``` As written, this will recall only the one most recent change. Let's make a new directory. ```bash= $ mkdir scripts $ cd scripts/ ``` Let's use `touch` to create an empty file without putting us inside of it (which differentiates `touch` from commands that initialize a text editor, like `vim`) ```bash= $ touch markAdapters.sh $ touch renameReadGroups.sh ``` To track all of the files of a directory to the staging area, we can use add. ```bash= $ cd .. $ git add scripts ``` `git` now tracks all of the changes inside of the `scripts` directory. **Technical Note:** `git` doesn't track empty directories, so people will sometimes add hidden files to a directory to allow `git` to track the directory until they put something inside of it. To recap, when we want to add changes to our repository, we first need to add the changed files to the staging area (`git add`) and then commit the staged changes to the repository (`git commit`): ![](https://i.imgur.com/VoHkaKj.png) ### Break (Regroup @ 10:35a) ### Exploring History Don't put raw data or output under version control. Output should be readily recreatable from your scripts. Raw data should undergo NO CHANGES. The power of version control is that we can seamlessly revert our document back to any previously committed version. As we saw in the previous episode, we can refer to commits by their identifiers. You can refer to the most recent commit of the working directory by using the identifier `HEAD`. Let's make another change to README.txt ```bash= $ vim README.txt ``` ``` Project: Genome Assembly Maintainer: Katharine Korunes Date Started: 12 Aug 2021 Contact Email: kkorunes@gmail.com ``` Using this command, we can show the differences between what we have any number of commits prior and what we have right now. For now, let's look 2 commits back. ```bash= $ git diff HEAD~2 README.txt ``` Play with this command a bit to understand how to refer to specific commits. You can also use the commit identifier to refer to commits that are not as recent (you may not want to count how many commits have happened between now and your previous version!). ```bash= $ git diff <commit ID> README.txt ``` **Note:** Commit IDs can be found in `git log`. Notice that the changes are currently in the staging area. Again, we can put things back the way they were by using `git checkout`: ```bash= $ git checkout HEAD README.txt ``` To change back to a specific version, we can use the commit ID. ```bash= $ git checkout <commit ID> README.txt ``` #### Recovering Older Versions of a File Jennifer has made changes to the Python script that she has been working on for weeks, and the modifications she made this morning “broke” the script and it no longer runs. She has spent ~ 1hr trying to fix it, with no luck… Luckily, she has been keeping track of her project’s versions using `git`! Which commands below will let her recover the last committed version of her Python script called data_cruncher.py? 1. `$ git checkout HEAD` 2. `$ git checkout HEAD data_cruncher.py` 3. `$ git checkout HEAD~1 data_cruncher.py` 4. `$ git checkout <unique ID of last commit> data_cruncher.py` 5. **Both 2 and 4** ### Ignoring Things What if we have files that we do not want `git` to track for us, like backup files created by our editor or intermediate files created during data analysis? Let’s create a few dummy files: ```bash= $ mkdir results $ cd results/ $ touch a.out $ touch b.out $ cd .. $ touch a.dat $ touch b.dat $ nano .gitignore ``` In nano: ``` *.dat #Will ignore any file that ends in .dat results/ #Will ignore anything in the results directory ``` When we use `git status`, `git` will notice that `.gitignore` has been added but it will not notice the additin of `a.out`, `b.out`. `a.dat`, and `b.dat`. We are going to commit `.gitignore` because anyone we share this repository with will probably also not want to track this file. ```bash= $ git add .gitignore $ git commit -m "Adding a .gitignore file" ``` If you try to track something that you specified you didn't want to track in the `.gitignore`, you will get a notification. ### Remotes in GitHub Version control really comes into its own when we begin to collaborate with other people. We already have most of the machinery we need to do this; the only thing missing is to copy changes from one repository to another. Systems like `git` allow us to move work between any two repositories. In practice, though, it’s easiest to use one copy as a central hub, and to keep it on the web rather than on someone’s laptop. Most programmers use hosting services like GitHub, Bitbucket or GitLab to hold those main copies; we’ll explore the pros and cons of this in a later episode. Log into GitHub and find the plus sign in the top right. ![](https://i.imgur.com/k5UG2qk.png) Select "Create repository" Note: Since this repository will be connected to a local repository, it needs to be empty. Leave “Initialize this repository with a README” unchecked, and keep “None” as options for both “Add .gitignore” and “Add a license.” See the “GitHub License and README files” exercise below for a full explanation of why the repository needs to be empty. ![](https://i.imgur.com/1LhTQnU.png) As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository: ![](https://i.imgur.com/K4saYNg.png) Pick "SSH" and copy the URL, and run in the command line: ```bash= $ git remote add origin git@<URL> ``` `origin` is a local name used to refer to the remote repository. It could be called anything, but `origin` is a convention that is often used by default in git and GitHub, so it’s helpful to stick with this unless there’s a reason not to. We can check that the command has worked by running `git remote -v`: ```bash= $ git remote -v ``` #### SSH Background & Setup Before we can connect to a remote repository, we need to set up a way for our computer to authenticate with GitHub so it knows it’s us trying to connect to our remote repository. We are going to set up the method that is commonly used by many different services to authenticate access on the command line. This method is called Secure Shell Protocol (SSH). SSH is a cryptographic network protocol that allows secure communication between computers using an otherwise insecure network. SSH uses what is called a key pair. This is two keys that work together to validate access. One key is publicly known and called the **public key**, and the other key called the **private key** is kept private. Very descriptive names. You can think of the public key as a padlock, and only you have the key (the private key) to open it. You use the public key where you want a secure method of communication, such as your GitHub account. You give this padlock, or public key, to GitHub and say “lock the communications to my account with this so that only computers that have my private key can unlock communications and send git commands as my GitHub account.” What we will do now is the minimum required to set up the SSH keys and add the public key to a GitHub account. We will run the list command to check what key pairs already exist on your computer. ```bash= $ ls -al ~/.ssh ``` Your output is going to look a little different depending on whether or not SSH has ever been set up on the computer you are using. Let's create a key ```bash= $ ssh-keygen -t ed25519 -C "your_email@example.com" ``` If you are using a legacy system that doesn’t support the Ed25519 algorithm, use: `$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com"` You will see: ``` Generating public/private ed25519 key pair. Enter file in which to save the key (/c/Users/DukeID/.ssh/id_ed25519): ``` We want to use the default file, so just press <kbd>Enter</kbd>. ``` Created directory '/c/Users/DukeID/.ssh'. Enter passphrase (empty for no passphrase): ``` Now, it will prompt you for a passphrase. Be sure to use something memorable or save your passphrase somewhere, as **there is no “reset my password” option**. ``` Enter same passphrase again: ``` After entering the same passphrase a second time, we receive the confirmation: ``` Your identification has been saved in /c/Users/DukeID/.ssh/id_ed25519 Your public key has been saved in /c/Users/DukeID/.ssh/id_ed25519.pub The key fingerprint is: SHA256:SMSPIStNyA00KPxuYu94KpZgRAYjgt9g4BA4kFy3g1o your_email@example.com The key's randomart image is: +--[ED25519 256]--+ |^B== o. | |%*=.*.+ | |+=.E =.+ | | .=.+.o.. | |.... . S | |.+ o | |+ = | |.o.o | |oo+. | +----[SHA256]-----+ ``` The “identification” is actually the private key. You should never share it. The public key is appropriately named. The “key fingerprint” is a shorter version of a public key. Now that we have generated the SSH keys, we will find the SSH files when we check. ```bash= $ ls -al ~/.ssh ``` Now we run the command to check if GitHub can read our authentication. ```bash= $ ssh -T git@github.com ``` Right, we forgot that we need to give GitHub our public key! First, we need to copy the public key. Be sure to include the .pub at the end, otherwise you’re looking at the private key. ```bash= $ cat ~/.ssh/id_ed25519.pub ``` Now, going to GitHub.com, click on your profile icon in the top right corner to get the drop-down menu. Click “Settings,” then on the settings page, click “SSH and GPG keys,” on the left side “Account settings” menu. Click the “New SSH key” button on the right side. Now, you can add the title, paste your SSH key into the field, and click the “Add SSH key” to complete the setup. Now that we’ve set that up, let’s check our authentication again from the command line. ```bash= $ ssh -T git@github.com ``` ``` Hi <User>! You've successfully authenticated, but GitHub does not provide shell access. ``` Now everything should be in the GitHub GUI. We can make changes in in GitHub, and we understand what `git` is doing under the hood! #### Push local changes to a remote Now that authentication is setup, we can return to the remote. This command will push the changes from our local repository to the repository on GitHub: ```bash= $ git push origin main ``` Our local and remote repositories are now in this state: ![](https://i.imgur.com/xpGOUTd.png) We can pull changes from the remote repository to the local one as well: ```bash= $ git pull origin main ``` Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub, though, this command would download them to our local repository. **Key Points:** * A local Git repository can be connected to one or more remote repositories. * Use the SSH protocol to connect to remote repositories. * `git push` copies changes from a local repository to a remote repository. * `git pull` copies changes from a remote repository to a local repository. ### Conflicts As soon as people can work in parallel, they’ll likely step on each other’s toes. This will even happen with a single person: if we are working on a piece of software on both our laptop and a server in the lab, we could make different changes to each copy. Version control helps us manage these conflicts by giving us tools to resolve overlapping changes. Let's say our README file on GitHub says: ``` Project: Genome Assembly Maintainer: Katharine Korunes Date Started: 12 Aug 2021 Contact Email: kkorunes@gmail.com ``` And a file that you are attempting to push says: ``` Project: Genome Assembly Maintainer: Katharine Korunes Date Started: 12 Aug 2021 ``` When you push this file from the remote copy to the main copy, it will be rejected. To resolve this, you will need to pull. ```bash= $ git push origin main ``` Git rejects the push because it detects that the remote repository has new updates that have not been incorporated into the local branch. What we have to do is pull the changes from GitHub, merge them into the copy we’re currently working in, and then push that. ![](https://i.imgur.com/RPc3V9u.png) Let’s start by pulling: ```bash= $ git pull origin main ``` The `git pull` command updates the local repository to include those changes already included in the remote repository. After the changes from remote branch have been fetched, Git detects that changes made to the local copy overlap with those made to the remote repository, and therefore refuses to merge the two versions to stop us from trampling on our previous work. The conflict is marked in in the affected file: ```bash= $ cat README.txt ``` ``` Project: Genome Assembly Maintainer: Katharine Korunes Date Started: 12 Aug 2021 <<<<<<< HEAD Contact Email: kkorunes@gmail.com ======= This line added to remote copy >>>>>>> dabb4c8c450e8475aee9b14b4383acc99f42af1d ``` Our change is preceded by `<<<<<<< HEAD`. Git has then inserted `=======` as a separator between the conflicting changes and marked the end of the content downloaded from GitHub with `>>>>>>>`. (The string of letters and digits after that marker identifies the commit we’ve just downloaded.) It is now up to us to edit this file to remove these markers and reconcile the changes. We can do anything we want: keep the change made in the local repository, keep the change made in the remote repository, write something new to replace both, or get rid of the change entirely. In general, it is always good practice to `git pull` before doing anything so that you know you're working from the most current form of a file. ### Open Science Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this: * A scientist collects some data and stores it on a machine that is occasionally backed up by her department. * She then writes or modifies a few small programs (which also reside on her machine) to analyze that data. * Once she has some results, she writes them up and submits her paper. She might include her data – a growing number of journals require this – but she probably doesn’t include her code. * Time passes. * The journal sends her reviews written anonymously by a handful of other people in her field. She revises her paper to satisfy them, during which time she might also modify the scripts she wrote earlier, and resubmits. * More time passes. * The paper is eventually published. It might include a link to an online copy of her data, but the paper itself will be behind a paywall: only people who have personal or institutional access will be able to read it. For a growing number of scientists, though, the process looks like this: * The data that the scientist collects is stored in an open access repository like figshare or Zenodo, possibly as soon as it’s collected, and given its own Digital Object Identifier (DOI). Or the data was already published and is stored in Dryad. * The scientist creates a new repository on GitHub to hold her work. * As she does her analysis, she pushes changes to her scripts (and possibly some output files) to that repository. She also uses the repository for her paper; that repository is then the hub for collaboration with her colleagues. * When she’s happy with the state of her paper, she posts a version to arXiv or some other preprint server to invite feedback from peers. * Based on that feedback, she may post several revisions before finally submitting her paper to a journal. * The published paper includes links to her preprint and to her code and data repositories, which makes it much easier for other scientists to use her work as starting point for their own research. This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. You can find more on the different aspects of Open Science in this book. This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work: * The conceptual stages of your work are documented, including who did what and when. Every step is stamped with an identifier (the commit ID) that is for most intents and purposes unique. * You can tie documentation of rationale, ideas, and other intellectual work directly to the changes that spring from them. * You can refer to what you used in your research to obtain your computational results in a way that is unique and recoverable. * With a version control system such as Git, the entire history of the repository is easy to archive for perpetuity. In a world where increasingly journals you will publish in are requiring reproducible code, you WILL have to do this work. You can make your life easier by starting now.