UPGG_InformaticsBootcamp2023_Aug23-24

--- tags: course notes --- # UPGG Informatics Orientation Bootcamp 2023 ## Wednesday Morning **Intro from Hilmar** - Welcome to the 5th iteration of this workshop! - Keep in mind that [The Carpentries](https://carpentries.org/volunteer/) offers lots of opportunities to get involved. - Adhere to the Code of Conduct, and please report code of contact violations. - Use the sticky notes to indicate if you're stuck (red/pink sticky) or ready to move on (green sticky.) - All these lessons are very hand-on with live programming. We encourage you to actually type along to help get the best learning experience. - Everyone is encouraged to pair up-- we're all coming in with different experiences and levels of familiarity, let's use this to help each other. - Please edit this document as we go! ### Introduction to the Unix command line #### Getting started - **Instructor: Odmaa** First, download the file `shell_data.zip` from the course website, place it in your `home` directory. Note that the `home` directory can have different names depending on your machine and username (e.g. /Users/dan). In this lesson, we'll **pretend** to be rotating in different labs that want you to do certain tasks that are relevant to command line usage. #### 1. Loney Lab You often to access big data files that are usually on the huge server (i.e. DCC - Duke Computer Clusters). These clusters are powerful for carrying out extensive analyses and have big data storage. Furthermore, many bioinformatics tools can only be used through a command line interface. Automation will make your life easier and your work less error-prone. This is important as reproducibility is crucial in scientific research. Boot up your `Terminal` for Mac users or `Git Bash` for Window users (instead of command prompt). Note: `Terminal`'s look can be changed in `Preferences`!! ##### Introducing the Shell Basic commands in Shell that are very useful. - `PS1=$` sets the terminal prompt to `$` - `pwd` prints out the path to the current directory - `ls` lists all files and directories inside your current directory - `cd <destination>` navigates to the destination directory - `man <command>` opens up the manual page of the command - **Tab completion**: press `Tab` to complete a command or keyword automatically. This is helpful for avoiding typos. For `ls`, you can start adding **flags** on top of the command to modify the behavior of `ls`. You can run `man ls` to see the manual page and all the flags you can set for `ls`. For example, `ls -F` would add a trailing slash at the end of the items listed. **Note:** `Git bash` and few other Unix installations may not come with `man` pages by default. A work around is to type `man <command>` in a search engine, and the first entry is almost always the exact page of what you would have gotten from doing it in a `Terminal`. **Exercise 1** Use the `-l` option for `ls` to display more information for each item in the directory. What is one piece of additional information this long format gives you that you don't see with the bare `ls` command. Permission, owner of the file, created date etc. These are information that you might not otherwise get from just the user interface. --- Run `clear` to clean up the previous commands and outputs on your `Terminal`. ##### Shortcut: Tab Completion Sometimes directory names can get unruly--especially in a collaborative directory--you might not want to type out the full names yourself. If you run `cd`, it automatically brings you to your `home` directory. If you type `cd she<tab>`, `Terminal` can complete the full name of the directory (`cd shell_data`) or file for you, as long as the first few letters are sufficient for it to find a unique name to complete. If there are 2 files that are named similarly, for example `directory1` and `directory2`, if you use tab completion after typing `dir`, it will complete up to the point that it cannot anymore, i.e. `directory`, and you will have to fill the rest. Alternatively, if you press `tab` twice, it will show you all the possible options it could complete, and you can choose from them. #### 2. Wallace Lab Dr. Wallce wants you to organize raw data files. #### `Home` vs `Root` `Root` is the overarching directory that contains all files and directories on your machine. But you are not going to work inside this directory much, but you will spend most time inside the `home` directory. If the machine or cluster is shared by multiple users, you might not even have access to the `root` directory. And everybody might have their own `home` directory that they have access to. Different shorthands for `cd` destination. - `/` will bring you to the `root` directory - `~` will bring you to the `home` directory - `..` will bring you to the parent directory - `.` will bring you to the current directory - `.hidden_filename`: hidden files and directories have names begin with `.` - Using `ls -a` to reveal hidden files and directories You can chain these shorthands together as well. For example, `cd ../../` will bring you **2 levels** up the parent directories. **Note:** `ls <directory>` lists out all files and directories under said directory, BUT it doesn't bring you to that directory. You are still in the same working directory where you ran the command. **Exercise 2** What flag do you use for `ls` to reveal hidden files and directories? **Why hidden files/directories?** - Some directories and files are necessary for the system to function!! But you might not want the users to edit these files directly, so they are hidden! **Exercise 3** Navigate to `home` directory, list contents of the `untrimmed_fastq` directory. ##### Full vs relative paths Intuition: a full path is like telling someone the **full address** (or better yet, GPS coordinates) of the destination, a relative path is like telling someone a **direction** to the destination FROM your current location. When you run `pwd`, your Terminal will prints out the **full path** to your current directory. To distinguish between full and relative paths, you can look at how the path begins. Full paths always begin with a `/` which denotes the `root` directory. In a way, a full path is equivalent to a "relative path" of the same directory **from the `root` directory**. ##### Navigational shortcuts `ls ~` will list out the content of your `home` directory. Is there any difference between `cd` and `cd ~`? No. #### 3. Nakano Lab Dr. Nakano wants you to have a closer look into the reads sent from the sequencing core. We'll be working with the `fastq` file format. ##### Wildcards - `*` matches any zero or more character - `?` matches any ONE character - `[]` matches one character given in the bracket ##### Command history - `Ctrl + C` cancels the command - `Ctrl + R` reverse-searches command history - `!<number>` copies and pastes that command into your current prompt - `Ctrl + L` clears your screen - Same as `clear` command ##### Examnining files - `cat` prints the content of the file to screen (stdout) - `less` opens the file in a read-only view without printing it out - `<Space>` to go forward - `<b>`to go backward - `head` prints the first few lines of the file - `tail` prints the last few lines of the file ##### Fastq format 1. Begins with `@` and information about the read 2. Actual DNA sequence 3. Begins with `+` and info 4. String of characters representing quality score (PHRED score) PHRED score is log-scaled. One digit (number or letter) representing quality of each nucleotide. ##### Creating, moving, copying, and removing - `touch` generates an empty file - `mkdir` makes a directory - `cp` copies a file or directory - `mv` moves a file or directory - Can also be used to rename file or directory - `rm` removes a file - `rm -r` removes a directory (recursive) **Use with caution** - `ls -l` prints the list of files and directories in long format - `chmod [+-] rwx` adds or remove reading/writing/executing permission - Stands for "change modes" --**Break**-- Plays around with different wildcards. ```bash ls *.fastq # list all files that end with .fastq (all fastq files) ls *7.fastq # list all files that end with 7.fastq ls SRR* # list all direcotires/files that start with SRR ls *9* # list all directories/files that contain 9 ls [a-z].fastq # list all files that are named 1 letter.fastq ls [0-9].fastq # list all files that are named 1 number.fastq ls *[0-9].fastq # list all files that end with 1 number.fastq ls *.fast[aq] # list all files that end with fasta or fastq ls *.fast? # list all files that end with fast + 1 character ``` ##### Exercises 1. List all files that start with letter 'c' ```bash ls c* ``` 2. List all files that contain letter 'a' ```bash ls *a* ``` 3. List all files that end with letter 'o' ```bash ls *o ``` 4. Bonus: list all files that contain letter 'a' or letter 'c' ``` bash ls *[ac]* ``` **Note:** wildcards are interpreted by Shell, not the `ls` program. That means, if you run `ls D*` and that wildcard matches with a directory (e.g. `Downloads`), `ls` will be applied to that directory and list eveyrthing under that directory. If you want to see how Shell interprets the wildcard, you can use `echo D*` and see what it expands to. ```bash echo Hello World # print Hello World ``` Command history is useful if you want to trace back what command you ran in the past. ```bash history # print command history ``` Each command is assigned to a number. Using `!<number>` will **re-run the command** directly. ##### Examining files ```bash cat SRR097977.fastq # print the contents of the file SRR097977.fastq to standard out ``` ##### Exercises 1. Print out the contents of the file. What is the last line of the file? ```bash cat SRR097977.fastq tail -n 1 SRR097977.fastq ## prints the last line of the file ``` 2. From `home` directory, use one short command to print the contents of all the files in the `untrimmed_fastq` directory. ```bash cat ~/shell_data/untrimmed_fastq/* ``` `cat` is amazing, but when the file is HUGE, it is annoying to use. `less` is useful for examining large files. ```bash less SRR097977.fastq # print the contents of the file SRR097977.fastq to a reader ``` Inside the `less` program, you can use `/` (forward slash) to look up a certain pattern. Press `/` again to move to the next hit, press `?` to move to the previous hit. If you want to look for **special characters** that `less` might interpret as commands, you have to use an **escape character** which is the backslash (`\`). For example, looking for the "?" literally, type `/\?` inside `less`. `head` and `tail` are useful programs for quickly looking at the top and bottom content of a file. ```bash head SRR097977.fastq # print the beginning of the file SRR097977.fastq tail -n 1 SRR097977.fastq # print the last 1 line of the file SRR097977.fastq ``` If you run `head -n4 SRR097977.fastq`, you get the first four lines of the fastq file. Fastq files are for storing DNA sequence data (2nd line), but they also have information on quality (on the 4th line). This is called the [PHRED quality score](https://en.wikipedia.org/wiki/Phred_quality_score). ##### Make, copy, rename directories and files To make a new directory, run `mkdir <directory_name>`. The directory will be created under your working directory. ```bash mkdir directory_name ``` Use `cp <to_be_copied_file> <destination>` to copy the file to the destination. Similarly, use `mv <to_be_moved_file> <destination>` to move the file to the destination. `mv` can also be used to rename a file or directory. For example, `mv filename.fastq new_filename.fastq` will change the name of the file. ##### File permission Remember, `ls -l` lists files in a long format which contains information about permission. We can adjust the permissions using `chmod` command. To remove the write permission to everyone: ```bash chmod -w SRR098026.fastq # remove writing permission to SRR098026.fastq ``` ##### Exercise Starting in the shell_data/untrimmed_fastq/ directory, do the following: 1. Make sure that you have deleted your backup directory and all files it contains. 2. Create a backup of each of your FASTQ files using cp. (Note: You’ll need to do this individually for each of the two FASTQ files. We haven’t learned yet how to do this with a wildcard.) 3. Use a wildcard to move all of your backup files to a new backup directory. 4. Change the permissions on all of your backup files to be write-protected. ```bash rm -r backup cp SRR098026.fastq SRR098026-backup.fastq cp SRR097977.fastq SRR097977-backup.fastq mkdir backup mv *-backup.fastq backup chmod -w backup/*-backup.fastq ``` It’s always a good idea to check your work with `ls -l` backup. #### 4. Belato Lab Dr. Belato has 100s of TBs of data for you to analyze, but there's limited server memory to work with. ##### Grep, redirection, and pipe - `grep string file`: search for exact match of string within the file - Be aware of what you are searching for. E.g. if you want to find all sequence lines in a fastq file that contain AAAA, keep in mind that quality score lines can also contain AAAA - `-B`: argument for printing a number of lines before each matching line - `-A`: argument for printing a number of lines after each matching line - `>`: redirect the output that would normally be printed to stdout to a file - If the file already exists, `>` will overwrite it - `>>`: append the output to a file - `wc -l`: word count (number of lines) - `|`: the pipe command treats the output of the command before as the input of the command that comes after ##### For loops and basename *For Loops* - Variable: a variable has a name and contains some information - Use `name=value` to define the variable - Use `$name` to call the variable - Use `${name}string` to distinguish variable name from other strings immediately following the variable name - a string is a bunch of text - For Loop: executes some code on each item in a list: ```bash for <variable> in <group to iterate over> do <some command> $<variable> done ``` *Basename* - `basename`: removes the suffix from the string Say, we want to see which fastq files have bad segments containing 10 consecutive unknown nucleotides (Ns). We can use `grep NNNNNNNNNN SRR098026.fastq` and it will print all lines that have the matched pattern within that line. But this might not be exactly useful without the additional information that the fastq format provides. This line of code will print an additional line **BEFORE** and 2 additional lines **AFTER** each matched line from `grep`. ```bash grep -B1 -A2 NNNNNNNNNN SRR098026.fastq ``` Now, you can identify which sequence it is that matches the pattern, as well as the quality of the sequence. ##### Exercise 1. Search for the sequence `GNATNACCACTTCC` in the `SRR098026.fastq` file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. ```bash grep -B1 GNATNACCACTTCC SRR098026.fastq ``` 2. Search for the sequence AAGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. ```bash grep -B1 AAGTT *.fastq ``` ##### Redirecting outputs Using `>` to redirect the output into a file. ```bash grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt ``` If you want to check how many bad reads there are: ```bash wc -l bad_reads.txt ``` ##### Exercise How many sequences in SRR098026.fastq contain at least 3 consecutive Ns? ```bash grep NNN SRR098026.fastq > bad_reads.txt wc -l bad_reads.txt ``` Or using a **pipe** (`|`), redirect an output to be an input of the other command. ```bash grep NNN SRR098026.fastq | wc -l ``` If you use `>` and the file already exists, the output will rewrite everything in that file. If you just want to append the new output to that preexisting file, you have to use `>>` instead. ##### Pipe If you don't want to write a new file every time you want to chain output of one command to the other, you can use **pipe** (`|`). ##### For loops Let's understand variables first. ```shell foo=abc # set a variable with the name foo. The variable foo has the value abc echo foo # this prints foo echo $foo # this prints abc foo=atg # set the value in the variable foo to atg echo $foo # this prints atg echo $fooATG # this prints nothing, because the computer doesn't recognize a variable named fooATG echo ${foo}CGC # this prints ATGCGC, because the {} separates the variable name foo from the string CGC ``` Using `{}` like `${variable}` is a good practice when you call a variable. You can use `;` to string together commands that you would have otherwise written in different lines. You can redirect outputs to a file as well (`>>`). `basename <string> <suffix>` removes the suffix from the name. ##### Exercise Print the file prefix of all of the `.txt` files in our current directory. ```bash $ for filename in *.txt > do > name=$(basename ${filename} .txt) > echo ${name} > done ``` #### 5. Scott Lab Dr. Scott gives you code scripts for analyses, where are you taking the scripts? - Many options are available. They differ in their controls and user interfaces - nano - Can directly access from command line - vim - Can directly access from command line - Visual Studio Code - Need to download from website - Atom - Need to download from website - etc. *Writing a Script* - Create the shell script and open it ```shell nano bad-read-script.sh ``` - Write this in your shell script ```shell grep -B1 -A2 NNNNNNNNNN *.fastq > scripted_bad_reads.txt ``` - Save the shell script - Exit the shell script - Run the shell script If you run the code above without specifying where the files are, it will run within your working directory. ```shell bash bad-read-script.sh ``` - Make the shell script an executable ```shell chmod +x bad-read-script.sh # give yourself executing permission to the shell script ./bad-read-script.sh # execute the shell script. If the file is not in your working directory, you must add the path to that file ``` *Downloading Data* - `wget`: ‘world wide web get’; download web pages or data at web address - `Curl -O`: ‘see URL’; display webpages or data at the web address #### 6. RP Lab RP lab wants to collaborate with another lab at Brown and is sharing some codes with them. How do you make your files accessible to them? *Project Organization* - Save a copy of your raw data - Remove write permission of the raw data - Utilize compartments - Doc, data, code, and result directories etc. - Keep records of your codes - `history | tail -n 7 > history.txt` - Or better yet, run your commands in shell script files - Comment your script - use `#` to indicate a line of comments ## Wednesday Afternoon ### Introduction to programming in Python using the Jupyter Notebook #### Getting started - **Instructor: RP** #### Setup - Get materials: Course website > Syllabus > Programming in Python > Download Materials (right click, save link as, to directly manage where to save it without having to move it after downloading) - Launch Jupyter Notebook: Anaconda Navigator (as part of the setup for bootcamp, the course website outlines how to download Anaconda Navigator. You can launch it from your computer's applications menu and can usually search for it in your computer's search bar) > Launch "Jupyter Notebook" (this will launch terminal and browser windows, go to your browser window) > navigate the folder where your materials are > top right corner "New" button, select Python3 > this is your Python script! #### Trying out running code - In your Python script, code is organized into cells, and each cell can be run on its own. Type your code `5+3` in your cell, then run it (top menu Run button, or command+return on Mac, or ctrl+enter on Windows), to see output `8`. Below are some other examples - Arithmetics - `5-3` - `5*2` - `10/5` - Print - `print(5+3)` - `print("Hello World")` - *Strings (aka text like "Hello World") need to be enclosed in quotes, otherwise Python thinks you are trying to call variables that happen to be named Hello and World (variables may carry values. For example, you can create a variable called Hello and assign the number 3 to it, and then create a variable called World and assign the number 5 to it), and Python will throw an error because we've never created variables named Hello and World* - To create another cell, top menu + button #### Types - Some common types include string, integer (whole number), float (number with decimal point) - `type('Hello World')` - `type(1)` - `type(20.5)` - `type(5/3)` #### Variables - You can use variables to store values, like strings or numbers - When printing multiple variables, Python seems to automatically separate them with a space - You can do arithmetic with variables - Variables are saved within the whole Python script, not just the cell you define them in - Python is able to process different types of variables, like arithmetics that involve both floats and integers - You can nestle commands within each other, e.g. print(type(variable_name)) - You can concatenate strings together with +. In this case, you need to + a ' ' if you want spaces in between those concatenated strings ```python= greeting = 'Hola, Mundo' name = 'Derek' print(greeting, name) ``` ```python= age = 20 print('My daughter is', age, 'years old.') print('Next year, she will be', age + 1, 'years old.') ``` ```python= lb = 50.0 kg = lb * 2.2 print(kg) ``` ```python= my_float = 23.5 my_integer = 10 print(my_float + my_integer) print(type(my_float + my_integer)) ``` ```python= my_string = 'Hello' my_string_2 = 'World' print(my_string + ' ' + my_string_2) ``` - Naming variables - You cannot start variable name with number - You cannot have space or dash in your variable name - Names are case sensitive - In Python, the convention is to put _ between parts of a name, e.g. my_float, my_string_2 (Other coding languages have other conventions like camelCase, e.g. myFloat, myString2) - You cannot give a variable the same name as a Python variable/function, e.g. `float = 20.0`. This will break Python because you will have redefined what the core type "float" means! - Variables can be defined together, but Python does not link the 2 variables' values together, so you can give one of the variables a new value, and it does not affect the other variables it was defined together with ```python= x = y = 'pencil' print(x, y) x = 5 print(x, y) ``` ```python= a = 3 b = a+1 a = 'word' print(a, b) ``` #### Indexing - Indexing allows you to subset/index elements in a slice (a slice is a collection of elements. For example, you can think of a word as a slice of letters) - Python starts counting at 0 (not 1)! - Indexing a string gives you its letters ```python= name = 'Python' print(name[0], name[4]) ``` - If you try to index an element that does not exist, you will get an "index out of range" error ```python= name = 'Python' print(name[7]) ``` - In Python, indexing a negative number is not "index out of range", but counting backwards ```python= name = 'Python' print(name[-1]) ``` - Index a range using `start_index:end_index`. The start is "closed" (included), and the end is "open" (not included). You can omit the start and/or end indices if you want to start at the start of the slice or end at the end of the slice ```python= print(name[0:3]) print(name[:4]) print(name[3:]) print(name[:]) ``` - You can index with a pattern. For example, `print(name[1:5:2])` means print elements from the variable "name", print from index 1 to index 5, but only print every 2 indices, aka every other letter, aka print indices 1,3,5 #### Check existence - Use `in` to check if an element is in a slice. You will get an answer, either true/false, and this answer (true/false) is a value of the type boolean ```python= name = 'Python' 'a' in name 'Py' in name ``` - You can also check for "not in a string" ```python= 'pt' not in name ``` #### Working with a DNA Sequence - Get your original sequence here! ```python= seq = 'ACCTGCATGC' ``` - Practice manipulating sequence elements ```python= b1 = seq[0] b2 = seq[1] b3 = seq[2] print(b1, b2, b3) seq[0:3] r3 = b3 + b2 + b1 print(r3) ``` - Mutating vs Reassigning a String - You can call elements within the string, but you cannnot mutate the elements within it. You can only reassign. ```python= r3[2] = 'G' ``` This is not allowed ```python= old_r3 = r3 r3 = b3 + b2 + 'G' print(old_r3, r3) ``` This is allowed - Goal: Reverse Complement - Break down the problem into parts - Make use of for loop ```python= for s in seq: print (s) ``` ```python= print('Before') for b in seq: print('Loop:', b) print('After') ``` ```python= count = 0 for s in seq: count = count + 1 print(count) ``` - Exercise 1: Reverse ```python= rev = '' for s in seq: rev = s + rev print(rev) ``` - Dictionaries - A type in Python - Each element in a dictionary is a key-value pair - In Python, you can mix types in dictionaries - Elements in a dictionary are not ordered. Index them not with an index number but with the key - You can mutate the value of attached to any key (unlike how strings cannot but mutated) - Keys must be unique. Values don't have to be unique - When ranging through a dictionary, you are ranging through the key, and you can use the key to access the value indirectly ```python= my_dict = {'key1':'value1', 'key2':2} print(my_dict) my_dict['key1'] ``` ```python= counts = {'A':2, 'C':4, 'G':2, 'T':2} counts['A'] = 3 print(counts['A']) base = 'C' print(counts[base]) ``` ```python= for c in counts: print(c) ``` ```python= for b in counts: qty = counts[b] print(b, qty) ``` - Exercise 2: Counting quantity ```python= counts = {'A':0, 'C':0, 'T':0, 'G':0} for letter in seq: counts[letter] +=1 print(counts) ``` ```python= #get counts of each base in a sequence, alternative method #example: when you don't have prior knowledge of all of the bases that you will encounter #or amino acids #or whatever you're looping through! count=0 counts = {} for letter in seq: try: counts[letter]=counts[letter]+1 except: key_val={letter:(count+1)} counts.update(key_val) #this is the same thing as doing: count=0 counts = {} for letter in seq: if letter in counts: counts[letter]=counts[letter]+1 else: key_val={letter:(count+1)} counts.update(key_val) print(counts) ``` - Reverse Complement - You use the `rev` variable from before ```python= comps = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'} revcomp = '' for s in rev: c = comps[s] revcomp = revcomp + c print(seq, rev, revcomp) ``` ```python= comps = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'} revcomp = " " for base in rev: revcomp += comps[base] print(revcomp) ``` #### Lists - In contrast to a dictionary, a list is an ordered collection of elements. Lists use the [] square brackets, and elments are separated by commas. ```python= seqs = [seq, rev, revcomp] print(seqs, type(seqs)) ``` - Just like with strings, we can use a for loop to iterate through each element in the list and perform commands. Let's use a for loop to print each element in the list. ```python= for s in seqs: print(s) ``` - Lists, unlike strings, are mutable. So we can add or remove to a list, reorder it, or swap items out. ```python= seqs[0] = 'AAAAAGGGGG' print(seqs) ``` - Both dictionaries and lists let you delete things with the del keyword - again because they're mutable. ```python= del seqs[0] print(seqs) ``` #### Conditionals - If/else - The output of conditionals is booleans, aka true/false - Once a condition is satisfied (if/else), can execute some code ```python= if 'GC' in seq: print('Seq contains GC') else: print('Seq does not contain GC') print('done') ``` - Practice calculating the GC% in sequence - AND vs OR (mathematical AND, mathematical OR. Imagine a Venn diagram. AND is the space where 2 circles intersect. OR is the entire area covered by both 2 circles) - = vs == (= is to assign value to variable. == is to evaluate whether the left-hand-side and the right-hand-side are equal) - other comparisons/evaluations are also possible, such as <= and >= - To the best of your ability, make your code succint and avoid typing the same code over and over ```python= gc = 0 atgc = 0 for s in seq: if s == 'G' or s == 'C': gc = gc + 1 atgc = atgc + 1 # outside the ifs # Outside the loop percent = (gc * 100) / atgc print('GC-content percentage:', percent, '%') ``` - If/elif/else - elif stands for else-if - Note that 1 and only 1 condition out of all the if/elif/else will be satisfied. The conditions should be mutually exclusive. This implies that the order of checking matters. If "if" is satisfied, "elif" or "else" will not be checked, and those associated code will not be executed ```python= percent = 20 if percent < 35: print('Low') else: if percent < 56: print('Normal') else: print('High') ``` #### Functions - Python has built-in functions ```python= print(len(seq)) ``` - You can get help on how to use built-in functions ```python= help(sorted) ``` - Exercise 3: Using functions ```python= bases = 'adenine cytosine guanine thymine' base_list = bases.upper().split() base_list.reverse() print(base_list) # Bonus for base in base_list: print(base[0]) ``` - You can also write your own functions - Define function - function name - function inputs/arguments - any help/usage messages - a good practice is to define any needed variables within the function, so that that variable is a "local variable" (exists within the function) rather than a "global variable" (exists throughout the script) - code to execute - return any outputs - Use function - Once you define your function, you can use it later ```python= def double(x): "Takes a value and doubles it" result = x * 2 return result doubled_value = double(100) print(doubled_value) help(double) ``` ```python= def reverse(seq): "Returns the reverse of a sequence" rev = '' for s in seq: rev = s + rev return rev def complement(seq): "Switches all A and T characters. Switches all G and C characters" comps = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'} comp = '' for s in seq: c = comps[s] comp = comp + c return comp def reverse_complement(seq): "Reverses the complement of a sequence" return reverse(complement(seq)) reverse(reverse(seq)) == seq complement('ACC') reverse_complement('CCAA') ``` - Exercise 4: Update the reverse function - Update the reverse() function to use splicing instead of a loop. With splicing, this can be done by making the step argument -1. Recall that the format for splicing is `[start:end:step]`. ```python= def reverse(seq): "Returns the reverse of a sequence" rev = '' for s in seq: rev = s + rev return rev def reverse(seq): "Returns the reverse of a sequence" return seq[::-1] reverse_complement("CCAA") ``` #### Working with Files - Will be covered tomorrow! ## Thursday morning ### Data exploration using Jupyter Notebook #### Instructor: RP #### Getting started Download materials: - Jupyter notebook: https://github.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/blob/gh-pages/notebooks/Data_exploration.ipynb - Files: https://raw.githubusercontent.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages/data/gapminderDataFiveYear_superDirty.txt https://raw.githubusercontent.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/gh-pages/data/PRB_data.txt Launch jupyter notebook and open the Data_exploration_student notebook. #### Overview Today we will learn how to work with tabular data and how to clean it up. #### Setting up the notebook You can write markdown in jupyter notebook cells to display as text and annotate your analysis. ##### Libraries We need a library to use functions not available in base python. If a function is like a recipe a library is like a recipe book. - pandas In particular the pandas library implements "DataFrames", a data structure for tabular data. ````python= import pandas as pd ```` - matplotlib Is a library to plot graphics. pyplot is a sublibrary within matplotlib, useful to make plots contained within the notebook. - numpy Useful to work with numbers. Introduces arrays. Importing all libraries we'll use today: ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline ``` #### Getting data into the notebook ##### What's a DataFrame? Are matrix-like, but with the difference that each column in a DataFrame can contain different data types (e.g. integers, strings). Index is zero based. You can also think about DataFrames as stacked lists. Read table using the pandas read_table function: ```python= gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t") ``` The table is separated by tabs ("\t"). Inspect the table: ````python= gapminder.head() ```` Prints out first five rows of the gapminder DataFrame, including column names. ````python= gapminder.tail() ```` Prints out last five rows. You can change the number of lines. ````python= gapminder.tail(10) ```` What if we have more than five columns? Print the names of all columns: ````python= gapminder.columns.values ```` There is no parenthesis after .columns.values. This is because these are not built in functions, but rather "attributes", which are properties of the DataFrame. #### Assess the structure and cleanliness ##### How many rows and columns are in the data? Shape of data: ````python= gapminder.shape ```` The shape is the dimensions (number of rows, number of columns). Summary of data types stored in the DataFrame: ````python= gapminder.info() ```` The class is the type of object (a DataFrame from the pandas library). RangeIndex: the range of rows (0-based) Non-Null Count: The number of not empty cells in the DataFrame. *Question:* why does printing info vs info() show different results? - info without parentesis accesses the "info" attribute (data about the object), while info() accesses the info method (a function is called on the DataFrame to produce the summary of data types). The describe method: ````python= gapminder.describe() ```` Computes summary statistics for each column. Only those columns that are numeric. You can also specify which columns you want to describe. ````python= gapminder['year'] ```` Select more than one column: ````python= gapminder[['year','region']] ```` You need to use an extra set of square brackets to contain the full list of columns that you want to call. ````python= gapminder[['pop','life','gdpPercap']].describe() ```` Extract a specific summary statistic: ````python= print(gapminder['pop'].describe().min()) print(gapminder['pop'].describe().max()) print(gapminder['pop'].describe().mean()) print(gapminder['pop'].describe().sd()) ```` You can also summarize categorical columns. Maybe you want to find the unique regions that are present in the DataFrame: ````python= pd.unique(gapminder['region']) ```` Also type it as: ````python= gapminder['region'].unique() ```` How many unique regions are in the data? ````python= print(len(gapminder['region'].unique())) ```` How many times does each unique region occur? ````python= gapminder['region'].value_counts() ```` ##### Exercises What are number of unique categories of another categorical column? ````python= years=gapminder['year'].unique() print(len(years)) ```` What is the outcome when you run value_counts()? ````python= gapminder['year'].value_counts() ```` #### Data cleaning ##### Referencing vs copying objects Before starting modifying your dataset, you want to make a copy. Make a copy of data before processing it: ````python= gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t") gapminder_copy = gapminder.copy() gapminder_copy.head() ```` ##### Handling missing data Methods to handle missing data The first approach is to just drop rows with missing values ````python= print(gapminder_copy.head()) # Compare before and after dropping na values gapminder_copy = gapminder_copy.dropna() print(gapminder_copy.head()) ```` Compare shape before and after dropping na ````python= print(gapminder.shape()) print(gapminder_copy.shape()) ```` ##### Changing Data Types Sometimes the way values are encoded don't make sense. We'll change the type of year from float to int (years should be integers). ````python= gapminder_copy['year'] = gapminder_copy['year'].astype(int) gapminder_copy['pop'] = gapminder_copy['pop'].astype(int) gapminder_copy.info() ```` ##### Handling (Unwanted) Repetitive Data We can use the duplicated() method that finds values of which row is duplicated (True or False) ````python= gapminder_copy.duplicated.head() ```` We can also drop those duplicated values ````python= gapminder_copy = gapminder_copy.drop_duplicates() gapminder_copy.head() ```` Now the indexes don't make sense because we dropped some rows (i.e. they go 0,1,5 etc). We need to re-set the indexes (to 0,1,2, ...). ````python= gapminder_copy = gapminder_copy.reset_index(drop=True) gapminder_copy.head() ```` ##### Handling inconsistent data The idea is to make data in columns as consistent as possible. One common issue is different number of spaces in strings representing the same word. Think "space" vs " space " would not be the same string. ````python= gapminder_copy['region'] = gapminder_copy['region'].str.lstrip() # Strip white space on left gapminder_copy['region'] = gapminder_copy['region'].str.rstrip() # Strip white space on right gapminder_copy['region'] = gapminder_copy['region'].str.lower() # Convert to lowercase gapminder_copy['region'].value_counts() # How many times does each unique region occur? ```` After doing this, some of the previous inconsistencies are removed. For example now most countries have at most 12 data points, one for each year of data collection. If you try the one-liner: ````python= gapminder_copy['region'] = gapminder_copy['region'].str.lstrip().str.rstrip().strlower() ```` ##### regex + replace() - "*" matches preceding character 0 or more times - "+" one or more times - . matches any character once - x|y matches either x or y Replace some regions that we know are the same country, but are not the same string yet. ````python= temp = gapminder_copy['region'].replace(".*congo.*", "africa_dem rep congo", regex=True) temp.value_counts() ```` This gives the issue that congo now has 24 entries, not 12. That's because there are two countries with the word congo. So we need to be more specific: The strings referring to the democratic republic of congo contain the "congo,dem" string, while the other country doesn't. ````python= gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True) gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True) gapminder_copy['region'].value_counts() # Now it's fixed. ```` The inplace=True option will directly replace the values in the DataFrame without needing to assign it again. ##### Excercise (regex): Start by looking at regions that have the word canada ````python= gapminder_copy[gapminder_copy["region"].str.contains('canada')] ```` ````python= gapminder_copy['region'].replace(".*ivore*", "africa_cote d'ivoire ", regex=True, inplace=True) gapminder_copy['region'].replace("canada", "americas_canada", regex=True, inplace=True) gapminder_copy['region'].value_counts() ```` ##### Tidy data The region column includes both country and continent. This might be an issue if we want to analyze country and continent separately. We want to split into to columns. ````python= gapminder_copy['country']=gapminder_copy['region'].str.split('_', 1).str[1] gapminder_copy['continent']=gapminder_copy['region'].str.split('_', 1).str[0] gapminder_copy.head() ```` The result of the split is a list of two words. We access the country and continent separately using indexes str[1] for country, str[0] for continent. ###### Removing and renaming columns Now that we split the region column into country and continent columns, let's remove the region column: ````python= gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column gapminder_copy.head() ```` To make column names more uniform, we'll change names to lower case. ````python= gapminder_copy.columns = gapminder_copy.columns.str.lower() gapminder_copy.head() ```` We also want to remove unnecessary white space in column names: ````python= gapminder_copy = gapminder_copy.rename(columns={'life exp' : 'lifeexp'}) gapminder_copy.head() ```` We use a dictionary as syntax to access the previous name of the column as key, and the new name as value: {'previous':'new'}. ##### Merging data If we have our dataset split into more than one table or file, we want to merge them. First, read the table with new data: ````python= PRB = pd.read_table("PRB_data.txt", sep = "\t") PRB.head() ```` Before merging two DataFrames by rows, we need to make sure that the columns are actually the same and that the names match. Here we are cleaning the dataset in the same way as the previous batch before merging: ````python= PRB['country']=PRB['region'].str.split('_', 1).str[1].str.lower() PRB['continent']=PRB['region'].str.split('_', 1).str[0].str.lower() PRB = PRB.drop('region', 1) PRB.columns = PRB.columns.str.lower() PRB = PRB.rename(columns={'life exp' : 'lifeexp'}) PRB.head() ```` Now we'll merge with the pd.concat function: ````python= gapminder_comb = pd.concat([gapminder_copy, PRB]) gapminder_comb.tail(15) ```` concat will only work if the column names match. The index doesn't re-set automatically, so let's re-set them. ##### Exercise: fix the index ````python= gapminder_comb = gapminder_comb.reset_index(drop=True) gapminder_comb.head() ```` ##### Subsetting and sorting We can also use slices with DataFrames. Select first 15 rows: ````python= gapminder_copy[0:15] ```` Select last 15 rows: ````python= gapminder_copy[-10:] ```` ###### Sorting ````python= gapminder_copy.sort_values(by = 'year', ascending = False) ```` Change the ascending argument to change the direction of sort (e.g. lower to higher is ascending = True, higher to lower is ascending = False). Summarize data: Here you take the mean or other summary stats as before, but the groupby method indicates groupings to get e.g. the mean by continent. ````python= continent_mean_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').mean() continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'}) continent_row_ct = gapminder_comb[['continent', 'country']].groupby(by='continent').count() continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'}) continent_median_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').median() continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'}) gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1) gapminder_summs = gapminder_summs.rename(columns = {'y':'year'}) gapminder_summs ```` ##### Visualization with matplotlib Code to run a histogram: ```python plt.hist(gapminder_copy['lifeexp']) plt.xlabel('lifeexp') plt.ylabel('count') ``` hist only plots one variable. Boxplot: ```python sns.boxplot(x='year', y='lifeexp', data = gapminder_copy) plt.xlabel('year') plt.ylabel('lifeexp') ``` To plot a boxplot, you need two variables. Note slight differences in syntax between seaborn and matplotlib (calling columns directly, vs referring to them by name). Scatterplot: ```python plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp']) plt.xlabel('gdppercap') plt.ylabel('lifeexp') ``` You can save your plot into file with plt.savefig: ```python plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp']) plt.xlabel('gdppercap') plt.ylabel('lifeexp') plt.savefig("my_figure.png") ``` ### Defensive Programming and Automation #### Instructor: Brandon #### Getting started Download data: https://github.com/Duke-GCB/scicomp-python/releases/download/2023-08-24/defense_automation.zip #### Defensive Programming ##### Style guides These are standard guides for how to format your code. This makes it easier to read. ###### Naming conventions Different styles to name your variables: snake_case CamelCase mixedCase UPPERCASE_WITH_UNDERSCORES In python: - Variables: snake_case - Function: snake_case - Errors: CamelCase ###### Exercise: Using naming conventions Answer below - ```python def velocity(total_distance, time): "This calculates the distance over time" velocity_result = total_distance / time return(velocity_result) velocity(20, 2) ``` In addition to following style guidelines, make sure that variable names are meaningful. ###### Dangers in variable naming Avoid naming variables with function names. For example: what happens when we name a variable "sum": ```python sum_of_two_numbers = sum([5, 4]) print(sum_of_two_numbers) sum = 10 + 5 print(sum) print(sum([5,4])) ``` This will throw an error. You'll need to re-start your python kernel. Hit re-start python kernel. ##### Comments Commenting your code will help people follow how it works. Example: ````python def my_function(x): # This is a comment. # print(x + x) # The print function above will not run due to the '#' print(x) ```` ##### Docstrings When you write a function, you can add a special comment at the beginning that will show up when looking for the help of the function. ````python def my_function(x): """This function returns the argument you gave it. """ # This is a comment. # print(x + x) # The print function above will not run due to the '#' print(x) my_function(1) ```` There are many different formats for structuring docstrings. You can choose a specific format, but it's best to be consistent (e.g. using the same format for all the functions in your project). ##### Exercise: Importance of naming and documentation ```python= def FUNCTION(number, words): Smallest = 0 for dictionary in number: LETTER = (dictionary / words) * 100 if LETTER > Smallest: Smallest = LETTER return Smallest ``` What are the 2 inputs? number: probably a list of numbers words: another number of sorts What does it return? The biggest percentage of an element of `number` / `words`. ##### Exercise: Refactoring a Function The above function can be refactored with more meaningful variable names. ```python= def max_percentage(list_num, num): """Returns maximum percentage of numbers in list_num, after dividing by num. Arguments * num = A random integer * list_num - a list of numbers that you want to find the maximum of """ to_return = 0 for i in list_num: percentage = (i / num) * 100 if percentage > to_return: to_return = percentage return to_return ``` ##### Don't Repeat Yourself - DRY Don't copy-paste individual chunks of code over and over. - If you need to update the code, you'd need to change every single chunk. - It doesn't scale. What if you have 1000s of inputs? The following code repeats itself for each protein dataset: ````python protein_data1 = ["CREG1", "ELK1", "SF1", "GATA1", "GATA3", "CREB1"] protein_data2 = ["ATF1", "GATA1", "STAT3", "P53", "CREG1"] protein_data3 = ["RELA", "MYC", "SF1", "CREG1", "GATA3", "ELK1"] proteins_of_interest = ["ELK1", "MITF", "KAL1", "CREG1"] # Are there any matches in the first list? match_list1 = [] for protein in protein_data1: if protein in proteins_of_interest: match_list1.append(protein) # Are there any matches in the second? match_list2 = [] for protein in protein_data2: if protein in proteins_of_interest: match_list2.append(protein) # Are there any matches in the third? match_list3 = [] for protein in protein_data3: if protein in proteins_of_interest: match_list3.append(protein) ```` ##### Exercise: Refactor the analysis into a function RP vvv ```python= def match_protein(data, interest): """This function returns a list of proteins within `data` that match the proteins within `interest` Args: data (list): A list contains protein names interest (list): A list contains proteins of interest """ answer = [] for protein in data: if protein in interest: answer.append(protein) return answer ``` ##### A Helpful Way to Format Strings How to make our own error messages? ```python my_int = 5 message = "This number does not work." message + my_int ``` This will fail because you can't concatenate strings and integers. However, there is a way to format strings to include numeric variables with f strings: ```python= ## my_int can be anything (str, int, float, etc) my_f_string = f"This number does not work: {my_int}. The number is lower than 10." ``` refer to to your variable with curly braces ##### Error Handling in Python How can we handle errors that arise in code? Python has different types of errors (NameError, ZeroDivisionError, etc.) e Use raise keyword and NameError function to indicate errors in your code: ```python raise NameError("My custom error message") ``` The traceback is a report that python generates on what happened that gave rise to an error. Your custom message will be printed in the traceback. Define the reverse_complement function: ```python def reverse_complement(dna_sequence): """Reverses the complement of a dna sequence""" complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" for letter in reverse: result = result + complements[letter] return result help(reverse_complement) print(reverse_complement("CAAT")) ``` Remember that in programming, letters are case-sensitive. A lower-case letter and its associated upper-case are two totally different character to computers. ```python reverse_complement("CACGtgcatggTGAAA") ``` Will return an error. ##### Try and Except Keywords ```python try: print(hello) except NameError: print("You ran into a NameError") raise ## Will stop the program from running further print("This is run after the try-except block!") ## No longer runs if an error is raised ``` The `try-except` block works as follows: 1. The code under `try` is run. 2. If no exception occurs after running all of the code under `try`, the code under `except` is skipped. 3. If an exception occurs, *and* it matches the one in the `except` line, then any remaining code under `try` is skipped and the code under `except` is now run. 4. If an exception occurs, but it *doesn't* match the exception we gave it, this exception is **unhandled** and the program stops. Note that, even though we **caught** the `NameError`, the program actually keeps going. This is because the `try-except` block only **controls the flow** of the program by specifying what code should be run in the case an error is encountered; it doesn't `raise` an exception by itself. We can visualize this by adding a `print()` statement after the whole `try-except` block. ##### Printing Error Messages When we print something in Jupyter or in the command line, by default the text goes to a destination called **standard output**, or `stdout`. Simultaneously, errors and warnings go to a destination called **standard error**, or `stderr`. It is dangerous to print error messages in `stdout` because some workflows utilize it for data and the error messages can get mixed in (eg, when using pipes `|` on the command line). We can set a destination for out `print()` command using the `file` argument. The default destination is `sys.stdout`. To send to `sys.stderr`, we will need to import the `sys` module. **A brief digression on modules** * `sys` is from the **standard library**, a collection of modules included in Python but not available by default. It deals with low-level tasks inside your computer, like interacting with keyboards, stdout and stderr, files, etc. * Libraries are incredibly useful: there are libraries for working with numeric and scientific data, generating plots, fetching data from the web, working with image and document files, databases, and more. * The full documentation for `sys` is part of the general Python documentation: https://docs.python.org/3/library/sys.html. ```python import sys print("Hello this is going to stdout", file = sys.stdout) print("Hello this is going to stderr", file = sys.stderr) ``` Back to our `reverse_complement()` function, let's add `try` and `except` to catch lower-case inputs. ```python def reverse_complement(dna_sequence): """Reverses the complement of a dna sequence""" complements = {"T":"A", "A":"T", "C":"G", "G":"C"} reverse = dna_sequence[::-1] result = "" # Add try - except - raise statements try: for letter in reverse: result = result + complements[letter] except KeyError: #print("All letters must be capitalized (A, C, G, or T)", file=sys.stderr) raise ValueError("Input needs to be a capital A, C, G, or T") return(result) print(reverse_complement("CAAg")) ``` It was a deliberate choice to catch for KeyError but raise a ValueError which indicates that there's something wrong with the value of the input, as opposed to something dictionary-specific. **Why bother using `try-except` when we can just use `if-else`?** We just demonstrated two different ways to handle potential errors: `if-else` blocks and `try-catch` blocks. Both of them protected against invalid input, so what's the difference? * In the `if-else` block, we checked the value of `letter` *before* trying to use it in the `complements` dictionary. In the `try-catch` block, we immediately called `complements[letter]`, which threw an error. * If you already know what your inputs should or shouldn't contain, it's *probably* a good idea to code in that information beforehand througn an `if-else`. * `try-catch` immediately runs code and stops when an error happens. This means the things done in `try` aren't reversed when you hit an error. * If you don't know *how exactly* something might break, but know it *can* happen, use a `try-catch`. For example, throw a `KeyError` or `ValueError` for keys that don't exist in your dictionary, but you can't guess what keys the user might try to use. * For some more discussion on this, see this StackOverflow forum post: https://stackoverflow.com/questions/7604636/better-to-try-something-and-catch-the-exception-or-test-if-its-possible-first. ##### Sanitizing Input Make a function that prints all k-mers of a given k from a sequence: ```python def kmers_from_sequence(dna_sequence, k): positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) ``` **Note**: If you replace `print(kmer)` with `return kmer`, the function will stop at the firt iteration of the loop and print just the first kmer. This is because `return` keyword ends the function on top of return a certain value. If you want the function to return all the kmers, you can append kmers into a list, and return the list **outside** the for loop. Let's refactor the following function to check the validity of value `k` as the input. - `k` has to be a positive number - `k` cannot be longer than the length of `dna_sequence` ```python def kmers_from_sequence(dna_sequence, k): if k < 1 or k > len(dna_sequence): raise ValueError("k needs to be at least 1 and no longer than the length of dna_sequence") positions = len(dna_sequence) - k + 1 for i in range(positions): kmer = dna_sequence[i:i + k] print(kmer) ``` ##### Separate code with line breaks Loading the data from this morning. ```python import pandas as pd gapminder = pd.read_table("gapminderDataFiveYear_superDirty.txt", sep = "\t") gapminder['region'] = gapminder['region'].astype(str) # Method 1 for formatting the 'region' column: gapminder['region'] = gapminder['region'].str.lstrip() # Strip white space on left gapminder['region'] = gapminder['region'].str.rstrip() # Strip white space on right gapminder['region'] = gapminder['region'].str.lower() # Convert to lowercase # Method 2 for formatting the 'region' column: gapminder['region'] = gapminder['region'].str.lstrip().str.rstrip().str.lower() # Strip white space on left and right, and convert to lowercase print(gapminder['region']) ``` If a line is too long to intuititvely visualize your code, you can separate functions into different lines inside parentheses. Make sure you have indentation to set boundary of what is inside the parentheses. ```python gapminder_copy['region'] = ( gapminder['region'] .str.lstrip() .str.rstrip() .str.lower() ) print(gapminder_copy['region']) ``` ##### Outlinging your code - Pseudocode ```python percent = 20 if percent < 39: print('Low') elif percent < 47: print('Normal') else: print('High') ``` This code may work, but the second comparison will relies on the fact that the first comparison isn't true to be checked at all. If you swap the order of the first and second comparisons, this code will break down. We can reorganize it using a new logic into: ``` Answer to above prompt: 1. Check if the percent is between 0 and 38: if so, then label as "Low". 2. Then, check if the percent is between 39 and 47: if so, then label as "Normal". 3. Then, check if the percent is between 48 and 100: if so, then label as "High". 4. (new condition!) If it's any other integer, raise an error. ``` which is more robust to check the range. This is called a **pseudocode**!! It will serve as a guide to how we will implement the logic into a code. The implementation would look like this: ```python percent = 20 if 0 <= percent <= 38: print('Low') elif 39 <= percent <= 47: print('Normal') elif 48 <= percent <= 100: print('High') else: raise ValueError("percent value must be between 0 and 100") ``` Python can combine two comparisons that would otherwise have required 2 statements in other languages, e.g. `0 <= percent <= 38` checks if `percent >= 0 and percent <= 38` #### Working with files ##### Open and close ```python f = open('ae.fa') for line in f: print(line) f.close() ``` Let's go through the steps we just did: 1. We used the `open()` function on a string that represents a path to a file. * The result of that function was saved to the variable `f`. This value is called a *file object*. 2. We wrote a `for` loop. When you write a `for` loop for a file object, each time a loop happens, one line in the file is read and stored in the loop variable `line`. 3. We printed the loop variable `line` for each loop. 4. We used the `close()` function to close the file. ##### Reading lines Notice the blank line in between each line from the file. Text files and programs indicate that there is a new line using a special newline character, `\n`. In our previous example, each line in the file includes the `\n` newline character at the end. Let's visualize this by adding each line to a single string, `all_lines`, and compare printing the string versus the raw data. ```python # Visualizing newline characters f = open('ae.fa') all_lines = '' # Initializes a new variable to store all the strings for line in f: print(line) all_lines = all_lines + line # Adds each successive line to the variable f.close() all_lines # Prints the variable ``` We can strip off the newline character with `str.strip()` method. ```python string_with_newline = "There is a newline at the end\n" string_with_newline string_with_newline = string_with_newline.strip() string_with_newline ``` **Note** not using `print()`, just calling the variable will show the `\n` character. `print()` is able to interpret the newline character. Try this with the original code. ```python f = open('ae.fa') for line in f: line = line.strip() print(line) f.close() ``` ##### Exercise: processing a fasta file Write a function called `read_fasta` that takes a filename and returns a `list` of all the DNA sequences in the file. - Each element in the list should correspond to one FASTA entry. - Make sure to remove the header rows and only include the DNA sequences. - Test your code with the `ls_orchid.fasta` file. Some hints: - Look at `ls_orchid.fasta` before coding to know how your input data is formatted. - There are multiple FASTA sequences in this file - you'll need to know when to stop reading the previous sequence and start reading the next one. - Remember to strip the newlines off each FASTA sequence. - It's very helpful to write pseudocode first, to go through the logic of your function before writing it. ```python def read_fasta(filename): sequences = [] temp = '' f = open(filename) for line in f: if line[0] in ['A','T','C','G','N']: temp = temp + line.strip() elif line[0] == '\n': sequences.append(temp) temp = '' f.close() return sequences ``` #### Scripts Now that you have your function (or even a program!), you might want to be able to run these functions straight without having to boot up a jupyter notebook every time. Let's add these codes into a file that you can run outside of the notebook!! ```python %%writefile cool_functions.py def read_fasta(filename): sequences = [] temp = '' f = open(filename) for line in f: if line[0] in ['A','T','C','G','N']: temp = temp + line.strip() elif line[0] == '\n': sequences.append(temp) temp = '' f.close() return sequences print(read_fasta('ls_orchid.fasta')) ``` Notice that the first line contains a `%%` operator followed by the command `writefile` and a file name. This operator is specific to Jupyter notebooks, called a "Cell Magic Command", and copies the code written in a cell into a file. Our script reads the `ls_orchid.fa` file (and only the `ls_orchid.fa` file) every time it's run, an example of **hard-coding** an argument. Most programs let you specify the arguments you want to use, as opposed to being inflexible and pre-determined every time you run it. Let's change our script again to achieve this goal. In Python, our program can get these arguments, but we have to load the `sys` module. Remember from earlier when we imported `sys` to call `sys.stderr`? Well, `sys` can also grab arguments from the command line. Let's change our `read_fasta` function slightly to allow it to read in *any* file by using the `sys` module: ```python %%writefile cool_functions.py import sys def read_fasta(filename): sequences = [] temp = '' f = open(filename) for line in f: if line[0] in ['A','T','C','G','N']: temp = temp + line.strip() elif line[0] == '\n': sequences.append(temp) temp = '' f.close() return sequences print(read_fasta(sys.argv[1])) ``` But what happens if we don't have an input file name? According to the `sys` documentation, `sys.argv` returns a list where the first item `sys.argv[0]` is the name of the script by default, and any additional items in the list are the command line arguments. If no arguments were passed, `sys.argv` should be a list with exactly one element, the script name. Let's add some more code to our script to check whether the user passed in a filename: ```python %%writefile cool_functions.py import sys def read_fasta(filename): sequences = [] temp = '' f = open(filename) for line in f: if line[0] in ['A','T','C','G','N']: temp = temp + line.strip() elif line[0] == '\n': sequences.append(temp) temp = '' f.close() return sequences if len(sys.argv) < 2: print('You must give a <sequence.fasta> file to this program!') sys.exit(1) print(read_fasta(sys.argv[1])) ``` Here, we use yet another function of `sys`, `sys.exit`, which lets us stop a script or program before the whole script is used. This is a little bit different from an exception. It's not critical, you just didn't run it correctly. #### Importing your own scripts ```python import cool_functions ``` Why is this giving an error when we import it into our Jupyter notebook? This is because we're using `sys.argv` to read arguments from the command line, but we're *not* on the command line right now, so there are no arguments/files to read in. We're attempting to do method #1 (import as a module within another Python file) with code that's meant for method #2 (run as a standalone script on the command line). So, how do we separate the `print` statement containing `sys.argv` such that it's only run when the script is run from the command line? Answer: Add this `if` statement before the `print` statement: `if __name__ == "__main__":`. ```python %%writefile cool_functions.py import sys def read_fasta(filename): sequences = [] temp = '' f = open(filename) for line in f: if line[0] in ['A','T','C','G','N']: temp = temp + line.strip() elif line[0] == '\n': sequences.append(temp) temp = '' f.close() return sequences if len(sys.argv) < 2: print('Usage:', sys.argv[0], '<sequence.fa>') sys.exit(1) if __name__ == "__main__": print(read_fasta(sys.argv[1])) ``` ##### Automation: combining Python and Bash We can also run this on the command line!! In Bash, you the syntax for a for loop is: ```bash for filename in *.fa* do echo $filename done ``` And to call your python program using command line: ```bash python cool_functions.py ls_orchid.fasta ``` Together, we can run our python program in a loop: ```bash for filename in *.fa* do python cool_functions.py $filename done ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.