# **DIBSI Intro Python Workshop** Welcome! We will use this document to collaboratively take notes and share resources (in a more permanent way than Slack). After the workshop, this document will go on Github so that you can all access it for posterity. If you know markdown, feel free to use it, but you can also just type. Link to this document: https://goo.gl/eaJVuY # Reference: ## Built-in functions You can always run these from anywhere within Python. **range(lower=0, upper, step=1)** - Makes a sequence of numbers from the lower bound (first argument, default 0) up to but not including the upper bound (second argument), counting by the step (third argument, default 1). Example: `range(4)` -> `0, 1, 2, 3`. `range(4, 9, 2)` -> `4, 6, 8` **len(sequence)** - Takes a sequence (like a list or string) and returns the length, as an integer. Example: `len("hello")` -> `5`. `len([1,2,3])` -> `3`. **print()** - Print takes as many arguments as you want to give it and prints them all to the screen. **help(function_name)** - Provides more information about a function. **type(variable)** - Returns the type of `variable`. Example: `type(5)` -> `int`. **isinstance(variable, type)** - Returns True if `variable` is of type `type`. Otherwise returns False. Example: `isinstance(2, int)` -> `True`. **sum(sequence)** - Adds together all of the values in the sequence. All values must be numeric. Example: `sum([1,2,3,4])` -> `10`. **open(path,mode)** - Returns a file object, a pointer to the location of the first line of the file. Example: `open('Desktop/game_theory.csv', 'r')` **next(file)** - Returns the current line of the file, from the current pointer up to the `\n` character, and sets the file object to point to the next line (the text after the `\n`). Example: `next(game_theory_file)` -> `'time,num_coop,num_defect\n'` `next(game_theory_file)` -> `'0,4838,1562\n'` ## Pandas functions You can only run these after importing the `pandas` library at the top of your file by writing `import pandas`. **pandas.read_csv(path)** - returns a CSV file as a dataframe. **pandas.DataFrame(data)** - Makes a new dataframe based on the data in `data`. We talked about making data be a dictionary, where the keys were strings that become the column names in the data frame and the values are lists containing the data that goes in that column. Alternatively, we could pass in a list of lists, but then pandas wouldn't know how to label the columns. Example: `pandas.DataFrame({"A":[1,2,3], "B":[4,5,6]})` returns a dataframe with the columns A and B. **pandas.concat(list_of_dataframes)** - Takes a list of dataframe objects and glues them together into a single dataframe and returns that dataframe. By default, it assumes that each dataframe contains additional rows that should be added to the bottom of the other dataframes (i.e. it uses axis 0). If you want to stitch the dataframes together side-by-side, so each additional dataframe adds more columns, you can call it with the keyword argument `axis=1`. Example: `data_1 = pandas.DataFrame({"A":[1,2,3], "B":[4,5,6]})` `data_2 = pandas.DataFrame({"C":[7,8,9], "D":[10,11,12]})` `data_frame_list = [data_1, data_2]` `pandas.concat(data_frame_list, axis=1)` This will return a dataframe with the columns A, B, C, and D ### Pandas dataframe methods You can call these with the syntax `dataframe_name.method()`, where `dataframe_name` is a Pandas dataframe object. **iloc[col_num, row_num]** **loc[col_num, row_name]** **max()** **min()** **median()** **mean()** **to_csv()** **reset_index()** ## Casting Converting variables of one type to variables of a different type. **int(thing)** - Tries to convert `thing` to an integer. Will throw an error if it isn't possible. Example: `int("5")` -> `5`. **float(thing)** - Tries to convert `thing` to a float. Will throw an error if it isn't possible. Example: `float("5.4")` -> `5`. **str(thing)** - Converts `thing` to a string. This should be possible for just about any variable. Example: `str(5)` -> `"5"`. **bool(thing)** - Converts `thing` to a bool (True/False). This should be possible for just about any variable. Example: `bool(1)` -> `True` **list(thing)** - Converts `thing` to a list. `thing` must be some sort of sequence. Example: `list("abcd")` -> `["a", "b", "c", "d"]` ## String methods You call these with the syntax `string_name.method()`. They are all operations that are specific to strings. **strip()** - Remove characters from the ends of a string. By default, removes spaces. Example: `" hi there ".strip()` -> `"hi there"`. You can also provide a string as an argument. If you do so, all the characters in that string will be stripped from the ends of the main string. Example: `"sea otters".strip("sea")` -> `" otter"` (note that the trailing "s" gets stripped too). **split()** - Converts a string to a list by breaking it into chunks. By default, it breaks at spaces. Example: `"hello there".split()` -> `["hello", "there"]`. You can also provide it an argument, telling it what to split on. Example `"here,are,some,words".strip(",")` -> `["here", "ara", "some", "words"]`. **startswith(otherstring)** - Returns True if the string this method is called on starts with `otherstring`. Returns False if it doesn't. Example: `"hello".startswith("he")` -> `True`. `"hello".startswith("e")` -> `False` **endswith(otherstring)** - Returns True if the string this method is called on ends with `otherstring`. Returns False if it doesn't. Example: `"hello".endswith("lo")` -> `True`. `"hello".endswith("l")` -> `False`. **isnumeric()** - Returns True if the string this method is called on contains only numbers. Returns false if it doesn't. Example: `"412843".isnumeric()` -> `True`. `"test123".isnumeric()` -> `False`. ## List methods You call these with the syntax `list_name.method()`. They are all operations that are specific to lists. **append(thing)** - Adds `thing` to the end of the list. Example: `my_list = [1,2,3]` `my_list.append(4)` `print(my_list)` will print `[1,2,3,4]` **remove(thing)** - removes first occurrence of `thing`. Example: `my_list = ["a", "a", "g", "t", "c"]` `my_list.remove("a")` `print(my_list)` will print `['a', 'g', 't', 'c']` **pop(index)** - removes whatever is at `list_name[index]`. Example: `my_list = [2,7,19]` `my_list.pop(1)` `print(my_list)` will print `[2,19]` ## File methods You can call these with the syntax `opened_file.method()`. **close()** - Closes an open file object so it can't be read or written anymore. Example: `data_file.close()`. ================================================== # Resources: Want more depth on a topic? Here are some good resources to check out. ### Python as a whole * [**learnpython.org**](https://www.learnpython.org/) - This is a website with explanations of various Python concepts, accompanied by example code you can run in your browser. It also provides practice problems, which you can run in your browser and it will tell you if got the right answer. * [**Code Academy Python Lessons**](https://www.codecademy.com/learn/python) - Similar to learnpython.org, but with a little more depth on each topic and more practice problems (I also like the interface more). You have to pay to get access to some parts, but the free version has a lot of great material. * [**The Hitchhikers Guide to Python**](http://python-guide-pt-br.readthedocs.io/en/latest/) - This is slightly more advanced. Once you're comfortable with the basic idea of writing Python code and are looking for more advice on how to continue improving, this is a great resource. Also illustrates a variety of things you can do with Python. ### Pandas * [**Data analysis in Python with pandas video series**](https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y) - This is a really detailed video series stepping through all components of pandas. The format is similar to the style of the workshop (although with less practice problems). It gets into some pretty advanced stuff, but it starts with the basics and builds up. * [**Pandas documentation quick start guide**](https://pandas.pydata.org/pandas-docs/stable/10min.html) - This is a nice refresher on how pandas works. Also, the rest of the site is (literally) the definitive guide on how to do things in Python. * [**Chris Fonnesbeck's Intro to Pandas notebooks**](https://github.com/fonnesbeck/statistical-analysis-python-tutorial) - Chris is another Software Carpentry instructor, and these are the materials that he uses to teach Pandas. They include a lot of examples, and some practice problems. ### Seaborn? ================================================== # **Notes:** # Monday **Jupyter Notebook** can be used to run Python scripts. Jupyter's format is helpful for keeping track of which scripts produced which outputs. Essentially, it can be used as a high-tech lab notebook. Note that in Jupyter Notebook, when running a script that *returns* something, that return pops up beneath the script as a numbered Out[x]. When printing to the screen or terminal (e.g. with `print("hello")`) produces an output that is not numbered. ## Variable types Variable type matters to Python; Python has built-in ways of dealing with certain types of variables. ### Strings * A string variable is assigned using quotation marks, e.g. `my_string = "hello"` * `my_string.upper()` returns the string in uppercase * `my_string[0]` returns the first character of the string * `my_string[0:3]` returns the first, second, and third characters of the string, but not the fourth character (which would be `my_string[3]`). * `my_string[0:1]` simply returns the first character. TypeErrors happen when Python can't perform a function on a certain type of variable, e.g. `5 + "5"` will return a TypeError. ### Lists A list contains a sequence of things, e.g. `my_list = [0, "hello", 3, "last"].` Any type can go in a list. The `range()` function can be used to create lists of numbers--if you "cast" it as a list. For instance, * `list(range(0, 7))` returns `[0, 1, 2, 3, 4, 5, 6]` * You can also skip numbers: `list(range(0, 7, 2))` returns `[0, 2, 4, 6]` `list(range(0, 7, 3))` returns `[0, 3, 6]` ### How to see all defined variables **In Jupyter notebook:** `%whos` This will print out the names of all the variables that have been defined within the notebook, the type that it is (int, function, str), and Data/Info, describes what the variable. **In Python interpreter:** `print(locals())` In a Python interpreter, you can use this command to see local variables, but it isn't as nice as `%whos` in the Jupyter notebook. ## For loops These are useful for repetitive tasks. Use the following format--note that the colon and the indent after the "for" line are all important! ``` for letter in "hello": print("Letter is", letter) ``` This script will print the following to the screen: ``` Letter is h Letter is e Letter is l Letter is l Letter is o ``` The `range` function is useful for specifying a number of times a for loop should run. ``` for x in range(0, 3): print("hello") ``` will print "hello" once for every number in the list ``` hello hello hello ``` So here are two ways to do the same thing: ``` for number in range(0, 6): print("helium"[number]) for letter in "helium": print(letter) ``` Both scripts print: ``` h e l i u m ``` **Accumulator pattern** We use this pattern to reuse the result of the previous iteration of a loop. You need to do the following: * Define a variable before the loop begins * Use the variable within the loop * Re-assign the variable within the loop For instance, calculate 5! (i.e. 5\*4\*3\*2\*1): ``` current = 1 for number in range(1,6): current = number*current print(current) ``` A less mathy example: turning a string into an uppercase string, with some fancy printing. ``` upper_string = "" for letter in "helium": print("Current string: " + upper_string + " - Letter to be capitalized: " + letter) upper_string = upper_string + letter.upper() print("Final string: "+upper_string) ``` Output: ``` Current string: - Letter to be capitalized: h Current string: H - Letter to be capitalized: e Current string: HE - Letter to be capitalized: l Current string: HEL - Letter to be capitalized: i Current string: HELI - Letter to be capitalized: u Current string: HELIU - Letter to be capitalized: m Final string: HELIUM ``` ## Conditionals The following are examples of comparison operators. These must evaluate to boolean values, i.e. they are either True or False: * `==` equals; `!=` not equals * `>` greater than; `<=` less than or equal to Note that `==`, the comparison operator, is different from `=`, which is used to assign a variable to a value. Also useful are `in` and `.startswith()`. Examples: * `"this" == "that"` --> output: `False` * `5 != 4` --> output: `True` * `"a" in "that"` --> output: `True` * `"that".startswith("tha")` --> output: `True` One use for these operators is conditionals: `if`, `elif`, and `else`. Using these, you can specify that something happen only if something is True (or False)! These have formatting similar to `for` loops. ``` if 4 < 5: print "less than" ``` Output: `less than`. You can string them together to make more complex decisions. When an `if` statement is not true, whatever is within an `else` statement after the `if` will be executed. For instance, the following program will output `not less than`: ``` if 1 < 1: print "less than" else: print "not less than" ``` Any string of conditionals requires an `if` at the beginning, and can have at most one `else` at the end. But as many `elif`s can be put after an `else` as you want--`elif` stands for `else if`, and will only happen if the Boolean after the `if` was False, and if the condition following the `elif` is True. For instance, to see if a number is negative, positive, or zero: ``` number = -1 if number > 0: print("Number is positive") elif number == 0: print("Number is zero") else: print("Number is negative") ``` # Tuesday Putting together loops, conditionals, and so on to count the number of vowels in a string (in this case, 5): ``` vowels = "aeiou" number = 0 for letter in "squeegee": if letter in vowels: number += 1 print(number) ``` To count the number of unique vowels (in this case, 2), just switch "squeegee" and "vowels" in the loop: ``` vowels = "aeiou" number = 0 for letter in vowels: if letter in "squeegee": number += 1 print(number) ``` ## Logical operators & `bool()` Logical operators can be used to string together statements to make more complex decisions. ``` if 5 > 6 or 5 < 0: print("true") ``` will give `true`. Because we used `or`, only one of the two statements needs to be true, but both could also be true. #### Types of logical operators: * `or` - requires at least one of the conditions to be true * `and` - requires both conditions be true * `not` - negates a Boolean, can also use `!=` for negating `==` For instance, the value of this statement is `True`: ```4==4 and 0 != 1``` #### Casting data types to Booleans: `bool()` Python interprets certain values of each variable type as true or false, and these types can be converted between each other. Examples: ``` bool("Forest") ``` returns `True` The applicability of this could be using `0` as a sentinal value in a function, which would translate to `False`, while strings show be evaluated as `True`. * Strings: `bool("")` returns `False`, casting any non-empty string returns `True` * Integers: `bool(0)` returns `False`, casting other integer returns `True` ## Lists We touched on them on Monday. Let's talk about them more in-depth. A list is defined as follows: ``` my_list = [6, 2.3, "tree", "pony"] ``` Just like strings, you can index into a list to see what is in that element, and display a subset of the list: `my_list[3]` returns `'pony'` `my_list[0:3]` returns `[6, 2.3, 'tree']` #### Differences between strings and lists Lists can be modified by indexing, but strings cannot. The following code returns an error: ``` my_string = "pony" my_string[3] = "b" ``` But this is just fine: ``` my_list[3] = "horse" print(my_list) ``` prints the new list: `[6, 2.3, 'tree', 'horse']` Some helpful things you can do with lists: * Append: `my_list.append(15)` will put the integer 15 at the end of a list * Sort: `sort(number_list)` will sort a list of numbers by size or a list of strings alphabetically You can also sum lists of numbers: `sum([1,1,5.5])` returns 7.5. ## Functions To use a function, first you have to define it. Anything you indent before the `return` will be included in the function. The function can take in inputs. A `return` ends the function and gives an output from the function (if an output is provided). The placement of return matters because as soon as the return statement is run, the function will be exited. Here is a code that counts the number of vowels in the input, `word`, then returns the output `count`. ``` def count_vowels(word): count = 0 vowels = "aeiou" for letter in word: if letter in vowels: count += 1 return count ``` This code, beginning with `def`, creates the function. However, it does not execute the function. To execute, or **call**, the function, provide an input as follows: ``` count_vowels("helium") ``` This snippet executes the function with input `"helium"`, and returns the output`3`. In Jupyter Notebook, the output is shown as output in the notebook. ### Docstrings: the "help" comment in a function It is best practice to write a description of your function, its inputs, and its outputs. We have done this below between the `"""` marks: ``` def count_vowels(word): """ This function counts the number of vowels in a word. Input: a string containing a word to count vowels in. Output: an integer representing the number of vowels in the word """ count = 0 vowels = "aeiou" for letter in word: if letter in vowels: count += 1 return count ``` This type of comment is called a **docstring**. If you are confused about what a function does, just run ``` help(count_vowels) ``` which will show the function's docstring. You will get an output that says that the function is in `module __main__:`, this is indicative that the functions was built in Python and not in another function. You can use `#` to add comments also--anything on the same line after the `#` will be "commented out," not looked at by Python. The `#` symbol is best for adding comments within your function, especially for longer functions. Only comments written at the beginning of the function within a `'''` or `"""` symbols will be displayed when you call `help()`. ### Uses of `return` Using `return` allows you to use the output of the function somewhere else in your code. For example, you could run the function on the word `"helium"` and then define a variable using the output: ``` num_vowels = count_vowels("helium") ``` So `num_vowels` equals 3. If you want to add up the vowels in two different words, you could run ``` total_vowels = count_vowels("helium") + count_vowels("squeegee") ``` The above code runs the function `count_vowels()` with input `helium`, runs `count_vowels()` with input `squeegee`, and then adds the two outputs together. The result is stored as the variable `total_vowels`. If you wanted, you could also print the result of the function before the `return`--it just depends on what your goals are. ### Returns & `if` statements You can have a function return a different value in different scenarios by using `if`, `elif`, and `else`. ``` def pos_neg_or_zero(number) if number > 0: return "Number is positive" elif number == 0: return "Number is zero" elif number < 0: return "Number is negative" else: return "Number is something else." print(pos_neg_or_zero(-5)) ``` Calling this function with input `-5` within the print will print the result `"Number is negative"`. It's good to have an `else` at the end of the series of elifs as a catch-all in case something unexpected happens. ### Sentinel values Returns can also be used to provide helpful information to the user. ``` def biome_to_indicator(biome): """ This function takes a biome and returns an indicator variable version. Input: a string containing the name of a biome Output: an integer corresponding to that biome """ if biome == "forest": return 0 elif biome == "desert": return 1 elif biome == "taiga": return 2 elif biome == "ocean": return 3 else: return -1 ``` The `return` within the `else` is set to `-1` to indicate that the input was not one of our pre-determined biomes. We don't expect -1 to be the output of this function, so it will work as what is known as a **sentinel value**. Running this function with an input we expected: ``` biome_to_indicator("forest") ``` returns the output: ``` 0 ``` But running it with a different input: ``` biome_to_indicator("outer_space") ``` returns the sentinel value: ``` -1 ``` Note that as written, this function will only work if the funtion is given the name in all lowercase. This code: ``` biome_to_indicator("Forest") ``` returns the sentinel value `-1` as well. #### Be careful with sentinel values We used `-1` instead of a string like `"NA"`, because if we later wanted to use this value, an error might occur if we had a string when an integer was expected. An example of this is below. ``` def biome_to_indicator(biome): """ This function takes a biome and returns an indicator variable version. Input: a string containing the name of a biome Output: an integer corresponding to that biome """ biome = biome.lower() if biome == "forest": return 0 elif biome == "desert": return 1 elif biome == "taiga": return 2 elif biome == "ocean": return 3 else: return "NA" ``` ``` biome_to_indicator_na("high desert") ``` will yield `'NA'`. However, if the output of this function needed to be manipulated as integers, this will yield an error. For the sake of argument, let's pretend we want to sum the values of this function. ``` sum([2,3]) ``` will yield `5`. ``` sum([biome_to_indicator_na("forest"), biome_to_indicator_na("boreal forest")]) ``` will yield an error that an integer cannot be added to a string. The `"forest"` input is gives the output `0` while `"boreal forest"` is undefined so it yields `"NA"`. The `sum([])` function is trying to add `0` and `"NA"`. Instead, it may be more useful to make 0 the sentinel value--that way, if we cast the function's return to a Boolean, it will yield `False`, giving us a heads-up that our input was not understood. ``` def biome_to_indicator(biome): """ This function takes a biome and returns an indicator variable version. Input: a string containing the name of a biome Output: an integer corresponding to that biome """ biome = biome.lower() if biome == "forest": return 1 elif biome == "desert": return 2 elif biome == "taiga": return 3 elif biome == "ocean": return 4 else: return 0 ``` Then run it on some bad input: ``` result = biome_to_indicator("high desert") if not bool(result): print "biome not recognized" ``` This little script will print `"biome not recognized"`. ### Input-checking in functions The biome function as written above only takes lowercase input. We could utilize logical operators, which are described above, e.g. we could start the cascade of elifs with: ``` if biome == "forest" or biome == "FOREST" or biome == "Forest": ``` However, this gets tiresome. Instead, we can make the input more uniform using the `.lower()` method: ``` def biome_to_indicator(biome): """ This function takes a biome and returns an indicator variable version. Input: a string containing the name of a biome Output: an integer corresponding to that biome """ biome = biome.lower() if biome == "forest": return 0 elif biome == "desert": return 1 elif biome == "taiga": return 2 elif biome == "ocean": return 3 else: return -1 ``` The `biome = biome.lower()` will convert all input to lowercase. As long as the input data is spelled correctly, the capitalization won't matter because it is converted to lowercase before being evaluated by the if-else statements. ``` biome_to_indicator("OceaN") ``` will yield `3` because it is converted to `"ocean"` before the if-else statements. ### Function challenge: outer() Write a function called `outer()` that takes a string as input and returns the first and last letters of that string. For example, `outer("oxygen")` will yield `on`. ``` # Write a function called outer() that takes a string as input and returns the first and last letters of the string def outer(word): ''' This function takes input string and returns a string containing the first and last letter of the input''' return word[0] + word[-1] ``` ### Function scope What does the following piece of code display when run, and why? ``` f = 0 k = 0 def f2k(f): k = ((f-32)*(5.0/9.0)) + 273.15 return k f2k(32) print(k) ``` Interestingly, `print(k)` prints `0`, whereas `f2k(32)` returns `273.15`. This is because k=0 was defined outside the scope of the function. In this instance, `k` is modified from within the function. ### Creating a sum function The basic function just uses an accumulator pattern. ``` def sum_function(input_list): ''' A function that returns the sum of a list of numbers. ''' total = 0 for number in input_list: total += number return total ``` But what if you get some bad input? Let's check to make sure everything in the input list is a number using the `.isnumeric()` method. This method tells you if a string contains only numbers or non-numeric characters as well. For instance, * `"523242".isnumeric()` returns `True` * `"5asfassafa23242".isnumeric()` returns `False`. So, we will check if the string-cast version of every element in the list is numeric. If not, we print a warning, so that if the function returns an error, the user will know why. ``` def sum_function(input_list): ''' A function that returns the sum of a list of numbers. Returns error if input is not a list containing ints or floats ''' total = 0 for number in input_list: if not str(number).isnumeric(): print("Warning: skipping non-numeric in list passed to sum_function:", number) total += number return total ``` ### Using `continue` & `break` We could, in fact, avoid errors by using `continue` or break. We can use these in loops in two different ways. * `continue` skips the rest of the loop. * `break` stops the loop. To see these in action, consider these two codes where we don't want to print the number 2: ``` for i in range(4): if i == 2: continue print(i) ``` or, ``` for i in range(4): if i == 2: break print(i) ``` The first skips the part of the loop that would print 2, but continues to the rest of the list: ``` 0 1 3 ``` The second stops going through the loop once it hits 2: ``` 0 1 ``` An implementation using `break` would look like this: ``` def sum_function(input_list): ''' A function that returns the sum of a list of numbers. Returns error if input is not a list containing ints or floats ''' total = 0 for number in input_list: if not str(number).isnumeric(): print("Warning: skipping non-numeric in list passed to sum_function:", number) break total += number return total ``` When run with input `[1, "a", 2]`, this returns 1. If we put `continue` in this code instead of `break`, this would return 3. ## The `enumerate()` function To provide the most thorough error-checking, it would be nice to tell the user which elements of the list were not numbers. To do this, we can use the `enumerate()` function to construct a loop. This function is a good way to loop through a list while keeping track of the position in the list. To use `enumerate()`, format the loop as follows. The name of a position in the list is specified first--we'll call the position `i`. Then what each element in the list will be called (`element`) and the list itself (`my_list`), are written after a comma. The following code: ``` my_list = [5,2,7] for i,element in enumerate(my_list): print("i: ", i, "- List element:", element) ``` prints this: ``` i: 0 - List element: 5 i: 1 - List element: 2 i: 2 - List element: 7 ``` An actual implementation would look like this: ``` def sum_function(input_list): ''' A function that returns the sum of a list of numbers. Returns error if input is not a list containing ints or floats ''' total = 0 for i,number in enumerate(input_list): if not str(number).isnumeric(): print("Warning: skipping non-numeric in list passed to sum_function:", number, "at index", i) continue total += number return total ``` and when given the input `["abc", 1, 5]`, prints the following: ``` Warning: skipping non-numeric in list passed to sum_function: abc at index 0 ``` but ultimately returns the correct sum, `6`. ## Files Doing data-intensive biology on Python requires the use of files: these are where the raw data is kept, and where the results of analyses are stored. ### Working with file systems on Jupyter notebook Anything preceded with `%` on Jupyter is a special Jupyter notebook command. * `%pwd` tells you where the Jupyter notebook is located--it **p**rints the **w**orking **d**irectory * `%ls` lists all the files within the current working directory * You can list what is in a different working directory ### Opening and closing a file Opening a file requires two components: a string of the path of the file, and a flag that indicates what you want to do with the file. For instance, to *read* the contents of a file, use `"r"`. It's best to store the open file as a variable. Here's an example path, but the path may be different on your computer. ``` data_file = open("Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") ``` After opening and reading a file, it is usually best to close it before moving on. To do so, use `data_file.close()`. ### Reading from a file We can print the first 10 lines of a file using a loop. ``` line_count = 0 for line in data_file: line_count += 1 if line_count > 9: break print(line) ``` This prints a huge amount of text, which I won't write here. Each line is very long, but Python only breaks a line when it sees a **new line** character, `\n`. If you run this loop again, it prints the next 10 lines of the file instead of the first 10. This behavior is because the file variable, `data_file`, is actually a pointer a line in a file. When you first open the file, the pointer is to the first line. But when you run a loop over `data_file`, you're moving the pointer; when the pointer gets to the end of the file, the file closes. The way to reset this pointer is by reopening the file. #### Searching for a certain line Let's say we want to look for a certain line of text in this file, `"Nei & Gojobori 1986. dN/dS (dN, dS)"`. First, unsuccessfully try this: ``` for line in data_file: if line == "Nei & Gojobori 1986. dN/dS (dN, dS)": print("Found it!", line) ``` This does not print "Found it," because every line also contains a newline character, so no line is perfectly equal to "Nei & Gojobori 1986. dN/dS (dN, dS)". To get around this, we can use the `.strip()` method to remove whitespace characters from the ends of the line. Overall: ``` data_file = open("Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") for line in data_file: if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": print("Found it!", line) data_file.close() ``` Prints: ``` Found it! Nei & Gojobori 1986. dN/dS (dN, dS) ``` An alternative method without using `.strip()` uses this for loop: ``` data_file = open("Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") for line in data_file: if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": print("Found it!", line) data_file.close() ``` #### Printing lines after a line of interest Now, let's say that we want the data that follows the phrase `"Nei & Gojobori 1986. dN/dS (dN, dS)"`. We know that the data we want are from the 4th line after that phrase to the 9th line after that phrase. ``` data_lines = [] for i, line in enumerate(data_file): if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)" in line: data_lines = list(range(i+4, i+9)) if i in data_lines: print(line) ``` When the first `if` finds a line containing the phrase, the phrase is on line i. The `if` adds values from `i+4` to `i+9` to the list `data_lines`; these are the line numbers of the lines we desire. The next `if` only prints a line if its number is one of the line numbers we stored in `data_lines`. This would fail to work if the lines we wanted to read were *before* the line that we were searching for. ### The .split() method Often, data are separated by spaces, commas, or other characters. Given such data, you can separate them into a list using the `.split()` method. Use whatever the separator is as the argument for split. ``` "1,2,3,4,5".split(",") ``` returns ``` ['1', '2', '3', '4', '5'] ``` Without an argument, `.split()` will split the string based on white space. The following code: ``` "1 2 3 4 5".split() ``` returns ``` ['1', '2', '3', '4', '5'] ``` ## General data processing example Consider one of the lines we found above,`"A18886 0.5516 (0.1125 0.2039)"`. Say we just want a list with the relevant float data from this line, excluding the part that starts with `A`. So our ideal output is `[0.5516, 0.1125, 0.2039]`. #### Method 1: Break this into steps: * Make the line into a list using `.split()` * Get rid of the extraneous string `"A18886"` * Strip away extraneous parentheses using `.strip()` * Make the clean strings into floats using `float()` First, split the data into a list: ``` line = "A18886 0.5516 (0.1125 0.2039)" line = line.split() ``` This gives us `line = ['A18886', '0.5516', '(0.1125', '0.2039)']` Then, slice the 0th element off of the list: ``` data_only = line[1:] ``` Now we're working with `data_only = ['0.5516', '(0.1125', '0.2039)']` Then, loop through every element of `data_only`, removing extraneous parentheses on the outside of each string with `.strip("()")` and then converting the clean string to a float with `float()`. ``` for i,datum in enumerate(data_only): data_only[i] = data_only[i].strip("()") data_only[i] = float(data_only[i]) print(data_only) ``` This prints `[0.5516, 0.1125, 0.2039]`. Many ways of doing the same thing: #### Method 2: using appends ``` line = "A18886 0.5516 (0.1125 0.2039)".split() data_list = line[1:] new_list = [] for item in data_list: new_item = float(item.strip("()")) new_list.append(new_item) print(new_list) ``` #### Method 3: individual treatment ``` data_string = "A18886 0.5516 (0.1125 0.2039)" data_string = data_string.strip("A18886") data_list = data_string.split() data_list[0] = float(data_list[0]) data_list[1] = float(data_list[1].strip("(")) data_list[2] = float(data_list[2][0:-1]) print(data_list) ``` #### Method 4: using a function ``` def clean_line(line): line = line.split() line = line[1:] for i,val in enumerate(line): line[i] = float(val.strip("()")) return line clean_line("A18886 0.5516 (0.1125 0.2039)") ``` #### Method 5: gettin' slick ``` line = "A18886 0.5516 (0.1125 0.2039)" my_list = [] for entry in line.split(): if not entry.startswith("A"): my_list.append(float(entry.strip(" ()"))) print(my_list) ``` ### Putting it all together Note that this is the matrix following our line of interest: ``` A50449 A18886 0.5516 (0.1125 0.2039) A91165 0.0795 (0.1445 1.8175) 0.0805 (0.1795 2.2292) A125421 0.0672 (0.1723 2.5644) 0.0512 (0.1204 2.3497) 0.6589 (0.1065 0.1617) ``` It contains dn and ds values within the parentheses, and dn/ds values outside of the parentheses, all labeled with phrases that start with A. ``` data_file = open("/Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") data_lines = [] dn_over_ds = [] dn = [] ds = [] for i,line in enumerate(data_file): if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": data_lines = list(range(i+4, i+9)) #if it's in the lines we care about if i in data_lines: print(line) line = line.split() line = line[1:] #get rid of the label on the line for j in range(len(line)): line[j] = float(line[j].strip("()")) #iterate over each value in the line for j in range(0, len(line), 3): #dn/ds is the first in the set of 3 values dn_over_ds.append(line[j]) for j in range(1, len(line), 3): dn.append(line[j]) #dn is the second in the set of 3 values for j in range(2, len(line), 3): ds.append(line[j]) #ds is the third in the set data_file.close() ``` ### Note It is simplest to open and close the file in its own code box every time we do an example, like the above scripts, but that will be omitted from examples from now on. # Wednesday ## The next() function An easier way to work with files is to use the `next()` function, which returns the current line of the file, up to the newline character `\n`, and then moves the pointer to the character after the newline. E.g. `game_theory_file = open('game_theory.csv', 'r')` `next(game_theory_file)` -> 'time,num_coop,num_defect\n' `next(game_theory_file)` -> '0,4838,1562\n' `next(game_theory_file)` -> '10,5745,655\n' and so on. ## Parsing .csv files We are working with `game_theory.csv`, a comma-separated values file. The name is self-explanatory: each entry is separated by a comma. Can be opened in Pandas, Excel, etc. * It's good to work with standard file formats--makes life easier for your collaborators and your future self. ### Single lines As we see above, to work with the data we have to deal with both the commas and the newline character, neither of which belong in our data. We can either strip off the newline, then split by commas: ``` stripped_line = '0,4838,1562\n'.strip() split_line = stripped_line.split(",") ``` Or the reverse: ``` split_line = '0,4838,1562\n'.split(",") #turns it into a list split_line[2] = split_line[2].strip() ``` Here is a function that takes a string containing three numbers and returns a list of those three numbers as ints: ``` def list_of_ints(input_string): '''Takes as input a comma-separated list of integers, e.g. "10,5745,655\n", and returns it as a list of ints, e.g. [10, 5745, 655].''' #Strip newline from end of string and split by commas string_list = input_string.strip().split(",") #Turn each entry in split line into an integer values_as_ints = [] for entry in string_list: values_as_ints.append(int(entry)) return values_as_ints ``` ### Entire files We can run the function we defined above on every line of `game_theory.csv`. Iterate through the file within a for loop, for instance. ``` for line in game_theory_file: values_list = list_of_ints(line.strip()) print(values_list) ``` We have to be careful, though, since this file contains a header that we have to skip. A few ways to do this are as follows. 1. Use `next()` before entering the loop. This skips the first line so that when we enter our for loop, game_theory_file points to the second line of the file. ``` next(game_theory_file) for line in game_theory_file: values_list = list_of_ints(line.strip()) print(values_list) ``` 2. Use `continue` within the loop to skip a line that starts with something that's non-numeric. We'll look at the first character in that line and use the `.isnumeric()` method to determine whether to skip running our function on that line. Alternatively, we could use something like `if line.startswith("time")` ``` for line in game_theory_file: if not line[0].isnumeric(): continue values_list = list_of_ints(line.strip()) print(values_list) ``` 3. Use `enumerate()` and skip the first run of the loop. ``` for i, line in enumerate(game_theory_file): if i==0: continue values_list = list_of_ints(line.strip()) print(values_list) ``` Each of these prints ``` [0, 4838, 1562] [10, 5745, 655] [20, 6172, 228] [30, 6345, 55] [40, 6390, 10] [50, 6393, 7] [60, 6393, 7] [70, 6393, 7] [80, 6393, 7] [90, 6393, 7] [100, 6393, 7] ``` #### A word on code efficiency The `next()` method above is probably the best for this particular type of file, because we only need to skip the first line, and it doesn't involve making a comparison during every run of the loop. This can slow you down. However, if you were working with a file where you had to skip multiple lines throughout the file, e.g. a .FASTA file with headings every few lines, the second method would probably be better. ### Extracting a set of values What if we wanted to extract just the number of cooperators at each time step? In each list of integers `values_list` above, the number of cooperators is `values_list[1]` (recall--the second column has index 1). Call the list of cooperator numbers `cooperators`. During each step when we convert the line to a list of ints, we will `.append()` the second element in the list to the cooperators list. So, the full script looks like this: ``` cooperators = [] next(game_theory_file) for line in game_theory_file: values_list = list_of_ints(line.strip()) cooperators.append(values_list[1]) print(cooperators) ``` and prints: ``` [4838, 5745, 6172, 6345, 6390, 6393, 6393, 6393, 6393, 6393, 6393] ``` ### Lists of lists We can create a list of lists by appending lists to a list. For instance, here is a list of lists: `[[0,129], [1,401], [2,819]]` The code used to create a list of lists from `game_theory.csv` is nearly identical to that used above to create the cooperators list. But instead of appending one item from the list (`values_list[1]`) we append the whole list (`values_list`): ``` list_of_lists = [] next(game_theory_file) for line in game_theory_file: values_list = list_of_ints(line.strip()) list_of_lists.append(values_list) print(list_of_lists) ``` Which prints: ` [[0, 4838, 1562], [10, 5745, 655], [20, 6172, 228], [30, 6345, 55], [40, 6390, 10], [50, 6393, 7], [60, 6393, 7], [70, 6393, 7], [80, 6393, 7], [90, 6393, 7], [100, 6393, 7]] `. To access the cooperators from this list and append them to a new list, run a loop: ``` cooperators = [] for single_list in list_of_lists: cooperators.append(single_list[1]) print cooperators ``` Will print the cooperators list as above: ``` [4838, 5745, 6172, 6345, 6390, 6393, 6393, 6393, 6393, 6393, 6393] ``` ## Pandas! Pandas is a Python library that provides easy-to-use data structures and data analysis tools. It is so widely used that many other libraries depend on it. For instance, Pandas can turn a .csv file into a dataframe named `data` with the following code: `data = pandas.read_csv('game_theory.csv')` ### Importing Pandas But wait! To use Pandas, you must import it. ``` import pandas ``` Then you can use Pandas functions, like `read_csv()`, with the syntax `pandas.read_csv(path)`. If you don't want to type out "pandas" every time you use a Pandas function, you can import Pandas with a nickname: `import pandas as pd`. Then it's easier to call functions: `pd.read_csv(path)`. If someone imported Pandas with `from pandas import *`, they could use all the Pandas functions without prepending the function with anything, e.g. they could just write `read_csv(path)`. But this is confusing, and it's probably best not to. ### Accessing subsets of dataframes Two ways to do this: `loc` and `iloc` Let's say our data is stored in a dataframe named `game_data`. If we print `game_data` we see ``` time num_coop num_defect 0 0 4838 1562 1 10 5745 655 2 20 6172 228 3 30 6345 55 4 40 6390 10 5 50 6393 7 6 60 6393 7 7 70 6393 7 8 80 6393 7 9 90 6393 7 10 100 6393 7 ``` Where the names of the columns were given in the .csv file, and the names of the rows were automatically generated by Panda when it read in the .csv file. See how the following commands modify this table. ### iloc[row_numbers, column_numbers] This method lets us specify locations by index (hence the name **i**ndex **loc**ation). Indexing works similarly to how it does in Python. If we want a single location in the data, we simply type: `game_data.iloc[9, 0]` -> `6393` If we want to get the data from the first 5 timesteps, we use the following script. Just like in Python, putting a `:` on its own gets the entire set of columns, and putting a `:5` gets everything from the first row to the 5th row. `game_data.iloc[:5, :]` When we print this command we get: ``` time num_coop num_defect 0 0 4838 1562 1 10 5745 655 2 20 6172 228 3 30 6345 55 4 40 6390 10 ``` Note that, like typical indexing, the rows included start at the first index, but is not inclusive of the second index. ### loc[rows,col_names] Alternatively, this one lets us specify locations by names. Note that if there was no header provided, the name of the column/row is automatically the same as its index. So, the exact same "indices" as above: `game_data.loc[:5, :]` Returns a different dataframe--one that includes the 5th row. ``` time num_coop num_defect 0 0 4838 1562 1 10 5745 655 2 20 6172 228 3 30 6345 55 4 40 6390 10 5 50 6393 7 ``` Because, unlike `iloc`, `loc` is **inclusive**. The real strength of `loc` is that you can specify column names instead of numbers. `game_data.loc[3:7,"num_coop":"num_defect"]` -> ``` num_coop num_defect 3 6345 55 4 6390 10 5 6393 7 6 6393 7 7 6393 7 ``` #### Examples You can use iloc to print a column, which Pandas thinks of as a "series": `game_data.iloc[:,1]` -> ``` 0 4838 1 5745 2 6172 3 6345 4 6390 5 6393 6 6393 7 6393 8 6393 9 6393 10 6393 Name: num_coop, dtype: int64 ``` Notice that Jupyter doesn't print this like it does a dataframe. The column name is printed at the bottom along with some gobbledygook. Don't worry. You can still interact with a series like a dataframe. We could use `loc` to print the first few timesteps in the cooperators column without worrying about the columns' indices: `game_data.loc[0:3,"num_coop"]` -> ``` 0 4838 1 5745 2 6172 3 6345 Name: num_coop, dtype: int64 ``` ### Intro to keyword arguments Unlike functions, where you have to put the input arguments in a certain order--e.g. `open()` always expects a path first, and a type of open second--keyword arguments allow you to just specify what argument is. In this example we specify that the "header" argument is "None", i.e. we read in the CSV file as if it does not have a header. ``` data_noheader = pandas.read_csv("game_theory.csv", header=None) data_noheader.loc[:,1:2] ``` See how the output columns are labeled: ``` 1 2 0 num_coop num_defect 1 4838 1562 2 5745 655 3 6172 228 4 6345 55 5 6390 10 6 6393 7 7 6393 7 8 6393 7 9 6393 7 10 6393 7 11 6393 7 ``` ### Basic descriptive statistics Panda can show us things like the maximum, mean, and so on. These are functions called on a certain dataframe. For instance, if we want the minimum number of defectors: ``` defect = game_data.loc[:, "num_defect"] defect.min() ``` which returns `7`. The ones we learned include `min`, `max`, `median`, and `mean`. These methods can be used on multiple columns at once. Say we wanted to get the mean cooperators and defectors: ``` coop_and_defect = game_data.iloc[:, 1:] coop_and_defect.mean() ``` This returns the series: ``` num_coop 6168.0 num_defect 232.0 dtype: float64 ``` We can also use keyword arguments with these descriptive statistics. We can specify whether we want the column maximum (axis=0) or the row maximum (axis=1). `game_data.max(axis=0)` -> ``` time 100 num_coop 6393 num_defect 1562 dtype: int64 ``` `game_data.max(axis=1)` -> ``` 0 4838 1 5745 2 6172 3 6345 4 6390 5 6393 6 6393 7 6393 8 6393 9 6393 10 6393 dtype: int64 ``` ### Pandle handles N/As Panda makes it easy to clean improbable data using **masks**; e.g. you can find negative or impossibly high values. You could also use it to mask out any sentinel values you might decide to use while manually cleaning data. `game_data[game_data < 6000]` -> ``` time num_coop num_defect 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 6172.0 NaN 3 NaN 6345.0 NaN 4 NaN 6390.0 NaN 5 NaN 6393.0 NaN 6 NaN 6393.0 NaN 7 NaN 6393.0 NaN 8 NaN 6393.0 NaN 9 NaN 6393.0 NaN 10 NaN 6393.0 NaN ``` Even better--it can handle all of the NaNs this can produce without "breaking." ``` masked_data = game_data[data < 6000] masked_data.mean() ``` returns: ``` time NaN num_coop 6362.777778 num_defect NaN dtype: float64 ``` #### Use a mask to turn all of the 7s in the dataframe to n/as: `game_data[game_data!=7]` ``` time num_coop num_defect 0 0 4838 1562.0 1 10 5745 655.0 2 20 6172 228.0 3 30 6345 55.0 4 40 6390 10.0 5 50 6393 NaN 6 60 6393 NaN 7 70 6393 NaN 8 80 6393 NaN 9 90 6393 NaN 10 100 6393 NaN ``` ### Plotting Before plotting using Panda in Jupyter Notebook, you'll want to make sure the plots don't pop up in a new window. Do this by running `%matplotlib inline` in the notebook. Now, you're ready to plot! (I can't paste plots...) # Thursday Working with the following code from Tuesday: ``` data_file = open("/Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") data_lines = [] dn_over_ds = [] dn = [] ds = [] for i,line in enumerate(data_file): if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": data_lines = list(range(i+4, i+9)) #if it's in the lines we care about if i in data_lines: print(line) line = line.split() line = line[1:] #get rid of the label on the line for j in range(len(line)): line[j] = float(line[j].strip("()")) #iterate over each value in the line for j in range(0, len(line), 3): #dn/ds is the first in the set of 3 values dn_over_ds.append(line[j]) for j in range(1, len(line), 3): dn.append(line[j]) #dn is the second in the set of 3 values for j in range(2, len(line), 3): ds.append(line[j]) #ds is the third in the set data_file.close() ``` ## Example: setting up column labels We want to set up a code that associates the dn/ds values with their row and column labels a diagonally symmetric table as follows: ``` A50449 A18886 0.5516 (0.1125 0.2039) A91165 0.0795 (0.1445 1.8175) 0.0805 (0.1795 2.2292) A125421 0.0672 (0.1723 2.5644) 0.0512 (0.1204 2.3497) 0.6589 (0.1065 0.1617) ``` For instance, `0.0672 (0.1723 2.5644)` is associated with row label `A125421` and column label `A50449`. To make the labeling more clear: ``` Data: Row label: Column label: 0.5516 (0.1125 0.2039) B: A18886 A: A50449 0.0795 (0.1445 1.8175) C: A91165 A: A50449 0.0805 (0.1795 2.2292) C: A91165 B: A18886 0.0672 (0.1723 2.5644) D: A125421 A: A50449 0.0512 (0.1204 2.3497) D: A125421 B: A18886 0.6589 (0.1065 0.1617) D: A125421 C: A91165 ``` ### Row labels As seen above, the order of the row labels is `[B, C, C, D, D, D]`. If we have a list of labels `[A, B, C, D]`, notice A has index 0, B has index 1, C, has index 2, and D has index 3. Thus, each label is added to the list the same number of times as its index. Let's make this into a function. ``` def get_row_labels(labels): row_labels = [] for i, label in enumerate(labels): for _ in range(i): row_labels.append(label) return row_labels ``` The first for loop runs through every label in `labels`. Because we used enumerate, we also have the index `i` of each label ready to use as a variable in the loop. The second for loop runs as many times as the index `i`. Each time it runs, it appends the label to `row_labels`. Thus, we end up with the correct row labels list: `get_row_labels(['A50449', 'A18886', 'A91165', 'A125421'])` -> `['A18886', 'A91165', 'A91165', 'A125421', 'A125421', 'A125421']` #### for _ : The loop above used `_` instead of a loop variable. This is a convention that lets anyone examining your code know that you will not be using the loop variable within the loop. This happens when you just want to do an action a certain number of times. ### Column labels We can use a similar nested for loop for the column labels. The order of column labels is `[A, A, B, A, B, C]`. Thinking of this in terms of the order we append things to an empty list: * At first we append only the first label. * Then we add the first and second labels. * Then, finally, we add the first, second, and third labels. So, this can be accomplished with a nested for loop. ``` def get_column_labels(labels): column_labels = [] for i in range(len(labels) - 1): for label in labels[0:i+1]: column_labels.append(label) return column_labels ``` The outer loop runs three times (length of the labels list minus 1). The inner loop is over `labels[0:i+1]` and appends every element within `labels[0:i+1]` to the list of labels. ``` labels[0:1] - [A] labels[0:2] - [A, B] labels[0:3] - [A, B, C] ``` Alternatively, instead of the nested for loop, we can use a cool method of lists: `extend(my_list)`, which adds `my_list` to whichever list you're calling the method on. This utilization is: ``` def get_column_labels(labels): column_labels = [] for i in range(len(labels) - 1): column_labels.extend(labels[0:i+1]) return column_labels ``` ## Dictionaries Dictionaries are an object type that associates "keys" with "values." Every key must be unique, but values don't have to be. Keys have to be immutable (e.g. they could be strings or numbers, but not lists). Dictionaries are created using curly brackets and supplying these key/value pairs; `dict_name = {key1:value1, key2:value2}`. A real-world example: ``` phone_numbers = {emily":"831-428-2464", "emergency":"911"} ``` A value is accessed by typing in the key: `phone_numbers["emergency"]` -> `911` We can add values, as well. `phone_numbers["call_before_you_dig"] = 411`. A value can be a variety of things--a list, or even a dictionary itself. `phone_numbers["emily"] = ['555-555-5555', '831-428-2464']` `phone_numbers["emily"] = {"secret": "555-555-5555", "cell":'831-428-2464'}` In this latter case, we can access Emily's secret number as follows: `phone_numbers["emily"]["secret"]` -> `555-555-5555`. ### Challenge problem Let's say we have this dictionary. `animals = {"cats":6, "ferrets":25, "dogs":14}` Level 1: Retrieve the number of dogs: `animals["dogs"]` -> 14 Level 2: Add 10 to the number of cats: `animals["cats"] += 10` Level 3: Try writing a loop over animals. What is this loop doing? Notice that this loop ``` for thing in animals: print(thing) ``` prints the animals: ``` cats ferret dogs ``` The dictionary prints the keys, not the values, because the keys are the only things that have to be unique. From the keys you can then recover the values. ``` for thing in animals: print(thing) print(animals[thing]) ``` Will print each key followed by its value. ### Why do we care? One reason to care is that Pandas dataframes require columns be put in as dictionaries! ### A word on efficiency You could put a construction similar to a dictionary in a list of lists--but a dictionary is much faster. In general, dictionaries are more efficient than lists if you are frequently using the object to access a specific value. ## Creating a DataFrame Pandas DataFrames can be created from dictionaries. Each key is the name of a column, and the value associated with it should be a list of column values. For instance: ``` data_dict = {"A":[1,2,3], "C":[4,5,6], "B":[7,8,9]} my_data pandas.DataFrame(my_dict) ``` yields this dataframe: ``` A B C 0 1 7 4 1 2 8 5 2 3 9 6 ``` Some things to note: pandas alphabetizes columns, and for now, all lists must be equal length. There are ways to add columns with missing values, and to rearrange columns, however. To rearrange columns, just specify the column labels within double brackets: `rearranged_data = my_data[["C","B","A"]]` ### Our example Using all the code we have already written, it is easy to create a dictionary that works for our example: ``` data_dict = {"dn":dn, "ds":ds, "dn/ds":dn_over_ds, "row_label":row_labels, "col_label":col_labels} dn_ds_dataframe = pandas.DataFrame(data_dict) ``` and put this into a function called `parse_file(path)` which parses a file in the desired path. The result: `parse_file("C:/Users/tessa/Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2")` -> ``` col_label dn dn/ds ds row_label 0 A50449 0.1125 0.5516 0.2039 A18886 1 A50449 0.1445 0.0795 1.8175 A91165 2 A18886 0.1795 0.0805 2.2292 A91165 3 A50449 0.1723 0.0672 2.5644 A125421 4 A18886 0.1204 0.0512 2.3497 A125421 5 A91165 0.1065 0.6589 0.1617 A125421 ``` Running the `describe()` method on this dataframe yields descriptive stats for the numeric columns: ``` dn dn/ds ds count 6.000000 6.000000 6.000000 mean 0.139283 0.248150 1.554400 std 0.031252 0.278883 1.090013 min 0.106500 0.051200 0.161700 25% 0.114475 0.070275 0.607300 50% 0.132450 0.080000 2.023350 75% 0.165350 0.433825 2.319575 max 0.179500 0.658900 2.564400 ``` ### Multiple files To run `parse_file(path)` on multiple files, we could loop through a list of file paths. Such a list can be produced using a function called `glob`. To use it, import glob: ``` import glob ``` We can use it to create a list of one filename: `glob.glob("python-novice-inflammation-data/data/inflammation-01.csv")` -> `['python-novice-inflammation-data/data/inflammation-01.csv']` But even better, we can use the `*` wildcard to create a list of many filenames. `glob.glob("python-novice-inflammation-data/data/inflammation-*.csv")` -> ``` ['python-novice-inflammation-data/data/inflammation-01.csv', 'python-novice-inflammation-data/data/inflammation-02.csv', 'python-novice-inflammation-data/data/inflammation-03.csv', 'python-novice-inflammation-data/data/inflammation-04.csv', 'python-novice-inflammation-data/data/inflammation-05.csv', 'python-novice-inflammation-data/data/inflammation-06.csv', 'python-novice-inflammation-data/data/inflammation-07.csv', 'python-novice-inflammation-data/data/inflammation-08.csv', 'python-novice-inflammation-data/data/inflammation-09.csv', 'python-novice-inflammation-data/data/inflammation-10.csv', 'python-novice-inflammation-data/data/inflammation-11.csv', 'python-novice-inflammation-data/data/inflammation-12.csv'] ``` This can be used in our current example as follows: `glob.glob("intermediatefiles/*/*.cluster.aa.fa.aln.nt.phy.trim.paml.model2")` -> ``` ['intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2', 'intermediatefiles/0009296/0009296.cluster.aa.fa.aln.nt.phy.trim.paml.model2', 'intermediatefiles/0009857/0009857.cluster.aa.fa.aln.nt.phy.trim.paml.model2', 'intermediatefiles/0009926/0009926.cluster.aa.fa.aln.nt.phy.trim.paml.model2'] ``` Now, we just loop through our files, appending the dataframes the function creates to a list. ``` data_frames = [] for filename in file_list: data_frames.append(parse_file(filename)) ``` Aaaandd we get a ValueError. Oh dear. #### Try/Except We can see which files are problematic using the try/except construction. ``` data_frames = [] for filename in file_list: try: data_frames.append(parse_file(filename)) except: print("There's something wrong with",filename) ``` The way try/except works is that the `try` portion of the code attempts to execute whatever is indented beneath it, but does not do so if the code causes an error. If an error is raised, the code within the `except` portion executes. We find that there are problems with two filenames: ``` There's something wrong with intermediatefiles\0009857\0009857.cluster.aa.fa.aln.nt.phy.trim.paml.model2 There's something wrong with intermediatefiles\0009926\0009926.cluster.aa.fa.aln.nt.phy.trim.paml.model2 ``` ### Concatenating dataframes Now we have two dataframes contained in `all_data`. To concatenate them, use the Pandas function `concat()`: `all_data = pandas.concat(data_frames)` -> ``` col_label dn dn/ds ds row_label 0 A50449 0.1125 0.5516 0.2039 A18886 1 A50449 0.1445 0.0795 1.8175 A91165 2 A18886 0.1795 0.0805 2.2292 A91165 3 A50449 0.1723 0.0672 2.5644 A125421 4 A18886 0.1204 0.0512 2.3497 A125421 5 A91165 0.1065 0.6589 0.1617 A125421 0 A21380 0.0392 0.1081 0.3629 A58613 1 A21380 0.1538 0.0724 2.1253 A92845 2 A58613 0.1684 0.0708 2.3779 A92845 3 A21380 0.1726 0.0777 2.2221 A127548 4 A58613 0.1572 0.0686 2.2905 A127548 5 A92845 0.0173 0.2538 0.0681 A127548 ``` Sadly, the row names are not unique. To fix this, just run the `reset_index()` method on the concatenated dataframe. `all_data.reset_index()` -> ``` index col_label dn dn/ds ds row_label 0 0 A50449 0.1125 0.5516 0.2039 A18886 1 1 A50449 0.1445 0.0795 1.8175 A91165 2 2 A18886 0.1795 0.0805 2.2292 A91165 3 3 A50449 0.1723 0.0672 2.5644 A125421 4 4 A18886 0.1204 0.0512 2.3497 A125421 5 5 A91165 0.1065 0.6589 0.1617 A125421 6 0 A21380 0.0392 0.1081 0.3629 A58613 7 1 A21380 0.1538 0.0724 2.1253 A92845 8 2 A58613 0.1684 0.0708 2.3779 A92845 9 3 A21380 0.1726 0.0777 2.2221 A127548 10 4 A58613 0.1572 0.0686 2.2905 A127548 11 5 A92845 0.0173 0.2538 0.0681 A127548 ``` We can also concatenate rows instead of columns with the keyword argument `axis=1`. `pandas.concat(data_frames, axis=1)` -> ``` col_label dn dn/ds ds row_label col_label dn dn/ds \ 0 A50449 0.1125 0.5516 0.2039 A18886 A21380 0.0392 0.1081 1 A50449 0.1445 0.0795 1.8175 A91165 A21380 0.1538 0.0724 2 A18886 0.1795 0.0805 2.2292 A91165 A58613 0.1684 0.0708 3 A50449 0.1723 0.0672 2.5644 A125421 A21380 0.1726 0.0777 4 A18886 0.1204 0.0512 2.3497 A125421 A58613 0.1572 0.0686 5 A91165 0.1065 0.6589 0.1617 A125421 A92845 0.0173 0.2538 ds row_label 0 0.3629 A58613 1 2.1253 A92845 2 2.3779 A92845 3 2.2221 A127548 4 2.2905 A127548 5 0.0681 A127548 ``` ## Running Python elsewhere Save your code as a .py file. When running on the command line, simply type: ``` python file_name.py ``` and your program will execute. Beware that, unlike Jupyter notebook, the last thing returned will not be printed to the terminal. If you want to use the functions that you defined in a .py file, you can import them just like we imported other modules, assuming you are in the same directory as the .py file: `import file_name` # Friday Today we'll be looking at numpy and scipy, tools that many data scientists use. ## Numpy Numpy is a library that is typically imported with the nickname `np`: ``` import numpy as np ``` ### Arrays in numpy The reason people use numpy is the n-dimensional array object it provides. To make an array, give numpy's `array` constructor a list of lists. ```my_array = np.array([1, 2, 3])``` Because this is a one-dimensional array, it's not that different from a list: * Index into array: `my_array[0]` -> `1` * Modify array: `my_array[0] = 6` * Print array: `print(my_array)` -> `[6 2 3]` * Find length and type of array: `len(my_array)` -> `3` `type(my_array)` -> `numpy.ndarray` Notice that the type is **nd**array: n-dimensional array. People use arrays in numpy instead of dataframes in pandas because numpy can handle arrays with as many dimensions as you want, whereas pandas dataframes are two-dimensional. ### Multi-dimensional arrays Multi-dimensional arrays are created using lists of lists. Here is a 2x3 array: ``` my_2d_list = np.array([ [1, 2, 3], [4, 5, 6] ]) ``` You can put line breaks within parentheses, but even with the line breaks, these start to become difficult to read. Some useful functions: * Indexing into multi-dimensional arrays: two ways `my_2d_list[0,1]` -> `2` `my_2d_list[0][1]` -> `2` * Indexing with ranges (can also be done two ways) `my_2d_list[1, :2]` -> `[4,5]` * Getting the shape (outermost dimension to innermost dimension) `my_2d_list.shape` -> `(2,3)` Notice that `.shape` doesn't have parentheses after it. As we know, types of objects can have methods, e.g. `my_dataframe.mean()`. These methods run code on the object. In contrast, no code needs to be run for `.shape`, as it's just a stored attribute associated with the variable. The `.shape` is undefined for arrays like this: ``` new_array = np.array([ [1, 2], [3, 4, 5] ]) ``` `new_array.shape` -> `(2,)` ### Why use a many-dimensional array? Example: a 3-dimensional array could be used if you had a table of values that changed over time. ``` my_3d_array = np.array([ [ [1.0, 2.0, 3.0], [4, 5.0, 6] ], [ [1, 2.7, 3.9], [5, 5.5, 5.9] ] ]) ``` Now `my_3d_array.shape` -> `(2,2,3)`. Two 2x3 arrays. ### Functions that return arrays: arange, linspace The usage of to the `range` function, `arange` is a numpy function that returns a range. The difference between the latter and the former is that the range is in the form of an array (i.e. **a**rray **range**) instead of a list. `np.arange(5, 10)` -> `array([5, 6, 7, 8, 9])` `np.arange(0, 10, 2)` -> `array([0, 2, 4, 6, 8])` The `linspace` function gives an array of evenly-spaced floats between and including the two endpoints. `np.linspace(11, 20, 10)` -> `array([ 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.])` ### Using arrays as matrices The `asmatrix` function tells numpy to interpret an array as a matrix: ``` my_mtx = np.asmatrix(np.array([[1,2],[3,4]])) ``` ### The math library(?) Most common math functions aside from the standard addition, multiplication, etc. can be found in the math library. ``import math`` For instance: * `math.sqrt(25)` -> `5.0` * `math.log(4)` -> `1.3862943611198906` ### Math on arrays and matrices Multiplying by a number: ``` my_array = np.array([[1,2],[3,4]]) my_array * 2 ``` -> ``` matrix([[ 4, 8], [12, 16]]) ``` The result of multiplying two objects depends whether those are arrays or matrices. Multiplying an array by an array is elementwise. ``` my_array = np.array([[1,2],[3,4]]) my_array * my_array ``` prints ``` array([[ 1, 4], [ 9, 16]]) ``` Whereas multilpication on matrix objects is matrix multiplication: ``` my_mtx = np.asmatrix(my_array) my_mtx*my_mtx ``` results in ``` matrix([[ 7, 10], [15, 22]]) ``` ## Deterministic modeling Our goal is to use computers to improve our science. We've been talking a lot about data. Modeling, however, can test your assumptions about the way the world works, or predict the future of your system. ### R packages used for modeling `dismo`: distribution modeling - give it where species will live now, it will give you where they may live in the future. ### Using Python functions to do math Think of the function `f(x) = x^2`, which squares its input. We can have Python do the same thing by defining a function: ``` def f_x(x): return x**2 f_x(2) ``` returns `4`. ### More complex equations Now consider the equation for growth in a population. Let `N` represent the population size at a certain time step, and `r` represent population growth within each time step. Then the number of new individuals at the next time step can be written as ``` dN/dt = rN ``` or, in a Python model, ``` def pop_growth(N, r): ''' Calculates the increase in population for a given timestep Input: N - the current size of the population r - the growth rate Returns: The number of new individuals in a time step''' return r * N ``` This model is a simple model that lumps birth rate and death rate together. #### Note on timesteps When we're modeling, we typically abstract time to "timesteps." How long a timestep is depends on the system you're working with, e.g. large animals have single-year timesteps, but bacteria may have hourly timesteps. ### Modeling population growth over many time steps We can use a for loop with our `pop_growth` function to model population growth over many time steps. For 100 time steps, starting our population with 200 individuals, and a growth rate of 0.05, that loop looks like this: ``` pop_size = 200 for i in range(100): pop_size += pop_growth(pop_size) print("Time:", i, "Population size:", pop_size) ``` This prints a lot of output, the first few lines of which are: ``` Time: 0 Population size: 210.0 Time: 1 Population size: 220.5 Time: 2 Population size: 231.525 Time: 3 Population size: 243.10125 Time: 4 Population size: 255.25631249999998 Time: 5 Population size: 268.01912812499995 ``` Notice how at time 0, our population has already grown. If you would prefer that time 0 has the initial population size, use this syntax with a print statement outside the for loop and the range of the for loop adjusted to start at 1. ``` pop_size = 200 print("Time:", 0, "Population size:", pop_size) for i in range(1, 100): pop_size += pop_growth(pop_size) print("Time:", i, "Population size:", pop_size) ``` ### Making a pandas dataframe from our data Recall that to make a dataframe, we use a dictionary, e.g. `pandas.DataFrame({"key1":value1,"key2":value2})`. If we use lists as our values, we can create a dataframe of the timesteps of this model. Initialize two lists, one for times and one for population sizes. I included one element in each list at first: time 0 and the initial population size. ``` initial_size = 200 steps = 50 rate = 0.05 time_list = [0] pop_list = [initial_size] ``` Then the for loop used above, but instead of printing values, the values are appended to the lists: ``` pop_size = initial_size for i in range(1, steps+1): pop_size += pop_growth(pop_size, rate) time_list.append(i) pop_list.append(pop_size) ``` Lastly, we create a dataframe from these data, with column headers "time" and "population" as keys. ```data = pandas.DataFrame({"time":time_list, "population":pop_list})``` To be sure our data are in a readable format, we use this function to order the columns so that "time" is first. ```data = data[["time", "population"]]``` When we print the top of the resulting dataframe using `print(data.head())`, we have a lovely formatted table: ``` time population 0 0 200.00000 1 1 210.00000 2 2 220.50000 3 3 231.52500 4 4 243.10125 ``` ### Plotting the data Plotting using code makes graphs easy to re-generate and modify. Recall that to make graphs inline, we first have to run `%matplotlib inline` within the notebook. Create a figure from the dataframe using `fig = data.plot(x="time")`, ensuring that "time" is interpreted as the x axis. Then set the labels and save the figure at 300 dpi. ``` fig.set_xlabel("Time") fig.set_ylabel("Population") fig.get_figure().savefig("myplot.png", dpi=300) ``` Make sure you include a path to somewhere you can find with a valid filetype. Above we used .png, but you could use .pdf, .jpg, etc. If we had several columns of data that we wanted to print on separate pages, we could use the following format: `fig.plot(subplots=True)`. Beware--this returns an array of figures. ### Logistic growth example The mathematical equation governing logistic growth is `dN/dt = rN(K-N)/K`, where K represents the population carrying capacity. In this instance, we simply change the `pop_growth(N)` function: ``` def pop_growth_logistic(N, r, K): ''' Calculates the logistic increase in population size for a given timestep Input: N - the current size of the population r - the growth rate K - the carrying capacity Returns: The number of new individuals in a time step''' return r * N * (K - N) / K ``` ## Stochastic modeling Stochastic modeling involves randomness. # Next Steps Now that you've seen all of the basic building blocks of Python, one of the main challenges is likely to be figuring out which concepts to use where. Here are some hints: * If you have a lot of sections of code where you are tempted to copy and paste them and just change the value, that is a hint that you might want to put that code in either a function or a loop (or maybe a function that you call from inside a loop). * If you have some data that you want to evaluate one way, and some data thtat you want to evaluate in a different way, you might want to use an if-statement. * If you want to do the same thing a bunch of times, you might want to use a loop * If you want to produce some sort of aggregate value for a sequence, you probably want to use an accumulator pattern (initialize a variable outside of a loop and update it from within the loop) * If you want to run code even though it might produce an error, use try/except * If you know that code is definitely going to produce an error and you want to avoid that, think about using an if-statement. If you're in a loop, you might want to pair this with continue or break. * If you want to store data associated with a specific value, think about using a dictionary Obviously, we can't cover all of Python in a week. Here are some common problems that we didn't talk about solving and some pointers on how you might go about solving them: * **I need to write a loop, but I won't know beforehand how many times its body should run** - Check out `while` loops. They keep running the code in their body until a specified condition is no longer true. * **I need to write a script (i.e. a .py file) that takes input from the user** - You might have noticed from using bash that a lot of programs take "command-line arguments" to provide additional guidance on how they should run. For instance, `cd` takes an argument telling it which directory to change to (e.g. `cd Desktop`). If you want to write a script that takes information in this way, checkout Python's `sys` module. Another option for providing user input to Python scripts is the `input()` function. * **I need to write a script that makes new directories** - Check out the `os` module. * **I want to do machine learning** - Check out [scikit-learn](http://scikit-learn.org/stable/) * **I want to deal with image data** - Check out [scikit-image](http://scikit-image.org/). Also check out [this tutorial](https://github.com/DataLucence/images). * **I want to deal with spatial data** - Check out [pysal](http://pysal.readthedocs.io/en/latest/). * **I need to make more advanced plots than pandas supports and/or I want to use really great color palettes for them** - Check out [Seaborn](https://seaborn.pydata.org/). And if you want to make thoughtful choices about color (I promise it will make your plots way better), I cannot reccomend their [color tutorial](https://seaborn.pydata.org/tutorial/color_palettes.html) enough. * **I need to make more customized data visualizations** - Pandas and Seaborn are great if you want to make the specific plots that they provide, but sometimes you need to build your own. In that case, check out [matplotlib](https://matplotlib.org/), which is the library that Seaborn and Pandas visualizations are built with. It gives you simpler graphical objects to work with, like shapes. * **I want to do really complicated things to strings** - Google "regular expressions." * **My code is taking forever to run** - Writing fast code is a whole huge topic to which many sequences of semester long courses are devoted. Also, for the most part, your time is more valuable than the computer's, so it's not worth stressing over. Sometimes, though it does matter. If so, here are a couple high-level things to look out for. * Are you printing out a lot of uneccessary information? Often, while you're writing code, it's good to print out a lot of stuff to make sure it's working the way you think it is. However, printing takes a lot of time. Once you're ready to run your code for real, consider removing calls to `print()` that you don't absolutely need. The same goes for reading from and writing to files. * Are you repeatedly searching for values in a list? If you are frequently using commands like `my_variable in [1,2,3,4]` (i.e. using the `in` operator to search for the value `my_variable` in a list), think about whether you actually care about the order of the things in the list, especially if that list is long. If you're just using it as a way to store a collection of values, think about using a dictionary or a set instead (we didn't cover sets, but you can google them). Looking for something in a list means Python has to compare that value to every single item in the list. Dictionaries and sets let Python find the value in a single step. * Check for any unecessary lines of code, especially inside loops (where they will get run many times). Are you checking a condition in an if statement that will always be true or false? Are you calculating the same value in multiple places instead of calculating it once and storing it in a variable? * Are you adding a lot of strings together? If you're doing something like `"how" + "are" + "you" + "today"`, it can actually take a pretty long time. Since strings are "immutable" (they can't be changed), each one of those plus signs requires creating an entirely new string. Instead, try using the string.join() method to do it all in one step. Example: `" ".join(["how", "are", "you", "today"])`. * Want to dive deeper into this topic? The first step I'd reccomend is reading about something called "profiling your code." This is a way to figure out what parts of your code are taking the longest, so that you can try to make them faster. For Python, I reccomend checking out the program [SnakeViz](https://jiffyclub.github.io/snakeviz/). * **My program is using too much memory** - This one is tougher. Sometimes you just have a lot of data that you need to work with at the same time. If you don't actually need to work with it all at the same time, though, take a look at a feature called "generators." * **I'm reading someone else's code and they're doing something weird that I don't know how to google** - Often, not knowing the term for something makes it hard to understand what's going on. Here are some common hard-to-google shortcuts that more advanced programmers take in their code: * `[i for i in range(x)]` - Sometimes, you'll see what looks sort of like the beginning of a for loop inside square brackets. This is called a [list comprehension](http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/). In this workshop, when we need to make a list of values that we calculate inside a loop, we've used the accumulator pattern, where we initialize an empty list outside of the for loop, and then append to it inside the for loop. List comprehensions are a shortcut for doing that. You might also see variants with curly brackets (these are dictionary comprehensions) or parentheses (these are called generator comprehensions). * `lambda x : x*x` - This is called a lambda function. It's a shortcut for defining very simple functions without giving them a name. The thing after colon is the calculation the function does. * Something with the @ sign above a function definition - These are called decorators. Fair warning, they're pretty confusing the first 20 times you try to understand them. * `a if condition else b` - This is called a ternary operator. It's basically like a if statement that fits on one line. * **I'm getting into more complex coding, and I'd like to make my own types of data (like how pandas defines DataFrames)** - This is called writing new classes. You'll use the `class` keyword.