Working with files: =================== Let's take a step back and work with a simpler file: `game_theory.csv` In a little bit, I'm going to show you a better way to do this. But first I'm going to show you the hard way so that you understand how to do this with more complex files. ``` game_theory_data = open("game_theory.csv", "r") for line in game_theory_data: print(line) ``` Challenge problem: Level 1: Write a function that takes a string containing three numbers and returns a list of those three numbers as ints. e.g. list_of_ints("10,5745,655") should return [10, 5745, 655]. Level 2: Write a loop that calls that function on each line of the file except the first (you can print each line if you want to verify that it's working). Level 3: Modify your loop from level 2 so that it creates a list of the number of cooperators at each point in time (the value in column 1 on each line). Fortunately, there's a better way. That way is to use a library called pandas. In order to use pandas, you might need to install it first. You can do so by opening Anaconda navigator, clicking the environments tab on the left, searching for the "pandas" package, and clicking "apply". Now that we have Pandas installed, we need to import it into our workspace, so that Python knows about all of the functions inside it: `import pandas` Pandas has a function for reading csv data, since it's such a common format: `data = pandas.read_csv("game_theory.csv")` This loads all of the data in game_theory.csv into an object called a dataframe. These are designed to give you a lot of useful features for working with data, and function similarly to dataframes in R. We can access specific locations in the dataframe like this: `data.iloc[0,0]` Note that the bracket operator is just like the one we used to grab specific values for lists and strings, except dataframes have two dimensions, so we need to specify a value for each. The first value is the row, the second value is the column. We can also use slices here: `data.iloc[3:6, 0:2]` If we want to use the column names rather than their indices, the syntax is similar: `data.loc[:, "time"]` We can even get slices by name: `data.loc[:, "num_coop":"num_defect"]` Pandas dataframes have a bunch of handy methods: `data.max()` returns the maximum value. By default, it returns the max with a column (axis=0). You can tell it to calculate the maximum along the other axis by passing in 1 as an argument: `data.max(1)`. You can do this for most of the following methods. `data.min()` returns the minimum value `data.count()` counts the number of values in each column or row `data.mean()` returns the average `data.median()` returns the median `data.describe()` prints summary statistics # Masking data Sometimes, we only want to work with a subet of our data, based on some criterion. We can use logical operators on a data frame to create what's called a mask dataframe. This is a dataframe of the same dimensions that contains only Trues and Falses. Trues correspond to values in the dataframe that met the logical criterion and Falses correspond to values that didn't. For instance, it looks like the experiment stalls at the end, and there are always 6393 cooperators. Let's make a mask that excludes those values: `data != 6393` We can use this mask to replace all values corresponding to False with "NA": `data[data != 6393]` This is useful, because Pandas is smart about ignoring NAs when we do statistics like the mean and the median. # Plotting Pandas also has built in plotting: `data.plot.line()` You can make other kinds of plots too! Some need additional information. For instance, scatter needs to know what you want on the x and y axes: `data.plot.scatter("num_coop", "num_defect")` Pandas plotting is all well and good, but it doesn't support all types of plots. So it's good to supplement pandas with Seaborn. # Subsetting You can also grab subsets of dataframes in pandas based on the value of a column or row. `data[data.time>50]` # Putting it all together Now, let's clean up some of the code we wrote yesterday. At the end of yesterday, we had: ``` data_file = open("Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") data_lines = [] dn_over_ds = [] dn = [] ds = [] for i, line in enumerate(data_file): if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": data_lines = list(range(i+4, i+9)) if i in data_lines: line = line.split() line = line[1:] for j in range(len(line)): line[j] = float(line[j].strip("()")) for j in range(0, len(line), 3): dn_over_ds.append(line[j]) for j in range(1, len(line), 3): dn.append(line[j]) for j in range(2, len(line), 3): ds.append(line[j]) print(dn_over_ds) print(dn) print(ds) data_file.close() ``` There is a lot going on here. We can make it less overwhelming by breaking it into functions. First, let's split out the part that cleans up the lines into it's own function: ``` def cleanup(line): line = line[1:] for j in range(len(line)): line[j] = float(line[j].strip("()")) return line ``` Now we can simplify the original code: ``` data_file = open("Desktop/intermediatefiles/0009021/0009021.cluster.aa.fa.aln.nt.phy.trim.paml.model2", "r") data_lines = [] dn_over_ds = [] dn = [] ds = [] for i, line in enumerate(data_file): if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": data_lines = list(range(i+4, i+9)) if i in data_lines: line = line.split() line = cleanup(line) for j in range(0, len(line), 3): dn_over_ds.append(line[j]) for j in range(1, len(line), 3): dn.append(line[j]) for j in range(2, len(line), 3): ds.append(line[j]) print(dn_over_ds) print(dn) print(ds) data_file.close() ``` Now let's write functions that extract the relevant pieces of information. ``` def process_data_file(filename): # Open the data file in the read mode data_file = open(filename, "r") # Initialize empty lists to store things in later data_lines = [] dn_over_ds = [] dn = [] ds = [] labels = [] for i, line in enumerate(data_file): # We know that the following line is always 4 lines before the lines we care about # We also know that there are 4 lines we care about. if line.strip() == "Nei & Gojobori 1986. dN/dS (dN, dS)": data_lines = list(range(i+4, i+8)) #record the indices of the lines that we're looking for if i in data_lines: # If the current line is one of the ones we're looking for line = line.split() # convert the line (a string) into a list broken at spaces. labels.append(line[0]) # The label is the first item in the list. We want to keep track of it. line = cleanup(line) # Remove extra parentheses, convert strings to floats dn_over_ds += get_dn_over_ds(line) # the get_dn_over_ds function returns a list of all dn/ds values dn += get_dn(line) # the get_dn function returns a list of all dn values ds += get_ds(line) # the get_ds function returns a list of all ds values # Because all of the above functions return lists, and we are adding them to prexisting lists, we use # += rather than append. This is called concatenation # We want to have a list of labels in the same order as the lists of dns and dssf col_labels = get_col_labels(labels) row_labels = get_row_labels(labels) data_file.close() data = pd.DataFrame([col_labels, row_labels, dn_over_ds, dn, ds]).T data.columns = ["col_label", "row_label", "dn/ds", "dn", "ds"] return data ```