Pandas - HackMD

--- title: Pandas tags: Templates, Talk description: View the slide with "Slide Mode". --- # Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean: By: Carlos Oliver (aka. @dondraper :man_in_business_suit_levitating:) --- ### Video Walkthrough {%youtube 5iu6HRuvZ4M %} --- ### Motivation :zap: * Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences :microscope: * The most widely used software for dealing with this kind of data is Microsoft Excel. --- ### Why use Python? :snake: * I'll give three (four) reasons why you might want to try working with this kind of data in Python instead 1. It's free. :money_with_wings: 2. Automation and complex manipulations is much easier :pie: 3. Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.) :chart_with_upwards_trend: 4. Clicking on buttons hurts :hospital: :mouse: :no_entry_sign: --- ### Objectives :goal_net: 1. Loading a dataset in Pandas 2. Inspecting it 3. Basic manipulations 4. Visualization :::danger * These libraries are super extensive. * This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from [documentation](https://pandas.pydata.org/) to best suit your needs. ::: --- ### Getting a Dataset * I browsed [Kaggle](www.kaggle.com) for some interesting COVID-related datasets (of course) * We'll be using the [COVID by US county](https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset) dataset. ::: danger **Disclaimer**: This is a real dataset, but the following visualizations and statistics are done purely for illustration purposes, not as a real analysis of COVID. ::: --- :::info 1. Download the **CSV** file (Comma Separated Values). A CSV is basically a text version of an Excel Table. 2. Place the file in a folder `my_project/data/` (we'll be working in the `my_project` folder) ::: --- ### Loading the file in Python :::info Create a file in `my_project/` called `covid.py` ::: * If the source file is a CSV. [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv) ```python import pandas as pd df = pd.read_csv('../data/covid.csv') ``` --- * If the source file is an Excel File. [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) ```python import pandas as pd df = pd.read_excel('../data/covid.xls') ``` --- ### DataFrames * The core object that Pandas uses is called a DataFrame. * We loaded our data into a DataFrame called `df` * This is object stores the table data and supports most of the needed functionality. * Let's take a look, using the `head` method which prints the first few rows of the table. --- ```python print(df.head()) ``` ::: warning ``` date county state fips cases deaths 0 2020-01-21 Snohomish Washington 53061.0 1 0 1 2020-01-22 Snohomish Washington 53061.0 1 0 2 2020-01-23 Snohomish Washington 53061.0 1 0 3 2020-01-24 Cook Illinois 17031.0 1 0 4 2020-01-24 Snohomish Washington 53061.0 1 0 ``` ::: * Each row is a day with number of deaths and cases for each US county. --- ### Selections: Columns * To get a list of columns ```python print(df.columns) ``` ::: warning ``` Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object') ``` ::: * To get a single column ```python dates = df['date'] print(dates) ``` Output: ``` 0 2020-01-21 1 2020-01-22 2 2020-01-23 3 2020-01-24 4 2020-01-24 5 2020-01-25 ... ``` * To get multiple columns ```python print(df[['date', 'county']]) ``` --- ### Selections: Row * To select a specific row, we use the `iloc` attribute. [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) ```python print(df.iloc[2]) ``` ::: warning ``` date 2020-01-23 county Snohomish state Washington fips 53061 cases 1 deaths 0 Name: 2, dtype: object ``` ::: --- * You can then select columns within a row ```python print(df.iloc[2]['date']) ``` ::: warning ``` 2020-01-23 ``` ::: --- ### Advanced Selection * Row and column selection methods are very extensive and powerful. You should really read the [docs]() for a full understanding. * Here is an example of getting the rows where deaths exceed cases by at least a factor of 2 * `.loc` is a powerful attribute which can take a condition on the column values to filter the rows. ```python print(df.loc[df['deaths'] > df['cases'] * 2 ]) ``` ::: warning ``` date county state fips cases deaths 26292 2020-04-02 Unknown California NaN 0 1 28473 2020-04-03 Unknown Arizona NaN 0 4 29475 2020-04-03 Unknown Minnesota NaN 0 3 29803 2020-04-03 Unknown New York NaN 37 608 30135 2020-04-03 Unknown Pennsylvania NaN 0 1 30475 2020-04-03 Unknown Utah NaN 0 3 ``` ::: --- ### Adding columns * Maybe we want to flag certain counties with lots of cases. ```python df['hot'] = df['cases']> 100 print(df.head()) ``` ::: warning ``` date county state fips cases deaths hot 0 2020-01-21 Snohomish Washington 53061.0 1 0 False 1 2020-01-22 Snohomish Washington 53061.0 1 0 False 2 2020-01-23 Snohomish Washington 53061.0 1 0 False 3 2020-01-24 Cook Illinois 17031.0 1 0 False 4 2020-01-24 Snohomish Washington 53061.0 1 0 False ``` ::: --- ### Writing DataFrames * Let's say we want to publish our DataFrame with the `hot` column. * We simply write it to a new CSV file which can be loaded later. ```python df.to_csv('new_data.csv') #or to excel format df.to_excel('new_data.xls') ``` --- ### Grouping * Maybe we want to know averages for the columns **per state** * the `groupby` method comes in handy here. ```python #groupby object state_grouped = df.groupby(['state']) #new dataframe means_df = state_grouped.mean() ``` ::: warning ``` state fips cases deaths hot 0 Alabama 1067.804565 95.768957 3.544569 0.222011 1 Alaska 2148.606250 23.179455 0.465347 0.059406 2 Arizona 4014.058216 411.876394 18.223048 0.372677 3 Arkansas 5075.446098 37.349700 0.763608 0.077983 4 California 6058.734863 671.394359 25.821298 0.382446 5 Colorado 8062.313048 206.642279 10.232297 0.233342 ``` ::: :::info The logic behind grouping is a bit involved so I suggest reading the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). ::: --- ### Sorting * Which states have the most daily deaths on average? --- ```python print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']]) ``` ::: warning ``` state deaths 33 New York 274.444252 31 New Jersey 212.262184 6 Connecticut 171.599364 22 Massachusetts 152.209790 8 District of Columbia 140.775000 41 Puerto Rico 58.486486 ``` --- ::: * I used the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method with `ascending=False` for a decreasing sort by `deaths` * Then I print the `state` and `deaths` columns. --- ### Plotting with Seaborn :ocean: * [Seaborn](https://seaborn.pydata.org/) is a _wrapper_ on top of matplotlib which works nicely with DataFrames. --- ### Deaths per Day * Let's compare deaths over time between three states. * Notice the use of [`isin()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) to filter rows whose column values are within a list of accepted values. ```python import seaborn as sns import matplotlib.pyplot as plt g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])]) plt.show() ``` ::: spoiler ```python #I used this to get rid of some x-ticks for easier reading. for ind, label in enumerate(g.get_xticklabels()): if ind % 30 == 0: # every 30th label is kept label.set_visible(True) else: label.set_visible(False) ``` ::: --- ### :hatching_chick: ![](https://i.imgur.com/XHkGRi4.png) --- ### Mortality by State * Let's do one more. * Let's group by state again and instead of averaging take totals (sum) * We also add a new column called `mortality` ```python tot = df.groupby(['state'], as_index=False).sum() tot['mortality'] = tot['deaths'] / tot['cases'] tot = tot.sort_values(by='mortality', ascending=False) sns.barplot(x='state', y='mortality', data=tot) ``` --- ### :hatching_chick: ![](https://i.imgur.com/L2ltpOg.png) ---