---
title: Pandas
tags: Templates, Talk
description: View the slide with "Slide Mode".
---
# Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean:
By: Carlos Oliver (aka. @dondraper :man_in_business_suit_levitating:)
---
### Video Walkthrough
{%youtube 5iu6HRuvZ4M %}
---
### Motivation :zap:
* Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences :microscope:
* The most widely used software for dealing with this kind of data is Microsoft Excel.
---
### Why use Python? :snake:
* I'll give three (four) reasons why you might want to try working with this kind of data in Python instead
1. It's free. :money_with_wings:
2. Automation and complex manipulations is much easier :pie:
3. Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.) :chart_with_upwards_trend:
4. Clicking on buttons hurts :hospital: :mouse: :no_entry_sign:
---
### Objectives :goal_net:
1. Loading a dataset in Pandas
2. Inspecting it
3. Basic manipulations
4. Visualization
:::danger
* These libraries are super extensive.
* This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from [documentation](https://pandas.pydata.org/) to best suit your needs.
:::
---
### Getting a Dataset
* I browsed [Kaggle](www.kaggle.com) for some interesting COVID-related datasets (of course)
* We'll be using the [COVID by US county](https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset) dataset.
::: danger
**Disclaimer**: This is a real dataset, but the following visualizations and statistics are done purely for illustration purposes, not as a real analysis of COVID.
:::
---
:::info
1. Download the **CSV** file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
2. Place the file in a folder `my_project/data/` (we'll be working in the `my_project` folder)
:::
---
### Loading the file in Python
:::info
Create a file in `my_project/` called `covid.py`
:::
* If the source file is a CSV. [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv)
```python
import pandas as pd
df = pd.read_csv('../data/covid.csv')
```
---
* If the source file is an Excel File. [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
```python
import pandas as pd
df = pd.read_excel('../data/covid.xls')
```
---
### DataFrames
* The core object that Pandas uses is called a DataFrame.
* We loaded our data into a DataFrame called `df`
* This is object stores the table data and supports most of the needed functionality.
* Let's take a look, using the `head` method which prints the first few rows of the table.
---
```python
print(df.head())
```
::: warning
```
date county state fips cases deaths
0 2020-01-21 Snohomish Washington 53061.0 1 0
1 2020-01-22 Snohomish Washington 53061.0 1 0
2 2020-01-23 Snohomish Washington 53061.0 1 0
3 2020-01-24 Cook Illinois 17031.0 1 0
4 2020-01-24 Snohomish Washington 53061.0 1 0
```
:::
* Each row is a day with number of deaths and cases for each US county.
---
### Selections: Columns
* To get a list of columns
```python
print(df.columns)
```
::: warning
```
Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')
```
:::
* To get a single column
```python
dates = df['date']
print(dates)
```
Output:
```
0 2020-01-21
1 2020-01-22
2 2020-01-23
3 2020-01-24
4 2020-01-24
5 2020-01-25
...
```
* To get multiple columns
```python
print(df[['date', 'county']])
```
---
### Selections: Row
* To select a specific row, we use the `iloc` attribute. [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
```python
print(df.iloc[2])
```
::: warning
```
date 2020-01-23
county Snohomish
state Washington
fips 53061
cases 1
deaths 0
Name: 2, dtype: object
```
:::
---
* You can then select columns within a row
```python
print(df.iloc[2]['date'])
```
::: warning
```
2020-01-23
```
:::
---
### Advanced Selection
* Row and column selection methods are very extensive and powerful. You should really read the [docs]() for a full understanding.
* Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
* `.loc` is a powerful attribute which can take a condition on the column values to filter the rows.
```python
print(df.loc[df['deaths'] > df['cases'] * 2 ])
```
::: warning
```
date county state fips cases deaths
26292 2020-04-02 Unknown California NaN 0 1
28473 2020-04-03 Unknown Arizona NaN 0 4
29475 2020-04-03 Unknown Minnesota NaN 0 3
29803 2020-04-03 Unknown New York NaN 37 608
30135 2020-04-03 Unknown Pennsylvania NaN 0 1
30475 2020-04-03 Unknown Utah NaN 0 3
```
:::
---
### Adding columns
* Maybe we want to flag certain counties with lots of cases.
```python
df['hot'] = df['cases']> 100
print(df.head())
```
::: warning
```
date county state fips cases deaths hot
0 2020-01-21 Snohomish Washington 53061.0 1 0 False
1 2020-01-22 Snohomish Washington 53061.0 1 0 False
2 2020-01-23 Snohomish Washington 53061.0 1 0 False
3 2020-01-24 Cook Illinois 17031.0 1 0 False
4 2020-01-24 Snohomish Washington 53061.0 1 0 False
```
:::
---
### Writing DataFrames
* Let's say we want to publish our DataFrame with the `hot` column.
* We simply write it to a new CSV file which can be loaded later.
```python
df.to_csv('new_data.csv')
#or to excel format
df.to_excel('new_data.xls')
```
---
### Grouping
* Maybe we want to know averages for the columns **per state**
* the `groupby` method comes in handy here.
```python
#groupby object
state_grouped = df.groupby(['state'])
#new dataframe
means_df = state_grouped.mean()
```
::: warning
```
state fips cases deaths hot
0 Alabama 1067.804565 95.768957 3.544569 0.222011
1 Alaska 2148.606250 23.179455 0.465347 0.059406
2 Arizona 4014.058216 411.876394 18.223048 0.372677
3 Arkansas 5075.446098 37.349700 0.763608 0.077983
4 California 6058.734863 671.394359 25.821298 0.382446
5 Colorado 8062.313048 206.642279 10.232297 0.233342
```
:::
:::info
The logic behind grouping is a bit involved so I suggest reading the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).
:::
---
### Sorting
* Which states have the most daily deaths on average?
---
```python
print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])
```
::: warning
```
state deaths
33 New York 274.444252
31 New Jersey 212.262184
6 Connecticut 171.599364
22 Massachusetts 152.209790
8 District of Columbia 140.775000
41 Puerto Rico 58.486486
```
---
:::
* I used the [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method with `ascending=False` for a decreasing sort by `deaths`
* Then I print the `state` and `deaths` columns.
---
### Plotting with Seaborn :ocean:
* [Seaborn](https://seaborn.pydata.org/) is a _wrapper_ on top of matplotlib which works nicely with DataFrames.
---
### Deaths per Day
* Let's compare deaths over time between three states.
* Notice the use of [`isin()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) to filter rows whose column values are within a list of accepted values.
```python
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])])
plt.show()
```
::: spoiler
```python
#I used this to get rid of some x-ticks for easier reading.
for ind, label in enumerate(g.get_xticklabels()):
if ind % 30 == 0: # every 30th label is kept
label.set_visible(True)
else:
label.set_visible(False)
```
:::
---
### :hatching_chick:
![](https://i.imgur.com/XHkGRi4.png)
---
### Mortality by State
* Let's do one more.
* Let's group by state again and instead of averaging take totals (sum)
* We also add a new column called `mortality`
```python
tot = df.groupby(['state'], as_index=False).sum()
tot['mortality'] = tot['deaths'] / tot['cases']
tot = tot.sort_values(by='mortality', ascending=False)
sns.barplot(x='state', y='mortality', data=tot)
```
---
### :hatching_chick:
![](https://i.imgur.com/L2ltpOg.png)
---