Data in Python: Intro to Pandas & a bit of Seaborn
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

By: Carlos Oliver (aka. @dondraper

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
)


Video Walkthrough

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’


Motivation
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

  • Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More โ†’
  • The most widely used software for dealing with this kind of data is Microsoft Excel.

Why use Python?
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

  • I'll give three (four) reasons why you might want to try working with this kind of data in Python instead
    1. It's free.
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More โ†’
    2. Automation and complex manipulations is much easier
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More โ†’
    3. Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.)
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More โ†’
    4. Clicking on buttons hurts
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More โ†’
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More โ†’
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More โ†’

Objectives
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

  1. Loading a dataset in Pandas
  2. Inspecting it
  3. Basic manipulations
  4. Visualization
  • These libraries are super extensive.
  • This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from documentation to best suit your needs.

Getting a Dataset

  • I browsed Kaggle for some interesting COVID-related datasets (of course)
  • We'll be using the COVID by US county dataset.

Disclaimer: This is a real dataset, but the following visualizations and statistics are done purely for illustration purposes, not as a real analysis of COVID.


  1. Download the CSV file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
  2. Place the file in a folder my_project/data/ (we'll be working in the my_project folder)

Loading the file in Python

Create a file in my_project/ called covid.py

  • If the source file is a CSV. doc
import pandas as pd
df = pd.read_csv('../data/covid.csv')

  • If the source file is an Excel File. doc
import pandas as pd
df = pd.read_excel('../data/covid.xls')

DataFrames

  • The core object that Pandas uses is called a DataFrame.
  • We loaded our data into a DataFrame called df
  • This is object stores the table data and supports most of the needed functionality.
  • Let's take a look, using the head method which prints the first few rows of the table.

print(df.head())
         date     county       state     fips  cases  deaths
0  2020-01-21  Snohomish  Washington  53061.0      1       0
1  2020-01-22  Snohomish  Washington  53061.0      1       0
2  2020-01-23  Snohomish  Washington  53061.0      1       0
3  2020-01-24       Cook    Illinois  17031.0      1       0
4  2020-01-24  Snohomish  Washington  53061.0      1       0
  • Each row is a day with number of deaths and cases for each US county.

Selections: Columns

  • To get a list of columns
print(df.columns)
Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')
  • To get a single column
dates = df['date']
print(dates)

Output:

0         2020-01-21
1         2020-01-22
2         2020-01-23
3         2020-01-24
4         2020-01-24
5         2020-01-25
...
  • To get multiple columns
print(df[['date', 'county']])

Selections: Row

  • To select a specific row, we use the iloc attribute. docs
print(df.iloc[2])
date      2020-01-23
county     Snohomish
state     Washington
fips           53061
cases              1
deaths             0
Name: 2, dtype: object

  • You can then select columns within a row
print(df.iloc[2]['date'])
2020-01-23

Advanced Selection

  • Row and column selection methods are very extensive and powerful. You should really read the docs for a full understanding.
  • Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
  • .loc is a powerful attribute which can take a condition on the column values to filter the rows.
print(df.loc[df['deaths'] > df['cases'] * 2 ])
              date   county         state  fips  cases  deaths
26292   2020-04-02  Unknown    California   NaN      0       1
28473   2020-04-03  Unknown       Arizona   NaN      0       4
29475   2020-04-03  Unknown     Minnesota   NaN      0       3
29803   2020-04-03  Unknown      New York   NaN     37     608
30135   2020-04-03  Unknown  Pennsylvania   NaN      0       1
30475   2020-04-03  Unknown          Utah   NaN      0       3

Adding columns

  • Maybe we want to flag certain counties with lots of cases.
df['hot'] = df['cases']> 100
print(df.head())
         date     county       state     fips  cases  deaths    hot
0  2020-01-21  Snohomish  Washington  53061.0      1       0  False
1  2020-01-22  Snohomish  Washington  53061.0      1       0  False
2  2020-01-23  Snohomish  Washington  53061.0      1       0  False
3  2020-01-24       Cook    Illinois  17031.0      1       0  False
4  2020-01-24  Snohomish  Washington  53061.0      1       0  False

Writing DataFrames

  • Let's say we want to publish our DataFrame with the hot column.
  • We simply write it to a new CSV file which can be loaded later.
df.to_csv('new_data.csv')
#or to excel format
df.to_excel('new_data.xls')


Grouping

  • Maybe we want to know averages for the columns per state
  • the groupby method comes in handy here.
#groupby object
state_grouped = df.groupby(['state'])
#new dataframe
means_df = state_grouped.mean()
                       state          fips        cases      deaths       hot
0                    Alabama   1067.804565    95.768957    3.544569  0.222011
1                     Alaska   2148.606250    23.179455    0.465347  0.059406
2                    Arizona   4014.058216   411.876394   18.223048  0.372677
3                   Arkansas   5075.446098    37.349700    0.763608  0.077983
4                 California   6058.734863   671.394359   25.821298  0.382446
5                   Colorado   8062.313048   206.642279   10.232297  0.233342

The logic behind grouping is a bit involved so I suggest reading the docs.


Sorting

  • Which states have the most daily deaths on average?

print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])
                       state      deaths
33                  New York  274.444252
31                New Jersey  212.262184
6                Connecticut  171.599364
22             Massachusetts  152.209790
8       District of Columbia  140.775000
41               Puerto Rico   58.486486

  • I used the sort_values method with ascending=False for a decreasing sort by deaths
  • Then I print the state and deaths columns.

Plotting with Seaborn
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

  • Seaborn is a wrapper on top of matplotlib which works nicely with DataFrames.

Deaths per Day

  • Let's compare deaths over time between three states.
  • Notice the use of isin() to filter rows whose column values are within a list of accepted values.
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])])
plt.show()
#I used this to get rid of some x-ticks for easier reading.
for ind, label in enumerate(g.get_xticklabels()):
    if ind % 30 == 0:  # every 30th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’


Mortality by State

  • Let's do one more.
  • Let's group by state again and instead of averaging take totals (sum)
  • We also add a new column called mortality
tot = df.groupby(['state'], as_index=False).sum()
tot['mortality'] = tot['deaths'] / tot['cases']
tot = tot.sort_values(by='mortality', ascending=False)
sns.barplot(x='state', y='mortality', data=tot)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’