Data in Python: Intro to Pandas & a bit of Seaborn
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
By: Carlos Oliver (aka. @dondraper
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
)
Video Walkthrough
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Motivation
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- The most widely used software for dealing with this kind of data is Microsoft Excel.
Why use Python?
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- I'll give three (four) reasons why you might want to try working with this kind of data in Python instead
- It's free.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- Automation and complex manipulations is much easier
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- Clicking on buttons hurts
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
Objectives
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- Loading a dataset in Pandas
- Inspecting it
- Basic manipulations
- Visualization
- These libraries are super extensive.
- This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from documentation to best suit your needs.
Getting a Dataset
Disclaimer: This is a real dataset, but the following visualizations and statistics are done purely for illustration purposes, not as a real analysis of COVID.
- Download the CSV file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
- Place the file in a folder
my_project/data/
(we'll be working in the my_project
folder)
Loading the file in Python
Create a file in my_project/
called covid.py
- If the source file is a CSV. doc
- If the source file is an Excel File. doc
DataFrames
- The core object that Pandas uses is called a DataFrame.
- We loaded our data into a DataFrame called
df
- This is object stores the table data and supports most of the needed functionality.
- Let's take a look, using the
head
method which prints the first few rows of the table.
- Each row is a day with number of deaths and cases for each US county.
Selections: Columns
Output:
Selections: Row
- To select a specific row, we use the
iloc
attribute. docs
- You can then select columns within a row
Advanced Selection
- Row and column selection methods are very extensive and powerful. You should really read the docs for a full understanding.
- Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
.loc
is a powerful attribute which can take a condition on the column values to filter the rows.
Adding columns
- Maybe we want to flag certain counties with lots of cases.
Writing DataFrames
- Let's say we want to publish our DataFrame with the
hot
column.
- We simply write it to a new CSV file which can be loaded later.
Grouping
- Maybe we want to know averages for the columns per state
- the
groupby
method comes in handy here.
The logic behind grouping is a bit involved so I suggest reading the docs.
Sorting
- Which states have the most daily deaths on average?
- I used the
sort_values
method with ascending=False
for a decreasing sort by deaths
- Then I print the
state
and deaths
columns.
Plotting with Seaborn
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
- Seaborn is a wrapper on top of matplotlib which works nicely with DataFrames.
Deaths per Day
- Let's compare deaths over time between three states.
- Notice the use of
isin()
to filter rows whose column values are within a list of accepted values.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ

Mortality by State
- Let's do one more.
- Let's group by state again and instead of averaging take totals (sum)
- We also add a new column called
mortality
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More โ
