By: Carlos Oliver (aka. @dondraper :man_in_business_suit_levitating:)
Learn More →
Disclaimer: This is a real dataset, but the following visualizations and statistics are done purely for illustration purposes, not as a real analysis of COVID.
my_project/data/
(we'll be working in the my_project
folder)Create a file in my_project/
called covid.py
import pandas as pd
df = pd.read_csv('../data/covid.csv')
import pandas as pd
df = pd.read_excel('../data/covid.xls')
df
head
method which prints the first few rows of the table.print(df.head())
date county state fips cases deaths
0 2020-01-21 Snohomish Washington 53061.0 1 0
1 2020-01-22 Snohomish Washington 53061.0 1 0
2 2020-01-23 Snohomish Washington 53061.0 1 0
3 2020-01-24 Cook Illinois 17031.0 1 0
4 2020-01-24 Snohomish Washington 53061.0 1 0
print(df.columns)
Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')
dates = df['date']
print(dates)
Output:
0 2020-01-21
1 2020-01-22
2 2020-01-23
3 2020-01-24
4 2020-01-24
5 2020-01-25
...
print(df[['date', 'county']])
iloc
attribute. docsprint(df.iloc[2])
date 2020-01-23
county Snohomish
state Washington
fips 53061
cases 1
deaths 0
Name: 2, dtype: object
print(df.iloc[2]['date'])
2020-01-23
.loc
is a powerful attribute which can take a condition on the column values to filter the rows.print(df.loc[df['deaths'] > df['cases'] * 2 ])
date county state fips cases deaths
26292 2020-04-02 Unknown California NaN 0 1
28473 2020-04-03 Unknown Arizona NaN 0 4
29475 2020-04-03 Unknown Minnesota NaN 0 3
29803 2020-04-03 Unknown New York NaN 37 608
30135 2020-04-03 Unknown Pennsylvania NaN 0 1
30475 2020-04-03 Unknown Utah NaN 0 3
df['hot'] = df['cases']> 100
print(df.head())
date county state fips cases deaths hot
0 2020-01-21 Snohomish Washington 53061.0 1 0 False
1 2020-01-22 Snohomish Washington 53061.0 1 0 False
2 2020-01-23 Snohomish Washington 53061.0 1 0 False
3 2020-01-24 Cook Illinois 17031.0 1 0 False
4 2020-01-24 Snohomish Washington 53061.0 1 0 False
hot
column.df.to_csv('new_data.csv')
#or to excel format
df.to_excel('new_data.xls')
groupby
method comes in handy here.#groupby object
state_grouped = df.groupby(['state'])
#new dataframe
means_df = state_grouped.mean()
state fips cases deaths hot
0 Alabama 1067.804565 95.768957 3.544569 0.222011
1 Alaska 2148.606250 23.179455 0.465347 0.059406
2 Arizona 4014.058216 411.876394 18.223048 0.372677
3 Arkansas 5075.446098 37.349700 0.763608 0.077983
4 California 6058.734863 671.394359 25.821298 0.382446
5 Colorado 8062.313048 206.642279 10.232297 0.233342
The logic behind grouping is a bit involved so I suggest reading the docs.
print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])
state deaths
33 New York 274.444252
31 New Jersey 212.262184
6 Connecticut 171.599364
22 Massachusetts 152.209790
8 District of Columbia 140.775000
41 Puerto Rico 58.486486
sort_values
method with ascending=False
for a decreasing sort by deaths
state
and deaths
columns.isin()
to filter rows whose column values are within a list of accepted values.import seaborn as sns
import matplotlib.pyplot as plt
g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])])
plt.show()
#I used this to get rid of some x-ticks for easier reading.
for ind, label in enumerate(g.get_xticklabels()):
if ind % 30 == 0: # every 30th label is kept
label.set_visible(True)
else:
label.set_visible(False)
mortality
tot = df.groupby(['state'], as_index=False).sum()
tot['mortality'] = tot['deaths'] / tot['cases']
tot = tot.sort_values(by='mortality', ascending=False)
sns.barplot(x='state', y='mortality', data=tot)