Intro to pandas

<a target="_blank" href="https://drive.google.com/file/d/1qX5Hr4L59cHP6nHFO2FDaWeFp4DunYHi/view?usp=sharing"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> <br/> # Intro to pandas pandas is a [Python](https://www.python.org/) package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real-world** data analysis in Python. The two primary data structures of pandas, [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series "pandas.Series") (1-dimensional) and [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame") (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. ## Getting started **Import pandas** ```python import pandas as pd ``` **Creating a Series object** A Pandas `Series` is a one-dimensional array of indexed data. It can be created from a list or array as follows: ```python data = pd.Series([0.25, 0.5, 0.75, 1.0]) data ``` 0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64 Like with a list, data can be accessed by the associated index via the familiar Python square-bracket notation: ```python data[1] ``` 0.5 ```python data[2:] ``` 2 0.75 3 1.00 dtype: float64 **Creating a DataFrame object** ```python raw_data = [ ['Tokyo', 13298000], ['Seoul', 10049000], ['Moskva', 12197500], ['HCMC', 8993082], ['Hanoi', 8053663], ] df = pd.DataFrame(raw_data, columns=['City', 'Population']) df ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>City</th> <th>Population</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Tokyo</td> <td>13298000</td> </tr> <tr> <th>1</th> <td>Seoul</td> <td>10049000</td> </tr> <tr> <th>2</th> <td>Moskva</td> <td>12197500</td> </tr> <tr> <th>3</th> <td>HCMC</td> <td>8993082</td> </tr> <tr> <th>4</th> <td>Hanoi</td> <td>8053663</td> </tr> </tbody> </table> </div> Another way: ```python raw_data = { 'City': ['Tokyo', 'Seoul', 'Moskva', 'HCMC', 'Hanoi'], 'Population': [13298000, 10049000, 12197500, 8993082, 8053663] } pd.DataFrame(raw_data) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>City</th> <th>Population</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>Tokyo</td> <td>13298000</td> </tr> <tr> <th>1</th> <td>Seoul</td> <td>10049000</td> </tr> <tr> <th>2</th> <td>Moskva</td> <td>12197500</td> </tr> <tr> <th>3</th> <td>HCMC</td> <td>8993082</td> </tr> <tr> <th>4</th> <td>Hanoi</td> <td>8053663</td> </tr> </tbody> </table> </div> **Reading data from a csv file** ```python salaries = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/salaries.csv') ``` ```python salaries.info() ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 148654 entries, 0 to 148653 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 148654 non-null int64 1 EmployeeName 148654 non-null object 2 JobTitle 148654 non-null object 3 BasePay 148045 non-null float64 4 OvertimePay 148650 non-null float64 5 OtherPay 148650 non-null float64 6 Benefits 112491 non-null float64 7 TotalPay 148654 non-null float64 8 TotalPayBenefits 148654 non-null float64 9 Year 148654 non-null int64 10 Notes 0 non-null float64 11 Agency 148654 non-null object 12 Status 0 non-null float64 dtypes: float64(8), int64(2), object(3) memory usage: 14.7+ MB **DataFrame methods to get an overview** | Method | Description | | --------------- | --------------------------------------- | | `df.head()` | first five rows | | `df.tail()` | last five rows | | `df.sample(5)` | random sample of rows | | `df.shape` | number of rows/columns | | `df.describe()` | calculates measures of central tendency | | `df.info()` | memory footprint and datatypes | ```python salaries.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Id</th> <th>EmployeeName</th> <th>JobTitle</th> <th>BasePay</th> <th>OvertimePay</th> <th>OtherPay</th> <th>Benefits</th> <th>TotalPay</th> <th>TotalPayBenefits</th> <th>Year</th> <th>Notes</th> <th>Agency</th> <th>Status</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>NATHANIEL FORD</td> <td>GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY</td> <td>167411.18</td> <td>0.00</td> <td>400184.25</td> <td>NaN</td> <td>567595.43</td> <td>567595.43</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>2</td> <td>GARY JIMENEZ</td> <td>CAPTAIN III (POLICE DEPARTMENT)</td> <td>155966.02</td> <td>245131.88</td> <td>137811.38</td> <td>NaN</td> <td>538909.28</td> <td>538909.28</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>2</th> <td>3</td> <td>ALBERT PARDINI</td> <td>CAPTAIN III (POLICE DEPARTMENT)</td> <td>212739.13</td> <td>106088.18</td> <td>16452.60</td> <td>NaN</td> <td>335279.91</td> <td>335279.91</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>3</th> <td>4</td> <td>CHRISTOPHER CHONG</td> <td>WIRE ROPE CABLE MAINTENANCE MECHANIC</td> <td>77916.00</td> <td>56120.71</td> <td>198306.90</td> <td>NaN</td> <td>332343.61</td> <td>332343.61</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>4</th> <td>5</td> <td>PATRICK GARDNER</td> <td>DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)</td> <td>134401.60</td> <td>9737.00</td> <td>182234.59</td> <td>NaN</td> <td>326373.19</td> <td>326373.19</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> </tbody> </table> </div> ```python salaries.shape ``` (148654, 13) ```python print('Number of employees: ', salaries.shape[0]) ``` Number of employees: 148654 ```python salaries.describe() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Id</th> <th>BasePay</th> <th>OvertimePay</th> <th>OtherPay</th> <th>Benefits</th> <th>TotalPay</th> <th>TotalPayBenefits</th> <th>Year</th> <th>Notes</th> <th>Status</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>148654.000000</td> <td>148045.000000</td> <td>148650.000000</td> <td>148650.000000</td> <td>112491.000000</td> <td>148654.000000</td> <td>148654.000000</td> <td>148654.000000</td> <td>0.0</td> <td>0.0</td> </tr> <tr> <th>mean</th> <td>74327.500000</td> <td>66325.448841</td> <td>5066.059886</td> <td>3648.767297</td> <td>25007.893151</td> <td>74768.321972</td> <td>93692.554811</td> <td>2012.522643</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>std</th> <td>42912.857795</td> <td>42764.635495</td> <td>11454.380559</td> <td>8056.601866</td> <td>15402.215858</td> <td>50517.005274</td> <td>62793.533483</td> <td>1.117538</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>min</th> <td>1.000000</td> <td>-166.010000</td> <td>-0.010000</td> <td>-7058.590000</td> <td>-33.890000</td> <td>-618.130000</td> <td>-618.130000</td> <td>2011.000000</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>25%</th> <td>37164.250000</td> <td>33588.200000</td> <td>0.000000</td> <td>0.000000</td> <td>11535.395000</td> <td>36168.995000</td> <td>44065.650000</td> <td>2012.000000</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>50%</th> <td>74327.500000</td> <td>65007.450000</td> <td>0.000000</td> <td>811.270000</td> <td>28628.620000</td> <td>71426.610000</td> <td>92404.090000</td> <td>2013.000000</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>75%</th> <td>111490.750000</td> <td>94691.050000</td> <td>4658.175000</td> <td>4236.065000</td> <td>35566.855000</td> <td>105839.135000</td> <td>132876.450000</td> <td>2014.000000</td> <td>NaN</td> <td>NaN</td> </tr> <tr> <th>max</th> <td>148654.000000</td> <td>319275.010000</td> <td>245131.880000</td> <td>400184.250000</td> <td>96570.660000</td> <td>567595.430000</td> <td>567595.430000</td> <td>2014.000000</td> <td>NaN</td> <td>NaN</td> </tr> </tbody> </table> </div> ## Selecting & Filtering **Select a column** ```python salaries['EmployeeName'] ``` ```python type(salaries['EmployeeName']) # The type of a column is Series ``` pandas.core.series.Series **Select multiple columns** ```python salaries[['EmployeeName', 'JobTitle']] ``` ```python type(salaries[['EmployeeName', 'JobTitle']]) # DataFrame ``` **Select data based on index and columns** ```python salaries.loc[1, 'EmployeeName'] ``` 'GARY JIMENEZ' ```python salaries.loc[1, ['EmployeeName', 'JobTitle']] ``` EmployeeName GARY JIMENEZ JobTitle CAPTAIN III (POLICE DEPARTMENT) Name: 1, dtype: object **Select data based on position** ```python salaries.iloc[1, 1] ``` 'GARY JIMENEZ' ```python salaries.iloc[1, [1, 2]] ``` EmployeeName GARY JIMENEZ JobTitle CAPTAIN III (POLICE DEPARTMENT) Name: 1, dtype: object **Filtering** _Find the person named "Khoa Trinh":_ ```python salaries[salaries['EmployeeName'] == "Khoa Trinh"] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Id</th> <th>EmployeeName</th> <th>JobTitle</th> <th>BasePay</th> <th>OvertimePay</th> <th>OtherPay</th> <th>Benefits</th> <th>TotalPay</th> <th>TotalPayBenefits</th> <th>Year</th> <th>Notes</th> <th>Agency</th> <th>Status</th> </tr> </thead> <tbody> <tr> <th>36161</th> <td>36162</td> <td>Khoa Trinh</td> <td>Electronic Maintenance Tech</td> <td>111921.0</td> <td>146415.32</td> <td>78057.41</td> <td>53102.29</td> <td>336393.73</td> <td>389496.02</td> <td>2012</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> </tbody> </table> </div> _What is the name of highest paid person (TotalPay)?_ ```python salaries['TotalPay'].max() ``` 567595.43 ```python salaries[salaries['TotalPay'] == salaries['TotalPay'].max()] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Id</th> <th>EmployeeName</th> <th>JobTitle</th> <th>BasePay</th> <th>OvertimePay</th> <th>OtherPay</th> <th>Benefits</th> <th>TotalPay</th> <th>TotalPayBenefits</th> <th>Year</th> <th>Notes</th> <th>Agency</th> <th>Status</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>NATHANIEL FORD</td> <td>GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY</td> <td>167411.18</td> <td>0.0</td> <td>400184.25</td> <td>NaN</td> <td>567595.43</td> <td>567595.43</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> </tbody> </table> </div> _Who had the income greater than $300.000 in 2011?_ ```python salaries[(salaries['TotalPay'] > 300000) & (salaries['Year'] == 2011)] ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Id</th> <th>EmployeeName</th> <th>JobTitle</th> <th>BasePay</th> <th>OvertimePay</th> <th>OtherPay</th> <th>Benefits</th> <th>TotalPay</th> <th>TotalPayBenefits</th> <th>Year</th> <th>Notes</th> <th>Agency</th> <th>Status</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>1</td> <td>NATHANIEL FORD</td> <td>GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY</td> <td>167411.18</td> <td>0.00</td> <td>400184.25</td> <td>NaN</td> <td>567595.43</td> <td>567595.43</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>1</th> <td>2</td> <td>GARY JIMENEZ</td> <td>CAPTAIN III (POLICE DEPARTMENT)</td> <td>155966.02</td> <td>245131.88</td> <td>137811.38</td> <td>NaN</td> <td>538909.28</td> <td>538909.28</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>2</th> <td>3</td> <td>ALBERT PARDINI</td> <td>CAPTAIN III (POLICE DEPARTMENT)</td> <td>212739.13</td> <td>106088.18</td> <td>16452.60</td> <td>NaN</td> <td>335279.91</td> <td>335279.91</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>3</th> <td>4</td> <td>CHRISTOPHER CHONG</td> <td>WIRE ROPE CABLE MAINTENANCE MECHANIC</td> <td>77916.00</td> <td>56120.71</td> <td>198306.90</td> <td>NaN</td> <td>332343.61</td> <td>332343.61</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>4</th> <td>5</td> <td>PATRICK GARDNER</td> <td>DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)</td> <td>134401.60</td> <td>9737.00</td> <td>182234.59</td> <td>NaN</td> <td>326373.19</td> <td>326373.19</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>5</th> <td>6</td> <td>DAVID SULLIVAN</td> <td>ASSISTANT DEPUTY CHIEF II</td> <td>118602.00</td> <td>8601.00</td> <td>189082.74</td> <td>NaN</td> <td>316285.74</td> <td>316285.74</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>6</th> <td>7</td> <td>ALSON LEE</td> <td>BATTALION CHIEF, (FIRE DEPARTMENT)</td> <td>92492.01</td> <td>89062.90</td> <td>134426.14</td> <td>NaN</td> <td>315981.05</td> <td>315981.05</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>7</th> <td>8</td> <td>DAVID KUSHNER</td> <td>DEPUTY DIRECTOR OF INVESTMENTS</td> <td>256576.96</td> <td>0.00</td> <td>51322.50</td> <td>NaN</td> <td>307899.46</td> <td>307899.46</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>8</th> <td>9</td> <td>MICHAEL MORRIS</td> <td>BATTALION CHIEF, (FIRE DEPARTMENT)</td> <td>176932.64</td> <td>86362.68</td> <td>40132.23</td> <td>NaN</td> <td>303427.55</td> <td>303427.55</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> <tr> <th>9</th> <td>10</td> <td>JOANNE HAYES-WHITE</td> <td>CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)</td> <td>285262.00</td> <td>0.00</td> <td>17115.73</td> <td>NaN</td> <td>302377.73</td> <td>302377.73</td> <td>2011</td> <td>NaN</td> <td>San Francisco</td> <td>NaN</td> </tr> </tbody> </table> </div> ```python # Number of people who had income greater than $300k from 2011 to 2014 salaries[salaries['TotalPay'] > 300000].groupby('Year')['Id'].count() ``` Year 2011 10 2012 17 2013 20 2014 14 Name: Id, dtype: int64 _How many job titles appears in the dataset?_ ```python salaries['JobTitle'].unique() # type: numpy array ``` array(['GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY', 'CAPTAIN III (POLICE DEPARTMENT)', 'WIRE ROPE CABLE MAINTENANCE MECHANIC', ..., 'Conversion', 'Cashier 3', 'Not provided'], dtype=object) ```python salaries['JobTitle'].nunique() # nunique will give the length of the result returned by unique() ``` 2159 It's equivalent to: ```python len(salaries['JobTitle'].unique()) ``` _How many times each job title occurs?_ ```python salaries['JobTitle'].value_counts() # type: Series ``` Transit Operator 7036 Special Nurse 4389 Registered Nurse 3736 Public Svc Aide-Public Works 2518 Police Officer 3 2421 ... CONSERVATORSHIP/CASE MGNT SUPV 1 POWER GENERATION TECHNICIAN I 1 Sr Sprv Prob Ofc, Juv Prob 1 ASSISTANT INDUSTRIAL HYGIENIST 1 CHIEF TELEPHONE OPERATOR 1 Name: JobTitle, Length: 2159, dtype: int64 ```python print('Top 5 most common jobs:') salaries['JobTitle'].value_counts().head(5) ``` Top 5 most common jobs: Transit Operator 7036 Special Nurse 4389 Registered Nurse 3736 Public Svc Aide-Public Works 2518 Police Officer 3 2421 Name: JobTitle, dtype: int64 ## Exercises - _What is the name of the lowest paid person?_ ```python # YOUR_CODE_HERE ``` - _What is the time range (from which year to which year) of this dataset?_ ```python # YOUR_CODE_HERE ``` - _How many people get paid less than $1000?_ ```python # YOUR_CODE_HERE ``` - _What is the ratio (in percentage %) between rich people (TotalPay > 300.000) and poor people (TotalPay < 1000)?_ ```python # YOUR_CODE_HERE ``` - _Is the number of poor people increased/decreased over years?_ ```python # YOUR_CODE_HERE ``` ## References - [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html) - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) - [San Francisco city employee salary data](https://www.kaggle.com/kaggle/sf-salaries)