<a target="_blank" href="https://drive.google.com/file/d/1qX5Hr4L59cHP6nHFO2FDaWeFp4DunYHi/view?usp=sharing"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
<br/>
# Intro to pandas
pandas is a [Python](https://www.python.org/) package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real-world** data analysis in Python.
The two primary data structures of pandas, [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series "pandas.Series") (1-dimensional) and [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame") (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.
## Getting started
**Import pandas**
```python
import pandas as pd
```
**Creating a Series object**
A Pandas `Series` is a one-dimensional array of indexed data. It can be created from a list or array as follows:
```python
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
```
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
Like with a list, data can be accessed by the associated index via the familiar Python square-bracket notation:
```python
data[1]
```
0.5
```python
data[2:]
```
2 0.75
3 1.00
dtype: float64
**Creating a DataFrame object**
```python
raw_data = [
['Tokyo', 13298000],
['Seoul', 10049000],
['Moskva', 12197500],
['HCMC', 8993082],
['Hanoi', 8053663],
]
df = pd.DataFrame(raw_data, columns=['City', 'Population'])
df
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>City</th>
<th>Population</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Tokyo</td>
<td>13298000</td>
</tr>
<tr>
<th>1</th>
<td>Seoul</td>
<td>10049000</td>
</tr>
<tr>
<th>2</th>
<td>Moskva</td>
<td>12197500</td>
</tr>
<tr>
<th>3</th>
<td>HCMC</td>
<td>8993082</td>
</tr>
<tr>
<th>4</th>
<td>Hanoi</td>
<td>8053663</td>
</tr>
</tbody>
</table>
</div>
Another way:
```python
raw_data = {
'City': ['Tokyo', 'Seoul', 'Moskva', 'HCMC', 'Hanoi'],
'Population': [13298000, 10049000, 12197500, 8993082, 8053663]
}
pd.DataFrame(raw_data)
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>City</th>
<th>Population</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Tokyo</td>
<td>13298000</td>
</tr>
<tr>
<th>1</th>
<td>Seoul</td>
<td>10049000</td>
</tr>
<tr>
<th>2</th>
<td>Moskva</td>
<td>12197500</td>
</tr>
<tr>
<th>3</th>
<td>HCMC</td>
<td>8993082</td>
</tr>
<tr>
<th>4</th>
<td>Hanoi</td>
<td>8053663</td>
</tr>
</tbody>
</table>
</div>
**Reading data from a csv file**
```python
salaries = pd.read_csv('https://raw.githubusercontent.com/dhminh1024/practice_datasets/master/salaries.csv')
```
```python
salaries.info()
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 148654 non-null int64
1 EmployeeName 148654 non-null object
2 JobTitle 148654 non-null object
3 BasePay 148045 non-null float64
4 OvertimePay 148650 non-null float64
5 OtherPay 148650 non-null float64
6 Benefits 112491 non-null float64
7 TotalPay 148654 non-null float64
8 TotalPayBenefits 148654 non-null float64
9 Year 148654 non-null int64
10 Notes 0 non-null float64
11 Agency 148654 non-null object
12 Status 0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
**DataFrame methods to get an overview**
| Method | Description |
| --------------- | --------------------------------------- |
| `df.head()` | first five rows |
| `df.tail()` | last five rows |
| `df.sample(5)` | random sample of rows |
| `df.shape` | number of rows/columns |
| `df.describe()` | calculates measures of central tendency |
| `df.info()` | memory footprint and datatypes |
```python
salaries.head()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Id</th>
<th>EmployeeName</th>
<th>JobTitle</th>
<th>BasePay</th>
<th>OvertimePay</th>
<th>OtherPay</th>
<th>Benefits</th>
<th>TotalPay</th>
<th>TotalPayBenefits</th>
<th>Year</th>
<th>Notes</th>
<th>Agency</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>NATHANIEL FORD</td>
<td>GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY</td>
<td>167411.18</td>
<td>0.00</td>
<td>400184.25</td>
<td>NaN</td>
<td>567595.43</td>
<td>567595.43</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>GARY JIMENEZ</td>
<td>CAPTAIN III (POLICE DEPARTMENT)</td>
<td>155966.02</td>
<td>245131.88</td>
<td>137811.38</td>
<td>NaN</td>
<td>538909.28</td>
<td>538909.28</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>ALBERT PARDINI</td>
<td>CAPTAIN III (POLICE DEPARTMENT)</td>
<td>212739.13</td>
<td>106088.18</td>
<td>16452.60</td>
<td>NaN</td>
<td>335279.91</td>
<td>335279.91</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>CHRISTOPHER CHONG</td>
<td>WIRE ROPE CABLE MAINTENANCE MECHANIC</td>
<td>77916.00</td>
<td>56120.71</td>
<td>198306.90</td>
<td>NaN</td>
<td>332343.61</td>
<td>332343.61</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>PATRICK GARDNER</td>
<td>DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)</td>
<td>134401.60</td>
<td>9737.00</td>
<td>182234.59</td>
<td>NaN</td>
<td>326373.19</td>
<td>326373.19</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
salaries.shape
```
(148654, 13)
```python
print('Number of employees: ', salaries.shape[0])
```
Number of employees: 148654
```python
salaries.describe()
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Id</th>
<th>BasePay</th>
<th>OvertimePay</th>
<th>OtherPay</th>
<th>Benefits</th>
<th>TotalPay</th>
<th>TotalPayBenefits</th>
<th>Year</th>
<th>Notes</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<th>count</th>
<td>148654.000000</td>
<td>148045.000000</td>
<td>148650.000000</td>
<td>148650.000000</td>
<td>112491.000000</td>
<td>148654.000000</td>
<td>148654.000000</td>
<td>148654.000000</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>mean</th>
<td>74327.500000</td>
<td>66325.448841</td>
<td>5066.059886</td>
<td>3648.767297</td>
<td>25007.893151</td>
<td>74768.321972</td>
<td>93692.554811</td>
<td>2012.522643</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>std</th>
<td>42912.857795</td>
<td>42764.635495</td>
<td>11454.380559</td>
<td>8056.601866</td>
<td>15402.215858</td>
<td>50517.005274</td>
<td>62793.533483</td>
<td>1.117538</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>min</th>
<td>1.000000</td>
<td>-166.010000</td>
<td>-0.010000</td>
<td>-7058.590000</td>
<td>-33.890000</td>
<td>-618.130000</td>
<td>-618.130000</td>
<td>2011.000000</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>25%</th>
<td>37164.250000</td>
<td>33588.200000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>11535.395000</td>
<td>36168.995000</td>
<td>44065.650000</td>
<td>2012.000000</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>50%</th>
<td>74327.500000</td>
<td>65007.450000</td>
<td>0.000000</td>
<td>811.270000</td>
<td>28628.620000</td>
<td>71426.610000</td>
<td>92404.090000</td>
<td>2013.000000</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>75%</th>
<td>111490.750000</td>
<td>94691.050000</td>
<td>4658.175000</td>
<td>4236.065000</td>
<td>35566.855000</td>
<td>105839.135000</td>
<td>132876.450000</td>
<td>2014.000000</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>max</th>
<td>148654.000000</td>
<td>319275.010000</td>
<td>245131.880000</td>
<td>400184.250000</td>
<td>96570.660000</td>
<td>567595.430000</td>
<td>567595.430000</td>
<td>2014.000000</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
## Selecting & Filtering
**Select a column**
```python
salaries['EmployeeName']
```
```python
type(salaries['EmployeeName']) # The type of a column is Series
```
pandas.core.series.Series
**Select multiple columns**
```python
salaries[['EmployeeName', 'JobTitle']]
```
```python
type(salaries[['EmployeeName', 'JobTitle']]) # DataFrame
```
**Select data based on index and columns**
```python
salaries.loc[1, 'EmployeeName']
```
'GARY JIMENEZ'
```python
salaries.loc[1, ['EmployeeName', 'JobTitle']]
```
EmployeeName GARY JIMENEZ
JobTitle CAPTAIN III (POLICE DEPARTMENT)
Name: 1, dtype: object
**Select data based on position**
```python
salaries.iloc[1, 1]
```
'GARY JIMENEZ'
```python
salaries.iloc[1, [1, 2]]
```
EmployeeName GARY JIMENEZ
JobTitle CAPTAIN III (POLICE DEPARTMENT)
Name: 1, dtype: object
**Filtering**
_Find the person named "Khoa Trinh":_
```python
salaries[salaries['EmployeeName'] == "Khoa Trinh"]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Id</th>
<th>EmployeeName</th>
<th>JobTitle</th>
<th>BasePay</th>
<th>OvertimePay</th>
<th>OtherPay</th>
<th>Benefits</th>
<th>TotalPay</th>
<th>TotalPayBenefits</th>
<th>Year</th>
<th>Notes</th>
<th>Agency</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<th>36161</th>
<td>36162</td>
<td>Khoa Trinh</td>
<td>Electronic Maintenance Tech</td>
<td>111921.0</td>
<td>146415.32</td>
<td>78057.41</td>
<td>53102.29</td>
<td>336393.73</td>
<td>389496.02</td>
<td>2012</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
_What is the name of highest paid person (TotalPay)?_
```python
salaries['TotalPay'].max()
```
567595.43
```python
salaries[salaries['TotalPay'] == salaries['TotalPay'].max()]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Id</th>
<th>EmployeeName</th>
<th>JobTitle</th>
<th>BasePay</th>
<th>OvertimePay</th>
<th>OtherPay</th>
<th>Benefits</th>
<th>TotalPay</th>
<th>TotalPayBenefits</th>
<th>Year</th>
<th>Notes</th>
<th>Agency</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>NATHANIEL FORD</td>
<td>GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY</td>
<td>167411.18</td>
<td>0.0</td>
<td>400184.25</td>
<td>NaN</td>
<td>567595.43</td>
<td>567595.43</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
_Who had the income greater than $300.000 in 2011?_
```python
salaries[(salaries['TotalPay'] > 300000) & (salaries['Year'] == 2011)]
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Id</th>
<th>EmployeeName</th>
<th>JobTitle</th>
<th>BasePay</th>
<th>OvertimePay</th>
<th>OtherPay</th>
<th>Benefits</th>
<th>TotalPay</th>
<th>TotalPayBenefits</th>
<th>Year</th>
<th>Notes</th>
<th>Agency</th>
<th>Status</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>NATHANIEL FORD</td>
<td>GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY</td>
<td>167411.18</td>
<td>0.00</td>
<td>400184.25</td>
<td>NaN</td>
<td>567595.43</td>
<td>567595.43</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>GARY JIMENEZ</td>
<td>CAPTAIN III (POLICE DEPARTMENT)</td>
<td>155966.02</td>
<td>245131.88</td>
<td>137811.38</td>
<td>NaN</td>
<td>538909.28</td>
<td>538909.28</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>ALBERT PARDINI</td>
<td>CAPTAIN III (POLICE DEPARTMENT)</td>
<td>212739.13</td>
<td>106088.18</td>
<td>16452.60</td>
<td>NaN</td>
<td>335279.91</td>
<td>335279.91</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>CHRISTOPHER CHONG</td>
<td>WIRE ROPE CABLE MAINTENANCE MECHANIC</td>
<td>77916.00</td>
<td>56120.71</td>
<td>198306.90</td>
<td>NaN</td>
<td>332343.61</td>
<td>332343.61</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>PATRICK GARDNER</td>
<td>DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)</td>
<td>134401.60</td>
<td>9737.00</td>
<td>182234.59</td>
<td>NaN</td>
<td>326373.19</td>
<td>326373.19</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>5</th>
<td>6</td>
<td>DAVID SULLIVAN</td>
<td>ASSISTANT DEPUTY CHIEF II</td>
<td>118602.00</td>
<td>8601.00</td>
<td>189082.74</td>
<td>NaN</td>
<td>316285.74</td>
<td>316285.74</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>6</th>
<td>7</td>
<td>ALSON LEE</td>
<td>BATTALION CHIEF, (FIRE DEPARTMENT)</td>
<td>92492.01</td>
<td>89062.90</td>
<td>134426.14</td>
<td>NaN</td>
<td>315981.05</td>
<td>315981.05</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>7</th>
<td>8</td>
<td>DAVID KUSHNER</td>
<td>DEPUTY DIRECTOR OF INVESTMENTS</td>
<td>256576.96</td>
<td>0.00</td>
<td>51322.50</td>
<td>NaN</td>
<td>307899.46</td>
<td>307899.46</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>8</th>
<td>9</td>
<td>MICHAEL MORRIS</td>
<td>BATTALION CHIEF, (FIRE DEPARTMENT)</td>
<td>176932.64</td>
<td>86362.68</td>
<td>40132.23</td>
<td>NaN</td>
<td>303427.55</td>
<td>303427.55</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
<tr>
<th>9</th>
<td>10</td>
<td>JOANNE HAYES-WHITE</td>
<td>CHIEF OF DEPARTMENT, (FIRE DEPARTMENT)</td>
<td>285262.00</td>
<td>0.00</td>
<td>17115.73</td>
<td>NaN</td>
<td>302377.73</td>
<td>302377.73</td>
<td>2011</td>
<td>NaN</td>
<td>San Francisco</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
```python
# Number of people who had income greater than $300k from 2011 to 2014
salaries[salaries['TotalPay'] > 300000].groupby('Year')['Id'].count()
```
Year
2011 10
2012 17
2013 20
2014 14
Name: Id, dtype: int64
_How many job titles appears in the dataset?_
```python
salaries['JobTitle'].unique() # type: numpy array
```
array(['GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY',
'CAPTAIN III (POLICE DEPARTMENT)',
'WIRE ROPE CABLE MAINTENANCE MECHANIC', ..., 'Conversion',
'Cashier 3', 'Not provided'], dtype=object)
```python
salaries['JobTitle'].nunique() # nunique will give the length of the result returned by unique()
```
2159
It's equivalent to:
```python
len(salaries['JobTitle'].unique())
```
_How many times each job title occurs?_
```python
salaries['JobTitle'].value_counts() # type: Series
```
Transit Operator 7036
Special Nurse 4389
Registered Nurse 3736
Public Svc Aide-Public Works 2518
Police Officer 3 2421
...
CONSERVATORSHIP/CASE MGNT SUPV 1
POWER GENERATION TECHNICIAN I 1
Sr Sprv Prob Ofc, Juv Prob 1
ASSISTANT INDUSTRIAL HYGIENIST 1
CHIEF TELEPHONE OPERATOR 1
Name: JobTitle, Length: 2159, dtype: int64
```python
print('Top 5 most common jobs:')
salaries['JobTitle'].value_counts().head(5)
```
Top 5 most common jobs:
Transit Operator 7036
Special Nurse 4389
Registered Nurse 3736
Public Svc Aide-Public Works 2518
Police Officer 3 2421
Name: JobTitle, dtype: int64
## Exercises
- _What is the name of the lowest paid person?_
```python
# YOUR_CODE_HERE
```
- _What is the time range (from which year to which year) of this dataset?_
```python
# YOUR_CODE_HERE
```
- _How many people get paid less than $1000?_
```python
# YOUR_CODE_HERE
```
- _What is the ratio (in percentage %) between rich people (TotalPay > 300.000) and poor people (TotalPay < 1000)?_
```python
# YOUR_CODE_HERE
```
- _Is the number of poor people increased/decreased over years?_
```python
# YOUR_CODE_HERE
```
## References
- [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [San Francisco city employee salary data](https://www.kaggle.com/kaggle/sf-salaries)