# [Cleaning Data in Python(Datacamp)](https://www.datacamp.com/courses/cleaning-data-in-python)
###### tags: `Datacamp` `python` `Cleaning Data` `tidying data` `combining data`
> 引用修改自[https://hackmd.io/0Euhw7ydReyUfY-kLUnXbg?both](https://hackmd.io/0Euhw7ydReyUfY-kLUnXbg?both)
>**作者:何彥南**
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
注意:以下df 為 pandas 的 DataFrame 的型式
---
[TOC]
---
# Exploring your data
## [1-1] Diagnose data for cleaning
### 1. Cleaning data
* Prepare data for analysis
* Data almost never comes in clean
* Diagnose your data for problems
### 2. Common data problems
* Inconsistent column names (欄位名不一致)
* Missing data (遺漏值)
* Outliers (極端值)
* Duplicate rows (重複資料)
* Untidy (不整齊)
* Need to process columns (需經過處理的欄位)
* Column types can signal unexpected data values
### 3. Unclean data
> 舉個例來說

* 上表有幾個問題:
* Column name inconsistencies (欄位名不一致)
* Missing data (遺漏值)
* Country names are in French (語言或編碼問題)
### 4. Load your data | read_csv()
> 載入你的資料
#### **pd.read_csv()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)]
* 常用參數
* **path**(必要):檔案位置。
* header: 指定載入的行名(columns name)。
* header=None 當檔案沒行名時可以使用,以(0、1、2...)代替。
* 預設 header=0 ,代表已第 0 個(index)列為行名。
* names: 更改行名。以 list 匯入。
* usecols: 指定要載入的行。以 list 匯入, 數字(index)或行名都行。
* skiprows: 跳過前面幾行,輸入 int。
* encoding: 讀取的編碼,輸入 string。
```python=
import pandas as pd
df = pd.read_csv('literary_birth_rate.csv')
```
### 5. Visually inspect | head() 、tail()
> 用看的 ,先把看得到的錯誤排除

#### **df.head()** [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)]
* 前n列,預設5。
#### **df.tail()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail)]
* 後n列,預設5。
### 6. Visually inspect(2) | columns、shape、info()

#### **columns**:
* 所有欄的名字(可迭代的)
#### **shape**:
* 表格的規格,輸出:(列,行)。
#### **info()**: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)
* 表格的整體資訊。
## [1-2] Exploratory data analysis
### 1. Data type of each column
>不通欄位的 type(資料類型)

### 2. Frequency counts: continent | value_counts()
#### **df[column].value_counts()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)]
* 依照指定行(column),對每個值進行分類計數(count)
* 輸出以分類為index,值為次數
* 注意:
* value_counts().values : 取得結果次數列表(每個分類的次數)
* value_counts().index : 分類列表
* 參數:
* ascending: 預設為降序,輸入bool,True 為升序。
* normalize: 是否使用比例,輸入bool。
* dropna: 丟掉空值。
```python=
df.continent.value_counts(dropna=False)
df['continent'].value_counts(dropna=False)
#[Out]:
'''
AF 49
ASI 47
EUR 36
LAT 24
OCE 6
NAM 2
Name: cont
'''
```
* 在DataFrame 下,df.continent 和 df['continent'] 都是呼叫資料表 (df) 裡的欄位(continent)
* 結果裡面有可能出現 missing 或 NaN 值,這也是我們要注意的地方,後面會在解釋。
### 3. Summary statistics
> 摘要統計
* Numeric columns: 數值化的行列資訊。
* outlier (異常值): 異常值(又稱極端值、離群值) ,我們可以知道他是較高(或低),之後再進一步的去調查。
### 4. Summary statistics: Numeric data | describe()
#### **df.describe()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)]
* 可以顯示資料整體的數值化統計量
```python=
df.describe()
#[Out]:
'''
female_literacy population
count 164.000000 1.220000e+02
mean 80.301220 6.345768e+07
std 22.977265 2.605977e+08
min 12.600000 1.035660e+05
25% 66.675000 3.778175e+06
50% 90.200000 9.995450e+06
75% 98.500000 2.642217e+07
max 100.000000 2.313000e+09
'''
```
#### practice
```python=
In [2]: # Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))
MANHATTAN 6310
BROOKLYN 2866
QUEENS 2121
BRONX 974
STATEN ISLAND 575
Name: Borough, dtype: int64
In [3]: # Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))
NY 12391
NJ 241
PA 38
CA 20
OH 19
FL 17
IL 17
CT 16
TX 13
TN 10
DC 7
MD 7
GA 6
KS 6
MA 6
VA 5
CO 4
WI 3
AZ 3
SC 3
MN 3
NC 2
UT 2
RI 2
VT 1
NM 1
MI 1
WA 1
IN 1
Name: State, dtype: int64
In [4]: # Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))
NOT APPLICABLE 7806
NaN 4205
ON-SITE 519
OFF-SITE 186
USE UNDER 300 CU.YD 130
Name: Site Fill, dtype: int64
```
## [1-3] Visual exploratory data analysis
### 1. Data visualization
* Great way to spot outliers and obvious errors
* More than just looking for patterns
* Plan data cleaning steps
### 2. Bar plots and histograms
* Bar plots(直條圖) for discrete data counts
* Histograms(直方圖) for continuous data counts
* Look at frequencies(頻率)
### 3. Histogram(直方圖)
> 適合展示連續型資料的頻率分布
```python=
df.population.plot('hist')
#[Out]:f<matplotlib.axes._subplots.AxesSubplot at 0x7f78e4abafd0>
import matplotlib.pyplot as plt
plt.show()
```

* 從這邊我們可以看出來,人口在一萬與兩萬那邊有異常值。
### 4. Identifying the error
> 辨識異常(錯誤),我們可以利用 panda 裡面基本的篩選功能去觀察異常值。(接續上面)
```python=
df[df.population > 1000000000]
#[Out]:
'''continent country female_literacy fertility population
0 ASI Chine 90.5 1.769 1.324655e+09
1 ASI Inde 50.8 2.682 1.139965e+09
162 OCE Australia 96.0 1.930 2.313000e+09
'''
```
* 不是所有的異常值都不好,要自己判斷。
* 這邊很明顯,Australia 他的 population(人口) 是錯誤的。
### 5. Box plots(箱型圖)
* 可視覺化基本資料的分布:
* Outliers (異常值)
* Min/max (極值)
* 25th, 50th, 75th percentiles (百分位數)
```python=
df.boxplot(column='population', by='continent')
#[Out]:<matplotlib.axes._subplots.AxesSubplot at 0x7ff5581bb630>
plt.show()
```
* 輸出

* 這邊圈起來的部分我們可以看出異常值,1~1.5 之間有兩個,還有一個在2以上。
### 6. Scatter plots(散佈圖)
* Relationship between 2 numeric variables
* Flag potentially bad data(標記潛在的壞數據)
* Errors not found by looking at 1 variable
#### practice
- Visualizing single variables with histograms
```python=
In [1]: # Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Describe the column
df['Existing Zoning Sqft'].describe()
Out[1]:
count 1.284600e+04
mean 1.439973e+03
std 3.860757e+04
min 0.000000e+00
25% 0.000000e+00
50% 0.000000e+00
75% 0.000000e+00
max 2.873107e+06
Name: Existing Zoning Sqft, dtype: float64
In [2]: # Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)
Out[2]: <matplotlib.axes._subplots.AxesSubplot at 0x7f72977433c8>
In [3]: # Display the histogram
plt.show()
```

- Visualizing multiple variables with boxplots
```python=
In [1]: # Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create the boxplot
df.boxplot(column='initial_cost', by='Borough', rot=90)
Out[1]: <matplotlib.axes._subplots.AxesSubplot at 0x7f728c33a0f0>
In [2]: # Display the plot
plt.show()
```

- Visualizing multiple variables with scatter plots
```python=
In [1]: # Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create and display the first scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
```

```python=
In [2]: # Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
```

---
# Tidying data for analysis
## [2-1] Tidy data(乾淨的資料)
* “Tidy Data” paper by Hadley Wickham, PhD
* Formalize(格式化) the way we describe theshape of data
* Gives us a goal when formatting(格式化) our data
* “Standard way to organize data values within a dataset”
### 1. Principles of tidy data
* Columns represent separate variables(變數)
* Rows represent individual(獨立的) observations(觀測值)
* Observational(觀察) units form tables
### 2. Converting to tidy data

* Better for reporting vs. better for analysis
* Tidy data makes it easier to fix common data problems
* The data problem we are trying to fix:
* Columns containing values, instead of variables
### 3. Melting | melt():
> 將多行(包含相同性質的值)轉成一行類別行(由原本行名組成)和一行值(由原本多行底下的相同值組成),並以補上一行類別變數(表示原本屬於哪行),值則是原本兩行裡的數值。
#### pd.melt(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)]
* 參數:
* frame: 輸入,dataframe 的形式。
* id_vars: 原資料為基準值的行。
* value_vars: 原資料要合併為一行(分類)的行。
* var_name: 轉換後分類那行的名字。
* value_name: 轉換後值的那行的名字。
```python=
pd.melt(frame=df,id_vars='name',value_vars=['treatment a', 'treatment b'],
var_name='treatment', value_name='result')
#[Out]:
'''
name treatment result
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
'''
```
* 在變化前,資料已人為基準(所以將其設為 id_vars),記錄每個人兩次治療的數值。
* 某些資料表的型態適合報告時呈現,但卻不適合分析,我們必須知道哪種資料表的形狀較適合我們。
* 在使用 pd.melt() 後,每筆資料代表的是每一次治療,這樣我們在後序的分析上就可以更方便且明確。

#### practice
- Reshaping your data using melt
```python=
In [1]: # Print the head of airquality
print(airquality.head())
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
In [2]: # Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'])
In [3]: # Print the head of airquality_melt
print(airquality_melt.head())
Month Day variable value
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
```
- Customizing melted data
```python=
In [1]: # Print the head of airquality
print(airquality.head())
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
In [2]: # Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
In [3]: # Print the head of airquality_melt
print(airquality_melt.head())
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
```
## [2-2] Pivoting data
### 1. Pivot: un-melting data | Pivot()
> 將列轉換成行

* Opposite(相反) of melting
* In melting, we turned columns into rows
* **Pivoting: turn unique values into separate columns**
* Analysis friendly shape to reporting friendly shape
* **Violates(違背) tidy data principle:**
* **rows contain observations**
* **Multiple variables stored in the same column**
#### df.Pivot(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot)]
* 參數:
* index: 以哪欄為基準。
* columns: 此行包含類別變數,將類別變為行。
* values: 此行包含參考的值。
```python=
weather_tidy = weather.pivot(index='date',columns='element',values='value')
print(weather_tidy)
#[Out]:
'''element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4
'''
```
### 2. Using pivot when you have duplicate entries
> 當基準(index) 同個類別但是存在多個值(values)時,使用 pivot()會造成以下錯誤

* 這邊以date為基準 2010-02-02 tmin 有2個值這樣就無法用pivot()轉換
```python=
import numpy as np
weather2_tidy = weather.pivot(values='value',
index='date',
columns='element')
#[Out]:
'''---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-2962bb23f5a3> in <module>()
1 weather2_tidy = weather2.pivot(values='value',
2 index='date',
----> 3 columns='element')
ValueError: Index contains duplicate entries, cannot reshape
'''
```
### 3. Pivot table | Pivot_table()
> 比起 Pivot() ,Pivot table() 有更詳細的參數,可以處裡在轉換時發生上述重複錯誤時。
#### df.pivot_table(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table)]
* 參數
* index: 以哪欄為基準。
* columns: 此行包含類別變數,將類別變為行。
* values: 此行包含參考的值。
* aggfunc: 處理方式,function
* Has a parameter that specifies how to deal with duplicate values
* Example: Can aggregate the duplicate values by taking their average
```python=
weather2_tidy = weather.pivot_table(values='value',index='date',
columns='element',aggfunc=np.mean)
#[Out]:
'''element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 15.4
'''
```
* 這邊透過設定 aggfunc= : 這參數,可以告訴 python 該如何處理兩個值。在這裡是用平均(np.mean)。
#### practice
```python=
In [1]: # Print the head of airquality_melt
print(airquality_melt.head())
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
In [2]: # Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')
In [3]: # Print the head of airquality_pivot
print(airquality_pivot.head())
measurement Ozone Solar.R Temp Wind
Month Day
5 1 41.0 190.0 67.0 7.4
2 36.0 118.0 72.0 8.0
3 12.0 149.0 74.0 12.6
4 18.0 313.0 62.0 11.5
5 NaN NaN 56.0 14.3
```
* Resetting the index of a DataFrame
```python=
In [1]: # Print the index of airquality_pivot
print(airquality_pivot.index)
MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
names=['Month', 'Day'])
In [2]: # Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
In [3]: # Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
RangeIndex(start=0, stop=153, step=1)
In [4]: # Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
measurement Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67.0 7.4
1 5 2 36.0 118.0 72.0 8.0
2 5 3 12.0 149.0 74.0 12.6
3 5 4 18.0 313.0 62.0 11.5
4 5 5 NaN NaN 56.0 14.3
```
* Pivoting duplicate values
```python=
In [1]: # Pivot table the airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'],
columns='measurement', values='reading', aggfunc=np.mean)
In [2]: # Print the head of airquality_pivot before reset_index
print(airquality_pivot)
measurement Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67.0 7.4
1 5 2 36.0 118.0 72.0 8.0
2 5 3 12.0 149.0 74.0 12.6
.. ... ... ... ... ... ...
149 9 27 NaN 145.0 77.0 13.2
150 9 28 14.0 191.0 75.0 14.3
151 9 29 18.0 131.0 76.0 8.0
152 9 30 20.0 223.0 68.0 11.5
[153 rows x 6 columns]
In [4]: # Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()
In [5]: # Print the head of airquality_pivot
print(airquality_pivot.head())
measurement Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67.0 7.4
1 5 2 36.0 118.0 72.0 8.0
2 5 3 12.0 149.0 74.0 12.6
3 5 4 18.0 313.0 62.0 11.5
4 5 5 NaN NaN 56.0 14.3
In [6]: # Print the head of airquality
print(airquality.head())
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
```
## [2-3] Beyond melt and pivot
### 1. Melting and parsing | str[ ]
* Another common problem:
* Columns contain multiple bits of information

* 這邊 m014 代表0~14 歲的男人,m1524 代表15~24歲的男人。
```python=
pd.melt(frame=tb, id_vars=['country', 'year'])
#[Out]:
'''
country year variable value
0 AD 2000 m014 0
1 AE 2000 m014 2
2 AF 2000 m014 52
3 AD 2000 m1524 0
4 AE 2000 m1524 4
5 AF 2000 m1524 228
'''
```
* Nothing inherently(本質上) wrong about original data shape
* Not conducive(有利於) for analysis
> 使用str擷取第一個字
* pd[column].str[ ]: [[doc](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#indexing-with-str)]
* str[0]為第一個字 (位置0),str[-1]為最後一個字,str[1:3]抓取位置在1~3 (不包括3)的字
```python=
tb_melt['sex'] = tb_melt.variable.str[0]
tb_melt
#[Out]:
'''
country year variable value sex
0 AD 2000 m014 0 m
1 AE 2000 m014 2 m
2 AF 2000 m014 52 m
3 AD 2000 m1524 0 m
4 AE 2000 m1524 4 m
5 AF 2000 m1524 228 m
'''
```
* 這邊我們對variable這行使用str[0],將第一個字分出來一個性別欄(sex),方便於之後的分析。
#### practice
- Splitting a column with .str
```python=
In [1]: tb.head()
Out[1]:
country year m014 m1524 m2534 ... f3544 f4554 f5564 f65 fu
0 AD 2000 0.0 0.0 1.0 ... NaN NaN NaN NaN NaN
1 AE 2000 2.0 4.0 4.0 ... 3.0 0.0 0.0 4.0 NaN
2 AF 2000 52.0 228.0 183.0 ... 339.0 205.0 99.0 36.0 NaN
3 AG 2000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 NaN
4 AL 2000 2.0 19.0 21.0 ... 8.0 8.0 5.0 11.0 NaN
[5 rows x 18 columns]
In [2]: # Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
In [3]: print(tb_melt.head())
Out[3]:
country year variable value
0 AD 2000 m014 0.0
1 AE 2000 m014 2.0
2 AF 2000 m014 52.0
3 AG 2000 m014 0.0
4 AL 2000 m014 2.0
In [4]: # Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
In [5]: # Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
In [6]: # Print the head of tb_melt
print(tb_melt.head())
country year variable value gender age_group
0 AD 2000 m014 0.0 m 014
1 AE 2000 m014 2.0 m 014
2 AF 2000 m014 52.0 m 014
3 AG 2000 m014 0.0 m 014
4 AL 2000 m014 2.0 m 014
```
* Splitting a column with `.split()` and `.get()`
> split分割文字為list,可以用get再取出
```python=
In [1]: # Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'],
var_name='type_country', value_name='counts')
In [2]: print(ebola_melt.head())
Date Day type_country counts
0 1/5/2015 289 Cases_Guinea 2776.0
1 1/4/2015 288 Cases_Guinea 2775.0
2 1/3/2015 287 Cases_Guinea 2769.0
3 1/2/2015 286 Cases_Guinea NaN
4 12/31/2014 284 Cases_Guinea 2730.0
In [3]: # Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_')
In [4]: # Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)
In [5]: # Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)
In [6]: # Print the head of ebola_melt
print(ebola_melt.head())
Date Day type_country counts str_split type country
0 1/5/2015 289 Cases_Guinea 2776.0 [Cases, Guinea] Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 [Cases, Guinea] Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 [Cases, Guinea] Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN [Cases, Guinea] Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 [Cases, Guinea] Cases Guinea
```
---
# Combining data for analysis
## [3-1] Concatenating data
### 1. Combining data
* Data may not always come in 1 huge file
* 5 million row dataset may be broken into 5 separate datasets
* Easier to store and share
* May have new data for each day
* Important to be able to combine then clean, or vice versa
* 以下表為範例:

### 2. pandas concat | concat()、loc[ ]、iloc[ ]
> 使用 pd.concat() 合併兩個表
#### pd.concat() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)]
* 參數:
* 必要資料,多個表的list
* ignore_index: 在合併完,重設index
```python=
concatenated = pd.concat([weather_p1, weather_p2])
print(concatenated)
#[Out]:
'''
date element value
0 2010-01-30 tmax 27.8
1 2010-01-30 tmin 14.5
0 2010-02-02 tmax 27.3
1 2010-02-02 tmin 14.4
'''
```
* 這邊我們可以發現,合併後的表 index 會保留原本的值。
* 下面我們使用 ignore_index :
```python=
pd.concat([weather_p1, weather_p2], ignore_index=True)
#[Out]:
'''
date element value
0 2010-01-30 tmax 27.8
1 2010-01-30 tmin 14.5
2 2010-02-02 tmax 27.3
3 2010-02-02 tmin 14.4
'''
```
> 我們可以藉由 panda 裡的 .loc[,] 去指定資料表裡,指定 index 和 colum 的值
* loc[index,column]: 指定標籤 label,可為字串。
* iloc [index,column]: 指定位置,只有數字。
* iloc和loc都是由index的和column 組成,只是用來指定指定位置的的基準不一樣。
```python=
concatenated = concatenated.loc[0, :]
#[Out]:
'''
date element value
0 2010-01-30 tmax 27.8
0 2010-02-02 tmax 27.3
'''
```
* 這邊我們指定所有 index 為0的 row。( : 代表所有的意思)
### 3. Concatenating DataFrames
> 這邊我們將年齡和性別分出來,成另一個表,這樣我們就可以針對性別和年齡做分析。

* 提示: 使用df[column].str[reference]
### practice
* Combining rows of data
```python=
In [1]: # Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])
In [2]: # Print the shape of row_concat
print(row_concat.shape)
(297, 4)
In [3]: # Print the head of row_concat
print(row_concat.head())
Date/Time Lat Lon Base
0 4/1/2014 0:11:00 40.7690 -73.9549 B02512
1 4/1/2014 0:17:00 40.7267 -74.0345 B02512
2 4/1/2014 0:21:00 40.7316 -73.9873 B02512
3 4/1/2014 0:28:00 40.7588 -73.9776 B02512
4 4/1/2014 0:33:00 40.7594 -73.9722 B02512
```
* Combining columns of data
```python=
In [1]: print(ebola_melt.head(n=3))
Date Day status_country counts
0 1/5/2015 289 Cases_Guinea 2776.0
1 1/4/2015 288 Cases_Guinea 2775.0
2 1/3/2015 287 Cases_Guinea 2769.0
In [2]: print(status_country.head(n=3))
status country
0 Cases Guinea
1 Cases Guinea
2 Cases Guinea
In [3]: # Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)
In [4]: # Print the shape of ebola_tidy
print(ebola_tidy.shape)
(1952, 6)
In [5]: # Print the head of ebola_tidy
print(ebola_tidy.head())
Date Day status_country counts status country
0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea
```
---
## [3-2] Finding and concatenating data
### 1. Concatenating many files
> 當我們有許多重複的資料需要處理
* Leverage Python’s features with data cleaning in pandas
* In order to concatenate DataFrames:
* They must be in a list
* Can individually(單獨) load if there are a few datasets
* But what if there are thousands?
* Solution: glob function to find files based on a pattern(模式)
### 2. Globbing | glob 套件
> glob 模塊: 用於查詢文件路徑
#### Glob 套件 [doc](https://docs.python.org/3/library/glob.html)
* Pattern matching for file names
* Wildcards(萬用字元): *和?
* Any csv file: *.csv
* Any single character: file_?.csv
* **Returns a list of file names**
* **Can use this list to load into separate DataFrames**
> 方法
* Load files from globbing into pandas
* Add the DataFrames into a list
* Concatenate multiple datasets at once
> `glob.glob`
```python=
import glob
csv_files = glob.glob('*.csv')
print(csv_files)
#[Out]:['file5.csv', 'file2.csv', 'file3.csv', 'file1.csv', 'file4.csv']
```
> 使用迴圈去分別將 csv_files 裡的路徑載入,並添加到 list_data 。
```python=
list_data = []
for filename in csv_files:
data = pd.read_csv(filename)
list_data.append(data)
pd.concat(list_data)
```
#### practice
* Finding files that match a pattern
```python=
In [1]: # Import necessary modules
import glob
import pandas as pd
In [2]: # Write the pattern: pattern
pattern = '*.csv'
In [3]: # Save all file matches: csv_files
csv_files = glob.glob(pattern)
In [4]: # Print the file names
print(csv_files)
['uber-raw-data-2014_06.csv', 'uber-raw-data-2014_05.csv', 'uber-raw-data-2014_04.csv']
In [5]: # Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
In [6]: # Print the head of csv2
print(csv2.head())
Date/Time Lat Lon Base
0 5/1/2014 0:02:00 40.7521 -73.9914 B02512
1 5/1/2014 0:06:00 40.6965 -73.9715 B02512
2 5/1/2014 0:15:00 40.7464 -73.9838 B02512
3 5/1/2014 0:17:00 40.7463 -74.0011 B02512
4 5/1/2014 0:17:00 40.7594 -73.9734 B02512
```
* Iterating and concatenating all matches
1. iterate through each of the filenames
2. read each filename into a DataFrame, and then
3. append it to the frames list.
```python=
In [1]: # Create an empty list: frames
frames = []
In [2]: # Iterate over csv_files
for csv in csv_files:
# Read csv into a DataFrame: df
df = pd.read_csv(csv)
# Append df to frames
frames.append(df)
In [3]: # Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)
In [4]: # Print the shape of uber
print(uber.shape)
(297, 4)
In [5]: # Print the head of uber
print(uber.head())
Date/Time Lat Lon Base
0 6/1/2014 0:00:00 40.7293 -73.9920 B02512
1 6/1/2014 0:01:00 40.7131 -74.0097 B02512
2 6/1/2014 0:04:00 40.3461 -74.6610 B02512
3 6/1/2014 0:04:00 40.7555 -73.9833 B02512
4 6/1/2014 0:07:00 40.6880 -74.1831 B02512
```
## [3-3] Merge data
### 1. Combining data
> 除了 concat 我們還可以使用 merge 去合併資料
* Concatenation is not the only way data can be combined

### 2. Merging data | merge()
>針對指定的欄位值去合併兩個不同的資料表
* Similar to joining tables in SQL
* **Combine disparate datasets based on common columns**

#### pd.merge(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)]
* 參數:
* left: 左邊的資料表
* right: 右邊的資料表
* on: 必須是兩個資料表都有得行名,也可以設定(label or list),預設為None 。
* left_on: 左邊為基準的行
* right_on: 右邊為基準的行
```python=
pd.merge(left=state_populations, right=state_codes,
on=None, left_on='state', right_on='name')
#[Out]:
'''
state population_2016 name ANSI
0 California 39250017 California CA
1 Texas 27862596 Texas TX
2 Florida 20612439 Florida FL
3 New York 19745289 New York NY
'''
```
### 3. Types of merge
> 不同類型的 merge
* All use the same function
* Only difference is the DataFrames you are merging
> One-to-one: 左右為基準的行,皆只有一筆相對應的資料。

> Many-to-one / one-to-many
* 上面為基準的行 (state) 裡有兩個重複的 New York,在將右邊的表合併時,會自動補上相對應的重複值。


> Many-to-Many
#### practice
* 1-to-1 data merge
```python=
In [1]: print(site.head())
name lat long
0 DR-1 -49.85 -128.57
1 DR-3 -47.15 -126.72
2 MSK-4 -48.87 -123.40
In [2]: print(visited.head())
ident site dated
0 619 DR-1 1927-02-08
1 734 DR-3 1939-01-07
2 837 MSK-4 1932-01-14
In [3]: # Merge the DataFrames: o2o
o2o = pd.merge(left = site,
right = visited, left_on='name', right_on='site')
In [4]: # Print o2o
print(o2o)
name lat long ident site dated
0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08
1 DR-3 -47.15 -126.72 734 DR-3 1939-01-07
2 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14
```
* Many-to-1 data merge
```python=
In [1]: print(site.head())
name lat long
0 DR-1 -49.85 -128.57
1 DR-3 -47.15 -126.72
2 MSK-4 -48.87 -123.40
In [2]: print(visited.head())
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
In [3]: # Merge the DataFrames: m2o
m2o = pd.merge(left = site, right = visited, left_on = 'name', right_on = 'site')
In [4]: # Print m2o
print(m2o)
name lat long ident site dated
0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08
1 DR-1 -49.85 -128.57 622 DR-1 1927-02-10
2 DR-1 -49.85 -128.57 844 DR-1 1932-03-22
3 DR-3 -47.15 -126.72 734 DR-3 1939-01-07
4 DR-3 -47.15 -126.72 735 DR-3 1930-01-12
5 DR-3 -47.15 -126.72 751 DR-3 1930-02-26
6 DR-3 -47.15 -126.72 752 DR-3 NaN
7 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14
```
---
# Cleaning data for analysis
## [4-1] Data types
### 1. Data types | dtypes
#### df.dtypes [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html)]
>這邊以下表為例

> 利用 df.dtypes 可以看到每個欄位的資料型態。
```python=
print(df.dtypes)
#[Out]:
'''
name object
sex object
treatment a object
treatment b int64
dtype: object
'''
```
### 2. Converting data types | astype()
> 透過 astype() 去改資料的種類
#### df[colum].astype() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)]
```python=
df['treatment b'] = df['treatment b'].astype(str)
df['sex'] = df['sex'].astype('category')
df.dtypes
#[Out]:
'''
name object
sex category
treatment a object
treatment b object
dtype: object
'''
```
### 3. Categorical data
* Converting categorical data to ‘category’ dtype:
* Can make the DataFrame **smaller in memory**
* Can make them be utilized(利用) by other Python libraries for analysis
### 4. Cleaning bad data
> 數值型的資料,再載入時因為 '-' 被判斷成字串。

> 解決方式
#### pd.to_numeric() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html)]
* 參數:
* errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
* If ‘raise’, then invalid parsing will raise an exception
* If ‘coerce’, then invalid parsing will be set as NaN
* If ‘ignore’, then invalid parsing will return the input
```python=
df['treatment a'] = pd.to_numeric(df['treatment a'],errors='coerce')
df.dtypes
#[Out]:
'''
name object
sex category
treatment a float64
treatment b object
dtype: object
'''
```
* errors='coerce' : 可以將無法轉換的值換成NaN
#### practice
- Converting data types
```python=
In [1]: # Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')
In [2]: # Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')
In [3]: # Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null category
smoker 244 non-null category
day 244 non-null object
time 244 non-null object
size 244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None
```
* Working with numeric data
```python=
In [1]: # Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')
In [2]: # Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')
In [3]: # Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 202 non-null float64
tip 220 non-null float64
sex 234 non-null category
smoker 229 non-null category
day 243 non-null category
time 227 non-null category
size 231 non-null float64
dtypes: category(4), float64(3)
memory usage: 7.2 KB
None
```
## [4-2] Using regular expressions to clean strings
### 1. String manipulation
* Much of data cleaning involves string manipulation
* Most of the world’s data is unstructured text
* Also have to do string manipulation to make datasets consistent with one another
* Many built-in and external libraries
* ‘re’ library for regular expressions
* A formal way of specifying a pattern
* Sequence of characters
* Pattern matching
* Similar to globbing
### 2. Example match

### 3. Using regular expressions | re 套件
> 正規表達式模塊 re ,是一個非常強大的文本解析工具。
#### **Re** [[doc](https://docs.python.org/3/library/re.html)]
```python=
import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89')
'''相當於:
result = re.match('\$\d*\.\d{2}', '$17.89')
'''
bool(result)
#[Out]:True
```
* re.compile(): 把正規表達式的模式轉換成正規表達物件(regular expression object)。
* 可以結合 search()或 match() 使用。
* re.search(pattern, string, flags=0): 回傳符合正表達式字串的第一個位置,沒有時回傳None。
* re.match(pattern, string, flags=0) 檢查是否能解析成指定字串。
```python=
import re
recipe = "I need 10 strawberries and 2 apples."
print(re.findall("\d+ [a-z]+", recipe))
```
#### practice
- String parsing with regular expressions
```python=
In [1]: # Import the regular expression module
import re
# Compile the pattern: prog
prog = re.compile('\d{3}\-\d{3}\-\d{4}')
# See if the pattern matches
result = prog.match('123-456-7890') #看資料是否和prog的格式吻合
print(bool(result)) #True
True
In [2]: # See if the pattern matches
result2 = prog.match('1123-456-7890') #看資料是否和prog的格式吻合
print(bool(result2)) #False
False
```
* Extracting numerical values from strings
```python=
In [1]: # Import the regular expression module
import re
# Find the numeric values: matches
matches = re.findall('\d+',
'the recipe calls for 10 strawberries and 1 banana')
# 在後面那段話裡面找出所有的數值資料,如果打\d,print出來會是['1','0','1']
# 所以之所以打\d+是因為,如果數值資料是相連的,就會print在同一位置內,['10','1']
# Print the matches
print(matches)
['10', '1']
```
- Pattern matching
```python=
In [4]: # Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}',
string='123-456-7890'))
print(pattern1)
True
In [5]: # Write the second pattern
pattern2 = bool(re.match(pattern='\$\d+.\d{2}',
string='$123.45'))
print(pattern2)
True
#Use [A-Z] to match any capital letter followed by \w*
#to match an arbitrary number of alphanumeric characters.
In [6]: # Write the third pattern
pattern3 = bool(re.match(pattern='\w*',
string='Australia'))
print(pattern3)
True
```
## [4-3] Using functions to clean data
### 1. Complex cleaning
* Cleaning step requires **multiple steps**
* Extract number from string
* Perform transformation on extracted number
* Python function
### 2. Apply | apply()
#### df.apply() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)]
* 參數:
* axis: 控制對行(0)或列(1)
* 輸出: Series
* 可以一次性的對整個序列(list、series、column)做操作 function。
```python=
print(df)
#[Out]:
'''
treatment a treatment b
Daniel 18 42
John 12 31
Jane 24 27
'''
df.apply(np.mean, axis=0)
#[Out]:
'''
treatment a 18.000000
treatment b 33.333333
dtype: float64
'''
df.apply(np.mean, axis=1)
#[Out]:
'''
Daniel 30.0
John 21.5
Jane 25.5
dtype: float64
'''
```
### 3. Applying functions
> 我們要使用 function、 re 和 apply 做資料清理,以下方表格為例:

> 首先寫一個正規表達式
```python=
import re
from numpy import NaN
pattern = re.compile('^\$\d*\.\d{2}$')
```
> 接著寫一個 function
```python=
def diff_money(row, pattern):
icost = row['Initial Cost']
tef = row['Total Est. Fee']
if bool(pattern.match(icost)) and bool(pattern.match(tef)):
icost = icost.replace("$", "")
tef = tef.replace("$", "")
icost = float(icost)
tef = float(tef)
return icost - tef
else:
return(NaN)
```
* bool(): 強制轉布林值
* float(): 強制轉浮點數
* replace(): 將原本的字串替換指定的字串
> 接著在使用apply()結合 function 生成新的列('diff')
```python=
df_subset['diff'] = df_subset.apply(diff_money,axis=1,pattern=pattern)
print(df_subset.head())
#[Out]:
'''
Job # Doc # Borough Initial Cost Total Est. Fee diff
0 121577873 2 MANHATTAN $75000.00 $986.00 74014.0
1 520129502 1 STATEN ISLAND $0.00 $1144.00 -1144.0
2 121601560 1 MANHATTAN $30000.00 $522.50 29477.5
3 121601203 1 MANHATTAN $1500.00 $225.00 1275.0
4 121601338 1 MANHATTAN $19500.00 $389.50 19110.5
'''
```
* 注意: 他是設定 axis=1,也就是對列,因為新產生的值是參考同一列新欄位的值。
#### practice
- Custom functions to clean data
```python=
In [1]: tips.sex.head()
Out[1]:
0 Female
1 Male
2 Male
3 Male
4 Female
Name: sex, dtype: object
In [2]: # Define recode_gender()
def recode_gender(gender):
# Return 0 if gender is 'Female'
if gender == 'Female':
return 0
# Return 1 if gender is 'Male'
elif gender == 'Male':
return 1
# Return np.nan
else:
return np.nan
In [3]: # Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)
In [4]: # Print the first five rows of tips
print(tips.head())
total_bill tip sex smoker day time size recode
0 16.99 1.01 Female No Sun NaN 2.0 0.0
1 10.34 1.66 Male No Sun Dinner 3.0 1.0
2 21.01 3.50 Male No Sun Dinner 3.0 1.0
3 23.68 3.31 Male No Sun Dinner 2.0 1.0
4 NaN 3.61 Female No Sun Dinner 4.0 0.0
```
- Lambda functions
> 將函數用lambda簡化
```python=
def my_square(x):
return x ** 2
df.apply(my_square)
# equal to
df.apply(lambda x: x ** 2)
```
```python=
In [1]: # Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))
In [2]: # Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])
In [3]: # Print the head of tips
print(tips.head())
total_bill tip sex smoker day time size total_dollar total_dollar_replace total_dollar_re
0 16.99 1.01 Female No Sun Dinner 2 $16.99 16.99 16.99
1 10.34 1.66 Male No Sun Dinner 3 $10.34 10.34 10.34
2 21.01 3.50 Male No Sun Dinner 3 $21.01 21.01 21.01
3 23.68 3.31 Male No Sun Dinner 2 $23.68 23.68 23.68
4 24.59 3.61 Female No Sun Dinner 4 $24.59 24.59 24.59
```
## [4-4] Duplicate and missing data
### 1. Duplicate data | drop_duplicates()
* Can skew results

> 使用 drop_duplicates() 處理
#### df.drop_duplicates() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)]
```python=
df = df.drop_duplicates()
print(df)
#[Out]:
'''
name sex treatment a treatment b
0 Daniel male - 42
1 John male 12 31
2 Jane female 24 27
'''
```
### 2. Missing data | dropna()、fillna()
> 缺失值

* 處理方法:
* Leave as-is (不處理)
* Drop them (全丟掉)
* Fill missing value (填補)
> 使用 info() 可清楚的知道,每行有多少缺失值
```python=
tips_nan.info()
#[Out]:
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 202 non-null float64
tip 220 non-null float64
sex 234 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 231 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
None
'''
```
> 方法一 : 使用 dropna() 處理缺失值
#### df.dropna() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)]
```python=
tips_dropped = tips_nan.dropna()
tips_dropped.info()
#[Out]:
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 243
Data columns (total 7 columns):
total_bill 147 non-null float64
tip 147 non-null float64
sex 147 non-null object
smoker 147 non-null object
day 147 non-null object
time 147 non-null object
size 147 non-null float64
dtypes: float64(3), object(4)
memory usage: 9.2+ KB
'''
```
> 方法二: 使用 fillna() 填補缺失值
#### df[columns].fillna() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)]
* Fill with provided value
* Use a summary statistic(描述統計)
```python=
tips_nan['sex'] = tips_nan['sex'].fillna('missing')
tips_nan[['total_bill', 'size']] = tips_nan[['total_bill','size']].fillna(0)
tips_nan.info()
#[Out]:
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 220 non-null float64
sex 244 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 244 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
'''
```
* 注意:在 panda 裡面可以使用雙中括號來呼叫多行,像上面的 tips_nan[['total_bill', 'size']]
> 方法三: 使用 test statistic 填補
* Careful when using test statistics to fill
* Have to make sure the value you are filling in makes sense
* Median(中位數) is a better statistic in the presence of outliers
```python=
mean_value = tips_nan['tip'].mean()
print(mean_value)
#[Out]:2.964681818181819
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)
tips_nan.info()
#[Out]:
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null object
smoker 229 non-null object
day 243 non-null object
time 227 non-null object
size 244 non-null float64
dtypes: float64(3), object(4)
memory usage: 13.4+ KB
'''
```
#### practice
- Filling missing data
```python=
In [2]: # Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality['Ozone'].mean()
# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)
# Print the info of airquality
print(airquality.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone 153 non-null float64
Solar.R 146 non-null float64
Wind 153 non-null float64
Temp 153 non-null int64
Month 153 non-null int64
Day 153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None
```
## [4-5] Testing with asserts
### 1. Assert statements
#### assert 方法 [[doc](https://docs.python.org/3/reference/simple_stmts.html#grammar-token-assert-stmt)]
> Assert(斷言) : 陳述在程式中安插除錯用的斷言(Assertion)檢查時很方便的一個方式。
* Programmatically vs visually checking
* If we drop or fill NaNs, we expect 0 missing values
* We can write an assert statement to verify this
* We can detect early warnings and errors
* This gives us confidence that our code is running correctly
```python=
assert 1 == 1
assert 1 == 2
#[Out]:
'''
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-65-a810b3a4aded> in <module>()
----> 1 assert 1 == 2
AssertionError:
'''
```
### 2. Google stock data

> Test column
```python=
assert google.Close.notnull().all()
#[Out]:
'''
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-49-eec77130a77f> in <module>()
----> 1 assert google.Close.notnull().all()
AssertionError:
'''
```
* 這邊我們用 notnull() 去測試 Close 這行裡每個值,是否有空值。
* all() 則是將結果合起來。
* 最後用 assert 去測試有無問題。
```python=
google_0 = google.fillna(value=0)
assert google_0.Close.notnull().all()
```
#### practice
- Testing your data with asserts
```python=
In [1]: # Assert that there are no missing values
assert pd.notnull(ebola).all().all()
# The first .all() method will return a True or False for each
# column, while the second .all() method will return a single
# True or False)
# assert後面的運算式成立時就做某些事,如果不成立,就會發起例外
In [2]: # Assert that all values are >= 0
assert (ebola >= 0).all().all()
```