# [Cleaning Data in Python(Datacamp)](https://www.datacamp.com/courses/cleaning-data-in-python) ###### tags: `Datacamp` `python` `Cleaning Data` `tidying data` `combining data` > 引用修改自[https://hackmd.io/0Euhw7ydReyUfY-kLUnXbg?both](https://hackmd.io/0Euhw7ydReyUfY-kLUnXbg?both) >**作者:何彥南** {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} 注意:以下df 為 pandas 的 DataFrame 的型式 --- [TOC] --- # Exploring your data ## [1-1] Diagnose data for cleaning ### 1. Cleaning data * Prepare data for analysis * Data almost never comes in clean * Diagnose your data for problems ### 2. Common data problems * Inconsistent column names (欄位名不一致) * Missing data (遺漏值) * Outliers (極端值) * Duplicate rows (重複資料) * Untidy (不整齊) * Need to process columns (需經過處理的欄位) * Column types can signal unexpected data values ### 3. Unclean data > 舉個例來說 ![](https://i.imgur.com/zG6VbbA.png) * 上表有幾個問題: * Column name inconsistencies (欄位名不一致) * Missing data (遺漏值) * Country names are in French (語言或編碼問題) ### 4. Load your data | read_csv() > 載入你的資料 #### **pd.read_csv()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)] * 常用參數 * **path**(必要):檔案位置。 * header: 指定載入的行名(columns name)。 * header=None 當檔案沒行名時可以使用,以(0、1、2...)代替。 * 預設 header=0 ,代表已第 0 個(index)列為行名。 * names: 更改行名。以 list 匯入。 * usecols: 指定要載入的行。以 list 匯入, 數字(index)或行名都行。 * skiprows: 跳過前面幾行,輸入 int。 * encoding: 讀取的編碼,輸入 string。 ```python= import pandas as pd df = pd.read_csv('literary_birth_rate.csv') ``` ### 5. Visually inspect | head() 、tail() > 用看的 ,先把看得到的錯誤排除 ![](https://i.imgur.com/Rgi2sF8.png) #### **df.head()** [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)] * 前n列,預設5。 #### **df.tail()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail)] * 後n列,預設5。 ### 6. Visually inspect(2) | columns、shape、info() ![](https://i.imgur.com/yKYt4HP.png) #### **columns**: * 所有欄的名字(可迭代的) #### **shape**: * 表格的規格,輸出:(列,行)。 #### **info()**: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) * 表格的整體資訊。 ## [1-2] Exploratory data analysis ### 1. Data type of each column >不通欄位的 type(資料類型) ![](https://i.imgur.com/W8th4ro.png) ### 2. Frequency counts: continent | value_counts() #### **df[column].value_counts()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)] * 依照指定行(column),對每個值進行分類計數(count) * 輸出以分類為index,值為次數 * 注意: * value_counts().values : 取得結果次數列表(每個分類的次數) * value_counts().index : 分類列表 * 參數: * ascending: 預設為降序,輸入bool,True 為升序。 * normalize: 是否使用比例,輸入bool。 * dropna: 丟掉空值。 ```python= df.continent.value_counts(dropna=False) df['continent'].value_counts(dropna=False) #[Out]: ''' AF 49 ASI 47 EUR 36 LAT 24 OCE 6 NAM 2 Name: cont ''' ``` * 在DataFrame 下,df.continent 和 df['continent'] 都是呼叫資料表 (df) 裡的欄位(continent) * 結果裡面有可能出現 missing 或 NaN 值,這也是我們要注意的地方,後面會在解釋。 ### 3. Summary statistics > 摘要統計 * Numeric columns: 數值化的行列資訊。 * outlier (異常值): 異常值(又稱極端值、離群值) ,我們可以知道他是較高(或低),之後再進一步的去調查。 ### 4. Summary statistics: Numeric data | describe() #### **df.describe()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)] * 可以顯示資料整體的數值化統計量 ```python= df.describe() #[Out]: ''' female_literacy population count 164.000000 1.220000e+02 mean 80.301220 6.345768e+07 std 22.977265 2.605977e+08 min 12.600000 1.035660e+05 25% 66.675000 3.778175e+06 50% 90.200000 9.995450e+06 75% 98.500000 2.642217e+07 max 100.000000 2.313000e+09 ''' ``` #### practice ```python= In [2]: # Print the value counts for 'Borough' print(df['Borough'].value_counts(dropna=False)) MANHATTAN 6310 BROOKLYN 2866 QUEENS 2121 BRONX 974 STATEN ISLAND 575 Name: Borough, dtype: int64 In [3]: # Print the value_counts for 'State' print(df['State'].value_counts(dropna=False)) NY 12391 NJ 241 PA 38 CA 20 OH 19 FL 17 IL 17 CT 16 TX 13 TN 10 DC 7 MD 7 GA 6 KS 6 MA 6 VA 5 CO 4 WI 3 AZ 3 SC 3 MN 3 NC 2 UT 2 RI 2 VT 1 NM 1 MI 1 WA 1 IN 1 Name: State, dtype: int64 In [4]: # Print the value counts for 'Site Fill' print(df['Site Fill'].value_counts(dropna=False)) NOT APPLICABLE 7806 NaN 4205 ON-SITE 519 OFF-SITE 186 USE UNDER 300 CU.YD 130 Name: Site Fill, dtype: int64 ``` ## [1-3] Visual exploratory data analysis ### 1. Data visualization * Great way to spot outliers and obvious errors * More than just looking for patterns * Plan data cleaning steps ### 2. Bar plots and histograms * Bar plots(直條圖) for discrete data counts * Histograms(直方圖) for continuous data counts * Look at frequencies(頻率) ### 3. Histogram(直方圖) > 適合展示連續型資料的頻率分布 ```python= df.population.plot('hist') #[Out]:f<matplotlib.axes._subplots.AxesSubplot at 0x7f78e4abafd0> import matplotlib.pyplot as plt plt.show() ``` ![](https://i.imgur.com/td2eHvf.png) * 從這邊我們可以看出來,人口在一萬與兩萬那邊有異常值。 ### 4. Identifying the error > 辨識異常(錯誤),我們可以利用 panda 裡面基本的篩選功能去觀察異常值。(接續上面) ```python= df[df.population > 1000000000] #[Out]: '''continent country female_literacy fertility population 0 ASI Chine 90.5 1.769 1.324655e+09 1 ASI Inde 50.8 2.682 1.139965e+09 162 OCE Australia 96.0 1.930 2.313000e+09 ''' ``` * 不是所有的異常值都不好,要自己判斷。 * 這邊很明顯,Australia 他的 population(人口) 是錯誤的。 ### 5. Box plots(箱型圖) * 可視覺化基本資料的分布: * Outliers (異常值) * Min/max (極值) * 25th, 50th, 75th percentiles (百分位數) ```python= df.boxplot(column='population', by='continent') #[Out]:<matplotlib.axes._subplots.AxesSubplot at 0x7ff5581bb630> plt.show() ``` * 輸出 ![](https://i.imgur.com/in21IG4.png) * 這邊圈起來的部分我們可以看出異常值,1~1.5 之間有兩個,還有一個在2以上。 ### 6. Scatter plots(散佈圖) * Relationship between 2 numeric variables * Flag potentially bad data(標記潛在的壞數據) * Errors not found by looking at 1 variable #### practice - Visualizing single variables with histograms ```python= In [1]: # Import matplotlib.pyplot import matplotlib.pyplot as plt # Describe the column df['Existing Zoning Sqft'].describe() Out[1]: count 1.284600e+04 mean 1.439973e+03 std 3.860757e+04 min 0.000000e+00 25% 0.000000e+00 50% 0.000000e+00 75% 0.000000e+00 max 2.873107e+06 Name: Existing Zoning Sqft, dtype: float64 In [2]: # Plot the histogram df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True) Out[2]: <matplotlib.axes._subplots.AxesSubplot at 0x7f72977433c8> In [3]: # Display the histogram plt.show() ``` ![](https://i.imgur.com/Q5L5kmT.png =400x) - Visualizing multiple variables with boxplots ```python= In [1]: # Import necessary modules import pandas as pd import matplotlib.pyplot as plt # Create the boxplot df.boxplot(column='initial_cost', by='Borough', rot=90) Out[1]: <matplotlib.axes._subplots.AxesSubplot at 0x7f728c33a0f0> In [2]: # Display the plot plt.show() ``` ![](https://i.imgur.com/C7t4OUS.png =400x) - Visualizing multiple variables with scatter plots ```python= In [1]: # Import necessary modules import pandas as pd import matplotlib.pyplot as plt # Create and display the first scatter plot df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70) plt.show() ``` ![](https://i.imgur.com/5ulpav0.png =400x) ```python= In [2]: # Create and display the second scatter plot df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70) plt.show() ``` ![](https://i.imgur.com/ZnJF5JW.png =400x) --- # Tidying data for analysis ## [2-1] Tidy data(乾淨的資料) * “Tidy Data” paper by Hadley Wickham, PhD * Formalize(格式化) the way we describe theshape of data * Gives us a goal when formatting(格式化) our data * “Standard way to organize data values within a dataset” ### 1. Principles of tidy data * Columns represent separate variables(變數) * Rows represent individual(獨立的) observations(觀測值) * Observational(觀察) units form tables ### 2. Converting to tidy data ![](https://i.imgur.com/KSNRR8A.png) * Better for reporting vs. better for analysis * Tidy data makes it easier to fix common data problems * The data problem we are trying to fix: * Columns containing values, instead of variables ### 3. Melting | melt(): > 將多行(包含相同性質的值)轉成一行類別行(由原本行名組成)和一行值(由原本多行底下的相同值組成),並以補上一行類別變數(表示原本屬於哪行),值則是原本兩行裡的數值。 #### pd.melt(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)] * 參數: * frame: 輸入,dataframe 的形式。 * id_vars: 原資料為基準值的行。 * value_vars: 原資料要合併為一行(分類)的行。 * var_name: 轉換後分類那行的名字。 * value_name: 轉換後值的那行的名字。 ```python= pd.melt(frame=df,id_vars='name',value_vars=['treatment a', 'treatment b'], var_name='treatment', value_name='result') #[Out]: ''' name treatment result 0 Daniel treatment a _ 1 John treatment a 12 2 Jane treatment a 24 3 Daniel treatment b 42 4 John treatment b 31 5 Jane treatment b 27 ''' ``` * 在變化前,資料已人為基準(所以將其設為 id_vars),記錄每個人兩次治療的數值。 * 某些資料表的型態適合報告時呈現,但卻不適合分析,我們必須知道哪種資料表的形狀較適合我們。 * 在使用 pd.melt() 後,每筆資料代表的是每一次治療,這樣我們在後序的分析上就可以更方便且明確。 ![](https://i.imgur.com/KSNRR8A.png) #### practice - Reshaping your data using melt ```python= In [1]: # Print the head of airquality print(airquality.head()) Ozone Solar.R Wind Temp Month Day 0 41.0 190.0 7.4 67 5 1 1 36.0 118.0 8.0 72 5 2 2 12.0 149.0 12.6 74 5 3 3 18.0 313.0 11.5 62 5 4 4 NaN NaN 14.3 56 5 5 In [2]: # Melt airquality: airquality_melt airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day']) In [3]: # Print the head of airquality_melt print(airquality_melt.head()) Month Day variable value 0 5 1 Ozone 41.0 1 5 2 Ozone 36.0 2 5 3 Ozone 12.0 3 5 4 Ozone 18.0 4 5 5 Ozone NaN ``` - Customizing melted data ```python= In [1]: # Print the head of airquality print(airquality.head()) Ozone Solar.R Wind Temp Month Day 0 41.0 190.0 7.4 67 5 1 1 36.0 118.0 8.0 72 5 2 2 12.0 149.0 12.6 74 5 3 3 18.0 313.0 11.5 62 5 4 4 NaN NaN 14.3 56 5 5 In [2]: # Melt airquality: airquality_melt airquality_melt = pd.melt(airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading') In [3]: # Print the head of airquality_melt print(airquality_melt.head()) Month Day measurement reading 0 5 1 Ozone 41.0 1 5 2 Ozone 36.0 2 5 3 Ozone 12.0 3 5 4 Ozone 18.0 4 5 5 Ozone NaN ``` ## [2-2] Pivoting data ### 1. Pivot: un-melting data | Pivot() > 將列轉換成行 ![](https://i.imgur.com/FtOS4OK.png) * Opposite(相反) of melting * In melting, we turned columns into rows * **Pivoting: turn unique values into separate columns** * Analysis friendly shape to reporting friendly shape * **Violates(違背) tidy data principle:** * **rows contain observations** * **Multiple variables stored in the same column** #### df.Pivot(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot)] * 參數: * index: 以哪欄為基準。 * columns: 此行包含類別變數,將類別變為行。 * values: 此行包含參考的值。 ```python= weather_tidy = weather.pivot(index='date',columns='element',values='value') print(weather_tidy) #[Out]: '''element tmax tmin date 2010-01-30 27.8 14.5 2010-02-02 27.3 14.4 ''' ``` ### 2. Using pivot when you have duplicate entries > 當基準(index) 同個類別但是存在多個值(values)時,使用 pivot()會造成以下錯誤 ![](https://i.imgur.com/xSPN3J2.png) * 這邊以date為基準 2010-02-02 tmin 有2個值這樣就無法用pivot()轉換 ```python= import numpy as np weather2_tidy = weather.pivot(values='value', index='date', columns='element') #[Out]: '''--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-9-2962bb23f5a3> in <module>() 1 weather2_tidy = weather2.pivot(values='value', 2 index='date', ----> 3 columns='element') ValueError: Index contains duplicate entries, cannot reshape ''' ``` ### 3. Pivot table | Pivot_table() > 比起 Pivot() ,Pivot table() 有更詳細的參數,可以處裡在轉換時發生上述重複錯誤時。 #### df.pivot_table(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table)] * 參數 * index: 以哪欄為基準。 * columns: 此行包含類別變數,將類別變為行。 * values: 此行包含參考的值。 * aggfunc: 處理方式,function * Has a parameter that specifies how to deal with duplicate values * Example: Can aggregate the duplicate values by taking their average ```python= weather2_tidy = weather.pivot_table(values='value',index='date', columns='element',aggfunc=np.mean) #[Out]: '''element tmax tmin date 2010-01-30 27.8 14.5 2010-02-02 27.3 15.4 ''' ``` * 這邊透過設定 aggfunc= : 這參數,可以告訴 python 該如何處理兩個值。在這裡是用平均(np.mean)。 #### practice ```python= In [1]: # Print the head of airquality_melt print(airquality_melt.head()) Month Day measurement reading 0 5 1 Ozone 41.0 1 5 2 Ozone 36.0 2 5 3 Ozone 12.0 3 5 4 Ozone 18.0 4 5 5 Ozone NaN In [2]: # Pivot airquality_melt: airquality_pivot airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading') In [3]: # Print the head of airquality_pivot print(airquality_pivot.head()) measurement Ozone Solar.R Temp Wind Month Day 5 1 41.0 190.0 67.0 7.4 2 36.0 118.0 72.0 8.0 3 12.0 149.0 74.0 12.6 4 18.0 313.0 62.0 11.5 5 NaN NaN 56.0 14.3 ``` * Resetting the index of a DataFrame ```python= In [1]: # Print the index of airquality_pivot print(airquality_pivot.index) MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]], labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]], names=['Month', 'Day']) In [2]: # Reset the index of airquality_pivot: airquality_pivot_reset airquality_pivot_reset = airquality_pivot.reset_index() In [3]: # Print the new index of airquality_pivot_reset print(airquality_pivot_reset.index) RangeIndex(start=0, stop=153, step=1) In [4]: # Print the head of airquality_pivot_reset print(airquality_pivot_reset.head()) measurement Month Day Ozone Solar.R Temp Wind 0 5 1 41.0 190.0 67.0 7.4 1 5 2 36.0 118.0 72.0 8.0 2 5 3 12.0 149.0 74.0 12.6 3 5 4 18.0 313.0 62.0 11.5 4 5 5 NaN NaN 56.0 14.3 ``` * Pivoting duplicate values ```python= In [1]: # Pivot table the airquality_dup: airquality_pivot airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean) In [2]: # Print the head of airquality_pivot before reset_index print(airquality_pivot) measurement Month Day Ozone Solar.R Temp Wind 0 5 1 41.0 190.0 67.0 7.4 1 5 2 36.0 118.0 72.0 8.0 2 5 3 12.0 149.0 74.0 12.6 .. ... ... ... ... ... ... 149 9 27 NaN 145.0 77.0 13.2 150 9 28 14.0 191.0 75.0 14.3 151 9 29 18.0 131.0 76.0 8.0 152 9 30 20.0 223.0 68.0 11.5 [153 rows x 6 columns] In [4]: # Reset the index of airquality_pivot airquality_pivot = airquality_pivot.reset_index() In [5]: # Print the head of airquality_pivot print(airquality_pivot.head()) measurement Month Day Ozone Solar.R Temp Wind 0 5 1 41.0 190.0 67.0 7.4 1 5 2 36.0 118.0 72.0 8.0 2 5 3 12.0 149.0 74.0 12.6 3 5 4 18.0 313.0 62.0 11.5 4 5 5 NaN NaN 56.0 14.3 In [6]: # Print the head of airquality print(airquality.head()) Ozone Solar.R Wind Temp Month Day 0 41.0 190.0 7.4 67 5 1 1 36.0 118.0 8.0 72 5 2 2 12.0 149.0 12.6 74 5 3 3 18.0 313.0 11.5 62 5 4 4 NaN NaN 14.3 56 5 5 ``` ## [2-3] Beyond melt and pivot ### 1. Melting and parsing | str[ ] * Another common problem: * Columns contain multiple bits of information ![](https://i.imgur.com/VzQW5Pk.png =400x) * 這邊 m014 代表0~14 歲的男人,m1524 代表15~24歲的男人。 ```python= pd.melt(frame=tb, id_vars=['country', 'year']) #[Out]: ''' country year variable value 0 AD 2000 m014 0 1 AE 2000 m014 2 2 AF 2000 m014 52 3 AD 2000 m1524 0 4 AE 2000 m1524 4 5 AF 2000 m1524 228 ''' ``` * Nothing inherently(本質上) wrong about original data shape * Not conducive(有利於) for analysis > 使用str擷取第一個字 * pd[column].str[ ]: [[doc](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#indexing-with-str)] * str[0]為第一個字 (位置0),str[-1]為最後一個字,str[1:3]抓取位置在1~3 (不包括3)的字 ```python= tb_melt['sex'] = tb_melt.variable.str[0] tb_melt #[Out]: ''' country year variable value sex 0 AD 2000 m014 0 m 1 AE 2000 m014 2 m 2 AF 2000 m014 52 m 3 AD 2000 m1524 0 m 4 AE 2000 m1524 4 m 5 AF 2000 m1524 228 m ''' ``` * 這邊我們對variable這行使用str[0],將第一個字分出來一個性別欄(sex),方便於之後的分析。 #### practice - Splitting a column with .str ```python= In [1]: tb.head() Out[1]: country year m014 m1524 m2534 ... f3544 f4554 f5564 f65 fu 0 AD 2000 0.0 0.0 1.0 ... NaN NaN NaN NaN NaN 1 AE 2000 2.0 4.0 4.0 ... 3.0 0.0 0.0 4.0 NaN 2 AF 2000 52.0 228.0 183.0 ... 339.0 205.0 99.0 36.0 NaN 3 AG 2000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 NaN 4 AL 2000 2.0 19.0 21.0 ... 8.0 8.0 5.0 11.0 NaN [5 rows x 18 columns] In [2]: # Melt tb: tb_melt tb_melt = pd.melt(tb, id_vars=['country', 'year']) In [3]: print(tb_melt.head()) Out[3]: country year variable value 0 AD 2000 m014 0.0 1 AE 2000 m014 2.0 2 AF 2000 m014 52.0 3 AG 2000 m014 0.0 4 AL 2000 m014 2.0 In [4]: # Create the 'gender' column tb_melt['gender'] = tb_melt.variable.str[0] In [5]: # Create the 'age_group' column tb_melt['age_group'] = tb_melt.variable.str[1:] In [6]: # Print the head of tb_melt print(tb_melt.head()) country year variable value gender age_group 0 AD 2000 m014 0.0 m 014 1 AE 2000 m014 2.0 m 014 2 AF 2000 m014 52.0 m 014 3 AG 2000 m014 0.0 m 014 4 AL 2000 m014 2.0 m 014 ``` * Splitting a column with `.split()` and `.get()` > split分割文字為list,可以用get再取出 ```python= In [1]: # Melt ebola: ebola_melt ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts') In [2]: print(ebola_melt.head()) Date Day type_country counts 0 1/5/2015 289 Cases_Guinea 2776.0 1 1/4/2015 288 Cases_Guinea 2775.0 2 1/3/2015 287 Cases_Guinea 2769.0 3 1/2/2015 286 Cases_Guinea NaN 4 12/31/2014 284 Cases_Guinea 2730.0 In [3]: # Create the 'str_split' column ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_') In [4]: # Create the 'type' column ebola_melt['type'] = ebola_melt['str_split'].str.get(0) In [5]: # Create the 'country' column ebola_melt['country'] = ebola_melt['str_split'].str.get(1) In [6]: # Print the head of ebola_melt print(ebola_melt.head()) Date Day type_country counts str_split type country 0 1/5/2015 289 Cases_Guinea 2776.0 [Cases, Guinea] Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 [Cases, Guinea] Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 [Cases, Guinea] Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN [Cases, Guinea] Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 [Cases, Guinea] Cases Guinea ``` --- # Combining data for analysis ## [3-1] Concatenating data ### 1. Combining data * Data may not always come in 1 huge file * 5 million row dataset may be broken into 5 separate datasets * Easier to store and share * May have new data for each day * Important to be able to combine then clean, or vice versa * 以下表為範例: ![](https://i.imgur.com/JAIxG6d.png) ### 2. pandas concat | concat()、loc[ ]、iloc[ ] > 使用 pd.concat() 合併兩個表 #### pd.concat() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)] * 參數: * 必要資料,多個表的list * ignore_index: 在合併完,重設index ```python= concatenated = pd.concat([weather_p1, weather_p2]) print(concatenated) #[Out]: ''' date element value 0 2010-01-30 tmax 27.8 1 2010-01-30 tmin 14.5 0 2010-02-02 tmax 27.3 1 2010-02-02 tmin 14.4 ''' ``` * 這邊我們可以發現,合併後的表 index 會保留原本的值。 * 下面我們使用 ignore_index : ```python= pd.concat([weather_p1, weather_p2], ignore_index=True) #[Out]: ''' date element value 0 2010-01-30 tmax 27.8 1 2010-01-30 tmin 14.5 2 2010-02-02 tmax 27.3 3 2010-02-02 tmin 14.4 ''' ``` > 我們可以藉由 panda 裡的 .loc[,] 去指定資料表裡,指定 index 和 colum 的值 * loc[index,column]: 指定標籤 label,可為字串。 * iloc [index,column]: 指定位置,只有數字。 * iloc和loc都是由index的和column 組成,只是用來指定指定位置的的基準不一樣。 ```python= concatenated = concatenated.loc[0, :] #[Out]: ''' date element value 0 2010-01-30 tmax 27.8 0 2010-02-02 tmax 27.3 ''' ``` * 這邊我們指定所有 index 為0的 row。( : 代表所有的意思) ### 3. Concatenating DataFrames > 這邊我們將年齡和性別分出來,成另一個表,這樣我們就可以針對性別和年齡做分析。 ![](https://i.imgur.com/t6awMuL.png) * 提示: 使用df[column].str[reference] ### practice * Combining rows of data ```python= In [1]: # Concatenate uber1, uber2, and uber3: row_concat row_concat = pd.concat([uber1, uber2, uber3]) In [2]: # Print the shape of row_concat print(row_concat.shape) (297, 4) In [3]: # Print the head of row_concat print(row_concat.head()) Date/Time Lat Lon Base 0 4/1/2014 0:11:00 40.7690 -73.9549 B02512 1 4/1/2014 0:17:00 40.7267 -74.0345 B02512 2 4/1/2014 0:21:00 40.7316 -73.9873 B02512 3 4/1/2014 0:28:00 40.7588 -73.9776 B02512 4 4/1/2014 0:33:00 40.7594 -73.9722 B02512 ``` * Combining columns of data ```python= In [1]: print(ebola_melt.head(n=3)) Date Day status_country counts 0 1/5/2015 289 Cases_Guinea 2776.0 1 1/4/2015 288 Cases_Guinea 2775.0 2 1/3/2015 287 Cases_Guinea 2769.0 In [2]: print(status_country.head(n=3)) status country 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea In [3]: # Concatenate ebola_melt and status_country column-wise: ebola_tidy ebola_tidy = pd.concat([ebola_melt, status_country], axis=1) In [4]: # Print the shape of ebola_tidy print(ebola_tidy.shape) (1952, 6) In [5]: # Print the head of ebola_tidy print(ebola_tidy.head()) Date Day status_country counts status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea ``` --- ## [3-2] Finding and concatenating data ### 1. Concatenating many files > 當我們有許多重複的資料需要處理 * Leverage Python’s features with data cleaning in pandas * In order to concatenate DataFrames: * They must be in a list * Can individually(單獨) load if there are a few datasets * But what if there are thousands? * Solution: glob function to find files based on a pattern(模式) ### 2. Globbing | glob 套件 > glob 模塊: 用於查詢文件路徑 #### Glob 套件 [doc](https://docs.python.org/3/library/glob.html) * Pattern matching for file names * Wildcards(萬用字元): *和? * Any csv file: *.csv * Any single character: file_?.csv * **Returns a list of file names** * **Can use this list to load into separate DataFrames** > 方法 * Load files from globbing into pandas * Add the DataFrames into a list * Concatenate multiple datasets at once > `glob.glob` ```python= import glob csv_files = glob.glob('*.csv') print(csv_files) #[Out]:['file5.csv', 'file2.csv', 'file3.csv', 'file1.csv', 'file4.csv'] ``` > 使用迴圈去分別將 csv_files 裡的路徑載入,並添加到 list_data 。 ```python= list_data = [] for filename in csv_files: data = pd.read_csv(filename) list_data.append(data) pd.concat(list_data) ``` #### practice * Finding files that match a pattern ```python= In [1]: # Import necessary modules import glob import pandas as pd In [2]: # Write the pattern: pattern pattern = '*.csv' In [3]: # Save all file matches: csv_files csv_files = glob.glob(pattern) In [4]: # Print the file names print(csv_files) ['uber-raw-data-2014_06.csv', 'uber-raw-data-2014_05.csv', 'uber-raw-data-2014_04.csv'] In [5]: # Load the second file into a DataFrame: csv2 csv2 = pd.read_csv(csv_files[1]) In [6]: # Print the head of csv2 print(csv2.head()) Date/Time Lat Lon Base 0 5/1/2014 0:02:00 40.7521 -73.9914 B02512 1 5/1/2014 0:06:00 40.6965 -73.9715 B02512 2 5/1/2014 0:15:00 40.7464 -73.9838 B02512 3 5/1/2014 0:17:00 40.7463 -74.0011 B02512 4 5/1/2014 0:17:00 40.7594 -73.9734 B02512 ``` * Iterating and concatenating all matches 1. iterate through each of the filenames 2. read each filename into a DataFrame, and then 3. append it to the frames list. ```python= In [1]: # Create an empty list: frames frames = [] In [2]: # Iterate over csv_files for csv in csv_files: # Read csv into a DataFrame: df df = pd.read_csv(csv) # Append df to frames frames.append(df) In [3]: # Concatenate frames into a single DataFrame: uber uber = pd.concat(frames) In [4]: # Print the shape of uber print(uber.shape) (297, 4) In [5]: # Print the head of uber print(uber.head()) Date/Time Lat Lon Base 0 6/1/2014 0:00:00 40.7293 -73.9920 B02512 1 6/1/2014 0:01:00 40.7131 -74.0097 B02512 2 6/1/2014 0:04:00 40.3461 -74.6610 B02512 3 6/1/2014 0:04:00 40.7555 -73.9833 B02512 4 6/1/2014 0:07:00 40.6880 -74.1831 B02512 ``` ## [3-3] Merge data ### 1. Combining data > 除了 concat 我們還可以使用 merge 去合併資料 * Concatenation is not the only way data can be combined ![](https://i.imgur.com/pIT9xM3.png) ### 2. Merging data | merge() >針對指定的欄位值去合併兩個不同的資料表 * Similar to joining tables in SQL * **Combine disparate datasets based on common columns** ![](https://i.imgur.com/sHKuhwt.png) #### pd.merge(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)] * 參數: * left: 左邊的資料表 * right: 右邊的資料表 * on: 必須是兩個資料表都有得行名,也可以設定(label or list),預設為None 。 * left_on: 左邊為基準的行 * right_on: 右邊為基準的行 ```python= pd.merge(left=state_populations, right=state_codes, on=None, left_on='state', right_on='name') #[Out]: ''' state population_2016 name ANSI 0 California 39250017 California CA 1 Texas 27862596 Texas TX 2 Florida 20612439 Florida FL 3 New York 19745289 New York NY ''' ``` ### 3. Types of merge > 不同類型的 merge * All use the same function * Only difference is the DataFrames you are merging > One-to-one: 左右為基準的行,皆只有一筆相對應的資料。 ![](https://i.imgur.com/fyWXN4u.png =400x) > Many-to-one / one-to-many * 上面為基準的行 (state) 裡有兩個重複的 New York,在將右邊的表合併時,會自動補上相對應的重複值。 ![](https://i.imgur.com/HrcRXf3.png =400x) ![](https://i.imgur.com/5sbXi93.png =400x) > Many-to-Many #### practice * 1-to-1 data merge ```python= In [1]: print(site.head()) name lat long 0 DR-1 -49.85 -128.57 1 DR-3 -47.15 -126.72 2 MSK-4 -48.87 -123.40 In [2]: print(visited.head()) ident site dated 0 619 DR-1 1927-02-08 1 734 DR-3 1939-01-07 2 837 MSK-4 1932-01-14 In [3]: # Merge the DataFrames: o2o o2o = pd.merge(left = site, right = visited, left_on='name', right_on='site') In [4]: # Print o2o print(o2o) name lat long ident site dated 0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08 1 DR-3 -47.15 -126.72 734 DR-3 1939-01-07 2 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14 ``` * Many-to-1 data merge ```python= In [1]: print(site.head()) name lat long 0 DR-1 -49.85 -128.57 1 DR-3 -47.15 -126.72 2 MSK-4 -48.87 -123.40 In [2]: print(visited.head()) ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 In [3]: # Merge the DataFrames: m2o m2o = pd.merge(left = site, right = visited, left_on = 'name', right_on = 'site') In [4]: # Print m2o print(m2o) name lat long ident site dated 0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08 1 DR-1 -49.85 -128.57 622 DR-1 1927-02-10 2 DR-1 -49.85 -128.57 844 DR-1 1932-03-22 3 DR-3 -47.15 -126.72 734 DR-3 1939-01-07 4 DR-3 -47.15 -126.72 735 DR-3 1930-01-12 5 DR-3 -47.15 -126.72 751 DR-3 1930-02-26 6 DR-3 -47.15 -126.72 752 DR-3 NaN 7 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14 ``` --- # Cleaning data for analysis ## [4-1] Data types ### 1. Data types | dtypes #### df.dtypes [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html)] >這邊以下表為例 ![](https://i.imgur.com/xyOfHPz.png =400x) > 利用 df.dtypes 可以看到每個欄位的資料型態。 ```python= print(df.dtypes) #[Out]: ''' name object sex object treatment a object treatment b int64 dtype: object ''' ``` ### 2. Converting data types | astype() > 透過 astype() 去改資料的種類 #### df[colum].astype() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)] ```python= df['treatment b'] = df['treatment b'].astype(str) df['sex'] = df['sex'].astype('category') df.dtypes #[Out]: ''' name object sex category treatment a object treatment b object dtype: object ''' ``` ### 3. Categorical data * Converting categorical data to ‘category’ dtype: * Can make the DataFrame **smaller in memory** * Can make them be utilized(利用) by other Python libraries for analysis ### 4. Cleaning bad data > 數值型的資料,再載入時因為 '-' 被判斷成字串。 ![](https://i.imgur.com/aIre55K.png =400x) > 解決方式 #### pd.to_numeric() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html)] * 參數: * errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ * If ‘raise’, then invalid parsing will raise an exception * If ‘coerce’, then invalid parsing will be set as NaN * If ‘ignore’, then invalid parsing will return the input ```python= df['treatment a'] = pd.to_numeric(df['treatment a'],errors='coerce') df.dtypes #[Out]: ''' name object sex category treatment a float64 treatment b object dtype: object ''' ``` * errors='coerce' : 可以將無法轉換的值換成NaN #### practice - Converting data types ```python= In [1]: # Convert the sex column to type 'category' tips.sex = tips.sex.astype('category') In [2]: # Convert the smoker column to type 'category' tips.smoker = tips.smoker.astype('category') In [3]: # Print the info of tips print(tips.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null category smoker 244 non-null category day 244 non-null object time 244 non-null object size 244 non-null int64 dtypes: category(2), float64(2), int64(1), object(2) memory usage: 10.3+ KB None ``` * Working with numeric data ```python= In [1]: # Convert 'total_bill' to a numeric dtype tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce') In [2]: # Convert 'tip' to a numeric dtype tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce') In [3]: # Print the info of tips print(tips.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 202 non-null float64 tip 220 non-null float64 sex 234 non-null category smoker 229 non-null category day 243 non-null category time 227 non-null category size 231 non-null float64 dtypes: category(4), float64(3) memory usage: 7.2 KB None ``` ## [4-2] Using regular expressions to clean strings ### 1. String manipulation * Much of data cleaning involves string manipulation * Most of the world’s data is unstructured text * Also have to do string manipulation to make datasets consistent with one another * Many built-in and external libraries * ‘re’ library for regular expressions * A formal way of specifying a pattern * Sequence of characters * Pattern matching * Similar to globbing ### 2. Example match ![](https://i.imgur.com/MAslyWw.png =400x) ### 3. Using regular expressions | re 套件 > 正規表達式模塊 re ,是一個非常強大的文本解析工具。 #### **Re** [[doc](https://docs.python.org/3/library/re.html)] ```python= import re pattern = re.compile('\$\d*\.\d{2}') result = pattern.match('$17.89') '''相當於: result = re.match('\$\d*\.\d{2}', '$17.89') ''' bool(result) #[Out]:True ``` * re.compile(): 把正規表達式的模式轉換成正規表達物件(regular expression object)。 * 可以結合 search()或 match() 使用。 * re.search(pattern, string, flags=0): 回傳符合正表達式字串的第一個位置,沒有時回傳None。 * re.match(pattern, string, flags=0) 檢查是否能解析成指定字串。 ```python= import re recipe = "I need 10 strawberries and 2 apples." print(re.findall("\d+ [a-z]+", recipe)) ``` #### practice - String parsing with regular expressions ```python= In [1]: # Import the regular expression module import re # Compile the pattern: prog prog = re.compile('\d{3}\-\d{3}\-\d{4}') # See if the pattern matches result = prog.match('123-456-7890') #看資料是否和prog的格式吻合 print(bool(result)) #True True In [2]: # See if the pattern matches result2 = prog.match('1123-456-7890') #看資料是否和prog的格式吻合 print(bool(result2)) #False False ``` * Extracting numerical values from strings ```python= In [1]: # Import the regular expression module import re # Find the numeric values: matches matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') # 在後面那段話裡面找出所有的數值資料,如果打\d,print出來會是['1','0','1'] # 所以之所以打\d+是因為,如果數值資料是相連的,就會print在同一位置內,['10','1'] # Print the matches print(matches) ['10', '1'] ``` - Pattern matching ```python= In [4]: # Write the first pattern pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')) print(pattern1) True In [5]: # Write the second pattern pattern2 = bool(re.match(pattern='\$\d+.\d{2}', string='$123.45')) print(pattern2) True #Use [A-Z] to match any capital letter followed by \w* #to match an arbitrary number of alphanumeric characters. In [6]: # Write the third pattern pattern3 = bool(re.match(pattern='\w*', string='Australia')) print(pattern3) True ``` ## [4-3] Using functions to clean data ### 1. Complex cleaning * Cleaning step requires **multiple steps** * Extract number from string * Perform transformation on extracted number * Python function ### 2. Apply | apply() #### df.apply() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)] * 參數: * axis: 控制對行(0)或列(1) * 輸出: Series * 可以一次性的對整個序列(list、series、column)做操作 function。 ```python= print(df) #[Out]: ''' treatment a treatment b Daniel 18 42 John 12 31 Jane 24 27 ''' df.apply(np.mean, axis=0) #[Out]: ''' treatment a 18.000000 treatment b 33.333333 dtype: float64 ''' df.apply(np.mean, axis=1) #[Out]: ''' Daniel 30.0 John 21.5 Jane 25.5 dtype: float64 ''' ``` ### 3. Applying functions > 我們要使用 function、 re 和 apply 做資料清理,以下方表格為例: ![](https://i.imgur.com/GTM5YnB.png =400x) > 首先寫一個正規表達式 ```python= import re from numpy import NaN pattern = re.compile('^\$\d*\.\d{2}$') ``` > 接著寫一個 function ```python= def diff_money(row, pattern): icost = row['Initial Cost'] tef = row['Total Est. Fee'] if bool(pattern.match(icost)) and bool(pattern.match(tef)): icost = icost.replace("$", "") tef = tef.replace("$", "") icost = float(icost) tef = float(tef) return icost - tef else: return(NaN) ``` * bool(): 強制轉布林值 * float(): 強制轉浮點數 * replace(): 將原本的字串替換指定的字串 > 接著在使用apply()結合 function 生成新的列('diff') ```python= df_subset['diff'] = df_subset.apply(diff_money,axis=1,pattern=pattern) print(df_subset.head()) #[Out]: ''' Job # Doc # Borough Initial Cost Total Est. Fee diff 0 121577873 2 MANHATTAN $75000.00 $986.00 74014.0 1 520129502 1 STATEN ISLAND $0.00 $1144.00 -1144.0 2 121601560 1 MANHATTAN $30000.00 $522.50 29477.5 3 121601203 1 MANHATTAN $1500.00 $225.00 1275.0 4 121601338 1 MANHATTAN $19500.00 $389.50 19110.5 ''' ``` * 注意: 他是設定 axis=1,也就是對列,因為新產生的值是參考同一列新欄位的值。 #### practice - Custom functions to clean data ```python= In [1]: tips.sex.head() Out[1]: 0 Female 1 Male 2 Male 3 Male 4 Female Name: sex, dtype: object In [2]: # Define recode_gender() def recode_gender(gender): # Return 0 if gender is 'Female' if gender == 'Female': return 0 # Return 1 if gender is 'Male' elif gender == 'Male': return 1 # Return np.nan else: return np.nan In [3]: # Apply the function to the sex column tips['recode'] = tips.sex.apply(recode_gender) In [4]: # Print the first five rows of tips print(tips.head()) total_bill tip sex smoker day time size recode 0 16.99 1.01 Female No Sun NaN 2.0 0.0 1 10.34 1.66 Male No Sun Dinner 3.0 1.0 2 21.01 3.50 Male No Sun Dinner 3.0 1.0 3 23.68 3.31 Male No Sun Dinner 2.0 1.0 4 NaN 3.61 Female No Sun Dinner 4.0 0.0 ``` - Lambda functions > 將函數用lambda簡化 ```python= def my_square(x): return x ** 2 df.apply(my_square) # equal to df.apply(lambda x: x ** 2) ``` ```python= In [1]: # Write the lambda function using replace tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', '')) In [2]: # Write the lambda function using regular expressions tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0]) In [3]: # Print the head of tips print(tips.head()) total_bill tip sex smoker day time size total_dollar total_dollar_replace total_dollar_re 0 16.99 1.01 Female No Sun Dinner 2 $16.99 16.99 16.99 1 10.34 1.66 Male No Sun Dinner 3 $10.34 10.34 10.34 2 21.01 3.50 Male No Sun Dinner 3 $21.01 21.01 21.01 3 23.68 3.31 Male No Sun Dinner 2 $23.68 23.68 23.68 4 24.59 3.61 Female No Sun Dinner 4 $24.59 24.59 24.59 ``` ## [4-4] Duplicate and missing data ### 1. Duplicate data | drop_duplicates() * Can skew results ![](https://i.imgur.com/wvL9nqo.png =400x) > 使用 drop_duplicates() 處理 #### df.drop_duplicates() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)] ```python= df = df.drop_duplicates() print(df) #[Out]: ''' name sex treatment a treatment b 0 Daniel male - 42 1 John male 12 31 2 Jane female 24 27 ''' ``` ### 2. Missing data | dropna()、fillna() > 缺失值 ![](https://i.imgur.com/d1GTfMI.png =400x) * 處理方法: * Leave as-is (不處理) * Drop them (全丟掉) * Fill missing value (填補) > 使用 info() 可清楚的知道,每行有多少缺失值 ```python= tips_nan.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 202 non-null float64 tip 220 non-null float64 sex 234 non-null object smoker 229 non-null object day 243 non-null object time 227 non-null object size 231 non-null float64 dtypes: float64(3), object(4) memory usage: 13.4+ KB None ''' ``` > 方法一 : 使用 dropna() 處理缺失值 #### df.dropna() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)] ```python= tips_dropped = tips_nan.dropna() tips_dropped.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> Int64Index: 147 entries, 0 to 243 Data columns (total 7 columns): total_bill 147 non-null float64 tip 147 non-null float64 sex 147 non-null object smoker 147 non-null object day 147 non-null object time 147 non-null object size 147 non-null float64 dtypes: float64(3), object(4) memory usage: 9.2+ KB ''' ``` > 方法二: 使用 fillna() 填補缺失值 #### df[columns].fillna() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)] * Fill with provided value * Use a summary statistic(描述統計) ```python= tips_nan['sex'] = tips_nan['sex'].fillna('missing') tips_nan[['total_bill', 'size']] = tips_nan[['total_bill','size']].fillna(0) tips_nan.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 220 non-null float64 sex 244 non-null object smoker 229 non-null object day 243 non-null object time 227 non-null object size 244 non-null float64 dtypes: float64(3), object(4) memory usage: 13.4+ KB ''' ``` * 注意:在 panda 裡面可以使用雙中括號來呼叫多行,像上面的 tips_nan[['total_bill', 'size']] > 方法三: 使用 test statistic 填補 * Careful when using test statistics to fill * Have to make sure the value you are filling in makes sense * Median(中位數) is a better statistic in the presence of outliers ```python= mean_value = tips_nan['tip'].mean() print(mean_value) #[Out]:2.964681818181819 tips_nan['tip'] = tips_nan['tip'].fillna(mean_value) tips_nan.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null object smoker 229 non-null object day 243 non-null object time 227 non-null object size 244 non-null float64 dtypes: float64(3), object(4) memory usage: 13.4+ KB ''' ``` #### practice - Filling missing data ```python= In [2]: # Calculate the mean of the Ozone column: oz_mean oz_mean = airquality['Ozone'].mean() # Replace all the missing values in the Ozone column with the mean airquality['Ozone'] = airquality.Ozone.fillna(oz_mean) # Print the info of airquality print(airquality.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 153 entries, 0 to 152 Data columns (total 6 columns): Ozone 153 non-null float64 Solar.R 146 non-null float64 Wind 153 non-null float64 Temp 153 non-null int64 Month 153 non-null int64 Day 153 non-null int64 dtypes: float64(3), int64(3) memory usage: 7.2 KB None ``` ## [4-5] Testing with asserts ### 1. Assert statements #### assert 方法 [[doc](https://docs.python.org/3/reference/simple_stmts.html#grammar-token-assert-stmt)] > Assert(斷言) : 陳述在程式中安插除錯用的斷言(Assertion)檢查時很方便的一個方式。 * Programmatically vs visually checking * If we drop or fill NaNs, we expect 0 missing values * We can write an assert statement to verify this * We can detect early warnings and errors * This gives us confidence that our code is running correctly ```python= assert 1 == 1 assert 1 == 2 #[Out]: ''' --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-65-a810b3a4aded> in <module>() ----> 1 assert 1 == 2 AssertionError: ''' ``` ### 2. Google stock data ![](https://i.imgur.com/hyEXwc8.png) > Test column ```python= assert google.Close.notnull().all() #[Out]: ''' --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-49-eec77130a77f> in <module>() ----> 1 assert google.Close.notnull().all() AssertionError: ''' ``` * 這邊我們用 notnull() 去測試 Close 這行裡每個值,是否有空值。 * all() 則是將結果合起來。 * 最後用 assert 去測試有無問題。 ```python= google_0 = google.fillna(value=0) assert google_0.Close.notnull().all() ``` #### practice - Testing your data with asserts ```python= In [1]: # Assert that there are no missing values assert pd.notnull(ebola).all().all() # The first .all() method will return a True or False for each # column, while the second .all() method will return a single # True or False) # assert後面的運算式成立時就做某些事,如果不成立,就會發起例外 In [2]: # Assert that all values are >= 0 assert (ebola >= 0).all().all() ```