Cleaning Data in Python

# Cleaning Data in Python ###### tags: `Datacamp` `python` `Cleaning Data` >**作者:何彥南** >Datacamp 課程: [Cleaning Data in Python](https://www.datacamp.com/courses/cleaning-data-in-python) **注意:** 1. df 為 pandas 的 DataFrame 的縮寫。 2. pd 為 panda 套件的縮寫。 3. 請以官方文件 [panda doc](https://pandas.pydata.org/pandas-docs/stable/) 為主。 4. 注意panda 的版本，有些功能可能在新版無法使用。 5. 程式碼內`#`標記的地方為 output --- [TOC] --- # Exploring your data ## [1-1] Diagnose data for cleaning ### 1. Cleaning data * Prepare data for analysis * Data almost never comes in clean * Diagnose your data for problems ### 2. Common data problems * Inconsistent column names (欄位名不一致) * Missing data (遺漏值) * Outliers (極端值) * Duplicate rows (重複資料) * Untidy (不整齊) * Need to process columns (需經過處理的欄位) * Column types can signal unexpected data values ### 3. Unclean data > 舉個例來說 ![](https://i.imgur.com/zG6VbbA.png) * 上表有幾個問題: * Column name inconsistencies (欄位名不一致) * Missing data (遺漏值) * Country names are in French (語言或編碼問題) ### 4. Load your data | read_csv() > 載入你的資料 #### **pd.read_csv()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)] * 常用參數 * **path**(必要):檔案位置。 * header: 指定載入的行名(columns name)。 * header=None 當檔案沒行名時可以使用，以(0、1、2...)代替。 * 預設 header=0 ，代表已第 0 個(index)列為行名。 * names: 更改行名。以 list 匯入。 * usecols: 指定要載入的行。以 list 匯入, 數字(index)或行名都行。 * skiprows: 跳過前面幾行，輸入 int。 * encoding: 讀取的編碼，輸入 string。 ```python= import pandas as pd df = pd.read_csv('literary_birth_rate.csv') ``` ### 5. Visually inspect | head() 、tail() > 用看的，先把看得到的錯誤排除 ![](https://i.imgur.com/Rgi2sF8.png) #### **df.head()** [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)] * 前n列，預設5。 #### **df.tail()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail)] * 後n列，預設5。 ### 6. Visually inspect(2) | columns、shape、info() ![](https://i.imgur.com/yKYt4HP.png) #### **columns**: * 所有欄的名字(可迭代的) #### **shape**: * 表格的規格，輸出:(列，行)。 #### **info()**: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) * 表格的整體資訊。 ## [1-2] Exploratory data analysis ### 1. Data type of each column >不通欄位的 type(資料類型) ![](https://i.imgur.com/W8th4ro.png) ### 2. Frequency counts: continent | value_counts() #### **df[column].value_counts()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)] * 依照指定行(column)，對每個值進行分類計數(count) * 輸出以分類為index，值為次數 * 注意: * value_counts().values : 取得結果次數列表(每個分類的次數) * value_counts().index : 分類列表 * 參數: * ascending: 預設為降序，輸入bool，True 為升序。 * normalize: 是否使用比例，輸入bool。 * dropna:　丟掉空值。 ```python= df.continent.value_counts(dropna=False) df['continent'].value_counts(dropna=False) #[Out]: ''' AF 49 ASI 47 EUR 36 LAT 24 OCE 6 NAM 2 Name: cont ''' ``` * 在DataFrame 下，df.continent 和 df['continent'] 都是呼叫資料表 (df) 裡的欄位(continent) * 結果裡面有可能出現 missing 或 NaN 值，這也是我們要注意的地方，後面會在解釋。 ### 3. Summary statistics > 摘要統計 * Numeric columns: 數值化的行列資訊。 * outlier (異常值): 異常值(又稱極端值、離群值) ，我們可以知道他是較高(或低)，之後再進一步的去調查。 ### 4. Summary statistics: Numeric data | describe() #### **df.describe()**: [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)] * 可以顯示資料整體的數值化統計量 ```python= df.describe() #[Out]: ''' female_literacy population count 164.000000 1.220000e+02 mean 80.301220 6.345768e+07 std 22.977265 2.605977e+08 min 12.600000 1.035660e+05 25% 66.675000 3.778175e+06 50% 90.200000 9.995450e+06 75% 98.500000 2.642217e+07 max 100.000000 2.313000e+09 ''' ``` ## [1-3] Visual exploratory data analysis ### 1. Data visualization * Great way to spot outliers and obvious errors * More than just looking for patterns * Plan data cleaning steps ### 2. Bar plots and histograms * Bar plots(直條圖) for discrete data counts * Histograms(直方圖) for continuous data counts * Look at frequencies(頻率) ### 3. Histogram(直方圖) > 適合展示連續型資料的頻率分布 ```python= df.population.plot('hist') #[Out]:f<matplotlib.axes._subplots.AxesSubplot at 0x7f78e4abafd0> import matplotlib.pyplot as plt plt.show() ``` ![](https://i.imgur.com/td2eHvf.png) * 從這邊我們可以看出來，人口在一萬與兩萬那邊有異常值。 ### 4. Identifying the error > 辨識異常(錯誤)，我們可以利用 panda 裡面基本的篩選功能去觀察異常值。(接續上面) ```python= df[df.population > 1000000000] #[Out]: '''continent country female_literacy fertility population 0 ASI Chine 90.5 1.769 1.324655e+09 1 ASI Inde 50.8 2.682 1.139965e+09 162 OCE Australia 96.0 1.930 2.313000e+09 ''' ``` * 不是所有的異常值都不好，要自己判斷。 * 這邊很明顯，Australia 他的 population(人口) 是錯誤的。 ### 5. Box plots(箱型圖) * 可視覺化基本資料的分布: * Outliers (異常值) * Min/max (極值) * 25th, 50th, 75th percentiles (百分位數) ```python= df.boxplot(column='population', by='continent') #[Out]:<matplotlib.axes._subplots.AxesSubplot at 0x7ff5581bb630> plt.show() ``` * 輸出 ![](https://i.imgur.com/in21IG4.png) * 這邊圈起來的部分我們可以看出異常值，1~1.5 之間有兩個，還有一個在2以上。 ### 6. Scatter plots(散佈圖) * Relationship between 2 numeric variables * Flag potentially bad data(標記潛在的壞數據) * Errors not found by looking at 1 variable --- # Tidying data for analysis ## [2-1] Tidy data(乾淨的資料) * “Tidy Data” paper by Hadley Wickham, PhD * Formalize(格式化) the way we describe theshape of data * Gives us a goal when formatting(格式化) our data * “Standard way to organize data values within a dataset” ### 1. Principles of tidy data * Columns represent separate variables(變數) * Rows represent individual(獨立的) observations(觀測值) * Observational(觀察) units form tables ### 2. Converting to tidy data ![](https://i.imgur.com/KSNRR8A.png) * Better for reporting vs. better for analysis * Tidy data makes it easier to fix common data problems * The data problem we are trying to fix: * Columns containing values, instead of variables ### 3. Melting | melt(): > 將多行(包含相同性質的值)轉成一行類別行(由原本行名組成)和一行值(由原本多行底下的相同值組成)，並以補上一行類別變數(表示原本屬於哪行)，值則是原本兩行裡的數值。 #### pd.melt(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html)] * 參數: * frame: 輸入，dataframe 的形式。 * id_vars: 原資料為基準值的行。 * value_vars: 原資料要合併為一行(分類)的行。 * var_name: 轉換後分類那行的名字。 * value_name: 轉換後值的那行的名字。 ```python= pd.melt(frame=df,id_vars='name',value_vars=['treatment a', 'treatment b'], var_name='treatment', value_name='result') #[Out]: ''' name treatment result 0 Daniel treatment a _ 1 John treatment a 12 2 Jane treatment a 24 3 Daniel treatment b 42 4 John treatment b 31 5 Jane treatment b 27 ''' ``` * 在變化前，資料已人為基準(所以將其設為 id_vars)，記錄每個人兩次治療的數值。 * 某些資料表的型態適合報告時呈現，但卻不適合分析，我們必須知道哪種資料表的形狀較適合我們。 * 在使用 pd.melt() 後，每筆資料代表的是每一次治療，這樣我們在後序的分析上就可以更方便且明確。 ![](https://i.imgur.com/KSNRR8A.png) ## [2-2] Pivoting data ### 1. Pivot: un-melting data | Pivot() > 將列轉換成行 ![](https://i.imgur.com/FtOS4OK.png) * Opposite(相反) of melting * In melting, we turned columns into rows * **Pivoting: turn unique values into separate columns** * Analysis friendly shape to reporting friendly shape * **Violates(違背) tidy data principle:** * **rows contain observations** * **Multiple variables stored in the same column** #### df.Pivot(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html#pandas.DataFrame.pivot)] * 參數: * index: 以哪欄為基準。 * columns: 此行包含類別變數，將類別變為行。 * values: 此行包含參考的值。 ```python= weather_tidy = weather.pivot(index='date',columns='element',values='value') print(weather_tidy) #[Out]: '''element tmax tmin date 2010-01-30 27.8 14.5 2010-02-02 27.3 14.4 ''' ``` ### 2. Using pivot when you have duplicate entries > 當基準(index) 同個類別但是存在多個值(values)時，使用 pivot()會造成以下錯誤 ![](https://i.imgur.com/xSPN3J2.png) * 這邊以date為基準 2010-02-02 tmin 有個值這樣就無法用pivot()轉換 ```python= import numpy as np weather2_tidy = weather.pivot(values='value',index='date',columns='element') #[Out]: '''--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-9-2962bb23f5a3> in <module>() 1 weather2_tidy = weather2.pivot(values='value', 2 index='date', ----> 3 columns='element') ValueError: Index contains duplicate entries, cannot reshape ''' ``` ### 3. Pivot table | Pivot_table() > 比起 Pivot() ，Pivot table() 有更詳細的參數，可以處裡在轉換時發生上述重複錯誤時。 #### df.pivot_table(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table)] * 參數 * index: 以哪欄為基準。 * columns: 此行包含類別變數，將類別變為行。 * values: 此行包含參考的值。 * aggfunc: 處理方式，function * Has a parameter that specifies how to deal with duplicate values * Example: Can aggregate the duplicate values by taking their average ```python= weather2_tidy = weather.pivot_table(values='value',index='date', columns='element',aggfunc=np.mean) #[Out]: '''element tmax tmin date 2010-01-30 27.8 14.5 2010-02-02 27.3 15.4 ''' ``` * 這邊透過設定 aggfunc= : 這參數，可以告訴 python 該如何處理兩個值。在這裡是用平均(np.mean)。 ## [2-3] Beyond melt and pivot ### 1. Melting and parsing | str[ ] * Another common problem: * Columns contain multiple bits of information ![](https://i.imgur.com/VzQW5Pk.png) * 這邊 m014 代表0~14 歲的男人，m1524 代表15~24歲的男人。 ```python= pd.melt(frame=tb, id_vars=['country', 'year']) #[Out]: ''' country year variable value 0 AD 2000 m014 0 1 AE 2000 m014 2 2 AF 2000 m014 52 3 AD 2000 m1524 0 4 AE 2000 m1524 4 5 AF 2000 m1524 228 ''' ``` * Nothing inherently(本質上) wrong about original data shape * Not conducive(有利於) for analysis > 使用str擷取第一個字 * pd[column].str[ ]: [[doc](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#indexing-with-str)] * str[0]為第一個字 (位置0)，str[-1]為最後一個字，str[1:3]抓取位置在1~3 (不包括3)的字 ```python= tb_melt['sex'] = tb_melt.variable.str[0] tb_melt #[Out]: ''' country year variable value sex 0 AD 2000 m014 0 m 1 AE 2000 m014 2 m 2 AF 2000 m014 52 m 3 AD 2000 m1524 0 m 4 AE 2000 m1524 4 m 5 AF 2000 m1524 228 m ''' ``` * 這邊我們對variable這行使用str[0]，將第一個字分出來一個性別欄(sex)，方便於之後的分析。 --- # Combining data for analysis ## [3-1] Concatenating data ### 1. Combining data * Data may not always come in 1 huge(巨大) file * 5 million row dataset may be broken into 5 separate datasets * Easier to store and share * May have new data for each day * Important to be able to combine then clean, or vice versa(反之亦然) * 以下表為範例: ![](https://i.imgur.com/JAIxG6d.png) ### 2. pandas concat | concat()、loc[ ]、iloc[ ] > 使用 pd.concat() 合併兩個表 #### pd.concat() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)] * 參數: * 必要資料，多個表的list * ignore_index: 在合併完，重設index ```python= concatenated = pd.concat([weather_p1, weather_p2]) print(concatenated) #[Out]: ''' date element value 0 2010-01-30 tmax 27.8 1 2010-01-30 tmin 14.5 0 2010-02-02 tmax 27.3 1 2010-02-02 tmin 14.4 ''' ``` * 這邊我們可以發現，合併後的表 index 會保留原本的值。 * 下面我們使用 ignore_index : ```python= pd.concat([weather_p1, weather_p2], ignore_index=True) #[Out]: ''' date element value 0 2010-01-30 tmax 27.8 1 2010-01-30 tmin 14.5 2 2010-02-02 tmax 27.3 3 2010-02-02 tmin 14.4 ''' ``` > 我們可以藉由 panda 裡的 .loc[,] 去指定資料表裡，指定 index 和 colum 的值 * loc[index,column]: 指定標籤 label，可為字串。 * iloc [index,column]: 指定位置，只有數字。 * iloc和loc都是由index的和column 組成，只是用來指定指定位置的的基準不一樣。 ```python= concatenated = concatenated.loc[0, :] #[Out]: ''' date element value 0 2010-01-30 tmax 27.8 0 2010-02-02 tmax 27.3 ''' ``` * 這邊我們指定所有 index 為0的 row。( : 代表所有的意思) ### 3. Concatenating DataFrames > 這邊我們將年齡和性別分出來，成另一個表，這樣我們就可以針對性別和年齡做分析。 ![](https://i.imgur.com/t6awMuL.png) * 提示: 使用df[column].str[reference] ## [3-2] Finding and concatenating data ### 1. Concatenating many files > 當我們有許多重複的資料需要處理 * Leverage Python’s features with data cleaning in pandas * In order to concatenate DataFrames: * They must be in a list * Can individually(單獨) load if there are a few datasets * But what if there are thousands?(不同大小時) * Solution: glob function to find files based on a pattern(模式) ### 2. Globbing | glob 套件 >　glob 模塊: 用於查詢文件路徑 #### Glob 套件 [doc](https://docs.python.org/3/library/glob.html) * Pattern matching for file names * Wildcards(萬用字元): *和? * Any csv file: *.csv * Any single character: file_?.csv * **Returns a list of file names** * **Can use this list to load into separate DataFrames** > 計畫方法 * Load files from globbing into pandas * Add the DataFrames into a list * Concatenate multiple datasets at once > 利用 glob.glob() ，查找文件。 ```python= import glob csv_files = glob.glob('*.csv') print(csv_files) #[Out]:['file5.csv', 'file2.csv', 'file3.csv', 'file1.csv', 'file4.csv'] ``` > 使用迴圈去分別將 csv_files 裡的路徑載入，並添加到 list_data 。 ```python= list_data = [] for filename in csv_files: data = pd.read_csv(filename) list_data.append(data) pd.concat(list_data) ``` ## [3-3] Merge data ### 1. Combining data > 除了 concat 我們還可以使用 merge 去合併資料 * Concatenation is not the only way data can be combined ![](https://i.imgur.com/pIT9xM3.png) ### 2. Merging data | merge() >針對指定的欄位值去合併兩個不同的資料表 * Similar to joining tables in SQL * **Combine disparate datasets based on common columns** ![](https://i.imgur.com/sHKuhwt.png) #### pd.merge(): [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)] * 參數: * left: 左邊的資料表 * right: 右邊的資料表 * on: 必須是兩個資料表都有得行名，也可以設定(label or list)，預設為None 。 * left_on: 左邊為基準的行 * right_on: 右邊為基準的行 ```python= pd.merge(left=state_populations, right=state_codes, on=None, left_on='state', right_on='name') #[Out]: ''' state population_2016 name ANSI 0 California 39250017 California CA 1 Texas 27862596 Texas TX 2 Florida 20612439 Florida FL 3 New York 19745289 New York NY ''' ``` ### 3. Types of merge >　不同類型的 merge * All use the same function * Only difference is the DataFrames you are merging >　One-to-one: 左右為基準的行，皆只有一筆相對應的資料。 ![](https://i.imgur.com/amSzGPw.png) > Many-to-one / one-to-many ![](https://i.imgur.com/HrcRXf3.png) * 上面為基準的行 (state) 裡有兩個重複的 New York，在將右邊的表合併時，會自動補上相對應的重複值。 ![](https://i.imgur.com/5sbXi93.png) > Many-to-Many --- # Cleaning data for analysis ## [4-1] Data types ### 1. Data types | dtypes #### df.dtypes [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html)] >這邊以下表為例 ![](https://i.imgur.com/xyOfHPz.png) > 利用 df.dtypes 可以看到每個欄位的資料型態。 ```python= print(df.dtypes) #[Out]: ''' name object sex object treatment a object treatment b int64 dtype: object ''' ``` ### 2. Converting data types | astype() > 透過 astype() 去改資料的種類 #### df[colum].astype() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)] ```python= df['treatment b'] = df['treatment b'].astype(str) df['sex'] = df['sex'].astype('category') df.dtypes #[Out]: ''' name object sex category treatment a object treatment b object dtype: object ''' ``` ### 3. Categorical data * Converting categorical data to ‘category’ dtype: * Can make the DataFrame **smaller in memory** * Can make them be utilized(利用) by other Python libraries for analysis ### 4. Cleaning bad data > 數值型的資料，再載入時因為 '-' 被判斷成字串。 ![](https://i.imgur.com/aIre55K.png) > 解決方式 #### pd.to_numeric() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html)] * 參數: * errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ * If ‘raise’, then invalid parsing will raise an exception * If ‘coerce’, then invalid parsing will be set as NaN * If ‘ignore’, then invalid parsing will return the input ```python= df['treatment a'] = pd.to_numeric(df['treatment a'],errors='coerce') df.dtypes #[Out]: ''' name object sex category treatment a float64 treatment b object dtype: object ''' ``` * errors='coerce' : 可以將無法轉換的值換成NaN ## [4-2] Using regular expressions to clean strings ### 1. String manipulation * Much of data cleaning involves string manipulation(操作) * Most of the world’s data is unstructured(非結構) text * Also have to do string manipulation to make datasets consistent with one another * Many built-in(內置) and external(外部) libraries * ‘re’ library for regular expressions * A formal way of specifying a pattern * Sequence of characters * Pattern matching * Similar to globbing ### 2. Example match ![](https://i.imgur.com/MAslyWw.png) ### 3. Using regular expressions | re 套件 > 正規表達式模塊 re ，是一個非常強大的文本解析工具。 #### **Re** [[doc](https://docs.python.org/3/library/re.html)] ```python= import re pattern = re.compile('\$\d*\.\d{2}') result = pattern.match('$17.89') '''相當於: result = re.match('\$\d*\.\d{2}', '$17.89') ''' bool(result) #[Out]:True ``` * re.compile(): 把正規表達式的模式轉換成正規表達物件(regular expression object)。 * 可以結合 search()或 match() 使用。 * re.search(pattern, string, flags=0): 回傳符合正表達式字串的第一個位置，沒有時回傳None。 * re.match(pattern, string, flags=0) 檢查是否能解析成指定字串。 ```python= import re recipe = "I need 10 strawberries and 2 apples." print(re.findall("\d+ [a-z]+", recipe)) ``` ## [4-3] Using functions to clean data ### 1. Complex cleaning * Cleaning step requires **multiple steps** * Extract number from string * Perform transformation on extracted number * Python function ### 2. Apply | apply() #### df.apply() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)] * 參數: * axis: 控制對行(0)或列(1) * 輸出: Series * 可以一次性的對整個序列(list、series、column)做操作 function。 ```python= print(df) #[Out]: ''' treatment a treatment b Daniel 18 42 John 12 31 Jane 24 27 ''' df.apply(np.mean, axis=0) #[Out]: ''' treatment a 18.000000 treatment b 33.333333 dtype: float64 ''' df.apply(np.mean, axis=1) #[Out]: ''' Daniel 30.0 John 21.5 Jane 25.5 dtype: float64 ''' ``` ### 3. Applying functions > 我們要使用 function、 re 和 apply 做資料清理，以下方表格為例: ![](https://i.imgur.com/GTM5YnB.png) > 首先寫一個正規表達式 ```python= import re from numpy import NaN pattern = re.compile('^\$\d*\.\d{2}$') ``` > 接著寫一個 function ![](https://i.imgur.com/Rbps28g.png) * bool(): 強制轉布林值 * float(): 強制轉浮點數 * replace(): 將原本的字串替換指定的字串 > 接著在使用apply()結合 function 生成新的列('diff') ```python= df_subset['diff'] = df_subset.apply(diff_money,axis=1,pattern=pattern) print(df_subset.head()) #[Out]: ''' Job # Doc # Borough Initial Cost Total Est. Fee diff 0 121577873 2 MANHATTAN $75000.00 $986.00 74014.0 1 520129502 1 STATEN ISLAND $0.00 $1144.00 -1144.0 2 121601560 1 MANHATTAN $30000.00 $522.50 29477.5 3 121601203 1 MANHATTAN $1500.00 $225.00 1275.0 4 121601338 1 MANHATTAN $19500.00 $389.50 19110.5 ''' ``` * 注意: 他是設定 axis=1，也就是對列，因為他心產生的值是參考同一列的其他值。 ## [4-4] Duplicate and missing data ### 1. Duplicate data | drop_duplicates() >重複值 * Can skew results ![](https://i.imgur.com/wvL9nqo.png) > 使用 drop_duplicates() 處理 #### df.drop_duplicates() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)] ```python= df = df.drop_duplicates() print(df) #[Out]: ''' name sex treatment a treatment b 0 Daniel male - 42 1 John male 12 31 2 Jane female 24 27 ''' ``` ### 3. Missing data | dropna()、fillna() > 缺失值 ![](https://i.imgur.com/d1GTfMI.png) * 處理方法: * Leave as-is (不處理) * Drop them (全丟掉) * Fill missing value (填補) > 使用 info() 可清楚的知道，每行有多少缺失值 ```python= tips_nan.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 202 non-null float64 tip 220 non-null float64 sex 234 non-null object smoker 229 non-null object day 243 non-null object time 227 non-null object size 231 non-null float64 dtypes: float64(3), object(4) memory usage: 13.4+ KB None ''' ``` > 方法一 : 使用 dropna() 處理缺失值 #### df.dropna() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)] ```python= tips_dropped = tips_nan.dropna() tips_dropped.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> Int64Index: 147 entries, 0 to 243 Data columns (total 7 columns): total_bill 147 non-null float64 tip 147 non-null float64 sex 147 non-null object smoker 147 non-null object day 147 non-null object time 147 non-null object size 147 non-null float64 dtypes: float64(3), object(4) memory usage: 9.2+ KB ''' ``` > 方法二: 使用 fillna() 填補缺失值 #### df[columns].fillna() [[doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)] * Fill with provided value * Use a summary statistic(描述統計) ```python= tips_nan['sex'] = tips_nan['sex'].fillna('missing') tips_nan[['total_bill', 'size']] = tips_nan[['total_bill','size']].fillna(0) tips_nan.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 220 non-null float64 sex 244 non-null object smoker 229 non-null object day 243 non-null object time 227 non-null object size 244 non-null float64 dtypes: float64(3), object(4) memory usage: 13.4+ KB ''' ``` * 注意：在 panda 裡面可以使用雙中括號來呼叫多行，像上面的 tips_nan[['total_bill', 'size']] > 方法三: 使用 test statistic(檢定統計量) 填補 * Careful when using test statistics to fill * Have to make sure the value you are filling in makes sense(能說的通) * Median(中位數) is a better statistic in the presence of outliers(百分比的極端值) ```python= mean_value = tips_nan['tip'].mean() print(mean_value) #[Out]:2.964681818181819 tips_nan['tip'] = tips_nan['tip'].fillna(mean_value) tips_nan.info() #[Out]: ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null object smoker 229 non-null object day 243 non-null object time 227 non-null object size 244 non-null float64 dtypes: float64(3), object(4) memory usage: 13.4+ KB ''' ``` ## [4-5] Testing with asserts ### 1. Assert statements #### assert 方法 [[doc](https://docs.python.org/3/reference/simple_stmts.html#grammar-token-assert-stmt)] >　Assert(斷言) : 陳述在程式中安插除錯用的斷言（Assertion）檢查時很方便的一個方式。 * Programmatically(編程) vs visually checking * If we drop or fill NaNs, we expect 0 missing values * We can write an assert statement to verify(驗證) this * We can detect early warnings and errors * This gives us confidence that our code is running correctly ```python= assert 1 == 1 assert 1 == 2 #[Out]: ''' --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-65-a810b3a4aded> in <module>() ----> 1 assert 1 == 2 AssertionError: ''' ``` ### 2. Google stock data ![](https://i.imgur.com/hyEXwc8.png) >Test column ```python= assert google.Close.notnull().all() #[Out]: ''' --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-49-eec77130a77f> in <module>() ----> 1 assert google.Close.notnull().all() AssertionError: ''' ``` * 這邊我們用 notnull() 去測試 Close 這行裡每個值，是否有空值。 * all() 則是將結果合起來。 * 最後用 assert 去測試有無問題。 ```python= google_0 = google.fillna(value=0) assert google_0.Close.notnull().all() ```