sklearn Pipeline SimpleImputer Preprocessing

sklearn Pipeline SimpleImputer Preprocessing === 王辰禎, DCT, NTCU(Taiwan) --- ###### tags: `Pipeline` **資料科學與問題解決week05HW(3/21)** [Colab 程式碼](https://colab.research.google.com/drive/1uzBirqz8ZffiVMq9fE7QM8QvDGGcG5lZ?usp=sharing) --- #### **If you use python IDLE from python.org, you should use CMD and pip to install the sklearn package.** 1. open the cmd(命令提示字元) 2. 輸入 `pip install scikit-learn` ![螢幕擷取畫面 2024-04-06 212806](https://hackmd.io/_uploads/B1aMDqkgC.png) --- #### <正文>import(匯入)套件，e.g. random、math套件等等 ##### 由於Sklearn有六大部分，from Sklearn.[某部分] import [某部件] ![image](https://hackmd.io/_uploads/B1ANg6qxxg.png) --- #### 將題目的資料丟進Data Frame ![image](https://hackmd.io/_uploads/SyYSxTcell.png) --- #### 建立管道器(pipeline)，定義名稱及對應動作。 ![image](https://hackmd.io/_uploads/BkELeT5xle.png) SimpleImputer遺漏值處理 : strategy='median(中位數)、mean(平均值)、most_frequent(眾數，出現最多次的數)', MinMaxScaler() : 最小最大值標準化(將min->0, max->1，故數據會縮到0~1之間) *sklearn中常見資料預處理: StandardScaler,MinMaxScaler, MaxAbsScaler, and RobustScaler --- #### 將管道器pipeline應用到數值型欄位 ##### 先選擇數值型的行(column)並命名為<numeric_features> ##### 將<df數值型部分> 指定 pipeline 應用( .fit_transform() ) 到 <df數值型部分> ![image](https://hackmd.io/_uploads/ByUPe6celg.png) --- #### 印出DataFrame ##### 由於小數點太多看著不舒服，故四捨五入，但實際處理數據不應該隨意四捨五入，避免數據不精確。 ![image](https://hackmd.io/_uploads/H1y_eaqgee.png) --- #### 參考資料 : 1. [(scikit-learn.org)sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) 2. [(scikit-learn.org)sklearn.preprocessing](https://scikit-learn.org/stable/api/sklearn.preprocessing.html) *[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) 3. [(iT邦幫忙)[Day 5] 資料清理&前處理_10程式中(2020)](https://ithelp.ithome.com.tw/articles/10240494)