# 機器學習 ###### tags: `數據分析` `機器學習` ## 機器學習概論 ### 依照輸入分類 #### Supervised Learning * 輸入資料有答案(label) #### Unsupervised Learning * 輸入資料沒答案(label) * Cluster unlabelled data * 讓電腦把資料相似的分類在一起,相異的放遠一點 #### Semi-supervised Learning * 一半的資料有答案,一半的資料沒答案 #### Reinforcement Learning * 給她沒答案的資料 * 如果答對給予正回饋,答錯就給負回饋 ### 依照輸出分類 ![](https://i.imgur.com/N589xjK.png) #### Regression * 預測的東西是一個連續的數值 * ex: * 股價預測 * 職棒比賽,確切比數 * 預測精確數值,難度較高 #### Classification ![](https://i.imgur.com/xP3zCmi.png) * 預測的東西是一個類別 * ex: * 股價漲或跌的預測 * 職棒比賽,誰贏誰輸 * 不需要精確預測,難度較簡單 ### Machine learning workflow ![](https://i.imgur.com/43WJTP7.png) #### 資料輸入 #### 特徵工程 ##### 資料缺失值 * 數字型的 * 整筆資料不要用 * 填平均值或中間值 * 類別型的 * 取眾數 * 設立一個其他類別 ##### 極端值 * 刪除極端值 ##### Split data * Training data * 訓練資料 * 通常80% * 通常大於testing data * Testing data * 測試資料 * 通常20% * 通常小於traing data ##### Normalization ![](https://i.imgur.com/QeZI0Rz.png) * 數量級差太多,學到的參數的數量級會相差非常大 * 讓每個值的欄位壓到0跟1之間或-1跟1之間 * 常用方法 * min_max * [0,1] * $\dfrac{x_i - x_{min}}{x_{max} - x_{min}}$ * Z-Score Standardization * [-1,1] * $\dfrac{x_i - \mu}{\sigma}$ #### Select model ![](https://i.imgur.com/YOQytfe.png) ##### Validate trained model - regression * 拿來評量模型表現好不好 ###### MSE * $MSE = \dfrac{1}{n}\sum\limits_{i = 1}^n{(f_i - y_i)^2}$ * $f_i$: 預測值 * $y_i$: 實際答案 * n: 數值數量 * 值越小,代表誤差越小 ###### R squared(coefficient of determination) * 觀念問號 * $R ^ 2 = 1 - \dfrac{\sum\limits_{i = 1}^n{(f_i - \bar{y}) ^ 2}}{\sum\limits_{i = 1}^n(y_i - \bar{y})}$ * $y_i$: 實際答案 * $f_i$: 預測值 * $\bar{y}$: 答案的平均 * 預測的離散程度跟答案的離散程度是否相近 * 越接近1代表表現越好,值越小表現越爛 ##### Validate trained model - classification ###### $accuracy = \dfrac{\# of correct prediction}{\# of data}$ ![](https://i.imgur.com/R55VtfB.png) ###### $accuracy = \dfrac{TP + TN}{TP + FP + FN + TN}$ ![](https://i.imgur.com/JTKUISt.png) * True Positive (TP)「真陽性」:真實情況是「有」,模型說「有」的個數。 * True Negative(TN)「真陰性」:真實情況是「沒有」,模型說「沒有」的個數。 * False Positive (FP)「偽陽性」:真實情況是「沒有」,模型說「有」的個數。 * False Negative(FN)「偽陰性」:真實情況是「有」,模型說「沒有」的個數。 * $Precision = \dfrac{TP}{TP + FP}$ * $Recall = \dfrac{TP}{TP + FN}$ ###### F1-score * $F1 = 2 * \dfrac{1}{\dfrac{1}{recall} + \dfrac{1}{precision}} = 2 * \dfrac{precision * recall}{precision + recall}$ * 越接近1表現越好,越接近0越差 ###### ROC * 針對不同的門檻值去畫出ROC curve * AUC(area under curve) * AUC=0.5 (no discrimination 無鑑別力) * 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力) * 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力) * 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力) ###### Confusion matrix ![](https://i.imgur.com/lU9P2oi.png) * 把二元分類變多元分類 * 對角線上是猜對的 ##### Bias-Variance Tradeoff ![](https://i.imgur.com/D34ny7m.png) * Variance : 穩定與不穩定 * Bias : 資料集不集中 ![](https://i.imgur.com/JHOkdaF.png) * 選擇模型太簡單(變數太少之類的),那會落到high bias的區域 * 選擇模型太複雜,會落到high variance的區域,會不穩定 * overfitting ##### No free lunch theory ![](https://i.imgur.com/gCZkIbJ.png) * 沒有任何一個演算法可以勝過其他所有的演算法 * 根據不同問題有不同的解法 #### 輸出 ### 應用領域 * 自動駕駛 * 支付 * 人臉辨識支付 * 醫療應用 * 腫瘤偵測 * 機器人理財 * 無人機 * 自動避障 * 預測推銷 * youtube推薦演算法 * 智慧工廠 * 語音智慧助理 * siri ## Regression Supervised Learning ### Linear Regression * 依現有的data,找到一條趨勢線,預測未知的data * 線性解 #### Caes ##### Suppose 房價只跟屋齡有關 ![](https://i.imgur.com/SQhNMib.png) ##### 可以依照data畫出多條趨勢線 ![](https://i.imgur.com/9vePYG8.png) ##### 利用cost function 找到最適合(誤差最小的)的趨勢線 $Cost = \sum\limits_{i = 1} ^ m{(y_i - \bar{y_i}) ^ 2}$ ![](https://i.imgur.com/qmAoam6.png) ##### 微積分求解 ![](https://i.imgur.com/QOhpvo4.png) #### Multivariate Linear Regression Models ![](https://i.imgur.com/CqqWjA3.png) ![](https://i.imgur.com/WrnknOZ.png) * 多變量時難以用公式解,變數太多了 * 通常用gradient descent ##### Gradient Descent ![](https://i.imgur.com/aZgSZav.png) * 只能找區域最小值 * learning rate太小會算很久,太大找不到最佳值,通常用0.1 到 1 之間 ![](https://i.imgur.com/xIvJGHN.png) ##### Gradient Descent V.S. Normal Equation ![](https://i.imgur.com/b4SzMwG.png) * Gradient Descent * learning rate要剛剛好 * 要迭代很多次 * n大時會比較好(資料量大) * Normal Equation * 不用選learning rate * 要計算矩陣逆運算,較麻煩 * n大時算很慢(因為矩陣逆運算) ##### Overfitting * 最左(underfitting) * 無法找到理想趨勢 * 中間(理想狀況) * 最右(overfitting) * 在train data上fit的很好,但有新東西就gg了 * 像死讀書,刷考古題的學生 * 變數越多越容易overfitting ![](https://i.imgur.com/rwBfkAg.png) #### Code ```python= from sklearn import preprocessing, linear_model from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split import numpy as np import pandas as pd # 資料輸入 df = pd.read_csv('./dataset/housing.csv', header = None, delim_whitespace=True) # 答案取出 y = df[13] x = df.drop(13, axis = 1) # Split data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1) # Normalization scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) # Model Select model = linear_model.LinearRegression() model.fit(x_train, y_train) # Predict y_pred = model.predict(x_test) print('Cofficient : {}'.format(model.coef_)) print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred))) print('Variance score : {}'.format(r2_score(y_test, y_pred))) ``` ``` Cofficient : [-1.02554944 0.96553896 0.16729159 0.58076865 -2.02969655 2.57536082 0.17046044 -2.84987298 2.50778431 -1.85852862 -2.05829633 0.82864609 -3.86813384] Mean squared error : 34.403276579602064 Variance score : 0.6739078917414478 ``` ### Polynomial Regression * 非線性解 * 低維度變高維度 #### nth-degree Polynomial Regression * n次多項式,當作多個feature ![](https://i.imgur.com/urCAN7u.png) * 其他model ![](https://i.imgur.com/67Hy72H.png) #### Graph ![](https://i.imgur.com/jWV1dd9.png) #### 交叉項(cross term) * 兩變數有加減乘除的關係(多為乘法) ![](https://i.imgur.com/1YOzhWH.png) * 原本兩個feature,經由相乘或是次方,多出很多個feature ![](https://i.imgur.com/6HUg0xS.png) ```python= import numpy as np import pandas as pd from sklearn import linear_model, preprocessing from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from sklearn.preprocessing import PolynomialFeatures # 匯入檔案 df = pd.read_csv('./dataset/winequality-red.csv') # 處理answer and data y = df['quality'] x = df.drop('quality', axis = 1) # 產生degree 為 2 的feature poly = PolynomialFeatures(degree = 2).fit(x) x = poly.transform(x) # Split data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1) # Normalization scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) # Select model model = linear_model.LinearRegression() model.fit(x_train, y_train) # Predict y_pred = model.predict(x_test) # 查看係數 print('The coefficient : {}\n'.format(model.coef_)) print('Mean squared error : {}'.format(mean_squared_error(y_test, y_pred))) print('Variance score : {}'.format(r2_score(y_test, y_pred))) ``` ``` The coefficient : [-2.20439093e-11 -3.32036477e+01 -3.46762494e+01 -1.98569253e+01 -1.40652656e+01 -6.85273381e+01 -7.74114997e+01 1.00727869e+02 -2.50880009e+01 -7.93478035e+01 4.63945174e+01 -6.70038809e+00 -1.07912073e+00 -4.28103408e-01 -1.80118094e-01 -5.27741073e-01 -9.30638006e-01 -6.23191815e-01 5.34638897e-01 3.69394013e+01 -1.63849368e+00 6.25798078e-01 6.86240624e-02 -1.07543978e-01 3.25708843e-02 -6.49414077e-02 2.15989372e-01 -5.66509424e-02 3.43575308e-01 3.51401508e+01 -1.05497188e+00 -2.55960941e-01 8.98178254e-01 -3.09937110e-02 9.34787505e-02 8.98840857e-02 1.48808583e-01 -3.20881226e-02 2.21970228e+01 -3.13608092e+00 -3.12173913e-01 1.11046257e+00 -1.59627758e-01 -2.03532456e-01 3.43111110e-02 -6.51049711e-04 1.79109779e+01 -2.69409190e+00 1.29726359e-02 -3.65247752e-01 6.44755908e-02 -5.20131174e-02 -1.18159029e-01 6.92374227e+01 -5.56852862e-01 1.09112468e-01 5.56498217e-01 -1.63840021e-01 -3.70231215e-02 7.79292790e+01 -5.47020463e-01 -4.70828794e-01 1.29740195e+00 8.93119391e-02 -1.01689650e+02 1.28353602e+00 4.09945512e-01 -1.61154603e+00 2.10910693e+01 8.08256509e+01 -4.72908479e+01 4.26542501e+00 -2.92688018e+00 1.05074947e+00 2.71306226e+00 -3.52447248e-01 7.49136902e-02 -1.16178316e-01] Mean squared error : 0.4308785588449687 Variance score : 0.29206509289757054 ``` ### Logistic Regression * X : feature * $\theta$ : 參數 ![](https://i.imgur.com/G8PGmq1.png) #### Sigmoid function * Output is [0, 1] * model的輸出常被拿來當作機率 ![](https://i.imgur.com/H0DiBSV.png) #### Cross-entropy)(Cost Function)補) * 給電腦兩個機率分佈,分析兩個機率分佈相不相似 * 兩機率分佈幾乎一樣,算出來越小 $C(\theta_0, \theta_1, ... , \theta_n) = \dfrac{1}{2m} \sum\limits_{i = 1} ^ m{}$ #### Information(資訊含量) * pi : 發生某事的機率 * Define Information : $log({\dfrac{1}{pi}})$ * 當某事機率越大,資訊含量越低 * 太陽從東邊升起的機率是100 % * Information = $log({\dfrac{1}{1}}) = 0$ * 當某事機率越小,資訊含量越高 * 明天下雨的機率是50 % * Information = $log({\dfrac{1}{\dfrac{1}{2}}}) = 0.3$ #### Entropy V.S. Cross-entropy ##### Entropy * 資訊含量的期望值 * 資訊含量越不確定時越高,越確定時越低 ![](https://i.imgur.com/E0NRmby.png) ##### Cross-entropy * 輸入兩個機率分佈 * 衡量兩個機率分佈相不相近 ![](https://i.imgur.com/AAxFM59.png) #### Example ![](https://i.imgur.com/KMHIjkk.png) * Cross entropy 通常大於entropy ![](https://i.imgur.com/dOvjGCT.png) #### Cost Function(cross-entropy)(問號) * h 與 1 - h都是機率分佈 * y 跟 1 - y,非一即零, ![](https://i.imgur.com/QyZi1iX.png) #### Learning * Gradient descent ### Code ```python= import numpy as np import pandas as pd from sklearn import preprocessing, linear_model from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression pima = pd.read_csv('./dataset/pima-indians-diabetes.csv') #x = pima[['pregnant', 'insulin', 'bmi', 'age']] y = pima['label'] x = pima.drop(['label'], axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = LogisticRegression() model.fit(x_train, y_train) print(model.coef_) print(model.intercept_) y_pred = model.predict(x_test) print(y_pred) accuracy = model.score(x_test, y_test) print(accuracy) ``` ``` [[ 0.33498141 1.03029784 -0.30681406 -0.02318156 -0.07418398 0.67294487 0.2001099 0.20383277]] [-0.90342628] [0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0] 0.7835497835497836 ``` ## Classification Supervised Learning ### K-Nearest Neighbor * 可同時classification跟regression * 通常用來classification * 無變數的 * 只需要決定k值 * k : 有多少鄰近的類別 * k = 1 * 最近的一個是正方形,所以他是正方形類別 * k = 3 * 最近的三個是一個正方刑跟兩個三角形,三角形較多,故為三角形類別 * k = 7 * 最近的七個是四個正方刑跟三個三角形,正方形較多,故為正方形類別 ![](https://i.imgur.com/fbfCRso.png) #### Step 1. Look at data * 把原本的資料做好分類 ![](https://i.imgur.com/0GLcNrl.png) 2. Caculate distances * 將每筆data跟所求data的距離求出來 ![](https://i.imgur.com/LcWHE8C.png) 3. Find neighbors * 找最近的k個資料 ![](https://i.imgur.com/77tBvJM.png) 4. Vote from labels * 最後做投票 * 看哪一個類型最多 ![](https://i.imgur.com/yetpiek.png) #### How to Define Distance ![](https://i.imgur.com/DQtJJp9.png) * Manhattan distance(L1 distance) ![](https://i.imgur.com/ysa4cWO.png) * Euclidean Distance(L2 distance) ![](https://i.imgur.com/HpQO3lY.png) #### How to choose K * 通常取奇數 * 必可以選出來,不會有相等的狀況 * K is small * 易受不好的資料影響 ![](https://i.imgur.com/jdZLnXa.png) * K is large * 不小心預測出多數的類別 * 永遠預測到同一個類別 ![](https://i.imgur.com/prMTU9i.png) #### Curse of dimensionality * 利用次方增維 ![](https://i.imgur.com/kr6pI1r.png) * 在高維度較容易分類 ![](https://i.imgur.com/RMGjZmB.png) * 如果太高維,會造成overfitting #### Code ```python= import pandas as pd import numpy as np from sklearn import preprocessing, datasets, neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score df = pd.read_csv('./dataset/seeds_dataset.csv', header = None) y = df[7] - 1 x = df.drop(7, axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = neighbors.KNeighborsClassifier(n_neighbors = 5) model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_samples = accuracy_score(y_test, y_pred, normalize = False) print('number of correct samples : {}'.format(num_correct_samples)) print('accuracy : {}'.format(accuracy)) ``` ``` number of correct samples : 39 accuracy : 0.9285714285714286 ``` ### Decision Tree * 把決策過程變成一棵樹的結構 ![](https://i.imgur.com/i3qzTEM.png) #### How to split on each node? ![](https://i.imgur.com/sB7ooB2.png) #### How to define a good split? * 決策效果極明顯的,比如如果A就一定Yes,反之則No,則A就是一個好的node * gain = 分割前 - 分割後 * gain越大越好 #### CART * Classification and Regression Trees(CART) * Binary tree * Gini為分割良好指標 ![](https://i.imgur.com/pqiPexP.png) ##### Gini Impurity * 公式 ![](https://i.imgur.com/O1eExNH.png) * 越勢均力敵時,gini越大,反之,gini越小 * gini越小越好 ![](https://i.imgur.com/64rVGmP.png) ![](https://i.imgur.com/j7AeaRS.png) ##### Exampl_1 * 分割後,求出gini的期望值 ![](https://i.imgur.com/CtVkR8c.png) * A * 0.5 - 0.4852 = 0.015 * B * 0.5 - 0.37 = 0.13 * B較佳 ![](https://i.imgur.com/yJ3RmWN.png) ##### Example_2 * 遇到數值型的 * 取平均當分界線(在這邊是80) ![](https://i.imgur.com/v6jfRiP.png) ![](https://i.imgur.com/TDdH5HL.png) #### ID3 * Iterative Dichotomiser 3(ID3) * 可以有多個分支 ![](https://i.imgur.com/blq6Xac.png) ##### Entropy * 公式 ![](https://i.imgur.com/b1rWiaI.png) * 越勢均力敵,entropy越大,反之,entropy越小 * Entropy 越小越好 ![](https://i.imgur.com/dYBiBYF.png) ![](https://i.imgur.com/8c2sST3.png) ##### Example_1 * 分割後,求出entropy的期望值 ![](https://i.imgur.com/rrf9zCE.png) * A * 0.301 - 0.294 = 0.007 * B * 0.301 - 0.242 = 0.069 * B較佳 ![](https://i.imgur.com/RFKTTjs.png) ##### Example_2 * Before splitting ![](https://i.imgur.com/Mz4ziQp.png) * After splitting ![](https://i.imgur.com/8TmtN4y.png) * 對所有欄位都split看看,取最好的 ![](https://i.imgur.com/tmPhc8B.png) * Overcast 的狀況都會是yes,所以就不需要決策node了 ![](https://i.imgur.com/r7rxeAF.png) * 重複前面的動作,找出其他的節點 ![](https://i.imgur.com/B0pbAUx.png) * 完成 ![](https://i.imgur.com/5hyNXAB.png) ##### 優點 * 寫成ifelse,可解釋性高 ##### Step ![](https://i.imgur.com/AzuQNRg.png) ##### Pruning * 為了分類,可能會overfitting,決策樹長很深 * Pre-prunung * 設一些條件,如果達到條件就不往下長 * 如果node所含的資料少於特定大小,就不繼續往下長 * Post-prunung * 長完後,取自己想要的部分 #### Code(CART) ```python= import pandas as pd import numpy as np from sklearn import preprocessing, neighbors from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.tree import DecisionTreeClassifier df = pd.read_csv('./dataset/abalone.csv', header = None) Mean = np.mean(df[8]) df[0] = pd.Categorical(df[0]).codes df[8] = df[8].apply(lambda x : 0 if x > Mean else 1) y = df[8] x = df.drop(8, axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = DecisionTreeClassifier() model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_samples = accuracy_score(y_test, y_pred, normalize = False) con_matrix = confusion_matrix(y_test, y_pred) print('number of correct sample : {}'.format(num_correct_samples)) print('accuracy : {}'.format(accuracy)) print('con_matrix : {}'.format(con_matrix)) ``` ``` number of correct sample : 595 accuracy : 0.7117224880382775 con_matrix : [[318 118] [123 277]] ``` ### Naive Bayes * 機率分類器 * 條件機率 * 資料量大時很適合用Naive Bayes * 特徵越多會呈線性成長 * 文本分類 #### How Naive Bayes Classifier Work * 看哪邊機率大,就屬於那一邊 ![](https://i.imgur.com/X59Vvz4.png) * 轉成數學式 ![](https://i.imgur.com/j136N4A.png) * 簡化後 * 看哪邊機率比較大,就屬於哪一類 ![](https://i.imgur.com/6fvhfNO.png) #### Example * 想知道句子是否跟運動有關 ![](https://i.imgur.com/SPylVxK.png) * 想知道"a very close game"是否跟運動有關 ![](https://i.imgur.com/XsT8MB4.png) * 推導後 ![](https://i.imgur.com/ONfKSnR.png) ##### How to calculate term? * $p(game|Sports) = \dfrac{2}{11}$ * 11 : 在sports的條件下,有十一個字 * 2 : 在sports的條件下,出現過2次games * 竟量不要有0的狀況,因為這樣機率直接變零了 ![](https://i.imgur.com/PgMAXUl.png) ##### Laplace smoothing * 為了避免出現機率0的狀況,導致條件機率為零,導致無意義的狀況發生 * 分母加整個文本有多少不同的字 * 分子加一 * 經過Laplace smoothing的算法後,就稱之為Multinomial Naive Bayes ![](https://i.imgur.com/FVXfW93.png) * 開算囉 ![](https://i.imgur.com/fYAB0SA.png) * 帶入條件機率的式子 * sports > not sports,所以猜他是sports類別 ![](https://i.imgur.com/RH92aCc.png) #### Different probability assumption * 假設機率都是高斯的常態分佈 ![](https://i.imgur.com/uMTWAjR.png) ##### Example * 利用一些體態,預測是男生還是女生 ![](https://i.imgur.com/YhHOk3K.png) * 算平均與標準差 ![](https://i.imgur.com/a8SovYP.png) * 下大於上,故猜測是female ![](https://i.imgur.com/yGOIwnD.png) #### Code ```python= import numpy as np import pandas as pd import time from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB from sklearn.metrics import accuracy_score, confusion_matrix df = pd.read_csv("./dataset/titanic/train.csv") df['Sex'] = pd.Categorical(df['Sex']).codes df['Embarked'] = pd.Categorical(df['Embarked']).codes df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1) df = df.dropna(axis = 0, how = 'any') y = df['Survived'] x = df.drop('Survived', axis = 1) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) model = GaussianNB() model.fit(x_train, y_train) y_pred = model.predict(x_test) print('Number of misabeled points out of total {} points : {}, performance {:05.2f}%' .format( x_test.shape[0], (y_test != y_pred).sum(), 100 * (1 - y_test != y_pred).sum() / x_test.shape[0]) ) accuracy = accuracy_score(y_test, y_pred) num_correct_sample = accuracy_score(y_test, y_pred, normalize = False) print('number of correct sample : {}'.format(num_correct_sample)) print('accuracy : {}'.format(accuracy)) ``` ``` Number of misabeled points out of total 215 points : 55, performance 74.42% number of correct sample : 160 accuracy : 0.7441860465116279 ``` ### Random Forests * ensemble learning * 把多個分類器並在一起 * 建立多棵dicision tree * 看哪一個類別多,他就是那個類別 ![](https://i.imgur.com/bsc1VVt.png) #### Example * random 選取N個row,得到許多sub_data ![](https://i.imgur.com/2xZQuiM.png) * 針對每個data,在隨機選取k個特徵 ![](https://i.imgur.com/SYQ3Zog.png) * 將這k個特徵,拿去做dicision tree(CART) ![](https://i.imgur.com/nIDgrwT.png) * 每筆sub_data都這樣做,會得到很多dicision tree ![](https://i.imgur.com/m34aa6l.png) * 接著看這n棵樹,判斷哪一個類別比較多,他就是那個類別 #### Code ```python= import pandas as pd import numpy as np from sklearn import preprocessing, neighbors, datasets from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.ensemble import RandomForestClassifier df = pd.read_csv('./dataset/seeds_dataset.csv', header = None) y = df[7] x = df.drop(7, axis = 1).copy() x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = RandomForestClassifier(max_depth = 6, n_estimators =15) model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_sample = accuracy_score(y_test, y_pred, normalize = False) print('number_correct_sample : {}'.format(num_correct_sample)) print('accuracy : {}'.format(accuracy)) ``` ``` number_correct_sample : 39 accuracy : 0.9285714285714286 ``` ### Support Vector Machine * 找到一個超平面,讓資料間margin最大 * H3最好,離兩個資料群最近的點最遠 ![](https://i.imgur.com/uPW7T07.png) * supprot vector * 兩個類別最近的點 * support hyperlane * 通過兩個類別最近的點的平面 * margin * 兩條supprt hyperlane的距離 ![](https://i.imgur.com/svdVDaH.png) * xi : 資料特徵 * yi : 類別{+1, -1} ![](https://i.imgur.com/w0kdU3O.png) ![](https://i.imgur.com/ewqOJ4e.png) ![](https://i.imgur.com/5nQIOwo.png) ![](https://i.imgur.com/0MjPZiX.png) * 在分母很難處理,把它變到分子,求最大值改成求最小值 ![](https://i.imgur.com/Mz3KtjT.png) * Hard cost * 百分之百不允許有人越界 * 一定會找到一個平面把兩個類別分開來 * soft cost * 允許有人越界 * $\epsilon_i$ : 越界多少,越界的程度 * C : 懲罰程度 * 越大代表越不允許越界 * 越小代表越允許越界 ![](https://i.imgur.com/TlHhggt.png) #### Linear v.s. nonlinear problems ![](https://i.imgur.com/uSti0Mk.png) ##### SVM Kernel Trick * 用某種函數把維度提升 * 提高維度就有可能分類 ![](https://i.imgur.com/o2JoYHz.png) ##### Example * 二維變三維 ![](https://i.imgur.com/wNd5zsI.png) * 另外一個最佳化問題比較好算,才返回去算原來的最佳化問題 ![](https://i.imgur.com/ESgH58P.png) ##### Common Kernel in SVM ![](https://i.imgur.com/CZMhWQO.png) #### Multi-class in SVM * 如果有kclass * Method 1 :one-against-rest * 每一個svm判斷是不是該類別 * 產生k個二元svm來做分類 ![](https://i.imgur.com/uE73hCQ.png) * Method 2 :one-against-one * 每一個svm判斷是第m個還是第n個類別 * 共會有$\dfrac{k(k - 1)}{2}$種 ![](https://i.imgur.com/TGS3FT4.png #### Code ```python= import pandas as pd import numpy as np from sklearn import preprocessing, datasets, linear_model from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import accuracy_score, confusion_matrix pima = pd.read_csv('./dataset/pima-indians-diabetes.csv') df=pima[['pregnant', 'insulin', 'bmi', 'age', 'label']] x=df[['pregnant', 'insulin', 'bmi', 'age']] y=df['label'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1) scaler = preprocessing.StandardScaler().fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) model = SVC(kernel = 'rbf') model.fit(x_train, y_train) y_pred = model.predict(x_test) accuracy = accuracy_score(y_test, y_pred) num_correct_sample = accuracy_score(y_test, y_pred, normalize = False) print('num_correct_sample : {}'.format(num_correct_sample)) print('accuracy : {}'.format(accuracy)) ``` ``` num_correct_sample : 160 accuracy : 0.6926406926406926 ``` ## Classification Unsupervised Learning ### K-means * k : 把這些資料分成k個類別(因為unsupervised沒類別的概念,所以要自己分類) * 1. * 隨便取k比資料,分成k類 * 再將相近的資料,歸類為該類別 * 2. * 算出新的重心 * 3. * 找出同類別的資料的重心 * 4. * 再重新用重心分類一次 ![](https://i.imgur.com/OZ0hPaj.png) * 5. 不斷repeat上面四個步驟,直到收斂不改變 ![](https://i.imgur.com/ozY5B9j.png) ```python= import numpy as np import pandas as pd from sklearn.cluster import KMeans data = pd.read_csv('./dataset/xclara.csv') print("Input Data and Shape") print(data.shape) f1 = data['V1'].values f2 = data['V2'].values x = np.array(list(zip(f1, f2))) model = KMeans(n_clusters = 3) model = model.fit(x) labels = model.predict(x) centroids = model.cluster_centers_ print('centroids : {}'.format(centroids)) print('prediction on each data : {}'.format(labels)) labels = model.predict(np.array([[12.0, 14.0]])) print('prediction on data point (12.0, 14.0) : {}'.format(labels)) ``` ``` Input Data and Shape (3000, 2) centroids : [[ 69.92418447 -10.11964119] [ 40.68362784 59.71589274] [ 9.4780459 10.686052 ]] prediction on each data : [2 2 2 ... 0 0 0] prediction on data point (12.0, 14.0) : [2] ``` ### DBSCAN * Density-based spatial clustering of applications with noise * 以密度來衡量哪個資料屬於哪個類別 * 優點:他很能抵抗noise data * 藍色區域密度大被歸為一類,紅色同理 ![](https://i.imgur.com/eeLbYLP.png) #### Terminlogy in DBSCAN * Density : 密度 * 指定某半徑,能匡到的資料數目即為密度 * core point : 核心點 * 定義了某半徑,以某資料點為圓心,做一個圓,匡到了一些資料點,如果超過我所設定最少量的資料點,就稱為core point * border point : * 定義了某半徑,以某資料點為圓心,做一個圓,匡到了一些資料點,如果少於我所設定最少量的資料點,但其中有匡到core point,就稱為border point * noise point : * 非core point 且非 border point ![](https://i.imgur.com/UwgYOgD.png) #### Directly Reachable * 假設有點p跟點q,兩點間連一條直線,經過的點不是core point就是border point,就稱為directly reachable * core point及附近的border point會被歸為同一類 ![](https://i.imgur.com/QpCbKnu.png) #### Example ![](https://i.imgur.com/fZdhYF7.png) * Directly reachable會被歸為一類 ![](https://i.imgur.com/yFSYauN.png) #### problem * 半徑沒設好會有一些問題 ![](https://i.imgur.com/Sajpam8.png) #### Code ```python= import numpy as np import pandas as pd from sklearn import preprocessing, metrics from sklearn.cluster import DBSCAN from sklearn.datasets.samples_generator import make_blobs df = pd.read_csv('./dataset/iris.csv', header = None) df = df.drop(4, axis = 1) print("Input Data and Shape") print(df.head()) print(df.shape) x = np.array(df) model = DBSCAN(eps = 0.3, min_samples = 5).fit(x) labels = model.labels_ n_clusters = len(set(labels)) - (1 if -1 in labels else 0) print('number of clusters : {}'.format(n_clusters)) print('cluster on x {}'.format(labels)) ``` ``` Input Data and Shape 0 1 2 3 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 (150, 4) number of clusters : 3 cluster on x [ 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -1 -1 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 0 0 -1 -1 -1 0 0 -1 0 0 0 0 -1 0 0 -1 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 -1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 2 2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 2] ``` ### EM * Expectation-Maximization * 藉由迭代,找到機率模型裡的參數 #### Maximum Likelihood Estimation * 給予一個機率分佈,想要找出一些參數 ![](https://i.imgur.com/IzmbX5h.png) * 數字overflow,所以取log處理 ![](https://i.imgur.com/DAGbQ71.png) * 估計整棵樹蘋果的平均重量 ![](https://i.imgur.com/QpW8LlK.png) * 將值帶入 ![](https://i.imgur.com/fv5PifS.png) * 取log ![](https://i.imgur.com/IAdp3CG.png) * 母體平均大約等於隨機抽樣的平均 ![](https://i.imgur.com/HJi8BSF.png) #### Likelihood Estimation Table ![](https://i.imgur.com/kX4OSXp.png) ![](https://i.imgur.com/APOIcjE.png) #### Example * 欲將資料分為兩個類別 ![](https://i.imgur.com/sXAYziq.png) * 隨機另兩個高斯曲線 ![](https://i.imgur.com/RwmR3eV.png) ##### E step * 一剛開始的狀況,因為是隨便給的,所以分得很爛 * 計算資料隸屬於黃色的比例跟隸屬於藍色的比例 * 最左邊的資料 : 70 % 是黃色類別,30 % 是藍色類別 * 左二 : 40 % 是黃色類別,60 % 是藍色類別 * 左三 : 10 % 是黃色類別,90 % 是藍色類別 ![](https://i.imgur.com/AR7kcam.png) ##### M step * 去修正機率模型的參數(平均跟標準差) * 計算likelyhood,更新平均跟標準差,畫出新的高斯分佈 ![](https://i.imgur.com/aLgUbtF.png) ##### 完成 ![](https://i.imgur.com/zZujfwg.png) #### 缺點 * 資料可能不是高斯常態分佈 * 可以使用其他分佈 ### Different cluster method ![](https://i.imgur.com/YHupseY.png) ## Dimension Reduction * unsupervised * 讓資料壓縮 * reduce time complexity * reduce space complexity * 增加可視化程度 ### Illustraion ![](https://i.imgur.com/tx5xMbn.png) ![](https://i.imgur.com/WNrivPf.png) #### Code ```python= import pandas as pd import numpy as np from sklearn import metrics, mixture, preprocessing df = pd.read_csv('./dataset/iris.csv', header = None) df = df.drop(4, axis = 1) print('Input Date and Shape') print(df.head()) x = np.array(df) model = mixture.GaussianMixture(n_components = 3).fit(x) x_pred = model.predict(x) print(x_pred) ``` ``` Input Date and Shape 0 1 2 3 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 0 2 2 2 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] ``` ### SVD * singular-value decomposition * 矩陣可拆分 ![](https://i.imgur.com/Eq6Ybkn.png) #### Matrix rank * 第一個row剪掉第二個row會等於第三個row * 代表第三個row是多餘的,代表線性獨立的row只有兩個 ![](https://i.imgur.com/o04RhMT.png) #### Example_1 * 聽不懂啦幹 ![](https://i.imgur.com/hNE9axu.png) ![](https://i.imgur.com/tmS7cSM.png) ![](https://i.imgur.com/N9RHfBn.png) * v : 選轉矩陣 * $\sigma$ : 壓扁 * U : 選轉矩陣 ![](https://i.imgur.com/nDnox93.png) #### Example_2 * 對電影的評分 ![](https://i.imgur.com/O2qCIB9.png) ![](https://i.imgur.com/oWeol1A.png) * 把數值最小的row及相對應的row或column砍掉 ![](https://i.imgur.com/FUnTIfZ.png) ![](https://i.imgur.com/U0Avmx5.png) * 相似於原始矩陣 ![](https://i.imgur.com/LVj58Bx.png) * 留下V * 把五維轉成二維 ![](https://i.imgur.com/aRzlaiG.png) #### How many singular values to keep? * 盡量保持80%以上的資料 ### PCA * principal component analysis * 把所有資料投影在垂直座標軸上 ![](https://i.imgur.com/2pTSjvr.png) #### Example ![](https://i.imgur.com/SLhqoLN.png) #### Which projection axis is better * 投影在紅線上好,還是綠線上好 * 投影在紅線上好,離散程度較高 * 較不會有兩點重疊 ![](https://i.imgur.com/A67qq5v.png) * 投影在w1軸上,離散程度要最大 ![](https://i.imgur.com/i8qtkeq.png) * 投影在w2軸上,離散程度盡量大 * 避免跟第一個軸一樣,所以兩軸要垂直 ![](https://i.imgur.com/ApKaUdn.png) #### PCA concept * 找一個軸,離散程度最大的 * 再找下一個軸,並確保跟其他軸內積相成等於零 ![Uploading file..._pb3u7qm37]() ##### Covariance ![](https://i.imgur.com/SJz0Lf3.png) ##### correlation ![](https://i.imgur.com/6XS7RRE.png) ##### Covariance Matrix ![](https://i.imgur.com/4sOAhPu.png) ![](https://i.imgur.com/T1Tyrhf.png) ![](https://i.imgur.com/O81xu4m.png) ![](https://i.imgur.com/CIqHxSX.png) ##### Step ![](https://i.imgur.com/011JL86.png) #### Example ![](https://i.imgur.com/JkQx8r1.png) ![](https://i.imgur.com/8ruwXHH.png) #### Code(PCA) ```python= import pandas as pd import numpy as np from sklearn.decomposition import PCA df = pd.read_csv('./dataset/seeds_dataset.csv', header = None) df = df.drop(7, axis = 1) print(df.head()) x = np.array(df) pca = PCA(n_components = 3) pca.fit(x) x_reduced = pca.transform(x) print('singular values is {}'.format(pca.singular_values_)) print('after pca, all of data is reduced to 3D') print(x_reduced) ``` ``` 0 1 2 3 4 5 6 0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 singular values is [47.49531899 21.09635322 3.92284041] after pca, all of data is reduced to 3D [[ 6.63448376e-01 -1.41732098e+00 4.12356541e-02] [ 3.15666512e-01 -2.68922915e+00 2.31726953e-01] [-6.60499302e-01 -1.13150635e+00 5.27087232e-01] [-1.05527590e+00 -1.62119002e+00 4.37015260e-01] [ 1.61999921e+00 -2.18338442e+00 3.33990920e-01] [-4.76938007e-01 -1.33649437e+00 3.55360614e-01] [-1.84834720e-01 -1.50364411e-01 1.41497264e-01] [-7.80629616e-01 -1.12979883e+00 2.79757608e-01] [ 2.28210810e+00 -1.36001690e+00 -3.50729413e-01] [ 1.97854147e+00 -1.49468793e+00 -2.93947251e-03] [ 3.69122947e-01 8.86722511e-01 1.13264978e-01] [-7.11021200e-01 -2.10663730e+00 1.37552595e-01] [-1.21370535e+00 9.46878939e-02 4.85809237e-01] [-1.16908541e+00 -7.42962899e-01 2.58209340e-01] [-1.19272176e+00 -9.53268162e-01 2.58450639e-01] [-5.08171207e-01 3.77958424e-01 6.56217572e-01] [-1.37469698e+00 1.32290559e+00 8.01997838e-01] [ 1.05726438e+00 -2.01562875e+00 4.33972249e-01] [-1.50961097e-01 -2.02235813e+00 7.52022386e-01] [-2.46241293e+00 7.37473835e-02 2.12996833e-01] [-6.31332100e-01 -7.18305655e-01 -5.43487434e-02] [-6.89698660e-01 -1.11182531e+00 -1.69906624e-02] [ 1.40769072e+00 -2.80658086e+00 3.14889098e-01] [-2.84267672e+00 -2.66880642e+00 -5.42113806e-02] [ 4.33268215e-01 -1.88984464e+00 1.04814954e-01] [ 1.81289158e+00 -2.60002176e+00 5.34196985e-02] [-2.02131332e+00 -6.08743328e-01 1.84384422e-01] [-2.19571862e+00 -1.49837622e+00 2.35406301e-02] [-7.44468841e-01 -1.06518721e+00 1.56860725e-01] [-1.50350480e+00 -3.68206745e-01 -8.59614612e-03] [-1.52075320e+00 -3.06180225e+00 -1.73200953e-01] [ 7.61190256e-01 -2.09488759e-01 1.65627857e-01] [-7.67738428e-01 1.26295451e-01 -1.20622472e-01] [-8.23965933e-01 -1.70715020e+00 5.32025512e-02] [ 4.39542396e-01 -1.52858534e+00 -5.57773440e-02] [ 1.52205298e+00 -1.25609762e+00 1.35317730e-01] [ 1.65240525e+00 -6.75119440e-01 -3.33822258e-03] [ 2.47674445e+00 -4.51537548e-01 3.09238192e-01] [ 1.15750673e-02 -5.96250977e-01 3.52544529e-02] [-1.11443822e+00 2.83345206e+00 5.68536979e-01] [-1.37160170e+00 -1.30108591e+00 3.86783538e-02] [-1.36349513e+00 -1.63960124e+00 7.21418337e-03] [-1.88302954e+00 -1.51985066e+00 4.14988118e-01] [ 6.29560566e-01 1.10062048e+00 6.34515325e-03] [ 2.84412124e-01 -5.60641213e-01 2.98249761e-01] [-9.60044753e-01 -2.29723526e+00 1.41107629e-01] [ 8.18964617e-01 -2.26570276e+00 1.53990241e-01] [ 1.96621301e-01 -7.40666486e-01 2.29592441e-01] [ 1.53276057e-02 -1.02061146e+00 2.02590478e-01] [ 2.54235169e-01 -1.55022082e+00 -1.05705755e-01] [-5.05384531e-01 1.97773604e-01 1.75366006e-01] [ 7.11973095e-01 1.96593925e+00 5.15648832e-01] [-3.55829381e-01 3.79579753e-01 -1.55069874e-01] [-5.66856332e-01 -4.55332330e-01 8.98393269e-02] [ 1.80968891e-02 -2.21678367e+00 -3.93360924e-01] [ 4.78424858e-01 -1.71349178e+00 -1.93034894e-01] [-3.75464636e-01 -9.76631316e-01 3.10946012e-01] [ 2.83883316e-01 -2.56463849e+00 2.83654402e-01] [ 7.69429731e-01 -1.63154473e+00 1.52399604e-01] [-2.77110124e+00 -2.60034941e+00 2.33323254e-01] [-3.80344820e+00 -1.51695365e+00 2.37149111e-01] [-4.00534905e+00 -1.97086343e+00 2.03609501e-01] [-2.87823982e+00 -8.86757382e-01 4.63038181e-01] [-1.87406423e+00 2.13355284e-01 7.89854003e-02] [-2.05089330e+00 -2.82503499e+00 1.19739519e-01] [-2.16820614e+00 -1.67334585e+00 4.54792497e-01] [-2.59673795e-01 -2.44512020e+00 -6.00299972e-02] [-7.07084427e-01 -1.59064814e+00 -5.52933377e-02] [-2.37114217e-01 -2.28118595e+00 -1.49729134e-01] [-2.28478840e+00 -4.60103809e-01 -1.17460180e-01] [ 3.16409179e+00 8.04126878e-01 -2.66641999e-01] [ 2.20955077e+00 1.27850851e+00 -1.57740826e-01] [ 2.62062165e+00 1.18221241e+00 3.71401616e-02] [ 4.76779625e+00 -1.57604721e-01 9.29992692e-02] [ 2.21232336e+00 6.01138596e-01 -1.39544203e-01] [ 2.07181452e+00 1.50200244e+00 -7.00860345e-02] [ 2.84275994e+00 5.04000805e-01 -2.39983894e-01] [ 6.46235442e+00 1.60085825e+00 -1.52794770e-01] [ 4.47829765e+00 1.97530486e+00 -3.07050833e-01] [ 2.61486732e+00 -5.12991394e-01 2.02041872e-02] [ 1.67836371e+00 2.07292392e+00 -4.87254221e-02] [ 4.03755261e+00 2.14071720e+00 3.50841784e-01] [ 5.71859177e+00 2.21396348e+00 1.90304067e-01] [ 5.58807242e+00 -1.50990711e+00 -3.07129758e-01] [ 5.32256756e+00 -5.11477550e-02 -1.35205623e-01] [ 4.00732381e+00 -7.29979610e-01 -3.00704179e-01] [ 4.70505178e+00 -1.45419923e+00 -9.90969343e-02] [ 4.79038389e+00 6.44967998e-01 -5.69065929e-01] [ 6.69570804e+00 2.94421273e+00 3.03912945e-01] [ 6.46037602e+00 2.15264305e+00 2.01205274e-01] [ 6.14337744e+00 -9.43926428e-01 -4.15476808e-01] [ 4.39514760e+00 -1.61012778e-02 -1.54841426e-03] [ 4.46139621e+00 1.12624509e-01 -7.95827310e-02] [ 3.78494886e+00 2.79030096e+00 3.90137034e-01] [ 4.01629279e+00 1.80247983e+00 -6.82260736e-01] [ 2.38051551e+00 3.23431649e-01 -3.38653786e-01] [ 5.03718740e+00 4.35062776e-01 -1.48395998e-01] [ 4.92049767e+00 -8.97804759e-01 -6.05902714e-01] [ 3.94065862e+00 -3.15810290e-01 -4.91695711e-01] [ 4.53327546e+00 -9.29513362e-01 -2.00722187e-01] [ 1.65697361e+00 7.28440774e-01 1.41627453e-01] [ 3.63867456e+00 -1.18043005e+00 6.80825411e-02] [ 4.97852528e+00 1.24163180e+00 2.63041542e-01] [ 4.94143745e+00 3.05328003e-01 -2.47908931e-01] [ 4.63589248e+00 2.70940363e-01 -1.14295115e-01] [ 4.52405499e+00 -5.83403419e-01 1.38095195e-01] [ 4.51574435e+00 -2.71311394e-01 -8.74599219e-02] [ 3.12281151e+00 4.56212973e-01 -8.62751578e-02] [ 5.83137312e+00 3.30385849e-01 -4.76189658e-01] [ 4.35718398e+00 -1.41741787e+00 -6.25084495e-02] [ 4.15756069e+00 -9.50833185e-01 9.65051938e-02] [ 5.08259001e+00 6.24704579e-01 6.28827972e-02] [ 4.89139771e+00 -9.82917842e-01 1.28275950e-01] [ 4.44322173e+00 3.57224575e+00 1.56974171e-01] [ 6.67155248e+00 1.84060496e+00 9.12010421e-02] [ 4.90749190e+00 -8.18130733e-01 -2.55684480e-01] [ 4.37369427e+00 1.17679493e+00 4.41493473e-01] [ 4.87191916e+00 1.48790820e-02 -9.56013893e-02] [ 4.44858857e+00 5.06661089e-01 9.35385490e-02] [ 5.88457254e+00 1.27049077e-01 -1.80453388e-01] [ 5.68383523e+00 2.94064884e+00 2.60580298e-01] [ 3.70570300e+00 4.03190365e-01 -1.09288956e-01] [ 1.48852809e+00 7.87950654e-01 -8.22192480e-02] [ 3.98326071e+00 -2.15055765e-01 1.49554483e-01] [ 1.15533338e+00 -2.55675983e-01 5.98162419e-01] [ 4.23450934e+00 1.03173593e+00 1.64406843e-01] [ 4.21700578e+00 1.24930690e+00 -1.56980693e-01] [ 3.62299759e+00 -9.85552845e-01 -1.99064864e-02] [ 6.17386444e+00 -1.00395914e+00 -1.89332536e-01] [ 2.71378793e+00 2.00947941e+00 3.92678397e-01] [ 3.86089703e+00 -3.73513274e-01 8.03597573e-02] [ 4.61500274e+00 -2.10261231e-01 9.79900380e-02] [ 5.92114800e-01 8.66320178e-01 -3.00484271e-01] [ 1.48591457e+00 7.74439749e-01 -1.72450887e-01] [ 6.90660085e-01 1.38976691e+00 -1.68271877e-01] [ 5.30993584e-01 -4.14621634e-02 2.10574049e-01] [ 2.89265049e+00 2.11576887e-01 -2.20244499e-01] [ 1.10275966e+00 -8.95136177e-01 -5.29279898e-01] [ 1.08106342e+00 -8.23294917e-01 -3.54727287e-01] [ 1.58041453e+00 2.92679166e-01 -2.25284288e-01] [-2.08047727e+00 1.36508853e+00 -1.99716808e-01] [-2.04891061e+00 3.11005161e+00 -6.38273447e-02] [-1.93112897e+00 2.06805453e+00 3.42899489e-02] [-3.14761497e+00 1.38674593e+00 -2.02387013e-02] [-3.35755431e+00 3.62389743e-01 -2.78959609e-01] [-4.22241525e+00 1.97238736e+00 -3.44133543e-01] [-3.55211446e+00 -1.92651406e+00 -3.78402644e-01] [-2.74248901e+00 3.68275745e-01 9.39425325e-02] [-2.26028697e+00 -7.15757433e-01 -3.01514983e-01] [-4.59256683e+00 1.21366537e+00 -4.10485088e-01] [-3.49115131e+00 1.07925934e+00 -2.38194431e-01] [-3.44038060e+00 2.89298232e+00 -2.07485411e-01] [-2.88398331e+00 7.17984884e-01 -3.58805444e-01] [-3.96469498e+00 -8.67031460e-01 -2.74174690e-01] [-3.85801218e+00 -1.19622949e-01 -3.44490680e-01] [-4.23854664e+00 1.60807548e+00 -2.98762740e-01] [-3.89609307e+00 -8.50339426e-01 -6.58059699e-02] [-2.98607257e+00 7.68415914e-01 -3.42792920e-01] [-3.33747668e+00 2.84753275e-01 -5.22692614e-01] [-3.83091968e+00 1.23660236e+00 -3.79258301e-01] [-2.36752198e+00 -8.93936559e-01 -5.13938956e-01] [-3.15770694e+00 1.92519249e-01 -3.20706760e-01] [-3.23143479e+00 8.85496109e-01 -4.51321791e-02] [-2.61469342e+00 3.94920094e-01 -8.06497416e-02] [-4.49824136e+00 2.13615465e+00 6.33923291e-02] [-2.94329647e+00 -1.88566242e+00 -4.82086609e-02] [-2.76609970e+00 8.91774133e-01 -1.72181504e-01] [-2.89907369e+00 -4.09293999e-01 -4.02323115e-01] [-3.90252444e+00 1.58340894e-01 -2.77227460e-01] [-3.95458414e+00 -6.73045369e-01 -2.41546685e-01] [-4.52101870e+00 2.49812224e+00 -2.49745319e-01] [-4.04118364e+00 2.51581412e+00 1.29405374e-01] [-4.04672999e+00 1.00738290e-01 -9.09399508e-02] [-4.03387689e+00 1.39431425e+00 -9.26707967e-02] [-4.51655795e+00 9.40408717e-01 -4.06201269e-01] [-4.67880742e+00 4.91842458e-01 -5.80895287e-02] [-4.15214114e+00 1.12758416e+00 -1.65527388e-01] [-4.67134302e+00 4.21033845e-01 -1.71844958e-01] [-4.01783124e+00 1.67978706e+00 3.14971477e-03] [-2.60793947e+00 -2.37304506e+00 -3.55362015e-01] [-4.03445958e+00 7.40517346e-01 1.30612374e-01] [-2.84072766e+00 9.33512708e-01 5.46936118e-02] [-3.09276593e+00 7.75654915e-01 -5.61725952e-02] [-3.75632591e+00 1.04703805e+00 -4.29715405e-02] [-2.41503022e+00 2.20440679e+00 -8.67431609e-02] [-3.57450320e+00 -7.19443063e-02 -4.01721420e-01] [-3.37275950e+00 8.03901892e-01 -4.60390433e-01] [-4.43114913e+00 -7.75918129e-02 -1.41347439e-01] [-4.55059431e+00 3.26576882e+00 1.99485527e-01] [-5.00252227e+00 6.36766609e-01 1.71812058e-01] [-4.55827661e+00 1.13664276e+00 -9.51701336e-02] [-4.04374701e+00 -2.25814885e-01 -6.87186458e-02] [-3.36161741e+00 -5.27869965e-01 -4.74331964e-02] [-4.56094950e+00 5.95574083e-01 -2.83496804e-01] [-3.11855689e+00 3.31944329e-02 3.57435656e-02] [-2.52946420e+00 8.37109623e-01 3.64003383e-01] [-2.58655448e+00 1.44846001e+00 3.01071387e-01] [-1.83336074e+00 7.30693089e-01 2.15840207e-01] [-2.36059318e+00 -6.86820620e-01 -2.53316928e-01] [-2.35819688e+00 -1.20488410e+00 3.56048199e-01] [-2.97998867e+00 1.39806029e+00 1.31248700e-01] [-2.41873891e+00 -1.74952111e+00 4.09312427e-01] [-4.21924202e+00 -1.94251854e-01 1.18503747e-01] [-3.08869155e+00 4.37638669e+00 4.99058361e-01] [-2.78962325e+00 -1.41941777e-01 5.02954010e-02] [-3.04187227e+00 -4.73126171e-01 1.95045363e-01] [-4.10906270e+00 1.09340872e-01 -8.74005598e-02] [-2.50003394e+00 4.30796502e+00 5.32818431e-01] [-3.33207854e+00 -5.25289746e-01 -9.81079349e-02] [-3.10755116e+00 1.54975743e+00 1.21282793e-01]] ``` ## Kaggle 實戰 https://www.kaggle.com/c/titanic ### Goal It is your job to predict if a passenger survived the sinking of the Titanic or not. For each in the test set, you must predict a 0 or 1 value for the variable. ### Metric Your score is the percentage of passengers you correctly predict. This is known as accuracy. ### Submission File Format You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows. The file should have exactly 2 columns: PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) PassengerId,Survived 892,0 893,1 894,0 Etc. ### 匯入資料 ```python= train_df = pd.read_csv('train.csv') test_df = pd.read_csv('test.csv') combine = [train_df, test_df] ``` ### 觀察資料 #### 看header ```python= print(train_df.columns.values) ``` #### Print a concise summary of a DataFrame. ```python= train_df.info() print('_'*40) test_df.info() ``` ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB ________________________________________ <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB ``` #### Generate descriptive statistics. ```python= train_df.describe() ``` ![](https://i.imgur.com/hhQN91w.png) #### 評分標準 ```python= round(model.score(X_train, Y_train) * 100, 2) ``` ### 匯出CSV ```python= submission = pd.DataFrame({ "PassengerId": test_df["PassengerId"], "Survived": Y_pred }) submission.to_csv('submission.csv', index=False) ```