機器學習 - HackMD

###### tags: `AI/ML` # 機器學習 ## 混淆矩陣 <a href = "https://www.gss.com.tw/eis/157-eis82/1540-eis82-2">reference </a><br> 將混淆矩陣內分成四個象限(真假以預測結果為基準): * 實際為真且預測為真(True Positive,TP) * 實際為真且預測為假(False Negative,FN) * 實際為假且預測為假(True Negative,TN) * 實際為假且預測為真(Fasle Positive,FP) <html> <a href = 'https://www.ycc.idv.tw/confusion-matrix.html'> <img src ="https://i.imgur.com/RbVHzPR.jpg"> </a> </html> 1. 靈敏度(Recall,Sensitivity) - 真陽性率，陽性樣本中，被檢測出為陽性的機率 TP/(TP+FN) 2. 特異度(Specificity) - 真陰性率，陰性樣本中，被檢測為陰性的機率 TN/(FP+TN) 3. 準確率(Precision) - 預測陽性內實際為陽性的機率 TP/(TP+FP) 4. F1 score - 2/(1/precision + 1/recall) 5. ACC - 正確預測的機率 (TP+TN)/(FP+FN+TP+TN) <a> 三種平均方式及差異 ![](https://i.imgur.com/wcmCPt5.jpg) </a> ### ROC 與 PR ![](https://i.imgur.com/v8jKLa2.jpg) &emsp;ROC曲線當中，以閾值為變動量，紀錄當閾值改變時測出來的Sensitivity 以及 1-Specificity(偽陽性)大小為何，當期圖形下面積(AUC) 為1時，代表模型有極佳準確率，當在(0,1)點時，代表有最大敏感度且偽陽性為零 * AUC=0.5 (no discrimination 無鑑別力) * 0.7≦AUC≦0.8 (acceptable discrimination 可接受的鑑別力) * 0.8≦AUC≦0.9 (excellent discrimination 優良的鑑別力) * 0.9≦AUC≦1.0 (outstanding discrimination 極佳的鑑別力) PR curve - 用來預測模型資料不平均的典型指標 <a href = 'https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/'> ![](https://i.imgur.com/I0a7jCP.png) </a> ROC 和 PR 在用來看效能時，ROC大多是直接看模型的本體效能(sensitivity and specificity)，因此不太會隨著資料點的增多或減少而有較大的改變，而PR則是看盛行率(prevalence)，會隨著資料點的增減而有較大的改變(可以看快篩的介紹)，因此在快篩的說明書上，大多只附sensitivity 以及 specificity，而不會附上precision 以及 recall。 ## Underfitting/Overfitting ### Underfitting &emsp;通常發生於模型的強度無法完全分辨所需要的資料點，因此會導致無法完全擬合的情況，使得成效不佳，牽涉到Bias & Variance的部分(文章下半部分)，可藉由觀察模型的learning curve判別，若訓練與驗證集資料曲線貼合彼此，則可能代表發生此情況，此時增加訓練資料並無法有效改善狀況，要改善此狀況需從模型本身加強，通常會出現此情況已代表模型本身強度過於弱小，因此需想辦法加強模型強度或是改變模型複雜度。 ### Overfitting &emsp;訓練完成的模型過度擬和於訓練資料，因此而導致今天未知資料在判別時，缺少彈性，進而導致在真實使用上不如training時良佳，可藉由learning curve判別，當今天的測試集損失，隨著訓練資料增加或是epoch增加時，會逐漸上升，代表此時模型已經無法公允的判別未知情況，也就是過擬和於訓練資料，此時可使用regularization、batch normalizaion、drop out等方式侷限模型發展。 -------------------- ## 統計資料 p-value - 強度指標，主要用來衡量推翻虛無指標的強度有多強，當P-value越小，則代表推翻虛無假說的強度愈強，在判斷顯著性時，以小於一域值視為(significant or non-sighnificant)，若使用雙尾檢定產生的結果，其p值通常較大，採用單尾檢定的結果其p值通常較小，然而由於其在真實情況上通常沒有真實意義（皆為統計結果），因此逐漸被信賴區間取代。信賴區間（Confidence Interval, CI）- 主要用以量化估計值的不確定性，通常以95%表示，代表實驗者有信心能夠有95%的受試者（樣本)介在指定範圍內。勝算比（Odds ratio, OR）- 為試驗組中發生結果的勝算（Odds）與對照組中發生結果的勝算，此兩者間的比值就稱為勝算比（OR）。 U-test - 用以觀看兩母體中的中位數是否有差別(需為連續變數)，通常取p value小於0.05作為H0成立 T-test - 用以觀看兩母題的平均數是否有差異(需要為常態分佈、連續變數)，通常取 p value<0.05 作為H0成立 <a href = 'https://www.technologynetworks.com/informatics/articles/one-way-vs-two-way-anova-definition-differences-assumptions-and-hypotheses-306553'> anova test - </a> * one-way: 用以觀看兩組以上的母體當中平均是否有差異(需要為常態分布、連續變數、變異數相同)，通常取p value小於0.05 作為H0成立 * two-way: 用以觀看兩組以上的母體時，兩因子間是否有關連性，以及單因子中各族群平均是否相同(需要為常態分布、連續變數、變異數相同) e.q. 今天要探討收入與性別的關聯性，則可以收集多組不同國家、種族內的收入與性別，並且利用two-way anova分析是否個別族群符合H0，若個別族群間不符合H0則代表族群內部相異性為否，以及兩變數間是否為H0，若兩族群間為H0則代表兩族群沒有內部關聯性。 ------------------------------------------ ## ML 模型 ### Random Forest &emsp;用更多的決策樹組成模型，避免單一決策樹造成過擬和的問題。 <br> &emsp;邏輯:從總樣本中抽取n個樣本(可重複抽取)，並且選取K個特徵建立決策樹，重複上列m次，則可產生m棵決策樹以及m組資料，最後利用每棵樹的結果多數決，決定最後總結果。 <a href = 'https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/'> Random forest hyperparameter </a> <a href ='https://ithelp.ithome.com.tw/m/articles/10272586'> ![](https://i.imgur.com/oMP0PKI.png) </a> 常見參數: >1. max_depth: The max_depth of a tree in Random Forest is defined as the longest path between the root node and the leaf node. > -> modify this can decide the performance of model. >2. min_sample_split: Parameter that tells the decision tree in a random forest the minimum required number of observations in any given node to split it. Default = 2 >-> when setting this too small may cause overfitting(分類直到單個leaf剩下N個解) >3. max_leaf_nodes: This hyperparameter sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree. -> when setting this too big may cause overfitting, too small may cause underfitting(設定決策樹最後剩下幾個nodes) >4. min_samples_leaf: This Random Forest hyperparameter specifies the minimum number of samples that should be present in the leaf node after splitting a node. Default = 1 >->(分類直到單個leaf剩下n個資料) >5. n_estimators: Number of trees in the forest. >6. max_sample: The max_samples hyperparameter determines what fraction of the original dataset is given to any individual tree. >->when having more than 0.2,this doesn't really effect the f1 score,but less this could reduce training time >7. max_features: This resembles the number of maximum features provided to each tree in a random forest. >8. bootstrap: Method for sampling data points (with or without replacement). Default = True >9. criterion: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. > ### light GBM &emsp;決策樹的變形，以直方圖為邏輯基準，先將資料用直方圖方式分類，用父減單一子分支的方式，計算另一分支數量，則可減少計算分支數量(差加速)，在建立決策數時，只延伸單側(應該是利用梯度決定-利用GOSS)，當中使用的GBDT為利用殘差作運算。 <a href = 'https://hackmd.io/@WangJengYun/HyLtemyxI'> Hack MD light GBM</a> <a href = 'https://zhuanlan.zhihu.com/p/99069186'> 知呼 light GBM </a> ### SVM(支援向量機) &emsp;將二維空間利用kernel擴增到多維度空間，則可利用多維度空間的方式更優化的去做分類(以此方式可做非線性分類 e.q.環狀)，但此模型在運行上效率較差，需要較多的時間tuning，且必須準確較準其iter次數，以避免其模型卡住或過擬和 <a href = 'https://ithelp.ithome.com.tw/articles/10270447'> ![](https://i.imgur.com/yGnSnTs.png) </a> ### Adaboost &emsp; boosting算法:在集成訓練初始的的弱學習器中，所有權值皆相同，在每次訓練弱學習器時，先前的學習器訓練結果會影響後續的學習權值，於先前學習器內學習效果較差的訓練樣本(loss較大)會於後續學習時加重其權值（加強訓練該特徵），於最後集成中，會將誤差率小的弱分類器權重加大，將誤差率大的弱分類器權重調小，集成強分類器（將正確率高的權重加大）。 ### Logistic Regression &emsp;用以分類"兩"類別的分類器，藉由訓練方程式內各參數所相依的權重w,與方程式bias(b)，可由w*(參數)+...+b得出z，並將z傳入sigmoid function，將其範圍侷限於0～1，定義若經過sigmoid function後值大於0.5屬於class 1，小於0.5屬於class 2。<br> <img src = 'https://i.imgur.com/WVCIVag.png' width=200><<--sigmoid function ## learning curve, bias&variance <a href = 'https://medium.com/ai%E5%8F%8D%E6%96%97%E5%9F%8E/learning-model-%E5%AD%B8%E7%BF%92%E6%9B%B2%E7%B7%9A%E8%A8%BA%E6%96%B7%E6%A8%A1%E5%9E%8B%E7%9A%84%E5%81%8F%E5%B7%AE%E5%92%8C%E6%96%B9%E5%B7%AE-%E4%B8%A6%E5%84%AA%E5%8C%96%E6%A8%A1%E5%9E%8B-47e472507923'>Learning Model : 學習曲線診斷模型的偏差和方差，並優化模型</a> #### **bias&variance** &emsp; 定義一個完美模型f可以完美的特徵與目標間關聯，f只存在於理想(完全未知)，定義利用已知訓練集訓練出來的模型為f^，已知由不同訓練集訓練出的f^ 不同，因此定義不同訓練集時f^的改變程度為*variance*。 &emsp; *bias*(偏差)，當實際對應關係與假設差距越大，bias越大。 &emsp; 在訓練時希望能夠得到較小的bias以及較小的variance，然而當variance極小時，代表模型十分擬和訓練集，因此當實際情況帶入時bias可能會愈高，模型的 bias 越低，它適應數據的能力就越強，同時 variance 也越高。所以，bias 越低，variance 越高。因此在數學上兩者無法同時達成，則需要trade-off。 ![](https://i.imgur.com/zg4DXiC.png) ![](https://i.imgur.com/VW68L4n.png) ### **learning curve** ![](https://i.imgur.com/V2lk6l5.png) &emsp; 在機器學習橫軸為訓練樣本的情況下，初始時MSE為0(train set)為正常的情況，因為此時的訓練樣本只有一個，因此模型很容易完全擬和這個情況，但當訓練樣本的數量增大時，會使得MSE逐漸增大(train set)，因為當樣本增加後，模型在訓練時便會造成誤差的產生，然而在validation set情況上則相反，在初始只有訓練一個樣本時，模型是完全擬和那個樣本，因此泛化性較低(無限趨近於無泛化性)，因此會造成初始時valiation set的MSE極高，而隨著訓練大小的增加，模型的泛化性逐漸增加，使得validation set的MSE下降，直至收斂。 ---------- ## K-fold validation &emsp;在機器學習中，時常使用K摺作為驗證技術，在使用K摺驗證時，會將切割完成的訓練集，切分為K等分，當中取一等分作為驗證集，剩餘的K-1等分則會做為訓練資料，在使用K-1等分資料訓練完成後，使用驗證集回測模型，並且依造回測結果，調整模型超參數，直至訓練成效達到預期，於此之後便不會再調整參數，並以測試集作為最終結果成績，並不會於測試集後又重複調整參數，避免data peaking的問題。 <img src = "https://i.imgur.com/arcwZVJ.png"> --- ## SHAP ``` def shapely( model =None, file_train = None, file_test = None, random_state_1 = 42, ): '==============================upsampling train/origin=======================' fold_train = pd.read_csv(file_train) #rename(消去columns項內非英文以及數字的解) fold_train = fold_train.drop(columns=['Unnamed: 0']) fold_train = fold_train.rename(columns=lambda x:re.sub('[^A-Za-z0-9_]+', '', x)) #print(i,'-1') # upsampling train set one_message = fold_train[fold_train["Class"] == 1 ] zero_message =fold_train[fold_train["Class"] == 0] one_message_up = resample(one_message, replace=True, n_samples=len(zero_message), random_state=random_state_1) fold_train = pd.concat([one_message_up,zero_message]) fold_train = shuffle(fold_train,random_state = random_state_1) fold_train= fold_train.reset_index(drop = True) # set the train set for single fold x_train = fold_train.drop(columns=['Class']) y_train = fold_train['Class'] roc_model = model.fit(x_train,y_train) explainer = shap.Explainer(roc_model,x_train) shap_values = explainer(x_train) shap.plots.beeswarm(shap_values, max_display=25) ```