NTCU機器學習作業

:::info 有時間再回來整理過程 ::: ## 成果展示 ### random-forest 學長姐範例 ![image](https://hackmd.io/_uploads/rJQ7eJXMgl.png) 我的結果 ![image](https://hackmd.io/_uploads/r14Xde7Mex.png) ### K-means 我的結果 ![image](https://hackmd.io/_uploads/B1O9Jgmzle.png) ### Hybird model 學長姐範例: ![image](https://hackmd.io/_uploads/HkTye17Mee.png) 我的結果 ![image](https://hackmd.io/_uploads/H13NCyQfgx.png) Training 和 Testing Loss ![Hybird-loss](https://hackmd.io/_uploads/SybBARGfeg.png) Thershold vs. F1 Score ![Hybird-Threshold](https://hackmd.io/_uploads/SJSE0Affgl.png) ## 先備知識訓練資料庫連結 > https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data 資料已經過PCA > Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. 官網裡有寫到資料分布極度不平均，因此應該用AUPRC當作準確程度的指標 > we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). 但我這裡先以提高F1 score為主學長姐的範例: ![image](https://hackmd.io/_uploads/rJMiI2TZll.png) ## random forest 原版 ![image](https://hackmd.io/_uploads/B1sU8gCble.png) n_estimators = 300 max_depth = 20, class_weight = {0: 1, 1: 5} ![image](https://hackmd.io/_uploads/HkAa4x0Wlx.png) n_estimators = 200 max_depth = 20, class_weight = {0: 1, 1: 7} ![image](https://hackmd.io/_uploads/B1eCBy7Ggx.png) ## 混和模型自己實作的isolation+XGBoost，模型成績如下 recall 只有80比學長姊的範例還低 ![image](https://hackmd.io/_uploads/HkssT6h-ge.png) ### 初步結果印出當前的training error和testing error ![image](https://hackmd.io/_uploads/r1UxaT3bee.png) 以及訓練過程中的Training和Testing loss 可以發現testing loss在後段中並未與testing loss一同下降 ![image](https://hackmd.io/_uploads/r1xIzAhZxg.png) ### 優化1: 針對資料不平衡去優化我這裡使用XGB內建的scale_pos_weigh功能 scale_pos_weight 可以用來處理類別不平衡（class imbalance）問題，透過在訓練過程中(計算loss function時)調整正樣本(詐騙)的權重 ```py= model = XGBClassifier( use_label_encoder=False, eval_metric='logloss', scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]), # class imbalance random_state=42 ) ``` 修改後得到了以下結果 ![image](https://hackmd.io/_uploads/B1x-EC2Zxe.png) 在Recall的表現上也有所提升 ![image](https://hackmd.io/_uploads/rymI4C2Zgg.png) 我決定進一步減小資料不平衡導致的高Recall問題，以下是常見的解決辦法 | 方法 | 說明 | 優點 | 缺點 | | ----------------- | ------------------- | -------------- | --------- | | **Oversampling** | 重複或合成正類資料 | 不浪費資料 | 易過擬合 | | **SMOTE** | 合成新的正類樣本（在正類鄰近點間插值） | 擴展資料多樣性，避免單一重複 | 可能產生不真實樣本 | | **Undersampling** | 隨機刪減負類樣本，使比例接近 | 快速有效 | 損失大量真實資料 | 最後我決定採用SMOTE + Undersampling的方式，透過增加正樣本(詐騙)和減少負樣本(正常)的數量，讓彼此的比例接近而且上述的training 和 testing loss的結果也讓我覺得模型可以容忍SMOTE + Undersampling可能造的負面影響我透過imblearn library來實現SMOTE + Undersampling SMOTEENN是一個samoling方法，同時使用了 Oversampling 和 Undersampling ```py= sme = SMOTEENN(random_state=42) X_resampled, y_resampled = sme.fit_resample(X_train, y_train.ravel()) ``` 資料經過resample的結果，可以看到詐騙案例的數量與正常的比例大致相同 ![image](https://hackmd.io/_uploads/Hkqx2Ahbel.png) 以下是執行的結果，Recall率比起上一個來的更高，但是precision來的十分糟糕我推測這是由於我忘記移除scale_pos_weight的關係 ![image](https://hackmd.io/_uploads/Hy3is02bge.png) traing loss 和 testing loss也在很不錯的範圍中 ![image](https://hackmd.io/_uploads/SyII3CnWge.png) 恩... 還是一樣慘，轉頭先只做Undersampling好了 ![image](https://hackmd.io/_uploads/S1VCCAhblg.png) Recall竟然來到了94，但是precision只有3，我還是選擇較穩定的scale_pos_weigh好了 ![image](https://hackmd.io/_uploads/HypRkyablg.png) 試試看Oversampling的結果 ![image](https://hackmd.io/_uploads/HJqCz1pWel.png) 有著不錯的Recall和AUPRC，但是Precision Score可以再更好 ![image](https://hackmd.io/_uploads/BkFbm1aZeg.png) 最後我先選擇了更穩定的scale_pos_weight ### 優化2. 讓模型更激進 gamma 是 XGBoost 和其他基於樹的增強模型（如 LightGBM）的重要超參數之一，它控制樹節點分裂的保守程度，目的是防止模型過度擬合（overfitting）將gamma設為0.1的結果，訓練過程仍然十分穩定 ![image](https://hackmd.io/_uploads/B1Ow416bge.png) 測試幾個數值下來後，gamma為0.5可以得到不錯的數值 ![image](https://hackmd.io/_uploads/r11takaZxx.png) gamma 0.5, max_depth 5 ![image](https://hackmd.io/_uploads/HkpKegTZxe.png) gamma 0.5, max_depth 7 ![image](https://hackmd.io/_uploads/BJHkbl6-xg.png) scale係數 0.8 ![image](https://hackmd.io/_uploads/S1YT9ipZee.png) scale係數 0.9 笑死跟範例一模一樣 * iso: * n_estimators = 600, * xgboost * scale係數 0.9 * max_depth = 7 * early_stopping_rounds = 10 ![image](https://hackmd.io/_uploads/S1Ul2iTZgg.png) subsample = 0.8, ![image](https://hackmd.io/_uploads/BkDk1haWxg.png) subsample = 0.9, ![image](https://hackmd.io/_uploads/SJIjy2aZel.png) ### 優化3. 尋找最好的Threshold 修正了資料 ![image](https://hackmd.io/_uploads/rJuIVTTZeg.png) 得到了不錯的f1 score ![image](https://hackmd.io/_uploads/H1BYPapZxl.png) 將max_depth調低成5，透過抓取讓f1-score表現最好的theshold得到了最高的f1-score 而且在precision跟recall的表現上都很不錯 ![image](https://hackmd.io/_uploads/BybDdapblg.png) ### 優化4. 換模型換成lof模型可能會有更好的結果，前面有試過且結果有變好，但是我後來就沒繼續嘗試了 lof ![image](https://hackmd.io/_uploads/Hky37bT-gx.png) ## 待整理 max_depth = 4，precision來到了99.9 ![image](https://hackmd.io/_uploads/B1t8Y6TZxl.png) n_estimators = 200 ![image](https://hackmd.io/_uploads/SJCMDlaWgl.png) n_estimators = 300 ![image](https://hackmd.io/_uploads/rJj2cxp-ll.png) n_estimators = 600 ![image](https://hackmd.io/_uploads/BJUOjlTbll.png) scale_pos_weight 沒係數 ![image](https://hackmd.io/_uploads/HJcBEWa-le.png) early_stopping_rounds=10 ![image](https://hackmd.io/_uploads/B1uRKWpbgl.png)