[Learn Note] [Machine Learning] Python 特徵重要性分析的 9 個常用方法

# 1. 排列重要性 Permutation Importance（RandomForestClassifier）這種方法通過隨機打亂單一特徵的數據，來觀察這個操作對模型性能的影響。如果模型性能明顯下降，則表明該特徵對模型的重要性較高。這種方法通常用於非樹模型，例如線性回歸或神經網絡。 ## 第一組 ![Feature Importance - RandomForestClassifier - Permutation Importance - Least Missing Value Feature](https://hackmd.io/_uploads/H1M3NCczC.png) ## 第二組 ![Feature Importance - RandomForestClassifier - Permutation Importance - Full Feature](https://hackmd.io/_uploads/Syt3V09GC.png) # 2. 內建特徵重要性( coef_ 或 feature_importances_)（RandomForestClassifier）某些模型內部會自動計算出特徵的重要性。比如線性回歸模型中的係數（coef_）可以顯示每個特徵對結果的影響，樹模型（如隨機森林或 XGBoost）的 feature_importances_ 屬性則可用來顯示特徵在分裂中的貢獻程度。 ## 第一組 ![Feature Importance - RandomForestClassifier - Least Missing Value Feature](https://hackmd.io/_uploads/B1xCpERcGA.png) ## 第二組 ![Feature Importance - RandomForestClassifier - Full Feature](https://hackmd.io/_uploads/S14CVCqz0.png) # 3. Leave-one-out（RandomForestClassifier）這種方法逐一移除一個特徵，並觀察模型性能的變化。如果移除某個特徵導致模型性能顯著下降，則說明該特徵重要性較高。這是相對簡單但耗時的特徵選擇方法。 ## 第一組 ![Feature Importance - Leave-one-out - Least Missing Value Feature](https://hackmd.io/_uploads/ByYe7KhMR.png) ## 第二組 ![Feature Importance - Leave-one-out - Full Feature](https://hackmd.io/_uploads/SkpgXF2MA.png) # 4. 相關性分析相關性分析測量兩個變數之間的線性關係，通常用於篩選出與目標變數高度相關的特徵。在高維度數據中，這種方法可以幫助減少不相關或多餘的特徵。 ## 第一組 ![Correlation between Features and Target (Absolute Values) - Least Missing Value Feature](https://hackmd.io/_uploads/SJU-XY3GC.png) ## 第二組 ![Correlation between Features and Target (Absolute Values) - Full Feature](https://hackmd.io/_uploads/S15-mthfC.png) # 5. 遞歸特徵消除 Recursive Feature Elimination（RandomForestClassifier） RFE 是一種遞歸方法，從初始的所有特徵集中移除不重要的特徵，並通過評估模型性能來選擇最佳的特徵子集。這個過程會反覆進行，直到找到對模型影響最大的特徵集。 ## 第一組 ![Feature Selection Ranking - RandomForestClassifier - Least Missing Value Feature](https://hackmd.io/_uploads/SkeMQF2M0.png) ## 第二組 ![Feature Selection Ranking - RandomForestClassifier - Full Feature](https://hackmd.io/_uploads/BkBMQY3GA.png) # 6. XGBoost特性重要性（XGBClassifier） XGBoost 是一種強大的梯度提升樹模型，它會自動計算出每個特徵的重要性，通常使用 gain（獲益）或 cover（覆蓋）等指標來衡量。這些指標反映了每個特徵在決策樹分裂過程中的貢獻度。 ## 第一組 ![Feature Importance - XGBoost Classifier - Least Missing Value Feature](https://hackmd.io/_uploads/rJpMXKnGC.png) ## 第二組 ![Feature Importance - XGBoost Classifier - Full Feature](https://hackmd.io/_uploads/B1-m7Knz0.png) # 7. 主成分分析 PCA PCA 是一種無監督降維技術，它通過線性變換將原始特徵投影到一組新的正交軸上（稱為主成分），並以此來最大化數據的方差。這樣可以減少特徵的數量，同時保留最重要的數據變異信息。 ## 第一組 ![PCA Visualization - Least Missing Value Feature](https://hackmd.io/_uploads/SyIQ7Ynf0.png) ## 第二組 ![PCA Visualization - Full Feature](https://hackmd.io/_uploads/SJ9XXt3MA.png) # 8. 變異數分析 ANOVA ANOVA 是用來比較多組樣本平均數的方法，可以用來衡量一個或多個自變數對因變數的影響程度。當應用於特徵選擇時，ANOVA 可以幫助篩選出與目標變數有顯著差異的特徵。 F 值衡量的是特徵與目標變數之間的變異比，而 p 值則告訴你這個關聯性的顯著性。 ## 第一組 ![ANOVA F-Scores - Least Missing Value Feature](https://hackmd.io/_uploads/rJNNmFhzA.png) ![ANOVA P-Values - Least Missing Value Feature](https://hackmd.io/_uploads/B1KE7F2MC.png) ## 第二組 ![ANOVA F-Scores - Full Feature](https://hackmd.io/_uploads/ryvHmYnMR.png) ![ANOVA P-Values - Full Feature](https://hackmd.io/_uploads/r15rQKnfA.png) # 9. 卡方檢定卡方檢定是一種統計檢驗，用於檢查分類變數之間的關聯性。在特徵選擇中，卡方檢定可以幫助篩選出與目標變數具有顯著關聯的特徵，通常用於分類問題。 ## 第一組 ![Chi-Squared Scores - Least Missing Value Feature](https://hackmd.io/_uploads/HJMLQK2GA.png) ![Chi-Squared P-Values - Least Missing Value Feature](https://hackmd.io/_uploads/S1rIXF2GA.png) ## 第二組 ![Chi-Squared Scores - Full Feature](https://hackmd.io/_uploads/BkOIXYnzA.png) ![Chi-Squared P-Values - Full Feature](https://hackmd.io/_uploads/r13UQK3zR.png)