機器學習實作範例

# 機器學習實作範例 > [name=Claire Weng] ## 範例(一) >參考來源 >[IBM HR Analytics💼Employee Attrition & Performance](https://www.kaggle.com/code/faressayah/ibm-hr-analytics-employee-attrition-performance/notebook) >[Colab實作筆記本](https://colab.research.google.com/drive/1KqF0C7JC78Jz9RA_z_LZf2T0WrJvn5E4?usp=sharing) --- ## 實驗步驟 ### 📌 環境設定 (Google Colab) 請先在 Google Colab 中執行以下指令來安裝必要的 Python 套件： ```python from google.colab import drive import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve import warnings warnings.filterwarnings('ignore') ``` ### 📌 1. 下載並載入 IBM HR Analytics Employee Attrition & Performance 數據集此專案使用 IBM HR Analytics Employee Attrition & Performance Dataset，數據集來自 Kaggle。 * 下載資料集：[IBM HR Analytics Employee Attrition & Performance](https://www.kaggle.com/code/faressayah/ibm-hr-analytics-employee-attrition-performance/notebook) * 上傳到 Colab，然後執行以下程式碼： :::success ==掛載 Google Drive== <style> .green {color: green;} </style> <style> .orange {color: orange;} </style> <span class="green">from google.colab import drive</span> <span class="green">drive.mount('/content/drive')</span> ::: ```python # 讀取數據 df = pd.read_csv("/content/drive/MyDrive/HackMD/WA_Fn-UseC_-HR-Employee-Attrition.csv") print("數據集前五筆：") print(df.head()) # 基本數據探索 print("數據集基本資訊：") print(df.info()) print("數據描述統計：") print(df.describe()) # 可視化數據分佈 plt.figure(figsize=(8, 4)) sns.countplot(x='Attrition', data=df, palette='coolwarm') plt.title('員工離職情況分佈') plt.show() ``` ### 📌 2. 資料預處理（特徵工程） ```python print("缺失值檢查：") print(df.isnull().sum()) # 檢查缺失值 # 移除無意義的欄位 df.drop(columns=['EmployeeNumber', 'Over18', 'StandardHours', 'EmployeeCount'], inplace=True) # 碼号轉換 ('Attrition' 轉為數字) df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0}) # 類別特徵的處理（標籤編碼 & 獨熱編碼） categorical_cols = df.select_dtypes(include=['object']).columns.tolist() # 檢查 Attrition 是否存在於類別欄位中 if 'Attrition' in categorical_cols: categorical_cols.remove('Attrition') # 只有存在時才移除 # OneHotEncoder 更新修正 encoder = OneHotEncoder(sparse_output=False, drop='first') # 修正 sparse=False 為 sparse_output=False df_encoded = pd.DataFrame(encoder.fit_transform(df[categorical_cols])) # 設定新欄位名稱 df_encoded.columns = encoder.get_feature_names_out(categorical_cols) # 移除原本的類別欄位，並合併新編碼的數據 df.drop(columns=categorical_cols, inplace=True) df = pd.concat([df, df_encoded], axis=1) # 分離特徵與目標變數 X = df.drop('Attrition', axis=1) y = df['Attrition'] # 標準化特徵 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) ``` ### 📌 3. 資料集分割 ```python X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y) ``` ### 📌 4. 訓練模型 ```python rf_model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42) rf_model.fit(X_train, y_train) gb_model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42) gb_model.fit(X_train, y_train) ``` ### 📌 5. 模型評估 ```python y_pred_rf = rf_model.predict(X_test) y_pred_gb = gb_model.predict(X_test) # 準確度 accuracy_rf = accuracy_score(y_test, y_pred_rf) accuracy_gb = accuracy_score(y_test, y_pred_gb) print(f'隨機梯梯模型準確度: {accuracy_rf:.2f}') print(f'梯度提升模型準確度: {accuracy_gb:.2f}') # 分類報告 print('隨機梯梯分類報告:\n', classification_report(y_test, y_pred_rf)) print('梯度提升模型分類報告:\n', classification_report(y_test, y_pred_gb)) # 混淆矩陣 plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues') plt.title("隨機森林混淆矩陣") plt.subplot(1,2,2) sns.heatmap(confusion_matrix(y_test, y_pred_gb), annot=True, fmt='d', cmap='Oranges') plt.title("梯度提升機混淆矩陣") plt.show() # ROC-AUC 分析 y_prob_rf = rf_model.predict_proba(X_test)[:,1] y_prob_gb = gb_model.predict_proba(X_test)[:,1] fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf) fpr_gb, tpr_gb, _ = roc_curve(y_test, y_prob_gb) plt.figure(figsize=(8,6)) plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC={roc_auc_score(y_test, y_prob_rf):.2f})') plt.plot(fpr_gb, tpr_gb, label=f'Gradient Boosting (AUC={roc_auc_score(y_test, y_prob_gb):.2f})') plt.plot([0,1], [0,1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.show() ``` ### 📌 6. 特徵重要性分析 ```python feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': rf_model.feature_importances_}) feature_importances.sort_values(by='Importance', ascending=False, inplace=True) plt.figure(figsize=(12,6)) sns.barplot(x='Importance', y='Feature', data=feature_importances[:10]) plt.title('Top 10 重要特徵') plt.show() ``` ## Troubleshooting（問題與解決方案）在執行資料預處理時，遇到以下問題，並附上對應的解決方案。 --- ### ❌ 問題 1：Attrition 這個欄位不是 object 類型 ### 📌 問題描述 Attrition 這個欄位並不是 object 類型，已經被數值化（0/1），所以 categorical_cols.remove('Attrition') 會導致錯誤，因為 Attrition 不存在於 categorical_cols。 ### ✅ 解決方案 1. **確認 categorical_cols內容** 先 print(categorical_cols) 來檢查確保 Attrition 是否真的存在其中。 1. **用 if 'Attrition' in categorical_cols: 檢查後再移除** 避免 remove 遇到不存在的元素時拋錯。 ```python # 類別特徵的處理（標籤編碼 & 獨熱編碼） categorical_cols = df.select_dtypes(include=['object']).columns.tolist() # 檢查 Attrition 是否存在於類別欄位中 if 'Attrition' in categorical_cols: categorical_cols.remove('Attrition') # 只有存在時才移除 # 進行獨熱編碼 encoder = OneHotEncoder(sparse=False, drop='first') df_encoded = pd.DataFrame(encoder.fit_transform(df[categorical_cols])) # 設定新欄位名稱 df_encoded.columns = encoder.get_feature_names_out(categorical_cols) # 移除原本的類別欄位，並合併新編碼的數據 df.drop(columns=categorical_cols, inplace=True) df = pd.concat([df, df_encoded], axis=1) ``` ### 重新執行仍出現錯誤訊息 ### ❌ 問題 2：sparse 參數已經被新版 scikit-learn 移除 ### 📌 問題描述 OneHotEncoder(sparse=False, drop='first')，其中 sparse 參數已經被新版 scikit-learn 移除，應該改用 sparse_output。 ### ✅ 解決方案 ```python # OneHotEncoder 更新修正 encoder = OneHotEncoder(sparse_output=False, drop='first') # 修正 sparse=False 為 sparse_output=False df_encoded = pd.DataFrame(encoder.fit_transform(df[categorical_cols])) # 設定新欄位名稱 df_encoded.columns = encoder.get_feature_names_out(categorical_cols) # 移除原本的類別欄位，並合併新編碼的數據 df.drop(columns=categorical_cols, inplace=True) df = pd.concat([df, df_encoded], axis=1) ``` ### 📌重新整理資料預處理程式碼 ```python print("缺失值檢查：") print(df.isnull().sum()) # 檢查缺失值 # 移除無意義的欄位 df.drop(columns=['EmployeeNumber', 'Over18', 'StandardHours', 'EmployeeCount'], inplace=True) # 碼号轉換 ('Attrition' 轉為數字) df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0}) # 類別特徵的處理（標籤編碼 & 獨熱編碼） categorical_cols = df.select_dtypes(include=['object']).columns.tolist() # 檢查 Attrition 是否存在於類別欄位中 if 'Attrition' in categorical_cols: categorical_cols.remove('Attrition') # 只有存在時才移除 # OneHotEncoder 更新修正 encoder = OneHotEncoder(sparse_output=False, drop='first') # 修正 sparse=False 為 sparse_output=False df_encoded = pd.DataFrame(encoder.fit_transform(df[categorical_cols])) # 設定新欄位名稱 df_encoded.columns = encoder.get_feature_names_out(categorical_cols) # 移除原本的類別欄位，並合併新編碼的數據 df.drop(columns=categorical_cols, inplace=True) df = pd.concat([df, df_encoded], axis=1) ``` --- ## 範例(二) >參考來源 >[Airbnb Analysis, Visualization and Prediction](https://www.kaggle.com/code/chirag9073/airbnb-analysis-visualization-and-prediction/notebook) >[Colab實作筆記本](https://colab.research.google.com/drive/1PkiQUOiHv5CF_rDO-7dIJFgzWcNSHQRp?usp=sharing) --- ## 實驗步驟 ### 📌 環境設定 (Google Colab) 請先在 Google Colab 中執行以下指令來安裝必要的 Python 套件： ```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score ``` ### 📌 1. 下載並載入 Airbnb Analysis, Visualization and Prediction 數據集此專案使用 Airbnb Analysis, Visualization and Prediction Dataset，數據集來自 Kaggle。 * 下載資料集：[Airbnb Analysis, Visualization and Prediction](https://www.kaggle.com/code/chirag9073/airbnb-analysis-visualization-and-prediction/notebook) * 上傳到 Colab，然後執行以下程式碼： :::success ==掛載 Google Drive== <style> .green {color: green;} </style> <style> .orange {color: orange;} </style> <span class="green">from google.colab import drive</span> <span class="green">drive.mount('/content/drive')</span> ::: ```python # 讀取數據 df = pd.read_csv('/content/drive/MyDrive/HackMD/AB_NYC_2019.csv') # 顯示資料框的基本資訊 print(df.info()) # 資料探索性分析 (EDA) print("資料集基本資訊：") print(df.info()) print("\n前5筆資料：") print(df.head()) # 檢查缺失值 print("\n缺失值統計：") print(df.isnull().sum()) # 繪製價格分佈圖 plt.figure(figsize=(10, 6)) sns.histplot(df['price'], bins=50, kde=True) plt.title('價格分佈') plt.xlabel('Price') plt.ylabel('Frequency') plt.xlim(0, 500) # 過濾極端值以便更清晰地觀察 plt.show() ``` ### 📌 2. 資料預處理（特徵工程） ```python # 移除價格為0或異常高的數據 # 這是數據清理的一部分，確保價格範圍合理 df = df[(df['price'] > 0) & (df['price'] < 500)] # 填補缺失值（以中位數填補數值型欄位） # 這是處理缺失值的方法，確保 `reviews_per_month` 不會有 NaN df['reviews_per_month'] = df['reviews_per_month'].fillna(df['reviews_per_month'].median()) # 將類別型變數轉為數值型（One-Hot Encoding） # `neighbourhood_group` 和 `room_type` 是類別變數，轉換為數值以便模型使用 df = pd.get_dummies(df, columns=['neighbourhood_group', 'room_type'], drop_first=True) # 選擇特徵與目標變數 # 這裡選擇了一些數值特徵，以及前面 One-Hot Encoding 產生的變數 features = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'] + \ [col for col in df.columns if 'neighbourhood_group_' in col or 'room_type_' in col] X = df[features] # 特徵矩陣 y = df['price'] # 目標變數 ``` ### 📌 3.訓練測試集切分 ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` ### 📌 4.模型訓練與評估模型表現 ```python model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) # 預測與評估模型表現 y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"均方誤差 (MSE): {mse}") print(f"R-squared (R²): {r2}") ``` ### 📌 5.特徵重要性分析 ```python feature_importances = pd.DataFrame({ 'Feature': features, 'Importance': model.feature_importances_ }).sort_values(by='Importance', ascending=False) plt.figure(figsize=(10, 8)) sns.barplot(x='Importance', y='Feature', data=feature_importances.head(10)) plt.title('Top 10 Feature Importances') plt.show() ``` ## Troubleshooting（問題與解決方案） --- ### ❌ 問題 1：模型訓練評估後才發現仍有未處理完的缺失值 ### 📌 問題描述 **1. 缺失值對模型影響** * 對訓練影響：如果訓練時數據仍有缺失值，某些模型（如 RandomForestRegressor）可能能夠處理缺失值，但仍可能影響模型學習到的模式，導致預測結果不穩定。 * 對測試影響：如果測試數據有缺失值，許多模型無法直接處理 NaN，這可能導致模型在預測時拋出錯誤，或是自動忽略某些數據，影響結果的可信度。 **2. 影響評估指標** * 均方誤差（MSE）可能被低估或高估：未處理的缺失值可能會讓部分資料行為異常，使得 MSE 變大，或是在測試時部分數據被排除，導致 MSE 過低。 * R² 可能誤導結果：如果數據的變異性受缺失值影響，模型的解釋能力（R²）可能會出現偏差。 ### ✅ 解決方案 1. **確認影響程度？** * 檢查 df.isnull().sum()：確保 X_train 和 X_test 內完全沒有缺失值。 * 檢查 y_train.isnull().sum()：確保目標變數完全沒有 NaN。 2. **重新評估模型：若發現缺失值，應先填補或刪除後再重新訓練模型。** **如果發現測試或訓練數據仍有缺失值，可以：** * 數值型特徵：用中位數或均值填補（df.fillna(df.median())）。 * 類別型特徵：用 "Unknown" 或眾數填補（df.fillna("Unknown")）。刪除缺失值過多的樣本（df.dropna()）。 ```python # 缺失值統計 # 計算每個欄位的缺失值數量，方便確認哪些欄位需要處理 missing_values = data.isnull().sum() print("缺失值統計：") print(missing_values) # 缺失值處理 # 填補 'name' 和 'host_name' 的缺失值 # 這些欄位為字串型，缺失時以 'Unknown' 來填補，避免 NaN 影響分析 data['name'] = data['name'].fillna('Unknown') data['host_name'] = data['host_name'].fillna('Unknown') # 填補 'last_review' 的缺失值 # 這是日期型欄位，缺失時用 'No Review' 作為占位符，表示沒有評論 # （也可考慮轉換為特定日期或直接刪除該行） data['last_review'] = data['last_review'].fillna('No Review') # 填補 'reviews_per_month' 的缺失值 # 這是數值型欄位，缺失時用 0 填補，表示該房源沒有評論過 data['reviews_per_month'] = data['reviews_per_month'].fillna(0) # 確認處理後的缺失值統計 # 再次計算缺失值，確保所有需要填補的欄位都已處理 missing_values_after = data.isnull().sum() print("\n處理後的缺失值統計：") print(missing_values_after) ``` ### ❌ 問題 2：初始模型訓練成績有待加強 ### 📌 問題描述在專案一開始，使用了 Random Forest Regressor 來進行房價預測，並採取了基本的數據清理與特徵工程。然而，在初步訓練後，發現模型仍有許多可以改進的地方。 ### ✅ 解決方案 **1. 模型選擇** 一開始使用 **RandomForestRegressor**，而後改為 **GradientBoostingRegressor**，並且加入 **GridSearchCV** 進行超參數調優。 **2. 特徵工程改進** 原本使用**latitude** 和 **longitude**，然而新增了一個新特徵 **distance_to_manhattan**，**計算距離曼哈頓中心的距離**，提升模型的空間特徵表現。 ```python # 新增距離特徵 data['distance_to_manhattan'] = calculate_distance(data['latitude'], data['longitude']) # 選擇作為特徵的變數，以及目標變數（價格） features = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'distance_to_manhattan'] + \ [col for col in data.columns if 'neighbourhood_group_' in col or 'room_type_' in col] ``` **3. 數據預處理改進** * **標準化數據 (StandardScaler)：** * 對數據進行標準化處理 (StandardScaler())，讓數據更適合 **Gradient Boosting** 演算法，而原先版本沒有這個步驟。 ```python # 標準化特徵，讓不同數值範圍的變數能夠在相同尺度上進行學習 scaler = StandardScaler() X_scaled = scaler.fit_transform(X) ``` **4. 超參數調優** * 原先**RandomForestRegressor**(n_estimators=100, random_state=42) 直接設定固定參數。 * 使用 **GridSearchCV** 對 **GradientBoostingRegressor** 進行**超參數搜索**，測試 n_estimators、learning_rate、max_depth 和 subsample 的最佳組合，提高模型表現。 **5. 模型評估** * **透過 GridSearchCV 尋找最佳參數**，可能提升了模型的準確性。 ```python 設定 Gradient Boosting Regressor 的超參數搜尋範圍 param_grid = { 'n_estimators': [100, 200], # 樹的數量 'learning_rate': [0.05, 0.1], # 學習率，控制每棵樹對最終預測的影響 'max_depth': [3, 5], # 樹的最大深度，控制模型複雜度 'subsample': [0.8, 1.0] # 每次訓練時使用的數據比例 } # 初始化 Gradient Boosting 模型 model = GradientBoostingRegressor(random_state=42) # 使用 GridSearchCV 進行超參數搜尋，採用 5 折交叉驗證，評估指標為 R²（判定係數） grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='r2') grid_search.fit(X_train, y_train) # 取得最佳模型（最優參數組合） best_model = grid_search.best_estimator_ ``` ### ❌ 問題 3：改用GradientBoostingRegressor模型訓練後，成績仍然有待提升 ### 📌 問題描述再改用 GradientBoostingRegressor 來進行房價預測後，發現還是有可以優化模型的地方 ### ✅ 解決方案 ### 主要有以下幾項優化和改進： ### 1. 資料清理與預處理 * **第二版本:** 主要處理了價格異常值（大於0且小於500）以及填補了 reviews_per_month 欄位的缺失值，並進行了 One-Hot Encoding。 * **最終版本:** 除了第二版本的清理，還新增了更多的資料處理步驟，包括： * 刪除缺失的 name 和 host_name 行，這有助於減少無意義的資料影響。 * **進一步限制 minimum_nights 的範圍，即只保留少於30夜的資料，這有助於清除不合理的預定紀錄。** ```python # 處理缺失值 df['reviews_per_month'] = df['reviews_per_month'].fillna(0) df.dropna(subset=['name', 'host_name'], inplace=True) # 移除異常值 df = df[(df['price'] > 0) & (df['price'] < 500)] df = df[df['minimum_nights'] < 30] ``` ### 2. 特徵工程 * **第二版本:** 計算了與曼哈頓中心的距離，並進行簡單的 One-Hot Encoding 處理了類別變數。 * **最終版本:** 在距離計算之外，還新增了： * **lat_lon** 特徵，這是將 latitude 和 longitude 相乘後得到的特徵，可能有助於捕捉地理位置間的互動效應。 ```python # 添加一個新的特徵，結合經度和緯度 df['lat_lon'] = df['latitude'] * df['longitude'] ``` ### 3. 資料轉換 * **第二版本:** 進行了標準化處理，並沒有額外對目標變數進行轉換。 * **最終版本:** 除了標準化特徵外，還使用了 **QuantileTransformer** 對目標變數 y 進行了正態分佈轉換，這有助於提升模型的穩定性和預測能力。 ```python # 使用 QuantileTransformer 轉換目標變數 quantile_transformer = QuantileTransformer(output_distribution='normal', random_state=42) y_transformed = quantile_transformer.fit_transform(y.values.reshape(-1, 1)).flatten() ``` ### 4. 模型訓練與調參 * **第二版本:** 使用了較簡單的 GridSearchCV 進行超參數調整，範圍相對較小（n_estimators、learning_rate、max_depth 和 subsample）。 * **最終版本:** 同樣使用了 GridSearchCV，但是進行了更廣泛的範圍測試，並**加入了 KFold 交叉驗證** 來提升模型的穩定性和泛化能力。此外，還進行了更多的超參數範圍設置，探索不同的學習率、樹的深度和樣本比例。 ```python # 使用 GradientBoostingRegressor 模型 param_grid = { 'n_estimators': [300, 400], 'learning_rate': [0.03, 0.05], 'max_depth': [4, 5], 'subsample': [0.7, 0.8], 'random_state': [42] } gbr = GradientBoostingRegressor() # 使用 KFold 進行交叉驗證 kf = KFold(n_splits=5, shuffle=True, random_state=42) # 使用 GridSearchCV 進行超參數調整 from sklearn.model_selection import GridSearchCV grid_search = GridSearchCV(estimator=gbr, param_grid=param_grid, scoring='r2', cv=kf, n_jobs=-1) grid_search.fit(X_train, y_train) # 顯示最佳參數 print("Best parameters found: ", grid_search.best_params_) ``` ### 5. 評估模型 * **第二版本:** 評估了 R-squared 和 MSE，並且進行了基礎的特徵重要性分析。 * **最終版本:** 同樣進行了 R-squared 和 MSE 的評估，**但對預測結果進行了反轉換**，將目標變數轉回原始的價格數值，使結果更具可解釋性。 ```python # 評估模型 best_model = grid_search.best_estimator_ y_pred_transformed = best_model.predict(X_test) # 反轉換預測值 y_pred = quantile_transformer.inverse_transform(y_pred_transformed.reshape(-1, 1)).flatten() # 計算 R-squared 和 MSE r2 = r2_score(y_test, y_pred_transformed) mse = mean_squared_error(y_test, y_pred_transformed) print(f"R-squared: {r2}") print(f"Mean Squared Error: {mse}") ``` ### 6. 可視化 * **第二版本:** 沒有涉及結果的可視化。 * **最終版本:** **增加了可視化部分，使用了散點圖來顯示實際價格與預測價格的關係**，這有助於直觀地了解模型的表現。 ```python # 預測值 vs 實際值的散點圖 plt.figure(figsize=(8, 6)) sns.scatterplot(x=quantile_transformer.inverse_transform(y_test.reshape(-1, 1)).flatten(), y=y_pred) plt.xlabel("Actual Prices") plt.ylabel("Predicted Prices") plt.title("Actual vs Predicted Prices") plt.show() ``` --- ## 範例(三) >參考來源 >[Credit Card Default: a very pedagogical notebook](https://www.kaggle.com/code/lucabasa/credit-card-default-a-very-pedagogical-notebook/notebook) >[Colab實作筆記本](https://colab.research.google.com/drive/1y3E7_UA5iF0XO0kd4i3Qk4Dg5BPg-wPs?usp=sharing) --- ## 實驗步驟 ### 📌 環境設定 (Google Colab) 請先在 Google Colab 中執行以下指令來安裝必要的 Python 套件 ```python import pandas as pd import matplotlib.pyplot as plt import matplotlib.font_manager as fm import seaborn as sns import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import RobustScaler from sklearn.feature_selection import RFECV from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.ensemble import StackingClassifier from sklearn.model_selection import StratifiedKFold, GridSearchCV from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix, RocCurveDisplay import shap import joblib ``` ### 📌 1. 下載並載入 Credit Card Default: a very pedagogical notebook 數據集此專案使用 Credit Card Default: a very pedagogical notebook Dataset，數據集來自 Kaggle。 * 下載資料集：[Credit Card Default: a very pedagogical notebook](https://www.kaggle.com/code/lucabasa/credit-card-default-a-very-pedagogical-notebook/notebook) * 上傳到 Colab，然後執行以下程式碼： :::success ==掛載 Google Drive== <style> .green {color: green;} </style> <style> .orange {color: orange;} </style> <span class="green">from google.colab import drive</span> <span class="green">drive.mount('/content/drive')</span> ::: ```python # 安裝思源黑體字型（只需執行一次） !apt-get -y install fonts-noto-cjk # 設定中文字型 font_path = '/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc' font_prop = fm.FontProperties(fname=font_path) plt.rcParams['font.family'] = font_prop.get_name() plt.rcParams['axes.unicode_minus'] = False # 讀取資料 df = pd.read_csv('/content/drive/MyDrive/HackMD/UCI_Credit_Card.csv') # 資料概況 print("欄位數量與名稱:") print(len(df.columns), df.columns.tolist()) print("\n資料型態:") print(df.dtypes) print("\n缺失值數量:") print(df.isnull().sum()) print("\n重複值數量:") print(df.duplicated().sum()) print("\n基本統計量:") print(df.describe()) # 類別目標欄位分布 plt.figure(figsize=(8,5)) sns.countplot(x='default.payment.next.month', data=df) plt.title('目標變數分布 (是否違約)', fontproperties=font_prop) plt.xlabel('是否違約 (1=是, 0=否)', fontproperties=font_prop) plt.ylabel('筆數', fontproperties=font_prop) plt.show() # 基本統計分析 print("資料筆數與欄位數:", df.shape) print("\n前五筆資料:") print(df.head()) print("\n數據描述:") print(df.describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99])) # 類別分布可視化（英文版） plt.figure(figsize=(10,6)) sns.countplot(x='default.payment.next.month', data=df) plt.title('Class Distribution') plt.xlabel('Default', fontproperties=font_prop) plt.ylabel('Count', fontproperties=font_prop) plt.show() # 特徵相關性分析 corr_matrix = df.corr() plt.figure(figsize=(18,15)) sns.heatmap(corr_matrix, annot=False, cmap='coolwarm') plt.title('特徵相關性矩陣', fontproperties=font_prop) plt.show() ``` ### 📌 2. 資料預處理 ```python # 替換特殊值（-1, -2）為 NaN，並補值 def handle_missing(df): pay_cols = ['PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6'] bill_cols = [f'BILL_AMT{i}' for i in range(1,7)] for col in pay_cols: df[col] = df[col].replace([-2, -1], np.nan) df[col].fillna(df[col].mode()[0], inplace=True) for col in bill_cols: df[col] = df[col].replace(-2, np.nan) df[col].fillna(df[col].median(), inplace=True) return df df = handle_missing(df) ``` ### 📌 3. 特徵工程：類別轉換與特徵縮放 ```python # 分箱 df['AGE_BIN'] = pd.cut(df['AGE'], bins=[20, 30, 40, 50, 60, 100]) # One-hot encoding X = pd.get_dummies(df.drop(['ID', 'default.payment.next.month'], axis=1), columns=['SEX', 'EDUCATION', 'MARRIAGE', 'AGE_BIN'], drop_first=True) feature_names = X.columns.tolist() y = df['default.payment.next.month'] # 資料切分 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42) # 數值縮放 scaler = RobustScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) ``` ### 📌 4. 特徵選擇與模型訓練（Stacking） ```python # 特徵選擇 selector = RFECV( estimator=XGBClassifier(tree_method='hist', device='cpu', eval_metric='auc'), step=10, cv=5, scoring='roc_auc', min_features_to_select=20 ) X_train_sel = selector.fit_transform(X_train, y_train) X_test_sel = selector.transform(X_test) # Stacking base_models = [ ('xgb', XGBClassifier(tree_method='hist', device='cpu', use_label_encoder=False)), ('lgbm', LGBMClassifier(device='cpu')) ] meta_model = XGBClassifier(tree_method='hist', device='cpu', use_label_encoder=False) stack_model = StackingClassifier( estimators=base_models, final_estimator=meta_model, cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42), stack_method='predict_proba' ) # Grid Search param_grid = { 'xgb__learning_rate': [0.05, 0.1], 'xgb__max_depth': [3, 5], 'final_estimator__n_estimators': [50, 100] } grid_search = GridSearchCV(stack_model, param_grid, cv=3, scoring='roc_auc', n_jobs=-1, verbose=1) grid_search.fit(X_train_sel, y_train) ``` ### 📌 5. 模型評估與視覺化報告 ```python # 選擇最佳模型 best_model = grid_search.best_estimator_ # 進行預測 y_pred = best_model.predict(X_test_sel) y_proba = best_model.predict_proba(X_test_sel)[:,1] # 評估指標 # AUC 計算： print("AUC:", roc_auc_score(y_test, y_proba)) # 混淆矩陣： print("\n混淆矩陣:\n", confusion_matrix(y_test, y_pred)) # 分類報告： print("\n分類報告:\n", classification_report(y_test, y_pred)) # ROC曲線 RocCurveDisplay.from_estimator(best_model, X_test_sel, y_test) plt.title("ROC Curve") plt.show() ``` ### 📌 6. SHAP 模型解釋 ```python # 解釋器與視覺化 explainer = shap.TreeExplainer(best_model.named_estimators_['xgb']) shap_values = explainer.shap_values(X_test_sel) shap.summary_plot(shap_values, X_test_sel, feature_names=selector.get_feature_names_out()) ``` ### 📌 7. 模型保存與部署函式設計 ```python # 儲存模型 joblib.dump({ 'model': best_model, 'scaler': scaler, 'selector': selector }, 'credit_model.pkl') # 推論類別 class CreditDefaultPredictor: def __init__(self, model_path): artifacts = joblib.load(model_path) self.model = artifacts['model'] self.scaler = artifacts['scaler'] self.selector = artifacts['selector'] def predict(self, X_new): X_new_scaled = self.scaler.transform(X_new) X_new_selected = self.selector.transform(X_new_scaled) return self.model.predict_proba(X_new_selected) ``` ## Troubleshooting（問題與解決方案） --- ### ❌ 問題 1：SHAP 模型解釋best_model 變數沒定義 ### 📌 問題描述 best_model 變數沒定義 → 想用的是 grid_search.best_estimator_。 ### ❌ 問題 2：selector.get_feature_names_out() 有時會出錯 ### 📌 問題描述 selector.get_feature_names_out() 有時會出錯，因為 RFECV 物件不保留原始欄位名稱，需要從 X.columns 中對照原始索引。 ### ✅ 解決方案 #### 正確且穩定的版本如下： ```python # 取得最佳模型中，第一層的 XGBoost 模型 best_model = grid_search.best_estimator_ xgb_model = best_model.named_estimators_['xgb'] # 建立 SHAP 解釋器（Tree-based model 專用） explainer = shap.Explainer(xgb_model, feature_names=X.columns[selector.get_support()]) # 計算 SHAP 值 shap_values = explainer(X_test_sel) # summary_plot（橫向為特徵貢獻度排序） shap.summary_plot(shap_values, features=X_test_sel, feature_names=X.columns[selector.get_support()]) ```