【Kaggle - Intro to Machine Learning 機器學習入門_Decision Tree】

# **【Kaggle - Intro to Machine Learning 機器學習入門_Decision Tree/Random Forest Regression】** :::info - How Models Work 什麼是模型？ - 什麼是 Decision Tree 決策樹？ - Basic Data Exploration 資料探索 - Your First Machine Learning Model 模型訓練 - Model Validation 模型驗證 - Underfitting and Overfitting 欠擬合和過擬合 - 什麼是 Random Forest Regressoion 隨機森林? ::: Source [Kaggle](https://www.kaggle.com/learn/intro-to-machine-learning) ![截圖 2025-06-25 21.50.35](https://hackmd.io/_uploads/HJonC_FVel.png) ## 什麼是模型？假設我想知道一個地區的房地產價值，應該如何做？多數人可能會依據實價登錄、過去經驗，比如同樣大小、同樣地段的房子，過去賣多少錢來判斷在機器學習中，「模型」就像判斷，從大量的歷史數據中學習規律(地段、屋齡、樓層、坪數、房間數、大樓/華廈/公寓、與車站/學校距離等)，再根據這些規律對新的數據進行預測學習的過程，稱為「擬合（fitting）」或「訓練（training）」模型，訓練模型的數據，就叫做「訓練數據（training data）」 Kaggle這門課用【決策樹 Learning_Decision Tree 】 ## 什麼是 Decesion Tree 決策樹？想像一下我們在玩一個「10個問題」的遊戲，要回答一系列的問題，回答Ａ走左邊、B走右邊，每個問題的答案都會引導我們走向下一個問題，直到得出一個結果 [網路圖片](https://www.smartdraw.com/decision-tree/) ![截圖 2025-06-25 22.08.19](https://hackmd.io/_uploads/HyD17tYNge.png) ``` 根節點 (Root Node)：樹的最頂端，是所有數據的起始點，提出第一個問題來分割數據分支 (Branches)：從每個節點延伸出來的線條，代表著問題的不同答案或不同結果內部節點 (Internal Node)：根節點以下的分支點，每個內部節點都代表一個「決策點」或「問題」葉子節點 (Leaf Node)：也稱為「終端節點」，當數據到達一個葉子節點時，就表示已經得出最終的分類結果或預測值 ``` 當我們有一個新房子要估價時，會讓它從決策樹的根節點開始，根據特徵回答每個節點的問題 1. 是否有兩間房間？不是的話走左邊，是的話走右邊 2. (走左邊) 是否大於8500平方公尺？不是的話走左邊，到達結果146k，是的話走右邊，到達結果188k 4. (走右邊) 是否大於11500平方公尺？不是的話走左邊，到達結果170k，是的話走右邊，到達結果233k ![截圖 2025-06-25 22.10.58](https://hackmd.io/_uploads/SkDtXFt4lg.png) ## Basic Data Exploration 資料探索使用的資料集 [Melbourne Housing Snapshot](https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot) ![截圖 2025-06-25 23.05.56](https://hackmd.io/_uploads/SkOqlcFExg.png) 查看數據型態 ```= melbourne_data.info ``` ![截圖 2025-06-25 23.18.31](https://hackmd.io/_uploads/ryGwXctVle.png) 所有特徵 ```= melbourne_data.columns ``` ![截圖 2025-06-25 23.11.42](https://hackmd.io/_uploads/SJk6W5FVll.png) ```= for c in melbourne_data_ori.columns: print(c, ":", melbourne_data_ori[c].value_counts(dropna=False)) print() ``` ![截圖 2025-06-25 23.23.23](https://hackmd.io/_uploads/B1xYN9KElx.png) ```= 'Suburb'：郊區或地區名稱，類別型 (Categorical) 'Address'：地址，文本型 (Textual) 'Rooms'：房間數量，數值型 (Numerical) 'Type'：房屋類型，h：House (獨立屋)、u：Unit (公寓/單元房)、t：Townhouse (聯排別墅)，類別型 (Categorical) 'Price'：價格，數值型 (Numerical) 'Method：銷售方式，S：Sold (已售出)、SA：Subject to Authority (待房東確認)、PI：Passed in (流拍)、VB：Vendor Bid (賣家出價，常見於拍賣)、SP：Sold Prior (預售)，類別型 (Categorical) 'SellerG'：賣家代理公司，類別型 (Categorical) 'Date'：房屋售出日期，日期型 (Date) 'Distance'：到市中心距離，數值型 (Numerical) 'Postcode'：郵遞區號，數值型 (Numerical) 'Bedroom2'：房間數，數值型 (Numerical) 'Bathroom'：浴室數，數值型 (Numerical) 'Car'：車位數，數值型 (Numerical) 'Landsize'：土地面積，數值型 (Numerical) 'BuildingArea'：建築面積，數值型 (Numerical) 'YearBuilt'：建造年份，類別型 (Categorical) 'CouncilArea'：所屬市議會，類別型 (Categorical) 'Lattitude'：緯度，數值型 (Numerical) 'Longtitude'：經度，數值型 (Numerical) 'Regionname'：區域名稱，類別型 (Categorical) 'Propertycount'：物業數量，數值型 (Numerical) ``` 處理缺失值，這裡用最簡單的方法，直接丟棄實務上可能會以某群的平均值、補0等來處理 ![截圖 2025-06-25 23.06.00](https://hackmd.io/_uploads/rJmcx5tVle.png) ## Your First Machine Learning Model 模型訓練先設Y值，也就是分類結果/預測值 ```= y = melbourne_data.Price ``` 再設X值，也就是要拿去訓練的特徵 ```= melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude'] X = melbourne_data[melbourne_features] ``` model ```= # model from sklearn.tree import DecisionTreeRegressor melbourne_model = DecisionTreeRegressor(random_state=1) melbourne_model.fit(X, y) ``` ```= print("Making predictions for the following 5 houses:") print(X.head()) print("The predictions are") print(melbourne_model.predict(X.head())) ``` ![截圖 2025-06-25 23.38.49](https://hackmd.io/_uploads/S1FzOcY4lg.png) ## Model Validation 模型驗證前面的模型會有問題，當我們把資料拿去訓練後，又把資料拿來測試，不就不準了嗎？ (誤差為0) ![螢幕擷取畫面 2025-06-27 175030](https://hackmd.io/_uploads/SyLtF12Vge.png) 因此我們要把資料切成訓練集和驗證集訓練集用來訓練驗證集用來預測預測出來的結果，再和真實的結果比較，算出誤差和準確率 ```= from sklearn.model_selection import train_test_split train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) # melbourne_model = DecisionTreeRegressor() melbourne_model.fit(train_X, train_y) # 預測訓練集和驗證集 train_predictions = melbourne_model.predict(train_X) val_predictions = melbourne_model.predict(val_X) # MAE train_mae = mean_absolute_error(train_y, train_predictions) val_mae = mean_absolute_error(val_y, val_predictions) print(f"MAE 訓練誤差: {train_mae:.2f}") print(f"MAE 驗證誤差: {val_mae:.2f}") ``` ![螢幕擷取畫面 2025-06-27 181604](https://hackmd.io/_uploads/H1a_Jx3Vgl.png) ## Underfitting and Overfitting 欠擬合和過擬合 Underfitting（欠擬合）：模型太簡單，無法學到訓練資料中的結構 → 訓練誤差和驗證誤差都很高 Overfitting（過擬合）：模型太複雜，學會了訓練資料的雜訊（noise） → 訓練誤差非常小，但驗證誤差很高我們前面的模型是過擬合，因此我們可以限制葉子節點數量，預設是 None [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) 用 max_leaf_nodes（葉節點數量）來控制模型複雜度 max_leaf_nodes 小 → 模型限制多，學不夠 → 容易欠擬合 max_leaf_nodes 大 → 模型自由度高，學得太多 → 容易過擬合 ```= from sklearn.metrics import mean_absolute_error from sklearn.tree import DecisionTreeRegressor def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y): model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0) model.fit(train_X, train_y) # preds_train = model.predict(train_X) preds_val = model.predict(val_X) # train_mae = mean_absolute_error(train_y, preds_train) val_mae = mean_absolute_error(val_y, preds_val) return train_mae, val_mae ``` ```= for max_leaf_nodes in [5, 50, 500, 5000]: train_mae, val_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y) print(f"Max leaf nodes: {max_leaf_nodes:<5} \t訓練 MAE: {train_mae:.2f} \t驗證 MAE: {val_mae:.2f}") ``` ![螢幕擷取畫面 2025-06-27 182608](https://hackmd.io/_uploads/ryjRWg2Egl.png) 5: 欠擬合（模型太簡單） 50: 稍好一些，但仍欠擬合 500: 模型能力增強，開始貼近資料結構 5000: 過擬合，訓練超好，但泛化差 ## 什麼是 Random Forest Regression 隨機森林? 這裡我們試著改用 RandomForestRegressor Random Forest Regressoion 是一種集成學習（Ensemble Learning）的方法，屬於隨機森林（Random Forest）演算法的迴歸版本結合多棵決策樹，每棵樹都是在不同的隨機資料上訓練出來的怎麼運作的? 1. 資料隨機抽樣（Bagging）使用 Bootstrap 抽樣法：從訓練資料中「有放回地隨機抽樣」，每份大小與原資料相同，但資料點可能重複，每份子樣本訓練成一棵 Decision Tree 假設1000筆訓練資料，抽一筆後 -> 放回 -> 再抽一筆...循環1000次 2. 特徵隨機選擇（Feature Randomness）在每個節點分裂時，隨機挑選一部分特徵來考慮，而不是使用全部特徵，這會讓每棵樹看起來都不太一樣，增加多樣性，也降低了過擬合的風險 ```= from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error forest_model = RandomForestRegressor(random_state=1) forest_model.fit(train_X, train_y) # train_preds = forest_model.predict(train_X) melb_preds = forest_model.predict(val_X) train_mae = mean_absolute_error(train_y, train_preds) val_mae = mean_absolute_error(val_y, melb_preds) print(f"MAE 訓練誤差: {train_mae:.2f}") print(f"MAE 驗證誤差: {val_mae:.2f}") ``` ![螢幕擷取畫面 2025-06-27 182659](https://hackmd.io/_uploads/S1DMGx3Exe.png) 雖然比起決策樹，隨機森林訓練誤差不是最低的，但MAE，也就是泛化誤差更小