# 應用 MATLAB 於模型預測與機器學習 >作者: 蔡承佑 ## Machine Learning :::spoiler Definition ![machine learning definition](https://hackmd.io/_uploads/SkTqz5gHA.png) ::: :::spoiler supervised learning workflow ![supervised machine learning workflow](https://hackmd.io/_uploads/Sk2Q4cgBC.png) - data - traing data: 用來訓練模型 - validation data: 用來驗證模型的正確性 - test data: 預測的資料 - find best model training options 1. choose a model 2. select features 3. tune parameters over fitting: fitting the noise means 用過於複雜的模型預測資料 ::: ## glossary of terms :::spoiler model an algorithm that predicts a response using a set of features. A model is trained on existing data and used to make predictions on new observations. ::: :::spoiler regression model a machine learning model that outputs a **continuous numeric response**. For example, predicting stock prices is a regression problem. ::: :::spoiler classification model a machine learning model that outputs a prediction from a **discrete set** of possible outcomes. For example, predicting if a medical image indicates healthy or cancer is a classifiction problem ::: :::spoiler refers to values used to create the model. Some model parameters are learned by the machine learning algorithm during training. Other parameters are set by the user prior to training. ::: :::spoiler hyperparameter parameters required by the model that are set by the user. Hyperparameters are not learned through model training but often determined through an optimization process. (不用再training但通常用在優化的過程) ::: :::spoiler training data data used to train a model. A final model is trained using the full training and validation data. ::: :::spoiler validation data data used to evaluate model performance during the training process. Validation data helps prevent choosing a model that overfits the training data (see overfitting below). A final model is trained using the full training and validation data. ::: :::spoiler test data data used to simulate new observations. Test data is split from a full data set early in the machine learning process and not used during preprocessing and model training steps. Test data is used to evaluate a final model. ::: :::spoiler resubstitution validation using the training data to evaluate a machine learning model. This approach provides no protection against overfitting because the same observations used to train the model are substituted into the model for calculating metrics. ::: :::spoiler overfitting a model that obtains high accuracy with the training data but does poorly with new data. This often happens because the model fits to random fluctuations in the training data. Validation data helps **prevent overfitting by using a subset of data** to evaluate model performance. ::: :::spoiler underfitting a model that is too simplistic to capture some trends in the data, resulting in large errors. For example, using a single sinusoid to model temperature may capture seasonal trends, but miss daily variations due to day and nighttime differences. ::: :::spoiler wide dataset a dataset where the number of features is similar to, or greater than the number of observations. ::: # Regression ## step1 資料處理 1. 資料: import a data file into MATLAB 2. 視覺化: visualize the data to look for relationships between variables 3. 處理: determine how the raw data should be cleaned before it is used to create predictive models ### Import data :::spoiler read data data = readtable("*path*") ::: :::spoiler read data (for checking if data is missing) ```matlab= files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv"; ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true); taxi = readall(ds); head(taxi) ``` ::: ### Visualizations :::spoiler 點狀圖 ```matlab= scatter(data.Distance, data.Fare) gscatter(data.Distance, data.Fare, data.RateCode) xlabel("Distance (mi.)") ylabel("Fare ($)") xlim([0 40]) ylim([0 200]) ``` >scatter點狀圖 gscatter 三個變數(第三個為categories) >xlabel ylabel 軸的名稱 >xlim ylim 圖表範圍 >![scattergram](https://hackmd.io/_uploads/r1dmGnlHC.png) ::: :::spoiler Histogram 1. 通常先看大範圍 ```matlab= histogram(data.Fare) xlabel("Fare ($)") ylabel("Occurances") ``` >![螢幕擷取畫面 2024-06-08 001847](https://hackmd.io/_uploads/ryeBm2lBA.png) 2. 接著再確定distanceBins ```matlab= fareBins = 0:1:60; histogram(data.Fare, fareBins) xlabel("Fare ($)") ylabel("Occurances") ``` >distanceBins 用來處理範圍 0到60相隔1 >![螢幕擷取畫面 2024-06-08 002156](https://hackmd.io/_uploads/SkYhXngrC.png) ::: :::spoiler boxplot ```matlab= boxplot(taxiC.Distance, "Orientation", "horizontal"); xlabel("Distance") ``` ::: :::spoiler Location ```matlab= geoscatter(data.PickupLat, data.PickupLon, '.', 'SizeData', 1) geolimits([40 41],[-75 -73]) ``` >拖拉地圖並update可以設置geolimits >![螢幕擷取畫面 2024-06-08 002957](https://hackmd.io/_uploads/r1DcSneSR.png) ::: ### clean data #### check if data missing :::spoiler ```matlab= files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv"; ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true); taxi = readall(ds); head(taxi) ``` ```matlab= numMissing = nnz(ismissing(taxi)) ``` >nnz 是 check id 不為 0 的 或不存在的 ::: <br/> #### 但資料missing並不代表他是合理的資料 - 就樣距離小於0就是不合理的 :::spoiler summary查看 ```matlab= summary(taxi) ``` ::: :::spoiler 幾%不合理的值 要<2% ```matlab= distErrPcnt = 100*nnz(taxi.Distance <= 0)/height(taxi) ``` ```matlab= locationCleanLoss = 100*( 1 - height(taxiC2)/height(taxiC)) ``` ::: :::spoiler 去除不合理值 (值) ```matlab= taxiC = taxi(taxi.Distance > 0, :); ``` ::: :::spoiler 查看%的值 prctile ```mtalab= pTilesDistance = prctile(taxiC.Distance, [0, 99.99]) ``` >remove top 0.01% data ::: :::spoiler 去除不合理值 (%數) rmoutliers ```mtalab= taxiC = rmoutliers(taxiC, "percentiles", [0, 99.99], "DataVariables", "Distance"); histogram(taxiC.Distance, 100); xlabel("Distance"); ``` >remove top 0.01% data ::: <br/> - 經緯度跨越到海上不合理 縮減經緯度 :::spoiler ```matlab= lat1 = 40; lat2 = 42; lon1 = -75; lon2 = -73; loc2keep = taxiC.PickupLat >= lat1 & taxiC.PickupLat <= lat2 & ... taxiC.DropoffLat >= lat1 & taxiC.DropoffLat <= lat2 & ... taxiC.PickupLon >= lon1 & taxiC.PickupLon <= lon2 & ... taxiC.DropoffLon >= lon1 & taxiC.DropoffLon <= lon2; taxiC2 = taxiC(loc2keep, :); ``` >loc2keep是決定的邏輯運算子 ::: :::spoiler visulization ```matlab= geoplot(taxiC2.PickupLat, taxiC2.PickupLon, "b.", "MarkerSize", 0.5); title("Pickups"); ``` ```matlab= geoplot(taxiC2.DropoffLat, taxiC2.DropoffLon, "r.", "MarkerSize", 0.5); title("Dropoffs"); ``` >可以看到比上面Location的圖更密集 >![螢幕擷取畫面 2024-06-09 135603](https://hackmd.io/_uploads/HJ5BVaGS0.png) ```matlab= histogram(taxiC2.DropoffLat, 100); hold on; histogram(taxiC2.PickupLat, 100); hold off; legend(["DropoffLat" "PickupLat"]); ``` >use hold to compare >![螢幕擷取畫面 2024-06-09 140124](https://hackmd.io/_uploads/rkXLHpzHC.png) ::: <br/> - time :::spoiler show time ```matlab= taxiC2.TimeOfDay = timeofday(taxiC2.PickupTime); histogram(taxiC2.TimeOfDay); xlabel("Pickup Time of Day"); ``` >![螢幕擷取畫面 2024-06-09 141248](https://hackmd.io/_uploads/ryLl_pfHA.png) ::: :::spoiler 如何知道需不需clean(看histogram的維度) ```matlab= taxiC2.TimeOfDay = hours(taxiC2.TimeOfDay); taxiC2.Duration = minutes(taxiC2.DropoffTime - taxiC2.PickupTime); histogram(taxiC2.Duration, 100); xlabel("Trip Duration (minutes)"); ``` >![螢幕擷取畫面 2024-06-09 142635](https://hackmd.io/_uploads/BkMEj6zHC.png) ::: :::spoiler proccess ![螢幕擷取畫面 2024-06-09 143559](https://hackmd.io/_uploads/rybOTpGH0.png) ::: :::spoiler summary 1. 處理資料 (可以用舊的值生成新的值) 2. 用圖表視覺化(histogram) 3. 如果維度過大就要clean 4. 去除不可能的值 5. 用prctile計算邊界 6. 再用rmoutliers去除邊界 7. 用histogram再次查看合理的值 ::: <br/> ##### 可用histogram或boxplot察看結果 ## Step2 model ![螢幕擷取畫面 2024-06-09 155655](https://hackmd.io/_uploads/rkCLl17SR.png) ### Linear Regression :::spoiler ![linear regression](https://hackmd.io/_uploads/rJM060MH0.png) ::: :::spoiler MSE ![MSE](https://hackmd.io/_uploads/SyfdARfHA.png) ::: :::spoiler advantage ![advantage of Simple Linear](https://hackmd.io/_uploads/SyIzyJ7HR.png) ::: :::spoiler polynomial function 次方可以是為新的變數,所以仍算是一種regression >![螢幕擷取畫面 2024-06-09 155337](https://hackmd.io/_uploads/r1Z911mrC.png) polynomial terms && interaction terms >![螢幕擷取畫面 2024-06-09 155432](https://hackmd.io/_uploads/B10Ry1mB0.png) ::: ### Decision Trees :::spoiler ![螢幕擷取畫面 2024-06-09 155917](https://hackmd.io/_uploads/ByoyWJQr0.png) split 是 true false problem leaves 是 results (avarage) need to set some parameter ex:number of split (limit the tree growth) ::: :::spoiler 分類 1. Fine Tree - 通常深度較深 - 子葉的資料較小 2. Medium Tree - 介於 Fine Tree 跟 Coarse Tree 之間 3. Coarse Tree - 子葉的資料較大 ::: ### how to select model :::spoiler ![how to select mode](https://hackmd.io/_uploads/Byu-MJ7SA.png) 1. app > choose model 2. New Session 選擇資料 3. 選擇response and preictors variables 4. 選擇validation (Resubstitution Validation 就是 No Validation) >![螢幕擷取畫面 2024-06-09 161122](https://hackmd.io/_uploads/rku0XkmBC.png) 5. 選擇x-axis決定predition variable 6. feature selection 7. 選擇model (All quick to train 通常為 regression 跟 tree 因為較快)> train 8. 看RMSE決定較好的模型 9. 調整模型 (Summary > Model Hyperparameters) (optimizer) 10. 產出 - export: 直接產出模型 但無法當成function使用,可以查看各種值 - generate function ::: :::spoiler tool to use ![螢幕擷取畫面 2024-06-09 161552](https://hackmd.io/_uploads/S1S4Bk7HC.png) ::: ## Step 3 Training ![螢幕擷取畫面 2024-06-09 190921](https://hackmd.io/_uploads/SJ9tpZ7BC.png) :::spoiler build linear model ```matlab= linearModel = fitlm(taxi, "linear", ... "ResponseVar", "Duration", ... "PredictorVars", ["Distance", "TimeOfDay"]) ``` ```matlab= linearModel = fitlm(taxi, "poly14", ... "ResponseVar", "Duration", ... "PredictorVars", ["Distance", "TimeOfDay"]) ``` >polyjk j代表最高幾次方 k代表交互最高階數 ```matlab= linearModel = fitlm(taxi, "Duration ~ 1 + Distance + TimeOfDay") ``` ::: :::spoiler build tree ```matlab= treeModel = fitrtree(taxi, "Duration", ... "PredictorNames", ["Distance", "TimeOfDay"]) ``` ```matlab= view(treeModel) % in command line view(treeModel, "mode", "graph") ``` customize: >MinLeafSize (default is 1) >MaxNumSplits >fine tree -> corse tree >![螢幕擷取畫面 2024-06-09 184618](https://hackmd.io/_uploads/r19f_b7B0.png) ```matlab= treeModel = fitrtree(taxi, "Duration", ... "PredictorNames", ["Distance", "TimeOfDay"], ... "MaxNumSplits", 20) ``` ::: :::spoiler view coefficient ```matlab= linearModel.Coefficients ``` ::: :::spoiler use model to predict predict(model, data) ```matlab= yPredict = predict(linearModel, taxi) ``` ```matlab= yPredict = predict(treeModel, taxi) ``` ::: :::spoiler compare ```matlab= scatter(taxi.TimeOfDay, taxi.Duration, '.') hold on scatter(taxi.TimeOfDay, yPredict, '.') hold off legend("Actual", "Predict") ``` >![螢幕擷取畫面 2024-06-09 174755](https://hackmd.io/_uploads/HyUvqxQSC.png) ::: ## Evaluate Models :::spoiler residuals ![residuals](https://hackmd.io/_uploads/By_HEMXHR.png) ::: :::spoiler MAE ![MAE](https://hackmd.io/_uploads/Sy20rMQS0.png) ::: :::spoiler SSE ![SSE](https://hackmd.io/_uploads/S1ym8GXSR.png) ```matlab= SSE = sum((yPredict - yActual).^2) ``` ::: :::spoiler MSE ![MSE2](https://hackmd.io/_uploads/rylvUM7HC.png) ::: :::spoiler RMSE 因為MSE沒有相同單位,所以在開根號 ![RMSE](https://hackmd.io/_uploads/H1k3If7HC.png) ::: :::spoiler SST ![SST](https://hackmd.io/_uploads/rkmmDMQHA.png) ```matlab= SST = sum((yActual - mean(yActual)).^2) ``` ::: :::spoiler R^2 bigger more fit ![R square](https://hackmd.io/_uploads/H1nLPG7r0.png) ::: :::spoiler summary ![summary of evaluation](https://hackmd.io/_uploads/Skwl_GXrA.png) ![螢幕擷取畫面 2024-06-09 195701](https://hackmd.io/_uploads/rkEjOG7B0.png) ::: :::spoiler rMetrics 查看各估計值 ```matlab= yActual = taxi.Duration rMetrics(yActual, yPredict) ``` ![螢幕擷取畫面 2024-06-09 201727](https://hackmd.io/_uploads/HksvafXrA.png) ::: :::spoiler compare 預測資料 橘色與藍色覆蓋約多代表越準 >![螢幕擷取畫面 2024-06-09 202123](https://hackmd.io/_uploads/HJbvRzmHR.png) 越對稱於對角線越準 >![螢幕擷取畫面 2024-06-09 202313](https://hackmd.io/_uploads/H1OACMXSA.png) >![螢幕擷取畫面 2024-06-09 203717](https://hackmd.io/_uploads/BJoGMm7HR.png) 查看 residual (越貼近水平那條直線residual越少) >![螢幕擷取畫面 2024-06-09 203854](https://hackmd.io/_uploads/SymdM7QB0.png) >![螢幕擷取畫面 2024-06-09 204143](https://hackmd.io/_uploads/S1ZN7XmSA.png) ::: # Classification ## Models ![螢幕擷取畫面 2024-06-09 224248](https://hackmd.io/_uploads/Hk15kSmrA.png) :::spoiler Logistic Regression ![螢幕擷取畫面 2024-06-09 224424](https://hackmd.io/_uploads/r1P1lS7S0.png) Function >![螢幕擷取畫面 2024-06-09 224530](https://hackmd.io/_uploads/HyjzxHXS0.png) coefficients decide threshold >![螢幕擷取畫面 2024-06-09 224646](https://hackmd.io/_uploads/H1scxrQr0.png) when to use it >![螢幕擷取畫面 2024-06-09 224816](https://hackmd.io/_uploads/HkcalBXH0.png) ::: :::spoiler KNN K = 3 >![螢幕擷取畫面 2024-06-09 225015](https://hackmd.io/_uploads/rknN-H7H0.png) difficult to capture and slower when k is big and data is large >![螢幕擷取畫面 2024-06-09 225053](https://hackmd.io/_uploads/BJQObBQS0.png) type >![螢幕擷取畫面 2024-06-09 225157](https://hackmd.io/_uploads/SkAcWHXrC.png) ::: :::spoiler SVM find a line separate two classes >![螢幕擷取畫面 2024-06-09 225426](https://hackmd.io/_uploads/BkUEzrQHA.png) Kernal Method >![螢幕擷取畫面 2024-06-09 230125](https://hackmd.io/_uploads/BkGk4S7BR.png) >![螢幕擷取畫面 2024-06-09 230158](https://hackmd.io/_uploads/HkPlESmB0.png) >![螢幕擷取畫面 2024-06-09 230232](https://hackmd.io/_uploads/r1W7NSQHC.png) ::: ## Implement :::spoiler Classification Learner App Workflow ![螢幕擷取畫面 2024-06-09 231411](https://hackmd.io/_uploads/ryM7wSmSC.png) 1. read data 2. App > choose classification learner 3. new session > From Work space 4. data、response、predictors、validation 5. 橘色x代表toll pay但預測no toll pay >![螢幕擷取畫面 2024-06-09 233100](https://hackmd.io/_uploads/Hy0TcB7SR.png) 6. export ::: :::spoiler Confusion Matrix (recall、fallout、accuracy、precision) TP = true positive FN = false negative FP = false positive TN = true negative >![螢幕擷取畫面 2024-06-10 000005](https://hackmd.io/_uploads/ryVR-87SC.png) > >![螢幕擷取畫面 2024-06-10 000320](https://hackmd.io/_uploads/H1ALMUmBC.png) Fallout && Recall relation >![螢幕擷取畫面 2024-06-10 001134](https://hackmd.io/_uploads/S13HNUmB0.png) >![螢幕擷取畫面 2024-06-10 001728](https://hackmd.io/_uploads/BkYpSLmSA.png) ::: :::spoiler perfect situation ![螢幕擷取畫面 2024-06-10 000730](https://hackmd.io/_uploads/HyN8Q8mSR.png) ::: ### Logistic Regression :::spoiler categorical ```matlab= taxiData.WasTollPaidCat = categorical(taxiData.WasTollPaid,[false true],["No Toll" "Toll"]); ``` ::: :::spoiler fitglm ```matlab= logRegMdl = fitglm(taxiData,'linear','Distribution','binomial', ... "PredictorVars",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],... "ResponseVar","WasTollPaidCat") ``` ::: :::spoiler predict (scores) ```matlab= scoresLogReg = predict(logRegMdl,taxiData) ``` ::: :::spoiler convert scores to predictions ```matlab= thresholdLogReg =0.5; predictedLogReg = scoresLogReg >= thresholdLogReg; predictedLogReg = categorical(predictedLogReg , [false true], ["No Toll" "Toll"]) ``` ::: :::spoiler Performance Metrics (confusionchart) (cMetrics) ```matlab= confusionchart(taxiData.WasTollPaidCat,predictedLogReg) cmLogReg = confusionmat(taxiData.WasTollPaidCat,predictedLogReg) recallTrueClass = cmLogReg(2,2)/(cmLogReg(2,2)+cmLogReg(2,1)) ``` ```matlab= cMetrics(taxiData.WasTollPaidCat,predictedLogReg) ``` ::: :::spoiler ROC curve ```matlab= [falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll"); plot(falloutsLogReg,recallsLogReg); xlabel("FPR (Fallout)"); ylabel("TPR (Recall)"); hold on; [falloutTLogReg,recallTLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll","TVals",thresholdLogReg); ccDot = plot(falloutTLogReg,recallTLogReg,"ro","MarkerFaceColor","r"); title("ROC Curve with Positive Class: Toll") legend(ccDot, "T = " + string( thresholdLogReg ) + " Recall = " + string(recallTLogReg) + " Fallout = " + string(falloutTLogReg) ,... "Location","best") hold off; ``` >red dot determine by threshold (決定正負的中心點) >如果no toll是positive則變成 ```matlab= [falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,1-scoresLogReg,"No Toll"); ``` ::: ### KNN :::spoiler build model ```matlab= knnMdl = fitcknn(taxiData,"WasTollPaidCat","NumNeighbors",50,"DistanceWeight","equal",... "PredictorNames",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],... "ResponseName","WasTollPaidCat") ``` ::: :::spoiler predict (default threshold is 0.5) ```matlab= [predictedKNN,scoresKNN] = predict(knnMdl,taxiData) ``` ::: :::spoiler convert scores to predictions ```matlab= thresholdKNN = 0.5; predKNNNoToll = scoresKNN(:,1) >= thresholdKNN; predKNNNoToll = categorical(predKNNNoToll,[true false],["No Toll" "Toll"]) ``` ::: :::spoiler Performance Metrics ```matlab= cMetrics(taxiData.WasTollPaidCat,predKNNNoToll) ``` ::: :::spoiler ROC curve ```matlab= [falloutsKNN,recallsKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll"); clf plot(falloutsKNN,recallsKNN); xlabel("FPR (Fallout)"); ylabel("TPR (Recall)"); hold on; [falloutTKNN,recallTKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll","TVals",thresholdKNN); ccDot = plot(falloutTKNN,recallTKNN,"ro","MarkerFaceColor","r"); title("ROC Curve with Positive Class: No Toll") legend(ccDot, "T = " + string(thresholdKNN) + " Fallout = " + string(falloutTKNN) + " Recall = " + string(recallTKNN) ) hold off; ``` ::: :::spoiler accuracy 越高不一定越好 x很多 >![螢幕擷取畫面 2024-06-10 014341](https://hackmd.io/_uploads/rJZZ5wQHC.png) 集中在一種分類 >![螢幕擷取畫面 2024-06-10 014506](https://hackmd.io/_uploads/HJoE5DmBC.png) ::: ==K值決定,必須看種類內資料數有多少== :::spoiler Multiclass ![螢幕擷取畫面 2024-06-10 025739](https://hackmd.io/_uploads/rJlwiuXHC.png) one vs one && one vs all >![螢幕擷取畫面 2024-06-10 031650](https://hackmd.io/_uploads/Sy86kYQBC.png) ::: # Choose Optimal Model (Validation) :::spoiler overfit and underfit ![螢幕擷取畫面 2024-06-10 042330](https://hackmd.io/_uploads/HJBU1cQHC.png) solve overfitting add more data => validation data >![螢幕擷取畫面 2024-06-10 042413](https://hackmd.io/_uploads/S1y9JqXrA.png) ![螢幕擷取畫面 2024-06-10 043024](https://hackmd.io/_uploads/HJOlWqXr0.png) ::: :::spoiler Validation ![螢幕擷取畫面 2024-06-10 042647](https://hackmd.io/_uploads/SJV7gqQHC.png) holdout >![螢幕擷取畫面 2024-06-10 042804](https://hackmd.io/_uploads/SJIvgcQBC.png) k-fold >![螢幕擷取畫面 2024-06-10 042858](https://hackmd.io/_uploads/ryZogq7SC.png) compare >![螢幕擷取畫面 2024-06-10 042934](https://hackmd.io/_uploads/Hk3Te5mSC.png) ::: :::spoiler partition test and train ```matlab= rng(1); taxiPartitions = cvpartition(height(taxiData), "HoldOut", 0.2) taxiTestIdx = test(taxiPartitions) taxiTest = taxiData(taxiTestIdx, : ); taxiTrainIdx = training(taxiPartitions) taxiTrain = taxiData(taxiTrainIdx, : ); taxiTrain = basicPreprocessing(taxiTrain); taxiTrain = addTimeOfDay(taxiTrain); taxiTrain = addDayOfWeek(taxiTrain); ``` 1. setting the seed ```matlab= rng(11) ``` 2. cvpartition with height ```matlab= healthData_holdout = cvpartition(height(healthData),"Holdout",0.4) ``` 3. apply training and test ```matlab= trainingDataR = healthData(training(healthData_holdout), : ) testDataR = healthData(test(healthData_holdout), : ) ``` 4. create two separate data sets for training/validation and test with indices from step 3 ::: ## feature selection - Filter Methods >![螢幕擷取畫面 2024-06-10 105429](https://hackmd.io/_uploads/HkheokErC.png) - Wrapper Methods >![螢幕擷取畫面 2024-06-10 105510](https://hackmd.io/_uploads/ByPNiJNrA.png) - Embeded Methods >![螢幕擷取畫面 2024-06-10 105607](https://hackmd.io/_uploads/S1vwsk4BR.png) :::spoiler embeded methods ```matlab= impValues = predictorImportance(trainedModel.ClassificationTree) ``` >值越高代表該feature越重要 ![螢幕擷取畫面 2024-06-10 114509](https://hackmd.io/_uploads/SyiAIgEBC.png) >左邊是impValues,曲線是累積值 >接著就可以選擇較高的feature了 ```matlab= obsTrainSmall = obsTrain(:, [3032 654 948 2328 9]) ``` >接著再training一次 ::: ## Regularization to Prevent Overfitting :::spoiler penalty term ![螢幕擷取畫面 2024-06-10 115135](https://hackmd.io/_uploads/rySLdgEHR.png) overfitting >![螢幕擷取畫面 2024-06-10 115156](https://hackmd.io/_uploads/HyHuOlErA.png) underfitting >![螢幕擷取畫面 2024-06-10 115225](https://hackmd.io/_uploads/BkeqdxVrC.png) caculation (每個模型適用不同的計算方式) >![螢幕擷取畫面 2024-06-10 115314](https://hackmd.io/_uploads/SkcnOxVSC.png) >Lasso Regression 也可以視為一種feature selection因為B為0 >![螢幕擷取畫面 2024-06-10 120136](https://hackmd.io/_uploads/Byu2qeVSR.png) default setting (hyper parameter) >![螢幕擷取畫面 2024-06-10 121418](https://hackmd.io/_uploads/BJvn6eVS0.png) ::: :::spoiler proccess 1. 標準化 ```matlab= meanObs = mean(obsTrain) stdObs = std(obsTrain) obsTrain = (obsTrain - meanObs)./stdObs ``` 2. train ridge and lasso models - fitrlinear (regression) - fitclinear (classification) ```matlab= mdl = fitclinear(obsTrain, grpTrain, ... "Learner", "logistic", ... "Regularization", "ridge", ... "KFold", 20) ``` 3. predict ```matlab= grpPredict = kfoldPredict(mdl); cMetrics(grpTrain, grpPredict) ``` >Use lasso regression when you want to remove some features. >Use ridge regression when you want all the features to contribute ::: ## Ensemble Models models accuracy 可能相近,但預測的結果卻不同,此時就須要ensemble model組合多個model cost: > 1. training time > 2. memory utilization > 3. prediction speed ![螢幕擷取畫面 2024-06-11 143957](https://hackmd.io/_uploads/ryO8Z_rHR.png) :::spoiler Boosted Ensembles proccess >![螢幕擷取畫面 2024-06-11 143346](https://hackmd.io/_uploads/BJR0J_rHR.png) >![螢幕擷取畫面 2024-06-11 143411](https://hackmd.io/_uploads/SywggurrR.png) Results >![螢幕擷取畫面 2024-06-11 144737](https://hackmd.io/_uploads/HJMS7uBH0.png) ::: :::spoiler Bagged Ensembles Results >![螢幕擷取畫面 2024-06-11 143533](https://hackmd.io/_uploads/rkfvg_rH0.png) ::: :::spoiler Bagged Ensembles (Random Forests) problem >若都用同樣資料則訓練出來的tree可能具有高度相似的結構 solution >將資料拆分成不同feature 的 subset ::: ## Parameter ### Model Parameters - estimated from data - values are optimized by the algorithm itself - they're not manually set ### Model Hyperparameters - cannot be estimated from data - can be manually set - used to help estimate model parameters - For example KNN 的 K 值 這裡用K值跟Distance Metric (KNN)做舉例 :::spoiler How to determine ![螢幕擷取畫面 2024-06-11 201420](https://hackmd.io/_uploads/HJo2J6HrR.png) Grid Search >找到所有的K跟Distance Metric並看哪種組合較好 >![螢幕擷取畫面 2024-06-11 201713](https://hackmd.io/_uploads/HksIeaHBC.png) Random Search >![螢幕擷取畫面 2024-06-11 201954](https://hackmd.io/_uploads/Sy6xbprH0.png) ::: ## Test Model :::spoiler 1. Test > Test Data (select taxiTest) 2. select test model 3. test all > test selected 4. 可以看residual plot 確認是否overfit ::: :::spoiler summary 整套 ML 的基礎流程 ![螢幕擷取畫面 2024-06-11 204706](https://hackmd.io/_uploads/ryVdD6HBC.png) 其中reducing complexity就是feature selection ![螢幕擷取畫面 2024-06-11 204842](https://hackmd.io/_uploads/SkbpPTHHR.png) ::: # Using Your Model ![螢幕擷取畫面 2024-06-11 220915](https://hackmd.io/_uploads/r11oc0HBR.png) :::spoiler proccess 其他語言也可以調用 matlab code 1. create project 2. commit 3. share matlab 可以把 code 轉成其他語言,刻到硬體上 matlab 可以把 model 變成 GUI 使用這介面 matlab web ::: # Automated machine learning - fitcauto - fitrauto :::spoiler load data ```matlab= load ovariancancer.mat obs grp % Set the rng seed rng(2); cv = cvpartition(grp,"Holdout",0.2); % Split into training and test data obsTrain = obs(training(cv),:); grpTrain = grp(training(cv)); obsTest = obs(test(cv),:); grpTest = grp(test(cv)); % Normalize the training and test data meanObs = mean(obsTrain); stdObs = std(obsTrain); obsTrainNorm = (obsTrain - meanObs)./stdObs ``` ::: :::spoiler Select feature The code below uses x^2^ tests to select 100 predictive features, which is 2.5% of the original 4000 features. ```matlab= % Use chi-squared tests to rank features by importance [idx,scores] = fscchi2(obsTrainNorm,grpTrain); % Create new training set using top 100 features obsTrainSmall = obsTrainNorm(:,idx(1:100)) ``` ::: :::spoiler Use fitcauto to select a model and hyperparameters ```matlab= mdl = fitcauto(obsTrainSmall,grpTrain); ``` ![螢幕擷取畫面 2024-06-11 224725](https://hackmd.io/_uploads/ryecQ1LBC.png) ::: :::spoiler Test the optimized model ```matlab= % Apply same pre-processing steps obsTestNorm = (obsTest - meanObs)./stdObs; obsTestSmall = obsTestNorm(:,idx(1:100)); % Predict labels grpPredict = predict(mdl,obsTestSmall); % Display metrics cMetrics(grpTest,grpPredict) % Display confusion matrix confusionchart(grpTest,grpPredict) ``` ::: :::spoiler Disadvantages - Long training times - Lack of full-workflow automation - No guarantee of the "best" model (因為 iteration 都是根據 default ) :::