應用 MATLAB 於模型預測與機器學習

作者: 蔡承佑

Machine Learning

Definition

machine learning definition

supervised learning workflow

supervised machine learning workflow

data
- traing data: 用來訓練模型
- validation data: 用來驗證模型的正確性
- test data: 預測的資料
find best model training options
1. choose a model
2. select features
3. tune parameters

over fitting: fitting the noise means 用過於複雜的模型預測資料

glossary of terms

model

an algorithm that predicts a response using a set of features. A model is trained on existing data and used to make predictions on new observations.

regression model

a machine learning model that outputs a continuous numeric response. For example, predicting stock prices is a regression problem.

classification model

a machine learning model that outputs a prediction from a discrete set of possible outcomes. For example, predicting if a medical image indicates healthy or cancer is a classifiction problem

refers to values used to create the model. Some model parameters are learned by the machine learning algorithm during training. Other parameters are set by the user prior to training.

hyperparameter

parameters required by the model that are set by the user. Hyperparameters are not learned through model training but often determined through an optimization process.
(不用再training但通常用在優化的過程)

training data

data used to train a model. A final model is trained using the full training and validation data.

validation data

data used to evaluate model performance during the training process. Validation data helps prevent choosing a model that overfits the training data (see overfitting below). A final model is trained using the full training and validation data.

test data

data used to simulate new observations. Test data is split from a full data set early in the machine learning process and not used during preprocessing and model training steps. Test data is used to evaluate a final model.

resubstitution validation

using the training data to evaluate a machine learning model. This approach provides no protection against overfitting because the same observations used to train the model are substituted into the model for calculating metrics.

overfitting

a model that obtains high accuracy with the training data but does poorly with new data. This often happens because the model fits to random fluctuations in the training data. Validation data helps prevent overfitting by using a subset of data to evaluate model performance.

underfitting

a model that is too simplistic to capture some trends in the data, resulting in large errors. For example, using a single sinusoid to model temperature may capture seasonal trends, but miss daily variations due to day and nighttime differences.

wide dataset

a dataset where the number of features is similar to, or greater than the number of observations.

Regression

step1 資料處理

資料: import a data file into MATLAB
視覺化: visualize the data to look for relationships between variables
處理: determine how the raw data should be cleaned before it is used to create predictive models

Import data

read data

data = readtable("path")

read data (for checking if data is missing)




files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv";
ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true);
taxi = readall(ds);
head(taxi)

Visualizations

點狀圖






scatter(data.Distance, data.Fare)
gscatter(data.Distance, data.Fare, data.RateCode)
xlabel("Distance (mi.)")
ylabel("Fare ($)")
xlim([0 40])
ylim([0 200])

scatter點狀圖 gscatter 三個變數(第三個為categories)
xlabel ylabel 軸的名稱
xlim ylim 圖表範圍

Histogram

通常先看大範圍



histogram(data.Fare)
xlabel("Fare ($)")
ylabel("Occurances")

接著再確定distanceBins




fareBins = 0:1:60;
histogram(data.Fare, fareBins)
xlabel("Fare ($)")
ylabel("Occurances")

distanceBins 用來處理範圍 0到60相隔1

boxplot


boxplot(taxiC.Distance, "Orientation", "horizontal");
xlabel("Distance")

Location



geoscatter(data.PickupLat, data.PickupLon, '.', 'SizeData', 1)

geolimits([40 41],[-75 -73])

拖拉地圖並update可以設置geolimits

clean data

check if data missing




files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv";
ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true);
taxi = readall(ds);
head(taxi)


numMissing = nnz(ismissing(taxi))

nnz 是 check id 不為 0 的或不存在的

但資料missing並不代表他是合理的資料

就樣距離小於0就是不合理的

summary查看


summary(taxi)

幾%不合理的值要<2%


distErrPcnt = 100*nnz(taxi.Distance <= 0)/height(taxi)


locationCleanLoss = 100*( 1 - height(taxiC2)/height(taxiC))

去除不合理值 (值)


taxiC = taxi(taxi.Distance > 0, :);

查看%的值 prctile


pTilesDistance = prctile(taxiC.Distance, [0, 99.99])

remove top 0.01% data

去除不合理值 (%數) rmoutliers



taxiC = rmoutliers(taxiC, "percentiles", [0, 99.99], "DataVariables", "Distance");
histogram(taxiC.Distance, 100);
xlabel("Distance");

remove top 0.01% data

經緯度跨越到海上不合理縮減經緯度












lat1 = 40;
lat2 = 42;

lon1 = -75;
lon2 = -73;

loc2keep = taxiC.PickupLat >= lat1 & taxiC.PickupLat <= lat2 & ...
    taxiC.DropoffLat >= lat1 & taxiC.DropoffLat <= lat2 & ...
    taxiC.PickupLon >= lon1 & taxiC.PickupLon <= lon2 & ...
    taxiC.DropoffLon >= lon1 & taxiC.DropoffLon <= lon2;

taxiC2 = taxiC(loc2keep, :);

loc2keep是決定的邏輯運算子

visulization


geoplot(taxiC2.PickupLat, taxiC2.PickupLon, "b.", "MarkerSize", 0.5);
title("Pickups");


geoplot(taxiC2.DropoffLat, taxiC2.DropoffLon, "r.", "MarkerSize", 0.5);
title("Dropoffs");

可以看到比上面Location的圖更密集



histogram(taxiC2.DropoffLat, 100); hold on;
histogram(taxiC2.PickupLat, 100); hold off;
legend(["DropoffLat" "PickupLat"]);

use hold to compare

time

show time




taxiC2.TimeOfDay = timeofday(taxiC2.PickupTime);

histogram(taxiC2.TimeOfDay);
xlabel("Pickup Time of Day");

如何知道需不需clean(看histogram的維度)





taxiC2.TimeOfDay = hours(taxiC2.TimeOfDay);
taxiC2.Duration = minutes(taxiC2.DropoffTime - taxiC2.PickupTime);

histogram(taxiC2.Duration, 100);
xlabel("Trip Duration (minutes)");

proccess

螢幕擷取畫面 2024-06-09 143559

summary

處理資料 (可以用舊的值生成新的值)
用圖表視覺化(histogram)
如果維度過大就要clean
去除不可能的值
用prctile計算邊界
再用rmoutliers去除邊界
用histogram再次查看合理的值

可用histogram或boxplot察看結果

Step2 model

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Linear Regression

linear regression

MSE

advantage

advantage of Simple Linear

polynomial function

次方可以是為新的變數，所以仍算是一種regression

polynomial terms && interaction terms

Decision Trees

螢幕擷取畫面 2024-06-09 155917

split 是 true false problem
leaves 是 results (avarage)
need to set some parameter ex:number of split (limit the tree growth)

分類

Fine Tree
- 通常深度較深
- 子葉的資料較小
Medium Tree
- 介於 Fine Tree 跟 Coarse Tree 之間
Coarse Tree
- 子葉的資料較大

how to select model

how to select mode

app > choose model
New Session 選擇資料
選擇response and preictors variables
選擇validation (Resubstitution Validation 就是 No Validation)

選擇x-axis決定predition variable
feature selection
選擇model (All quick to train 通常為 regression 跟 tree 因為較快)> train
看RMSE決定較好的模型
調整模型 (Summary > Model Hyperparameters) (optimizer)
產出
- export: 直接產出模型但無法當成function使用，可以查看各種值
- generate function

tool to use

螢幕擷取畫面 2024-06-09 161552

Step 3 Training

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

build linear model



linearModel = fitlm(taxi, "linear", ...
    "ResponseVar", "Duration", ...
    "PredictorVars", ["Distance", "TimeOfDay"])



linearModel = fitlm(taxi, "poly14", ...
    "ResponseVar", "Duration", ...
    "PredictorVars", ["Distance", "TimeOfDay"])

polyjk j代表最高幾次方 k代表交互最高階數


linearModel = fitlm(taxi, "Duration ~ 1 + Distance + TimeOfDay")

build tree


treeModel = fitrtree(taxi, "Duration", ...
    "PredictorNames", ["Distance", "TimeOfDay"])


view(treeModel) % in command line
view(treeModel, "mode", "graph")

customize:

MinLeafSize (default is 1)
MaxNumSplits
fine tree -> corse tree



treeModel = fitrtree(taxi, "Duration", ...
    "PredictorNames", ["Distance", "TimeOfDay"], ...
    "MaxNumSplits", 20)

view coefficient


linearModel.Coefficients

use model to predict

predict(model, data)


yPredict = predict(linearModel, taxi)


yPredict = predict(treeModel, taxi)

compare





scatter(taxi.TimeOfDay, taxi.Duration, '.')
hold on
scatter(taxi.TimeOfDay, yPredict, '.')
hold off
legend("Actual", "Predict")

Evaluate Models

residuals

MAE

SSE


SSE = sum((yPredict - yActual).^2)

MSE

MSE2

RMSE

因為MSE沒有相同單位，所以在開根號
RMSE

SST


SST = sum((yActual - mean(yActual)).^2)

R^2 bigger more fit

R square

summary

summary of evaluation
螢幕擷取畫面 2024-06-09 195701

rMetrics 查看各估計值


yActual = taxi.Duration
rMetrics(yActual, yPredict)

螢幕擷取畫面 2024-06-09 201727

compare

預測資料
橘色與藍色覆蓋約多代表越準

越對稱於對角線越準

查看 residual (越貼近水平那條直線residual越少)

Classification

Models

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Logistic Regression

螢幕擷取畫面 2024-06-09 224424

Function

coefficients decide threshold

when to use it

KNN

K = 3

difficult to capture and slower when k is big and data is large

type

SVM

find a line separate two classes

Kernal Method

Implement

Classification Learner App Workflow

螢幕擷取畫面 2024-06-09 231411

read data
App > choose classification learner
new session > From Work space
data、response、predictors、validation
橘色x代表toll pay但預測no toll pay

export

Confusion Matrix (recall、fallout、accuracy、precision)

TP = true positive
FN = false negative
FP = false positive
TN = true negative

Fallout && Recall relation

perfect situation

螢幕擷取畫面 2024-06-10 000730

Logistic Regression

categorical


taxiData.WasTollPaidCat = categorical(taxiData.WasTollPaid,[false true],["No Toll" "Toll"]);

fitglm



logRegMdl = fitglm(taxiData,'linear','Distribution','binomial', ...
    "PredictorVars",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],...
    "ResponseVar","WasTollPaidCat")

predict (scores)


scoresLogReg = predict(logRegMdl,taxiData)

convert scores to predictions



thresholdLogReg =0.5;
predictedLogReg = scoresLogReg >= thresholdLogReg;
predictedLogReg = categorical(predictedLogReg , [false true], ["No Toll" "Toll"])

Performance Metrics (confusionchart) (cMetrics)



confusionchart(taxiData.WasTollPaidCat,predictedLogReg)
cmLogReg = confusionmat(taxiData.WasTollPaidCat,predictedLogReg)
recallTrueClass = cmLogReg(2,2)/(cmLogReg(2,2)+cmLogReg(2,1))


cMetrics(taxiData.WasTollPaidCat,predictedLogReg)

ROC curve














[falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll");

plot(falloutsLogReg,recallsLogReg); 
xlabel("FPR (Fallout)");
ylabel("TPR (Recall)");
hold on; 

[falloutTLogReg,recallTLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll","TVals",thresholdLogReg);
ccDot = plot(falloutTLogReg,recallTLogReg,"ro","MarkerFaceColor","r"); 

title("ROC Curve with Positive Class: Toll")
legend(ccDot, "T = " + string( thresholdLogReg ) + " Recall = " + string(recallTLogReg) + " Fallout = " + string(falloutTLogReg) ,...
    "Location","best")
hold off;

red dot determine by threshold (決定正負的中心點)
如果no toll是positive則變成


[falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,1-scoresLogReg,"No Toll");

KNN

build model



knnMdl = fitcknn(taxiData,"WasTollPaidCat","NumNeighbors",50,"DistanceWeight","equal",...
    "PredictorNames",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],...
    "ResponseName","WasTollPaidCat")

predict (default threshold is 0.5)


[predictedKNN,scoresKNN] = predict(knnMdl,taxiData)

convert scores to predictions




thresholdKNN = 0.5; 
predKNNNoToll = scoresKNN(:,1) >= thresholdKNN;
predKNNNoToll = categorical(predKNNNoToll,[true false],["No Toll" "Toll"])

Performance Metrics


cMetrics(taxiData.WasTollPaidCat,predKNNNoToll)

ROC curve











[falloutsKNN,recallsKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll");
clf
plot(falloutsKNN,recallsKNN);
xlabel("FPR (Fallout)");
ylabel("TPR (Recall)");
hold on;
[falloutTKNN,recallTKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll","TVals",thresholdKNN);
ccDot = plot(falloutTKNN,recallTKNN,"ro","MarkerFaceColor","r"); 
title("ROC Curve with Positive Class: No Toll")
legend(ccDot, "T = " + string(thresholdKNN) + " Fallout = " + string(falloutTKNN) + " Recall = " + string(recallTKNN) )
hold off;

accuracy 越高不一定越好

x很多

集中在一種分類

K值決定，必須看種類內資料數有多少

Multiclass

螢幕擷取畫面 2024-06-10 025739

one vs one && one vs all

Choose Optimal Model (Validation)

overfit and underfit

螢幕擷取畫面 2024-06-10 042330

solve overfitting
add more data => validation data

螢幕擷取畫面 2024-06-10 043024

Validation

螢幕擷取畫面 2024-06-10 042647

holdout

k-fold

compare

partition test and train
















rng(1);
taxiPartitions = cvpartition(height(taxiData), "HoldOut", 0.2)

taxiTestIdx = test(taxiPartitions)

taxiTest = taxiData(taxiTestIdx, : );

taxiTrainIdx = training(taxiPartitions)

taxiTrain = taxiData(taxiTrainIdx, : );

taxiTrain = basicPreprocessing(taxiTrain);

taxiTrain = addTimeOfDay(taxiTrain);

taxiTrain = addDayOfWeek(taxiTrain);

setting the seed


rng(11)

cvpartition with height


healthData_holdout = cvpartition(height(healthData),"Holdout",0.4)

apply training and test


trainingDataR = healthData(training(healthData_holdout), : )
testDataR = healthData(test(healthData_holdout), : )

create two separate data sets for training/validation and test with indices from step 3

feature selection

Filter Methods

Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →

Wrapper Methods

Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →

Embeded Methods

Image Not Showing Possible Reasons
The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted
Learn More →

embeded methods


impValues = predictorImportance(trainedModel.ClassificationTree)

值越高代表該feature越重要

螢幕擷取畫面 2024-06-10 114509

左邊是impValues，曲線是累積值
接著就可以選擇較高的feature了


obsTrainSmall = obsTrain(:, [3032 654 948 2328 9])

接著再training一次

Regularization to Prevent Overfitting

penalty term

螢幕擷取畫面 2024-06-10 115135

overfitting

underfitting

caculation (每個模型適用不同的計算方式)

Lasso Regression 也可以視為一種feature selection因為B為0

default setting (hyper parameter)

proccess

標準化




meanObs = mean(obsTrain)
stdObs = std(obsTrain)

obsTrain = (obsTrain - meanObs)./stdObs

train ridge and lasso models
- fitrlinear (regression)
- fitclinear (classification)




mdl = fitclinear(obsTrain, grpTrain, ...
    "Learner", "logistic", ...
    "Regularization", "ridge", ...
    "KFold", 20)

predict


grpPredict = kfoldPredict(mdl);
cMetrics(grpTrain, grpPredict)

Use lasso regression when you want to remove some features.
Use ridge regression when you want all the features to contribute

Ensemble Models

models accuracy 可能相近，但預測的結果卻不同，此時就須要ensemble model組合多個model

cost:

training time

memory utilization

prediction speed

螢幕擷取畫面 2024-06-11 143957

Boosted Ensembles

proccess

Results

Bagged Ensembles

Results

Bagged Ensembles (Random Forests)

problem

若都用同樣資料則訓練出來的tree可能具有高度相似的結構

solution

將資料拆分成不同feature 的 subset

Parameter

Model Parameters

estimated from data
values are optimized by the algorithm itself
they're not manually set

Model Hyperparameters

cannot be estimated from data
can be manually set
used to help estimate model parameters
For example KNN 的 K 值

這裡用K值跟Distance Metric (KNN)做舉例

How to determine

螢幕擷取畫面 2024-06-11 201420

Grid Search

找到所有的K跟Distance Metric並看哪種組合較好

Random Search

Test Model

Test > Test Data (select taxiTest)
select test model
test all > test selected
可以看residual plot 確認是否overfit

summary 整套 ML 的基礎流程

螢幕擷取畫面 2024-06-11 204706

其中reducing complexity就是feature selection

螢幕擷取畫面 2024-06-11 204842

Using Your Model

螢幕擷取畫面 2024-06-11 220915

proccess

其他語言也可以調用 matlab code

create project
commit
share

matlab 可以把 code 轉成其他語言，刻到硬體上

matlab 可以把 model 變成 GUI 使用這介面 matlab web

Automated machine learning

fitcauto
fitrauto

load data
















load ovariancancer.mat obs grp

% Set the rng seed
rng(2);
cv = cvpartition(grp,"Holdout",0.2);

% Split into training and test data
obsTrain = obs(training(cv),:);
grpTrain = grp(training(cv));
obsTest = obs(test(cv),:);
grpTest = grp(test(cv));

% Normalize the training and test data
meanObs = mean(obsTrain);
stdObs = std(obsTrain);
obsTrainNorm = (obsTrain - meanObs)./stdObs

Select feature

The code below uses x² tests to select 100 predictive features, which is 2.5% of the original 4000 features.





% Use chi-squared tests to rank features by importance
[idx,scores] = fscchi2(obsTrainNorm,grpTrain);

% Create new training set using top 100 features
obsTrainSmall = obsTrainNorm(:,idx(1:100))

Use fitcauto to select a model and hyperparameters


mdl = fitcauto(obsTrainSmall,grpTrain);

螢幕擷取畫面 2024-06-11 224725

Test the optimized model












% Apply same pre-processing steps
obsTestNorm = (obsTest - meanObs)./stdObs;
obsTestSmall = obsTestNorm(:,idx(1:100));

% Predict labels
grpPredict = predict(mdl,obsTestSmall);

% Display metrics
cMetrics(grpTest,grpPredict)

% Display confusion matrix
confusionchart(grpTest,grpPredict)

Disadvantages

Long training times
Lack of full-workflow automation
No guarantee of the "best" model (因為 iteration 都是根據 default )

應用 MATLAB 於模型預測與機器學習

Machine Learning

glossary of terms

Regression

step1 資料處理

Import data

Visualizations

clean data

check if data missing

但資料missing並不代表他是合理的資料

可用histogram或boxplot察看結果

Step2 model

Linear Regression

Decision Trees

how to select model

Step 3 Training

Evaluate Models

Classification

Models

Implement

Logistic Regression

KNN

Choose Optimal Model (Validation)

feature selection

Regularization to Prevent Overfitting

Ensemble Models

Parameter

Model Parameters

Model Hyperparameters

Test Model

Using Your Model

Automated machine learning

Read more

2025q1 Homework3 (kxo)

2025q1 Homework1 (lab0)

2025q1 Homework2 (quiz1+2)

2025q1 Homework1 (ideas)