Try   HackMD

應用 MATLAB 於模型預測與機器學習

作者: 蔡承佑

Machine Learning

Definition

machine learning definition

supervised learning workflow

supervised machine learning workflow

  • data

    • traing data: 用來訓練模型
    • validation data: 用來驗證模型的正確性
    • test data: 預測的資料
  • find best model training options

    1. choose a model
    2. select features
    3. tune parameters

over fitting: fitting the noise means 用過於複雜的模型預測資料

glossary of terms

model

an algorithm that predicts a response using a set of features. A model is trained on existing data and used to make predictions on new observations.

regression model

a machine learning model that outputs a continuous numeric response. For example, predicting stock prices is a regression problem.

classification model

a machine learning model that outputs a prediction from a discrete set of possible outcomes. For example, predicting if a medical image indicates healthy or cancer is a classifiction problem

refers to values used to create the model. Some model parameters are learned by the machine learning algorithm during training. Other parameters are set by the user prior to training.

hyperparameter

parameters required by the model that are set by the user. Hyperparameters are not learned through model training but often determined through an optimization process.
(不用再training但通常用在優化的過程)

training data

data used to train a model. A final model is trained using the full training and validation data.

validation data

data used to evaluate model performance during the training process. Validation data helps prevent choosing a model that overfits the training data (see overfitting below). A final model is trained using the full training and validation data.

test data

data used to simulate new observations. Test data is split from a full data set early in the machine learning process and not used during preprocessing and model training steps. Test data is used to evaluate a final model.

resubstitution validation

using the training data to evaluate a machine learning model. This approach provides no protection against overfitting because the same observations used to train the model are substituted into the model for calculating metrics.

overfitting

a model that obtains high accuracy with the training data but does poorly with new data. This often happens because the model fits to random fluctuations in the training data. Validation data helps prevent overfitting by using a subset of data to evaluate model performance.

underfitting

a model that is too simplistic to capture some trends in the data, resulting in large errors. For example, using a single sinusoid to model temperature may capture seasonal trends, but miss daily variations due to day and nighttime differences.

wide dataset

a dataset where the number of features is similar to, or greater than the number of observations.

Regression

step1 資料處理

  1. 資料: import a data file into MATLAB
  2. 視覺化: visualize the data to look for relationships between variables
  3. 處理: determine how the raw data should be cleaned before it is used to create predictive models

Import data

read data

data = readtable("path")

read data (for checking if data is missing)
files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv"; ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true); taxi = readall(ds); head(taxi)

Visualizations

點狀圖
scatter(data.Distance, data.Fare) gscatter(data.Distance, data.Fare, data.RateCode) xlabel("Distance (mi.)") ylabel("Fare ($)") xlim([0 40]) ylim([0 200])

scatter點狀圖 gscatter 三個變數(第三個為categories)
xlabel ylabel 軸的名稱
xlim ylim 圖表範圍
scattergram

Histogram
  1. 通常先看大範圍
histogram(data.Fare) xlabel("Fare ($)") ylabel("Occurances")

螢幕擷取畫面 2024-06-08 001847

  1. 接著再確定distanceBins
fareBins = 0:1:60; histogram(data.Fare, fareBins) xlabel("Fare ($)") ylabel("Occurances")

distanceBins 用來處理範圍 0到60相隔1
螢幕擷取畫面 2024-06-08 002156

boxplot
boxplot(taxiC.Distance, "Orientation", "horizontal"); xlabel("Distance")
Location
geoscatter(data.PickupLat, data.PickupLon, '.', 'SizeData', 1) geolimits([40 41],[-75 -73])

拖拉地圖並update可以設置geolimits
螢幕擷取畫面 2024-06-08 002957

clean data

check if data missing

files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv"; ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true); taxi = readall(ds); head(taxi)
numMissing = nnz(ismissing(taxi))

nnz 是 check id 不為 0 的 或不存在的


但資料missing並不代表他是合理的資料

  • 就樣距離小於0就是不合理的
summary查看
summary(taxi)
幾%不合理的值 要<2%
distErrPcnt = 100*nnz(taxi.Distance <= 0)/height(taxi)
locationCleanLoss = 100*( 1 - height(taxiC2)/height(taxiC))
去除不合理值 (值)
taxiC = taxi(taxi.Distance > 0, :);
查看%的值 prctile
pTilesDistance = prctile(taxiC.Distance, [0, 99.99])

remove top 0.01% data

去除不合理值 (%數) rmoutliers
taxiC = rmoutliers(taxiC, "percentiles", [0, 99.99], "DataVariables", "Distance"); histogram(taxiC.Distance, 100); xlabel("Distance");

remove top 0.01% data


  • 經緯度跨越到海上不合理 縮減經緯度
lat1 = 40; lat2 = 42; lon1 = -75; lon2 = -73; loc2keep = taxiC.PickupLat >= lat1 & taxiC.PickupLat <= lat2 & ... taxiC.DropoffLat >= lat1 & taxiC.DropoffLat <= lat2 & ... taxiC.PickupLon >= lon1 & taxiC.PickupLon <= lon2 & ... taxiC.DropoffLon >= lon1 & taxiC.DropoffLon <= lon2; taxiC2 = taxiC(loc2keep, :);

loc2keep是決定的邏輯運算子

visulization
geoplot(taxiC2.PickupLat, taxiC2.PickupLon, "b.", "MarkerSize", 0.5); title("Pickups");
geoplot(taxiC2.DropoffLat, taxiC2.DropoffLon, "r.", "MarkerSize", 0.5); title("Dropoffs");

可以看到比上面Location的圖更密集
螢幕擷取畫面 2024-06-09 135603

histogram(taxiC2.DropoffLat, 100); hold on; histogram(taxiC2.PickupLat, 100); hold off; legend(["DropoffLat" "PickupLat"]);

use hold to compare
螢幕擷取畫面 2024-06-09 140124


  • time
show time
taxiC2.TimeOfDay = timeofday(taxiC2.PickupTime); histogram(taxiC2.TimeOfDay); xlabel("Pickup Time of Day");

螢幕擷取畫面 2024-06-09 141248

如何知道需不需clean(看histogram的維度)
taxiC2.TimeOfDay = hours(taxiC2.TimeOfDay); taxiC2.Duration = minutes(taxiC2.DropoffTime - taxiC2.PickupTime); histogram(taxiC2.Duration, 100); xlabel("Trip Duration (minutes)");

螢幕擷取畫面 2024-06-09 142635

proccess

螢幕擷取畫面 2024-06-09 143559

summary
  1. 處理資料 (可以用舊的值生成新的值)
  2. 用圖表視覺化(histogram)
  3. 如果維度過大就要clean
  4. 去除不可能的值
  5. 用prctile計算邊界
  6. 再用rmoutliers去除邊界
  7. 用histogram再次查看合理的值

可用histogram或boxplot察看結果

Step2 model

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Linear Regression

linear regression

MSE

MSE

advantage

advantage of Simple Linear

polynomial function

次方可以是為新的變數,所以仍算是一種regression

螢幕擷取畫面 2024-06-09 155337

polynomial terms && interaction terms

螢幕擷取畫面 2024-06-09 155432

Decision Trees

螢幕擷取畫面 2024-06-09 155917

split 是 true false problem
leaves 是 results (avarage)
need to set some parameter ex:number of split (limit the tree growth)

分類
  1. Fine Tree

    • 通常深度較深
    • 子葉的資料較小
  2. Medium Tree

    • 介於 Fine Tree 跟 Coarse Tree 之間
  3. Coarse Tree

    • 子葉的資料較大

how to select model

how to select mode

  1. app > choose model
  2. New Session 選擇資料
  3. 選擇response and preictors variables
  4. 選擇validation (Resubstitution Validation 就是 No Validation)

螢幕擷取畫面 2024-06-09 161122

  1. 選擇x-axis決定predition variable
  2. feature selection
  3. 選擇model (All quick to train 通常為 regression 跟 tree 因為較快)> train
  4. 看RMSE決定較好的模型
  5. 調整模型 (Summary > Model Hyperparameters) (optimizer)
  6. 產出
    • export: 直接產出模型 但無法當成function使用,可以查看各種值
    • generate function
tool to use

螢幕擷取畫面 2024-06-09 161552

Step 3 Training

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

build linear model
linearModel = fitlm(taxi, "linear", ... "ResponseVar", "Duration", ... "PredictorVars", ["Distance", "TimeOfDay"])
linearModel = fitlm(taxi, "poly14", ... "ResponseVar", "Duration", ... "PredictorVars", ["Distance", "TimeOfDay"])

polyjk j代表最高幾次方 k代表交互最高階數

linearModel = fitlm(taxi, "Duration ~ 1 + Distance + TimeOfDay")
build tree
treeModel = fitrtree(taxi, "Duration", ... "PredictorNames", ["Distance", "TimeOfDay"])
view(treeModel) % in command line view(treeModel, "mode", "graph")

customize:

MinLeafSize (default is 1)
MaxNumSplits
fine tree -> corse tree
螢幕擷取畫面 2024-06-09 184618

treeModel = fitrtree(taxi, "Duration", ... "PredictorNames", ["Distance", "TimeOfDay"], ... "MaxNumSplits", 20)
view coefficient
linearModel.Coefficients
use model to predict

predict(model, data)

yPredict = predict(linearModel, taxi)
yPredict = predict(treeModel, taxi)
compare
scatter(taxi.TimeOfDay, taxi.Duration, '.') hold on scatter(taxi.TimeOfDay, yPredict, '.') hold off legend("Actual", "Predict")

螢幕擷取畫面 2024-06-09 174755

Evaluate Models

residuals

residuals

MAE

MAE

SSE

SSE

SSE = sum((yPredict - yActual).^2)
MSE

MSE2

RMSE

因為MSE沒有相同單位,所以在開根號
RMSE

SST

SST

SST = sum((yActual - mean(yActual)).^2)
R^2 bigger more fit

R square

summary

summary of evaluation
螢幕擷取畫面 2024-06-09 195701

rMetrics 查看各估計值
yActual = taxi.Duration rMetrics(yActual, yPredict)

螢幕擷取畫面 2024-06-09 201727

compare

預測資料
橘色與藍色覆蓋約多代表越準

螢幕擷取畫面 2024-06-09 202123

越對稱於對角線越準

螢幕擷取畫面 2024-06-09 202313
螢幕擷取畫面 2024-06-09 203717

查看 residual (越貼近水平那條直線residual越少)

螢幕擷取畫面 2024-06-09 203854
螢幕擷取畫面 2024-06-09 204143

Classification

Models

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Logistic Regression

螢幕擷取畫面 2024-06-09 224424

Function

螢幕擷取畫面 2024-06-09 224530

coefficients decide threshold

螢幕擷取畫面 2024-06-09 224646

when to use it

螢幕擷取畫面 2024-06-09 224816

KNN

K = 3

螢幕擷取畫面 2024-06-09 225015

difficult to capture and slower when k is big and data is large

螢幕擷取畫面 2024-06-09 225053

type

螢幕擷取畫面 2024-06-09 225157

SVM

find a line separate two classes

螢幕擷取畫面 2024-06-09 225426

Kernal Method

螢幕擷取畫面 2024-06-09 230125

螢幕擷取畫面 2024-06-09 230158

螢幕擷取畫面 2024-06-09 230232

Implement

Classification Learner App Workflow

螢幕擷取畫面 2024-06-09 231411

  1. read data
  2. App > choose classification learner
  3. new session > From Work space
  4. data、response、predictors、validation
  5. 橘色x代表toll pay但預測no toll pay

螢幕擷取畫面 2024-06-09 233100

  1. export
Confusion Matrix (recall、fallout、accuracy、precision)

TP = true positive
FN = false negative
FP = false positive
TN = true negative

螢幕擷取畫面 2024-06-10 000005

螢幕擷取畫面 2024-06-10 000320

Fallout && Recall relation

螢幕擷取畫面 2024-06-10 001134
螢幕擷取畫面 2024-06-10 001728

perfect situation

螢幕擷取畫面 2024-06-10 000730

Logistic Regression

categorical
taxiData.WasTollPaidCat = categorical(taxiData.WasTollPaid,[false true],["No Toll" "Toll"]);
fitglm
logRegMdl = fitglm(taxiData,'linear','Distribution','binomial', ... "PredictorVars",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],... "ResponseVar","WasTollPaidCat")
predict (scores)
scoresLogReg = predict(logRegMdl,taxiData)
convert scores to predictions
thresholdLogReg =0.5; predictedLogReg = scoresLogReg >= thresholdLogReg; predictedLogReg = categorical(predictedLogReg , [false true], ["No Toll" "Toll"])
Performance Metrics (confusionchart) (cMetrics)
confusionchart(taxiData.WasTollPaidCat,predictedLogReg) cmLogReg = confusionmat(taxiData.WasTollPaidCat,predictedLogReg) recallTrueClass = cmLogReg(2,2)/(cmLogReg(2,2)+cmLogReg(2,1))
cMetrics(taxiData.WasTollPaidCat,predictedLogReg)
ROC curve
[falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll"); plot(falloutsLogReg,recallsLogReg); xlabel("FPR (Fallout)"); ylabel("TPR (Recall)"); hold on; [falloutTLogReg,recallTLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll","TVals",thresholdLogReg); ccDot = plot(falloutTLogReg,recallTLogReg,"ro","MarkerFaceColor","r"); title("ROC Curve with Positive Class: Toll") legend(ccDot, "T = " + string( thresholdLogReg ) + " Recall = " + string(recallTLogReg) + " Fallout = " + string(falloutTLogReg) ,... "Location","best") hold off;

red dot determine by threshold (決定正負的中心點)
如果no toll是positive則變成

[falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,1-scoresLogReg,"No Toll");

KNN

build model
knnMdl = fitcknn(taxiData,"WasTollPaidCat","NumNeighbors",50,"DistanceWeight","equal",... "PredictorNames",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],... "ResponseName","WasTollPaidCat")
predict (default threshold is 0.5)
[predictedKNN,scoresKNN] = predict(knnMdl,taxiData)
convert scores to predictions
thresholdKNN = 0.5; predKNNNoToll = scoresKNN(:,1) >= thresholdKNN; predKNNNoToll = categorical(predKNNNoToll,[true false],["No Toll" "Toll"])
Performance Metrics
cMetrics(taxiData.WasTollPaidCat,predKNNNoToll)
ROC curve
[falloutsKNN,recallsKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll"); clf plot(falloutsKNN,recallsKNN); xlabel("FPR (Fallout)"); ylabel("TPR (Recall)"); hold on; [falloutTKNN,recallTKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll","TVals",thresholdKNN); ccDot = plot(falloutTKNN,recallTKNN,"ro","MarkerFaceColor","r"); title("ROC Curve with Positive Class: No Toll") legend(ccDot, "T = " + string(thresholdKNN) + " Fallout = " + string(falloutTKNN) + " Recall = " + string(recallTKNN) ) hold off;
accuracy 越高不一定越好

x很多

螢幕擷取畫面 2024-06-10 014341

集中在一種分類

螢幕擷取畫面 2024-06-10 014506

K值決定,必須看種類內資料數有多少

Multiclass

螢幕擷取畫面 2024-06-10 025739

one vs one && one vs all

螢幕擷取畫面 2024-06-10 031650

Choose Optimal Model (Validation)

overfit and underfit

螢幕擷取畫面 2024-06-10 042330

solve overfitting
add more data => validation data

螢幕擷取畫面 2024-06-10 042413

螢幕擷取畫面 2024-06-10 043024

Validation

螢幕擷取畫面 2024-06-10 042647

holdout

螢幕擷取畫面 2024-06-10 042804

k-fold

螢幕擷取畫面 2024-06-10 042858

compare

螢幕擷取畫面 2024-06-10 042934

partition test and train
rng(1); taxiPartitions = cvpartition(height(taxiData), "HoldOut", 0.2) taxiTestIdx = test(taxiPartitions) taxiTest = taxiData(taxiTestIdx, : ); taxiTrainIdx = training(taxiPartitions) taxiTrain = taxiData(taxiTrainIdx, : ); taxiTrain = basicPreprocessing(taxiTrain); taxiTrain = addTimeOfDay(taxiTrain); taxiTrain = addDayOfWeek(taxiTrain);
  1. setting the seed
rng(11)
  1. cvpartition with height
healthData_holdout = cvpartition(height(healthData),"Holdout",0.4)
  1. apply training and test
trainingDataR = healthData(training(healthData_holdout), : ) testDataR = healthData(test(healthData_holdout), : )
  1. create two separate data sets for training/validation and test with indices from step 3

feature selection

  • Filter Methods

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • Wrapper Methods

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • Embeded Methods

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

embeded methods
impValues = predictorImportance(trainedModel.ClassificationTree)

值越高代表該feature越重要

螢幕擷取畫面 2024-06-10 114509

左邊是impValues,曲線是累積值
接著就可以選擇較高的feature了

obsTrainSmall = obsTrain(:, [3032 654 948 2328 9])

接著再training一次

Regularization to Prevent Overfitting

penalty term

螢幕擷取畫面 2024-06-10 115135

overfitting

螢幕擷取畫面 2024-06-10 115156

underfitting

螢幕擷取畫面 2024-06-10 115225

caculation (每個模型適用不同的計算方式)

螢幕擷取畫面 2024-06-10 115314
Lasso Regression 也可以視為一種feature selection因為B為0
螢幕擷取畫面 2024-06-10 120136

default setting (hyper parameter)

螢幕擷取畫面 2024-06-10 121418

proccess
  1. 標準化
meanObs = mean(obsTrain) stdObs = std(obsTrain) obsTrain = (obsTrain - meanObs)./stdObs
  1. train ridge and lasso models
    • fitrlinear (regression)
    • fitclinear (classification)
mdl = fitclinear(obsTrain, grpTrain, ... "Learner", "logistic", ... "Regularization", "ridge", ... "KFold", 20)
  1. predict
grpPredict = kfoldPredict(mdl); cMetrics(grpTrain, grpPredict)

Use lasso regression when you want to remove some features.
Use ridge regression when you want all the features to contribute

Ensemble Models

models accuracy 可能相近,但預測的結果卻不同,此時就須要ensemble model組合多個model

cost:

  1. training time
  2. memory utilization
  3. prediction speed

螢幕擷取畫面 2024-06-11 143957

Boosted Ensembles

proccess

螢幕擷取畫面 2024-06-11 143346
螢幕擷取畫面 2024-06-11 143411

Results

螢幕擷取畫面 2024-06-11 144737

Bagged Ensembles

Results

螢幕擷取畫面 2024-06-11 143533

Bagged Ensembles (Random Forests)

problem

若都用同樣資料則訓練出來的tree可能具有高度相似的結構

solution

將資料拆分成不同feature 的 subset

Parameter

Model Parameters

  • estimated from data
  • values are optimized by the algorithm itself
  • they're not manually set

Model Hyperparameters

  • cannot be estimated from data
  • can be manually set
  • used to help estimate model parameters
  • For example KNN 的 K 值

這裡用K值跟Distance Metric (KNN)做舉例

How to determine

螢幕擷取畫面 2024-06-11 201420

Grid Search

找到所有的K跟Distance Metric並看哪種組合較好
螢幕擷取畫面 2024-06-11 201713

Random Search

螢幕擷取畫面 2024-06-11 201954

Test Model

  1. Test > Test Data (select taxiTest)
  2. select test model
  3. test all > test selected
  4. 可以看residual plot 確認是否overfit
summary 整套 ML 的基礎流程

螢幕擷取畫面 2024-06-11 204706

其中reducing complexity就是feature selection

螢幕擷取畫面 2024-06-11 204842

Using Your Model

螢幕擷取畫面 2024-06-11 220915

proccess

其他語言也可以調用 matlab code

  1. create project
  2. commit
  3. share

matlab 可以把 code 轉成其他語言,刻到硬體上

matlab 可以把 model 變成 GUI 使用這介面 matlab web

Automated machine learning

  • fitcauto
  • fitrauto
load data
load ovariancancer.mat obs grp % Set the rng seed rng(2); cv = cvpartition(grp,"Holdout",0.2); % Split into training and test data obsTrain = obs(training(cv),:); grpTrain = grp(training(cv)); obsTest = obs(test(cv),:); grpTest = grp(test(cv)); % Normalize the training and test data meanObs = mean(obsTrain); stdObs = std(obsTrain); obsTrainNorm = (obsTrain - meanObs)./stdObs
Select feature

The code below uses x2 tests to select 100 predictive features, which is 2.5% of the original 4000 features.

% Use chi-squared tests to rank features by importance [idx,scores] = fscchi2(obsTrainNorm,grpTrain); % Create new training set using top 100 features obsTrainSmall = obsTrainNorm(:,idx(1:100))
Use fitcauto to select a model and hyperparameters
mdl = fitcauto(obsTrainSmall,grpTrain);

螢幕擷取畫面 2024-06-11 224725

Test the optimized model
% Apply same pre-processing steps obsTestNorm = (obsTest - meanObs)./stdObs; obsTestSmall = obsTestNorm(:,idx(1:100)); % Predict labels grpPredict = predict(mdl,obsTestSmall); % Display metrics cMetrics(grpTest,grpPredict) % Display confusion matrix confusionchart(grpTest,grpPredict)
Disadvantages
  • Long training times
  • Lack of full-workflow automation
  • No guarantee of the "best" model (因為 iteration 都是根據 default )