作者: 蔡承佑
data
find best model training options
over fitting: fitting the noise means 用過於複雜的模型預測資料
an algorithm that predicts a response using a set of features. A model is trained on existing data and used to make predictions on new observations.
a machine learning model that outputs a continuous numeric response. For example, predicting stock prices is a regression problem.
a machine learning model that outputs a prediction from a discrete set of possible outcomes. For example, predicting if a medical image indicates healthy or cancer is a classifiction problem
refers to values used to create the model. Some model parameters are learned by the machine learning algorithm during training. Other parameters are set by the user prior to training.
parameters required by the model that are set by the user. Hyperparameters are not learned through model training but often determined through an optimization process.
(不用再training但通常用在優化的過程)
data used to train a model. A final model is trained using the full training and validation data.
data used to evaluate model performance during the training process. Validation data helps prevent choosing a model that overfits the training data (see overfitting below). A final model is trained using the full training and validation data.
data used to simulate new observations. Test data is split from a full data set early in the machine learning process and not used during preprocessing and model training steps. Test data is used to evaluate a final model.
using the training data to evaluate a machine learning model. This approach provides no protection against overfitting because the same observations used to train the model are substituted into the model for calculating metrics.
a model that obtains high accuracy with the training data but does poorly with new data. This often happens because the model fits to random fluctuations in the training data. Validation data helps prevent overfitting by using a subset of data to evaluate model performance.
a model that is too simplistic to capture some trends in the data, resulting in large errors. For example, using a single sinusoid to model temperature may capture seasonal trends, but miss daily variations due to day and nighttime differences.
a dataset where the number of features is similar to, or greater than the number of observations.
data = readtable("path")
files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv";
ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true);
taxi = readall(ds);
head(taxi)
scatter(data.Distance, data.Fare)
gscatter(data.Distance, data.Fare, data.RateCode)
xlabel("Distance (mi.)")
ylabel("Fare ($)")
xlim([0 40])
ylim([0 200])
scatter點狀圖 gscatter 三個變數(第三個為categories)
xlabel ylabel 軸的名稱
xlim ylim 圖表範圍
histogram(data.Fare)
xlabel("Fare ($)")
ylabel("Occurances")
fareBins = 0:1:60;
histogram(data.Fare, fareBins)
xlabel("Fare ($)")
ylabel("Occurances")
distanceBins 用來處理範圍 0到60相隔1
boxplot(taxiC.Distance, "Orientation", "horizontal");
xlabel("Distance")
geoscatter(data.PickupLat, data.PickupLon, '.', 'SizeData', 1)
geolimits([40 41],[-75 -73])
拖拉地圖並update可以設置geolimits
files = "C:\Users\TUF Gaming\Documents\MATLAB\Predictive Modeling and Machine Learning\Predictive Modeling and Machine Learning\Taxi Data\yellow_tripdata_2015-01.csv";
ds = fileDatastore(files, "ReadFcn", @importTaxiDataWithoutCleaning, "UniformRead", true);
taxi = readall(ds);
head(taxi)
numMissing = nnz(ismissing(taxi))
nnz 是 check id 不為 0 的 或不存在的
summary(taxi)
distErrPcnt = 100*nnz(taxi.Distance <= 0)/height(taxi)
locationCleanLoss = 100*( 1 - height(taxiC2)/height(taxiC))
taxiC = taxi(taxi.Distance > 0, :);
pTilesDistance = prctile(taxiC.Distance, [0, 99.99])
remove top 0.01% data
taxiC = rmoutliers(taxiC, "percentiles", [0, 99.99], "DataVariables", "Distance");
histogram(taxiC.Distance, 100);
xlabel("Distance");
remove top 0.01% data
lat1 = 40;
lat2 = 42;
lon1 = -75;
lon2 = -73;
loc2keep = taxiC.PickupLat >= lat1 & taxiC.PickupLat <= lat2 & ...
taxiC.DropoffLat >= lat1 & taxiC.DropoffLat <= lat2 & ...
taxiC.PickupLon >= lon1 & taxiC.PickupLon <= lon2 & ...
taxiC.DropoffLon >= lon1 & taxiC.DropoffLon <= lon2;
taxiC2 = taxiC(loc2keep, :);
loc2keep是決定的邏輯運算子
geoplot(taxiC2.PickupLat, taxiC2.PickupLon, "b.", "MarkerSize", 0.5);
title("Pickups");
geoplot(taxiC2.DropoffLat, taxiC2.DropoffLon, "r.", "MarkerSize", 0.5);
title("Dropoffs");
可以看到比上面Location的圖更密集
histogram(taxiC2.DropoffLat, 100); hold on;
histogram(taxiC2.PickupLat, 100); hold off;
legend(["DropoffLat" "PickupLat"]);
use hold to compare
taxiC2.TimeOfDay = timeofday(taxiC2.PickupTime);
histogram(taxiC2.TimeOfDay);
xlabel("Pickup Time of Day");
taxiC2.TimeOfDay = hours(taxiC2.TimeOfDay);
taxiC2.Duration = minutes(taxiC2.DropoffTime - taxiC2.PickupTime);
histogram(taxiC2.Duration, 100);
xlabel("Trip Duration (minutes)");
次方可以是為新的變數,所以仍算是一種regression
polynomial terms && interaction terms
split 是 true false problem
leaves 是 results (avarage)
need to set some parameter ex:number of split (limit the tree growth)
Fine Tree
Medium Tree
Coarse Tree
linearModel = fitlm(taxi, "linear", ...
"ResponseVar", "Duration", ...
"PredictorVars", ["Distance", "TimeOfDay"])
linearModel = fitlm(taxi, "poly14", ...
"ResponseVar", "Duration", ...
"PredictorVars", ["Distance", "TimeOfDay"])
polyjk j代表最高幾次方 k代表交互最高階數
linearModel = fitlm(taxi, "Duration ~ 1 + Distance + TimeOfDay")
treeModel = fitrtree(taxi, "Duration", ...
"PredictorNames", ["Distance", "TimeOfDay"])
view(treeModel) % in command line
view(treeModel, "mode", "graph")
customize:
MinLeafSize (default is 1)
MaxNumSplits
fine tree -> corse tree
treeModel = fitrtree(taxi, "Duration", ...
"PredictorNames", ["Distance", "TimeOfDay"], ...
"MaxNumSplits", 20)
linearModel.Coefficients
predict(model, data)
yPredict = predict(linearModel, taxi)
yPredict = predict(treeModel, taxi)
scatter(taxi.TimeOfDay, taxi.Duration, '.')
hold on
scatter(taxi.TimeOfDay, yPredict, '.')
hold off
legend("Actual", "Predict")
SSE = sum((yPredict - yActual).^2)
因為MSE沒有相同單位,所以在開根號
SST = sum((yActual - mean(yActual)).^2)
yActual = taxi.Duration
rMetrics(yActual, yPredict)
預測資料
橘色與藍色覆蓋約多代表越準
越對稱於對角線越準
查看 residual (越貼近水平那條直線residual越少)
Function
coefficients decide threshold
when to use it
K = 3
difficult to capture and slower when k is big and data is large
type
find a line separate two classes
Kernal Method
TP = true positive
FN = false negative
FP = false positive
TN = true negative
Fallout && Recall relation
taxiData.WasTollPaidCat = categorical(taxiData.WasTollPaid,[false true],["No Toll" "Toll"]);
logRegMdl = fitglm(taxiData,'linear','Distribution','binomial', ...
"PredictorVars",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],...
"ResponseVar","WasTollPaidCat")
scoresLogReg = predict(logRegMdl,taxiData)
thresholdLogReg =0.5;
predictedLogReg = scoresLogReg >= thresholdLogReg;
predictedLogReg = categorical(predictedLogReg , [false true], ["No Toll" "Toll"])
confusionchart(taxiData.WasTollPaidCat,predictedLogReg)
cmLogReg = confusionmat(taxiData.WasTollPaidCat,predictedLogReg)
recallTrueClass = cmLogReg(2,2)/(cmLogReg(2,2)+cmLogReg(2,1))
cMetrics(taxiData.WasTollPaidCat,predictedLogReg)
[falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll");
plot(falloutsLogReg,recallsLogReg);
xlabel("FPR (Fallout)");
ylabel("TPR (Recall)");
hold on;
[falloutTLogReg,recallTLogReg] = perfcurve(taxiData.WasTollPaidCat,scoresLogReg,"Toll","TVals",thresholdLogReg);
ccDot = plot(falloutTLogReg,recallTLogReg,"ro","MarkerFaceColor","r");
title("ROC Curve with Positive Class: Toll")
legend(ccDot, "T = " + string( thresholdLogReg ) + " Recall = " + string(recallTLogReg) + " Fallout = " + string(falloutTLogReg) ,...
"Location","best")
hold off;
red dot determine by threshold (決定正負的中心點)
如果no toll是positive則變成
[falloutsLogReg,recallsLogReg] = perfcurve(taxiData.WasTollPaidCat,1-scoresLogReg,"No Toll");
knnMdl = fitcknn(taxiData,"WasTollPaidCat","NumNeighbors",50,"DistanceWeight","equal",...
"PredictorNames",[ "PickupLon" "PickupLat" "DropoffLon" "DropoffLat"],...
"ResponseName","WasTollPaidCat")
[predictedKNN,scoresKNN] = predict(knnMdl,taxiData)
thresholdKNN = 0.5;
predKNNNoToll = scoresKNN(:,1) >= thresholdKNN;
predKNNNoToll = categorical(predKNNNoToll,[true false],["No Toll" "Toll"])
cMetrics(taxiData.WasTollPaidCat,predKNNNoToll)
[falloutsKNN,recallsKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll");
clf
plot(falloutsKNN,recallsKNN);
xlabel("FPR (Fallout)");
ylabel("TPR (Recall)");
hold on;
[falloutTKNN,recallTKNN] = perfcurve(taxiData.WasTollPaidCat,scoresKNN(:,1),"No Toll","TVals",thresholdKNN);
ccDot = plot(falloutTKNN,recallTKNN,"ro","MarkerFaceColor","r");
title("ROC Curve with Positive Class: No Toll")
legend(ccDot, "T = " + string(thresholdKNN) + " Fallout = " + string(falloutTKNN) + " Recall = " + string(recallTKNN) )
hold off;
x很多
集中在一種分類
K值決定,必須看種類內資料數有多少
one vs one && one vs all
solve overfitting
add more data => validation data
holdout
k-fold
compare
rng(1);
taxiPartitions = cvpartition(height(taxiData), "HoldOut", 0.2)
taxiTestIdx = test(taxiPartitions)
taxiTest = taxiData(taxiTestIdx, : );
taxiTrainIdx = training(taxiPartitions)
taxiTrain = taxiData(taxiTrainIdx, : );
taxiTrain = basicPreprocessing(taxiTrain);
taxiTrain = addTimeOfDay(taxiTrain);
taxiTrain = addDayOfWeek(taxiTrain);
rng(11)
healthData_holdout = cvpartition(height(healthData),"Holdout",0.4)
trainingDataR = healthData(training(healthData_holdout), : )
testDataR = healthData(test(healthData_holdout), : )
Image Not Showing Possible ReasonsLearn More →
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Image Not Showing Possible ReasonsLearn More →
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Image Not Showing Possible ReasonsLearn More →
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
impValues = predictorImportance(trainedModel.ClassificationTree)
值越高代表該feature越重要
左邊是impValues,曲線是累積值
接著就可以選擇較高的feature了
obsTrainSmall = obsTrain(:, [3032 654 948 2328 9])
接著再training一次
overfitting
underfitting
caculation (每個模型適用不同的計算方式)
Lasso Regression 也可以視為一種feature selection因為B為0
default setting (hyper parameter)
meanObs = mean(obsTrain)
stdObs = std(obsTrain)
obsTrain = (obsTrain - meanObs)./stdObs
mdl = fitclinear(obsTrain, grpTrain, ...
"Learner", "logistic", ...
"Regularization", "ridge", ...
"KFold", 20)
grpPredict = kfoldPredict(mdl);
cMetrics(grpTrain, grpPredict)
Use lasso regression when you want to remove some features.
Use ridge regression when you want all the features to contribute
models accuracy 可能相近,但預測的結果卻不同,此時就須要ensemble model組合多個model
cost:
- training time
- memory utilization
- prediction speed
proccess
Results
Results
problem
若都用同樣資料則訓練出來的tree可能具有高度相似的結構
solution
將資料拆分成不同feature 的 subset
這裡用K值跟Distance Metric (KNN)做舉例
Grid Search
找到所有的K跟Distance Metric並看哪種組合較好
Random Search
其中reducing complexity就是feature selection
其他語言也可以調用 matlab code
matlab 可以把 code 轉成其他語言,刻到硬體上
matlab 可以把 model 變成 GUI 使用這介面 matlab web
load ovariancancer.mat obs grp
% Set the rng seed
rng(2);
cv = cvpartition(grp,"Holdout",0.2);
% Split into training and test data
obsTrain = obs(training(cv),:);
grpTrain = grp(training(cv));
obsTest = obs(test(cv),:);
grpTest = grp(test(cv));
% Normalize the training and test data
meanObs = mean(obsTrain);
stdObs = std(obsTrain);
obsTrainNorm = (obsTrain - meanObs)./stdObs
The code below uses x2 tests to select 100 predictive features, which is 2.5% of the original 4000 features.
% Use chi-squared tests to rank features by importance
[idx,scores] = fscchi2(obsTrainNorm,grpTrain);
% Create new training set using top 100 features
obsTrainSmall = obsTrainNorm(:,idx(1:100))
mdl = fitcauto(obsTrainSmall,grpTrain);
% Apply same pre-processing steps
obsTestNorm = (obsTest - meanObs)./stdObs;
obsTestSmall = obsTestNorm(:,idx(1:100));
% Predict labels
grpPredict = predict(mdl,obsTestSmall);
% Display metrics
cMetrics(grpTest,grpPredict)
% Display confusion matrix
confusionchart(grpTest,grpPredict)