### 1. Problem: Classify Weather Types Based on Weather-related Features This problem involves classifying weather into four categories: `Rainy`, `Sunny`, `Cloudy`, and `Snowy` based on a set of input features describing environmental conditions. Weather classification is really useful, especially because it can be applied in real-time. Predicting the current weather accurately has many benefits, like providing precise data for quick decisions and being able to react fast to sudden changes in weather. Real-time predictions are more accurate since they rely on the most recent data, which helps reduce the risks of extreme weather events like storms or cold snaps. By predicting weather accurately, we can save lives and avoid large financial losses by taking action in time. For example, by September 2024, there were 20 weather disasters in the U.S. causing over $53 billion in damage, showing just how important accurate and timely weather forecasts can be in reducing the impact of these events (NCEI, 2024). ### 2. Dataset Description ##### The weather classification dataset contains 13,200 observations and 11 features, of which 6 are numerical variables and 4 are categorical variables : * This dataset does not include missing and duplicate value. * Numerical features: Temperature, Humidity, Wind Speed, Atmosphere Pressure, UV Index, Visibility. * Categorical features: Cloud Cover, Season, Location, Weather Type. ##### Dataset dictionary: ![Screenshot 2024-12-28 at 2.52.26 PM](https://hackmd.io/_uploads/SyGHgQ6Skl.png) ### 3. Initial data analysis/visualization of the data IDA process includes four sessions: exploring the target variable distribution, visualizing numerical features, visualizing categorical features, and exploring relationships between the target variable and other variables. ##### 3.1. Target Variable (Weather Type) Distribution Pie chart shows `Weather Type` distribution across different categories (Cloudy, Rainy, Snowy, Sunny). Each category has an equal (25%) distribution. This means that the dataset is balanced and good for modeling and mitigates bias. ![image](https://hackmd.io/_uploads/rJ5alQ6SJl.png) ##### 3.2. Visualizing Numerical Features: Distribution and Density of all numerical attributes: The plots show `Wind Speed` and `Visibility` are skewed, potentially requiring data pre-processing for better model performance. `Atmospheric Pressure` should be carefully handled for its extreme conditions. ![image](https://hackmd.io/_uploads/Hk7kZX6HJg.png) ##### 3.3. Visualizing Categorical Features The target feature (Weather Type) distribution is balanced, it helps for modeling a robust classification. `Cloud Cover` has a different distribution, while `Location` has an even distribution. `Season` has two times winter data compared to other seasons, so predictors might get better performance in winter data. ![image](https://hackmd.io/_uploads/rJPz-Q6Bkg.png) ##### 3.4. Weather Type Distribution by Other Variables The bar charts show how different weather types are spread across factors like temperature, cloud cover, season, and location suggest that weather types are connected to specific conditions. For example, some weather types tend to happen more often within certain temperature ranges, visibility levels, or during specific seasons. ![image](https://hackmd.io/_uploads/Hyq4Wmpr1x.png) ##### 3.5. Justification for Using All Data in Prediction In future prediction tasks, we will use all the data to capture as much information as we can. Here’s why: * Using All Features is Effectively: Each variable provides different insights about the weather. Numerical data gives exact measurements, while categories help describe the environment. By using all these variables, our models can learn from a wide range of influences. * Boosting Model Performance: Using all features helps the model understand complex patterns and relationships, which improves accuracy. Dropping some variables could lead to losing valuable information, especially when those variables are important for predicting weather types. * Balanced Weather Types: Because the Weather Type data is balanced, the model can generalize better. Including both numerical and categorical features will help the model distinguish between different weather conditions more effectively. ### 4. Data Preprocessing & Feature Engineering ##### 4.1. Encoding Categorical Variables * Variables like Cloud Cover, Season, Location, and Weather Type are turned into categories so they can be used correctly in machine learning. This way, the algorithms understand these variables as different groups or labels. ```{r, warning=FALSE, message=FALSE} # Convert categorical columns to factors data$Cloud.Cover <- as.factor(data$Cloud.Cover) data$Season <- as.factor(data$Season) data$Location <- as.factor(data$Location) data$Weather.Type <- as.factor(data$Weather.Type) ``` ##### 4.2. Outlier Handling and Handling Unusual Values * Removes unrealistic values in the Humidity, Precipitation, and Wind Speed columns based on what we expect their normal ranges to be. Doing this helps prevent incorrect values hurting the model’s learning and performance. ```{r, warning=FALSE, message=FALSE} # Filt unrealistic values of humidity data <- data %>% filter(Humidity >= 0 & Humidity <= 100) # Filt unrealistic values of precipitation data <- data %>% filter(Precipitation >= 0 & Precipitation <= 100) # Filt unrealistic values of wind speed (the highest level of hurricane is 36.9) data <- data %>% filter(Wind.Speed >= 0 & Wind.Speed <= 36.9) ``` * We use IQR to find outliers in the Temperature, Atmospheric Pressure, and Visibility columns. Any values that are more than 1.5 times outside IQR are considered outliers and are removed from the data. ```{r, warning=FALSE, message=FALSE} # IQR outlier detection of temperature Q1 <- quantile(data$Temperature, 0.25) Q3 <- quantile(data$Temperature, 0.75) IQR_value <- Q3 - Q1 lower_bound <- Q1 - 1.5 * IQR_value upper_bound <- Q3 + 1.5 * IQR_value data <- data %>% filter(Temperature >= lower_bound & Temperature <= upper_bound) # IQR outlier detection of atmospheric pressure Q1 <- quantile(data$Atmospheric.Pressure, 0.25) Q3 <- quantile(data$Atmospheric.Pressure, 0.75) IQR_value <- Q3 - Q1 lower_bound <- Q1 - 1.5 * IQR_value upper_bound <- Q3 + 1.5 * IQR_value data <- data %>% filter(Atmospheric.Pressure >= lower_bound & Atmospheric.Pressure <= upper_bound) # IQR outlier detection of visibility Q1 <- quantile(data$Visibility.km, 0.25) Q3 <- quantile(data$Visibility.km, 0.75) IQR_value <- Q3 - Q1 lower_bound <- Q1 - 1.5 * IQR_value upper_bound <- Q3 + 1.5 * IQR_value data <- data %>% filter(Visibility.km >= lower_bound & Visibility.km <= upper_bound) # Removing rows with NAs after handling unusual values data_clean <- na.omit(data) ``` ##### 4.3. Feature Scaling * We scale the numerical columns using `scale()` function so that average mean becomes 0 and standard deviation becomes 1. This helps standardize the data and improves model’s performance. ```{r, warning=FALSE, message=FALSE} library(dplyr) numeric_cols <- c("Temperature", "Humidity", "Wind.Speed", "Precipitation", "Atmospheric.Pressure", "UV.Index", "Visibility.km") data_scaled <- data_clean data_scaled[numeric_cols] <- scale(data_clean[numeric_cols]) ``` ##### 4.4. Feature engineering * A new categorical feature Temp.Category is created by dividing Temperature into three ranges: Cold, Mild, and Hot. This helps to simplify the model's understanding, especially for models that have trouble handling complex, non-linear relationships between temperature and the target. ```{r, warning=FALSE, message=FALSE} library(dplyr) data_scaled$Temp.Category <- cut(data_scaled$Temperature, breaks = c(-Inf, 15, 30, Inf), labels = c("Cold", "Mild", "Hot")) ``` * An interaction feature Humidity_Wind_Interaction is created by multiplying Humidity and Wind.Speed. This feature captures the combined effect of these two variables on the weather. This lets the model take into account how these two factors work together. For instance, high wind speed and humidity might have a bigger impact on the outcome when combined, rather than when they're looked at separately. Interaction terms can help uncover hidden relationships between the variables that might not be obvious at first glance. ```{r, warning=FALSE, message=FALSE} data_scaled$Humidity_Wind_Interaction <- data_scaled$Humidity * data_scaled$Wind.Speed ``` ### 5. Classification Algorithms Used & Model Evaluation ##### 5.1. Split dataset The dataset is split into 70% training data and 30% testing data using stratified sampling based on the Weather.Type variable to ensure balanced classes in both sets. ```{r, warning=FALSE, message=FALSE} library(caret) set.seed(5003) trainIndex <- createDataPartition(data$Weather.Type, p = 0.7, list = FALSE) train_data <- data[trainIndex, ] test_data <- data[-trainIndex, ] ``` ##### 5.2.1 Support Vector Machine (SVM) Model Used **Initial Model** A basic SVM model with a radial kernel was applied. Default regularization parameter (C=1) was used. ```{r, warning=FALSE, message=FALSE} library(e1071) library(caret) svm_model <- svm(Weather.Type ~ ., data = train_data, kernel = "radial") pred_svm <- predict(svm_model, test_data) ``` **Fine-tuned Model** Using `train()` function from `caret` package, we perform a random search to find the best hyperparameters, focusing on `C` (controls how much the model allows errors) and `sigma` (controls how far support vectors influence the model). We used 5-fold cross-validation to prevent overfitting. The best values for `C` and `sigma` were **1.18** and **0.06**, respectively. - **Parameter Tuning**: - `C` controls how strict the model is with misclassifications. A smaller `C` allows more mistakes, helping the model generalize better, while a larger `C` makes the model more accurate on training data but increases the risk of overfitting. - `sigma` controls how far each support vector influences the decision boundary. Tuning `sigma` helps balance flexibility and smoothness in the model's predictions. - Final SVM model was chosen based on the highest accuracy during cross-validation (in the 7th iteration) with the best `C` and `sigma` values. ##### 5.2.2 SVM Model Evaluation **Confusion Matrix** The confusion matrix shows how well the model’s predictions match the actual values. - Most Accurate Predictions: The model does a great job predicting Snowy and Sunny, with many correct predictions for these categories. - Major Misclassifications: The model has trouble telling the difference between Cloudy and Rainy, with many Cloudy cases being wrongly predicted as Rainy. - Overall: The confusion matrix supports the earlier results of high accuracy and sensitivity, but it highlights areas that need improvement, particularly in distinguishing between Cloudy and Rainy. ![image](https://hackmd.io/_uploads/r1szXQaSyx.png) **ROC Curve** ROC curves help us visualize how well the classifier performs at different thresholds. - AUC Scores: The Area Under the Curve (AUC) for each weather type is shown in the legend. Higher AUC scores (closer to 1) mean better performance. In this case, the AUC for Cloudy is 0.96, while Rainy, Snowy, and Sunny have very high AUCs of 0.99, the model is great at telling these weather types apart. - Interpreting the Curves: The curves almost reach the top-left corner, it is ideal for an ROC curve. This means the model is very good at being both sensitive and specific. The Cloudy class has a slightly lower sensitivity, shown by its curve being a bit closer to the diagonal line, which represents random guessing. ![image](https://hackmd.io/_uploads/rkbNm76HJe.png) **Overview SVM Evaluation** - Accuracy: The model achieved a high accuracy of 97.21%, meaning it correctly classified most of the test data. - Mean Sensitivity: Also 97.21%, showing the model is good at identifying true positives across all weather types. - Mean Precision: At 97.22%, meaning that when the model predicts a certain weather type, it’s usually correct. - Mean F1-Score: The F1-score, which balances precision and recall, is 97.21%, indicating the model does well in both identifying and correctly predicting weather types. ![image](https://hackmd.io/_uploads/ryJCQXpByl.png) ##### 5.3.1. Model-2 Decision Tree Used **Initial model** - Training: A basic decision tree model is built using the `rpart()` function, with the method set to "class" for classification. - **Exploring from the decision tree plot**: - The tree first splits based on whether `Temperature >= 7`, and then further splits based on `UV Index` and `Precipitation`. - Each node shows the probability of different weather types and highlights the most likely weather type for that split. - The leaf nodes represent the final predictions, with the majority weather type in each node determining the outcome. ```{r, warning=FALSE, message=FALSE, fig.align='center'} library(rpart) library(rpart.plot) set.seed(5003) decision_tree_model <- rpart(Weather.Type ~ ., data = train_data, method = "class") # plot the decision tree rpart.plot(decision_tree_model) # predictions on the test set test_pred <- predict(decision_tree_model, newdata = test_data, type = "class") ``` ![image](https://hackmd.io/_uploads/HJ_MEXaryl.png) **Fine-tuned Model** Using the `train()` from the `caret` package, we performed a random search to fine-tune the complexity parameter (`cp`). We used 5-fold cross-validation to avoid overfitting and ensure the model generalizes well. The best `cp` value found was **5.90e-04**. This fine-tuned model should perform better on the test set because the optimized tree structure. - **Parameter Tuning**: - `cp` controls how much the tree gets pruned. A smaller `cp` allows the tree to grow deeper and fit the data closely, but it risks overfitting. A larger `cp` prunes the tree more, which reduces its complexity but may lead to underfitting. - The final decision tree model was chosen based on the highest cross-validation accuracy, which was achieved in the 5th iteration with the optimal `cp` value of **5.90e-04**. This `cp` has a balance between the tree's complexity and its ability to prediction. - The tuning process tested different `cp` values, and as shown in the **accuracy vs. complexity parameter** plot, accuracy drops with higher `cp` values, indicating too much pruning hurts the model’s performance. ```{r, warning=FALSE, message=FALSE, fig.align='center'} set.seed(5003) train_control_dt <- trainControl(method = "cv", number = 5, search = "random") best_decision_tree_model <- train(Weather.Type ~ ., data =train_data , method = "rpart", trControl = train_control_dt, tuneLength = 10) # print best complexity parameter dt_best_tune <- best_decision_tree_model$bestTune dt_best_tune_df <- as.data.frame(dt_best_tune) dt_best_tune_df$cp <- formatC(dt_best_tune_df$cp, format = "e", digits = 2) kable(dt_best_tune_df, caption = "Best Complexity for Decision Tree Model") plot(best_decision_tree_model, main = "Best Decision Tree Model: Complexity Parameter vs Accuracy") # predict on test data test_pred_dt <- predict(best_decision_tree_model, newdata = test_data) ``` ![Screenshot 2024-12-28 at 3.09.17 PM](https://hackmd.io/_uploads/HJPEEXpH1l.png) ![image](https://hackmd.io/_uploads/S1WVvXaSyl.png) ##### 5.3.2. Decision Tree Model Evaluation **Confusion Matrix** - Accuracy: The model performs well for predicting Rainy and Snowy with few mistakes. However, it struggles to differentiate between Cloudy and Sunny lead to a high misclassifications between these two weather types. - Error Distribution: Most of the mistakes happen when the model predicts Sunny or Rainy instead of Cloudy. This suggests that the features used by the decision tree might not clearly separate these weather types. ![image](https://hackmd.io/_uploads/SJevwXpSye.png) **ROC Curve** - **Overall Performance**: - The curves are very close to the top-left corner, meaning the model performs well across all weather types. - The AUC scores are high for all classes, ranging from 0.95 for Cloudy to 0.99 for Snowy. - **Class Needing Improvement**: - Cloudy weather has the lowest AUC score (0.95), meaning the model struggles a bit more to differentiate it from other weather types, which was also seen earlier in the confusion matrix where many Cloudy predictions were incorrectly labeled as Sunny. ![image](https://hackmd.io/_uploads/Hy-0PXaSJe.png) **Overview Model Evaluation** - Strong Overall Performance: All key metrics accuracy, sensitivity, precision, and F1-score—are around 96.98%, showing that the model performs well across the board. - Balanced Performance: The similar values for sensitivity and precision mean the model balances between catching true positives and avoiding false positives, which is important in multi-class classification where both types of errors matter. - Robustness: The high test accuracy and balanced F1-score suggest the model generalizes well, avoide issues like underfitting or overfitting. ![Screenshot 2024-12-28 at 3.25.11 PM](https://hackmd.io/_uploads/HJNyOmprke.png) ##### 5.4.1. Model-3 Random Forest Used **Initial Model** In the initial model, the default parameters of random forest were selected for training. mtry is 3, ntree is 500, and nodesize is 1 by default. During the training process, the OOB (Out-of-Bag) error rate of the model is 8.56%, which indicates that the model has good generalization ability. As the number of trees increases to 500, the OOB error rate of the model gradually stabilizes. This suggests that building more trees does not significantly improve model performance at this stage. ```{r, warning=FALSE, message=FALSE, fig.align='center'} library(randomForest) # default mtry=sqrt(variables), ntree = 500, nodesize = 1 set.seed(5003) # Train the random forest classifier with default parameters default_rf_model <- randomForest( formula = Weather.Type ~ ., data = train_data ) #Plot the OOB error rate with the number of trees plot( default_rf_model$err.rate[, "OOB"], type = "l", col = "blue", lwd = 2, xlab = "Number of Trees", ylab = "OOB Error Rate", main = "OOB Error Rate vs Number of Trees" ) ``` ![image](https://hackmd.io/_uploads/ryiIdmaHJe.png) **Fine-tune model** - **Use grid search find best mtry**: Model tuning using Random Forest, grid search is used to find the optimal mtry parameters. This code found the best mtry parameter through grid search, and `mtry = 3` was selected as the best value, bringing the highest cross-validation accuracy of 97.88%. ```{r, warning=FALSE, message=FALSE} library(knitr) set.seed(5003) train_control_rf <- trainControl(method = 'cv', number = 10, search = 'grid') tune_grid_rf <- expand.grid(mtry = c(1:sqrt(ncol(train_data)-1))) rf_model <- train(Weather.Type ~ ., data = train_data, method = 'rf', trControl = train_control_rf, tuneGrid = tune_grid_rf) #Output best mtry best_mtry <- rf_model$bestTune$mtry best_result <- rf_model$results[rf_model$results$mtry == best_mtry, ] best_result$Accuracy <- paste0(round(best_result$Accuracy * 100, 2), "%") best_result$Kappa <- paste0(round(best_result$Kappa * 100, 2), "%") best_result$AccuracySD <- formatC(best_result$AccuracySD, format = "e", digits = 2) best_result$KappaSD <- formatC(best_result$KappaSD, format = "e", digits = 2) kable(best_result, caption = "Best Tuned Random Forest Model Results", align = c("l", "l", "l", "l", "l","l")) ``` ![Screenshot 2024-12-28 at 3.28.04 PM](https://hackmd.io/_uploads/H1Nqd7pSyl.png) - **Use for loop find best tree number**: - Each line represents a different class of error in the model. There are about four dotted lines in different colors that correspond to the four target categories Cloudy, Rainy, Snowy and Sunny. The solid black line represents the OOB (Out-of-Bag) error estimate for the population, reflecting the overall error rate of the model across all samples. - The choice of **425** trees is reasonable because increasing the number of trees after this point does not significantly improve the performance of the model. ```{r, warning=FALSE, message=FALSE, fig.align='center'} library(knitr) ntree_values <- seq(50, 500 , by = 25) oob_error_rates <- numeric(length(ntree_values)) set.seed(5003) for(i in 1:length(ntree_values)){ rf_model1 <- randomForest(as.factor(Weather.Type) ~ ., data = train_data, mtry = 3, ntree = ntree_values[i], importance = TRUE, oob.prox = TRUE) oob_error_rates[i] <- rf_model1$err.rate[ntree_values[i]] } # Find the best ntree and minimum OOB error rate best_ntree <- ntree_values[which.min(oob_error_rates)] min_oob_error <- min(oob_error_rates) ntree_result <- data.frame( Best_ntree = best_ntree, Min_OOB_Error = formatC(min_oob_error, format = "e", digits = 2) ) kable(ntree_result, caption = "Best Ntree and Its OOB Error Rate", align = c("l", "l")) plot(rf_model1) ``` ![Screenshot 2024-12-28 at 3.30.12 PM](https://hackmd.io/_uploads/BJfMY76S1x.png) ![image](https://hackmd.io/_uploads/rkmJtQpSkx.png) - **Use for loop find best nodesize**: The nodesize parameter is tuned, and the optimal nodesize is selected by using the optimal accuracy. ```{r, warning=FALSE, message=FALSE} # Define the range of nodesize value nodesize_values <- c(1:10) store_nodesize <- list() set.seed(5003) tune_grid_rf1 <- expand.grid(mtry = 3) # Use loops to train different nodesize for (nodesize in nodesize_values) { rf_model2 <- train( Weather.Type ~ ., data = train_data, method = "rf", tuneGrid = tune_grid_rf1, trControl = train_control_rf, ntree = 425, nodesize = nodesize ) store_nodesize[[as.character(nodesize)]] <- rf_model2 } # Find the best nodesize, based on average accuracy results_nodesize <- resamples(store_nodesize) summary_results_rf <- summary(results_nodesize) mean_accuracies <- summary_results_rf$statistics$Accuracy[,"Mean"] best_nodesize_index <- which.max(mean_accuracies) best_nodesize <- nodesize_values[best_nodesize_index] nodesize_result <- data.frame( Optimal_Nodesize = best_nodesize, Mean_Accuracy = paste0(round(max(mean_accuracies) * 100, 2), "%") ) kable(nodesize_result, caption = "Best Nodesize and Its Mean Accurancy", align = c("l", "l")) ``` ![Screenshot 2024-12-28 at 3.31.34 PM](https://hackmd.io/_uploads/r14vtQaSyl.png) - **Train final random forest model**: The OOB error rate of the final random forest model is 2.11% by fitting the best parameter values obtained from the previous grid search and cross-validation tuning. The model had a classification error rate of less than 5 percent for all four categories: Cloudy, Rainy, Snowy and Sunny. ```{r, warning=FALSE, message=FALSE} library(knitr) set.seed(5003) final_rf_model <- randomForest( formula = Weather.Type ~ ., data = train_data, mtry = 3, ntree = 425, nodesize = 1, importance = TRUE ) # Pridiction on the test set test_pred_rf <- predict(final_rf_model, newdata = test_data) # Display class error and OOB class_error_numeric <- as.numeric(final_rf_model$confusion[, "class.error"]) class_error <- data.frame("Class Error" = formatC(class_error_numeric, format = "e", digits = 2)) rownames(class_error) <- rownames(final_rf_model$confusion) oob_error_rate <- final_rf_model$err.rate[nrow(final_rf_model$err.rate), "OOB"] class_error["OOB Estimate", ] <- formatC(oob_error_rate, format = "e", digits = 2) transposed_df <- as.data.frame(t(class_error)) kable(transposed_df, caption = "Class Error and OOB Estimate of Error Rate") ``` ![Screenshot 2024-12-28 at 3.32.39 PM](https://hackmd.io/_uploads/SJ4iK7TSkl.png) ##### 5.4.2 Random Forest Model Evaluation **Confusion Matrix** - Strong Performance: The model does well overall, correctly predicting most instances, especially for Sunny, Snowy, and Cloudy, with over 820 correct predictions for each. - Cloudy and Rainy Confusion: There are some misclassifications between Cloudy and Rainy weather. 25 Rainy instances were predicted as Cloudy, and 16 Cloudy instances were predicted as Rainy. The features distinguishing these two weather types might not be strong enough or overlap in certain conditions. - Sunny Misclassifications: While the model predicts Sunny weather accurately, only a few misclassifications happen with Cloudy and Rainy weather. ![image](https://hackmd.io/_uploads/HkTJcQar1l.png) **ROC Curve** - High AUC Scores: All four weather types have AUC scores above 0.97, showing that the Random Forest model performs very well for all classes. - Slight Misclassifications: The Cloudy class has a slightly lower AUC score (0.97) compared to the others, indicating the model struggles a bit to distinguish Cloudy. This matches the confusion matrix, where Cloudy and Sunny were often mixed up. Best Performing Classes: Rainy and Snowy weather have the highest AUC scores (0.99), meaning the model is almost perfect at predicting these weather types. ![image](https://hackmd.io/_uploads/ByQZ57arJe.png) **Overview Random Forest Evaluation** - Test Set Accuracy: The model achieved 97.62% accuracy on the test set, showing it can generalize well and make accurate predictions. - Overall Performance: The model has high accuracy, balanced sensitivity and precision, and a strong F1-score. The high AUC scores for most weather types highlight its ability to distinguish between them. - Improvement: There is still a slight need for improvement in differentiating between Cloudy and Sunny weather. ![Screenshot 2024-12-28 at 3.34.34 PM](https://hackmd.io/_uploads/rkPzcQpHJe.png) #### 5.5.1. Model-4 XGBoost Used **Initial Model** - **Prepare Data for XGBoost**: - We convert the `Weather.Type` category into numeric labels and `-1`, since XGBoost needs labels to start from 0. - `model.matrix()` function changes the data into a matrix format that XGBoost requires, while removing the `Weather.Type` column since it’s our target variable. - `xgb.DMatrix()` is a special data structure used by XGBoost to boost performance during training. ```{r, warning=FALSE, message=FALSE} library(xgboost) library(Matrix) # Convert labels and prepare data matrix train_labels <- as.numeric(train_data$Weather.Type) - 1 test_labels <- as.numeric(test_data$Weather.Type) - 1 # Convert to matrix format train_matrix <- model.matrix(~.+0, data = train_data[,-which(names(train_data) %in% c("Weather.Type"))]) test_matrix <- model.matrix(~.+0, data = test_data[,-which(names(test_data) %in% c("Weather.Type"))]) # Create DMatrix objects dtrain <- xgb.DMatrix(data = train_matrix, label = train_labels) dtest <- xgb.DMatrix(data = test_matrix, label = test_labels) ``` - **Set Parameters for XGBoost**: These parameters are customized for multi-class classification. Using decision trees `gbtree` helps capture complex patterns, while `multi:softmax` ensure the model predicts the class with the highest probability. The `merror` metric is used to measure the proportion of incorrect predictions, it is crucial for evaluating the model’s accuracy in prediction. ```{r, warning=FALSE, message=FALSE} # Define parameters params <- list( booster = "gbtree", objective = "multi:softmax", # Multi-class classification num_class = length(unique(data$Weather.Type)), # Number of classes eval_metric = "merror" # Use error rate as evaluation metric ) ``` - **Perform 5-Fold Cross-Validation**: - Cross-validation `xgb.cv`: Cross-validation splits the data into 5 parts, training the model 5 times, with one part used as the validation set each time. - `nrounds = 100` is maximum number of boosting rounds. - `early_stopping_rounds = 10` sets training stops if there’s no improvement after 10 rounds. - `best_nrounds` stores optimal number of boosting rounds is based on the round with the lowest test error. - Output: The model achieved the lowest test classification error at the 7th iteration. ```{r, warning=FALSE, message=FALSE} library(knitr) set.seed(5003) # Use cross-validation on the training set cv_results <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, stratified = TRUE, verbose = 0, early_stopping_rounds = 10, maximize = FALSE ) # show the best nrounds best_nrounds <- cv_results$best_iteration best_nrounds_result <- cv_results$evaluation_log[best_nrounds, ] best_nrounds_result$train_merror_mean <- formatC(best_nrounds_result$train_merror_mean, format = "e", digits = 2) best_nrounds_result$train_merror_std <- formatC(best_nrounds_result$train_merror_std, format = "e", digits = 2) best_nrounds_result$test_merror_mean <- formatC(best_nrounds_result$test_merror_mean, format = "e", digits = 2) best_nrounds_result$test_merror_std <- formatC(best_nrounds_result$test_merror_std, format = "e", digits = 2) kable(best_nrounds_result, caption = "Best Round Evaluation Metrics", align = c("l", "l", "l", "l", "l")) ``` ![Screenshot 2024-12-28 at 3.35.27 PM](https://hackmd.io/_uploads/HJTP97arJe.png) - **Train & Test Initial XGBoost Model:** - `xgb.train()` trains XGBoost model using all the training data `dtrain` and optimal number of boosting rounds `best_nrounds`. - `predict()`: The model makes predictions on the test data (`dtest`), and the predicted class labels are stored in `pred_xgb`. ```{r, warning=FALSE, message=FALSE} # Train the final model using the best number of rounds xgb_model <- xgb.train( params = params, data = dtrain, nrounds = best_nrounds ) # Predict on the test set test_pred_xgb <- predict(xgb_model, dtest) # Convert numeric predictions to categorical labels class.name <- c("Cloudy", "Rainy", "Snowy", "Sunny") test_pred_xgb <- factor(test_pred_xgb, levels = 0:3, labels = class.name) ``` **Fine-tune model** - Tune Hyperparameters using GridSearch: The code uses grid search to fine-tune the hyperparameters of an XGBoost model for multi-class classification. The goal is to find the best parameters that result in the highest performance. This grid search process helps optimize the model and avoiding overfitting. Cross-validation ensures that the selected hyperparameters perform well across different data subsets. ```{r, warning=FALSE, message=FALSE} library(knitr) set.seed(5003) # Example of parameter tuning using a grid search tune_grid <- expand.grid( nrounds = seq(50, 150, by = 10), max_depth = seq(3, 10, by = 1), eta = c(0.01, 0.1, 0.3), # Learning rate gamma = c(0, 1, 5), colsample_bytree = c(0.5, 0.7, 1), min_child_weight = c(1, 5, 10), subsample = c(0.6, 0.8, 1) ) # Run caret train with grid search tuned_model <- train( x = train_matrix, y = train_labels, method = "xgbTree", tuneGrid = tune_grid, trControl = trainControl(method = "cv", number = 5), verbosity = 0 ) # Display the best parameters as a table best_params <- tuned_model$bestTune kable(best_params, caption = "Best Parameters from Grid Search Tuning", align = c("l", "l", "l", "l", "l", "l", "l", "l")) final_params <- list( booster = "gbtree", objective = "multi:softmax", # Multi-class classification num_class = length(unique(data$Weather.Type)), # Number of classes max_depth = best_params$max_depth, eta = best_params$eta, gamma = best_params$gamma, colsample_bytree = best_params$colsample_bytree, min_child_weight = best_params$min_child_weight, subsample = best_params$subsample ) # train the final model with the best number of rounds final_model <- xgb.train( params = final_params, data = dtrain, nrounds = best_params$nrounds ) # predict on test set test_pred_xgb <- predict(final_model, dtest) # Convert numeric predictions to categorical labels test_pred_xgb <- factor(test_pred_xgb, levels = 0:3, labels = class.name) ``` ![Screenshot 2024-12-28 at 3.37.27 PM](https://hackmd.io/_uploads/BJIp9Q6rJl.png) ##### 5.5.2. XGBoost Model Evaluation **Confusion Matrix** - Strong Performance: The model performs well, making many correct predictions, especially for Snowy and Sunny, where the error rate is low. - Cloudy and Rainy Confusion: There is some confusion between Cloudy and Rainy weather, with 29 instances of each being misclassified as the other. This suggests these two weather types have similar features in the dataset. - Overall Accuracy: With a high number of correct predictions for all categories, the model shows strong overall accuracy. ![image](https://hackmd.io/_uploads/B1OWi76Bkl.png) **ROC Curve** - AUC Scores: An AUC close to 1 means the model has excellent discriminatory power. All AUC scores are above 0.95, showing the model performs well across all weather types. - Strong Performance for Rainy and Snowy: The model performs nearly perfectly for Rainy and Snowy, with very little confusion in predicting these weather types. - ROC Curve: ROC curve plots Sensitivity (TP rate) against 1-Specificity (FP rate). A steeper curve near the top-left corner means better performance. - High Sensitivity and Low False Positives: Since the curves for all four weather types are close to the top-left corner, the model has high sensitivity and low false-positive rates for all weather types. ![image](https://hackmd.io/_uploads/rkrMsmaByg.png) **Overview XGBoost Evaluation** With metrics around 97.5% for accuracy, sensitivity, precision, and F1-score, this XGBoost model shows excellent performance in classifying different weather types. It effectively identifies the correct instances while maintaining high precision. ![Screenshot 2024-12-28 at 3.39.13 PM](https://hackmd.io/_uploads/rJxVjQpSyx.png) ### 6. Final Model Selection ##### 6.1. Compare with Confusion Matrix **SVM Model**: SVM shows balanced performance across all weather types. But, there is some misclassifications, especially in the Cloudy and Sunny. Like, lots of Cloudy instances were wrongly predicted as Rainy, though Snowy and Sunny were predicted more accurately. **Decision Tree Model**: Decision tree does okay but has a few issues with misclassifications. It struggles to correctly classify Cloudy, often predicting it as Rainy or Sunny. However, it handles Snowy predictions pretty well, with most classification right. **Random Forest Model**: The Random Forest does better than both the Decision Tree and SVM. It has fewer misclassifications, especially for Cloudy and Rainy. Sunny and Snowy were predicted with high accuracy. **XGBoost Model**: XGBoost performs like Random Forest, with very few misclassifications. Most Cloudy, Rainy, and Snowy instances were predicted correctly. Overall, XGBoost has really strong predictive power, especially for Snowy and Rainy classes. ![image](https://hackmd.io/_uploads/Hk4dimpSyg.png) ##### 6.2. Compare with ROC Curve All of models are strong for all classes, with AUC values close to 1. **SVM Model**: Snowy and Rainy have the highest AUC scores. The Cloudy class got the lowest AUC at 0.96, but still represents very good classification skills. **Decision Tree Model**: Decision Tree model show slightly lower AUC values compared to SVM. The Cloudy class also has the lowest AUC, and the curve show some mistakes happened in classifying this weather type. But, Snowy and Sunny classes perform quite well, with AUC values close to 0.99. **Random Forest Model**: The Cloudy class with an AUC of 0.97, greater than others, but there's a bit of low compare to other weather types. **XGBoost Model**: Similar to Random Forest, Rainy and Snowy have AUC scores of 0.99, while Cloudy and Sunny also performed well. This show XGBoost’s strong ability to classify weather types correctly. ![image](https://hackmd.io/_uploads/ByLYjXpSke.png) ##### 6.3. Compare with Accuracy, Precision, Sensitivity and F1-Score **Accuracy**: Random Forest got the highest accuracy at 97.62%, and XGBoost is close behind with 97.53%. SVM and Decision Tree are a bit lower, with accuracies of 97.21% and 96.97%, respectively. **Mean Sensitivity (Recall)**: Random Forest and XGBoost again outshine the other models with similar sensitivity, showing they’re better at correctly identifying the actual positives in each class. SVM and Decision Tree have a little bit lower sensitivity. **Mean Precision (Positive Predictive Value)**: Random Forest leads with the highest precision score of 97.63%. This means, when Random Forest predicts a certain weather type, it’s more likely to be correct compared to the other models. **Mean F1-Score**: Random Forest has the top F1-score at 97.62%, followed closely by XGBoost with 97.53%. SVM and Decision Tree still perform well, with F1-scores above 96.9%, showing a good balance between precision and recall. ![Screenshot 2024-12-28 at 3.40.58 PM](https://hackmd.io/_uploads/HyYcj76HJx.png) ##### 6.4. Strength & Limitation of Each Model **SVM Model:** - **Strengths:** - It has strong generalization ability, and works really well on small datasets. - High accuracy, especially when dealing with imbalanced data. - **Limitations:** - It can be computationally expensive, especially for larger datasets. - Needs careful tuning of hyperparameters, especially the kernel choice. **Decision Tree Model:** - **Strengths:** - It's easy to interpret and explain to others. - A simple model that works good for getting quick insights. - **Limitations:** - It tends to overfit, especially if no pruning is done. - Accuracy and precision are lower compared to more complex models like Random Forest or XGBoost. **Random Forest Model:** - **Strengths:** - It’s good at avoiding overfitting because it uses many trees. - Offers high accuracy and generalization over different datasets. - **Limitations:** - It's not as easy to interpret compared to a single decision tree. - It can be computationally heavy for large datasets. **XGBoost Model:** - **Strengths:** - It provides very high predictive accuracy and handles large datasets efficiently. - In-built regularization helps reduce overfitting issues. - **Limitations:** - It’s harder to tune because it has many hyperparameters. - Higher computational cost compared to simpler models. ##### 6.5. Select Final Model After comparing the models using the confusion matrix, ROC curves, accuracy, precision, sensitivity, F1-score, strengths, and limitations, the **Random Forest** model is the best choice for this weather classification task. It performs the best overall, with the highest accuracy, precision, sensitivity, and F1-score across all weather types. Plus, it handles overfitting well and is still interpretable enough for practical use. The **XGBoost** model comes in as a close second, thanks to its great performance and flexibility. But Random Forest is chosen for its balance of simplicity and strong results across all metrics, making it the best solution for this classification problem. ### 7. Conclusion ##### 7.1. Discussion of Potential Shortcomings **Development Process Shortcomings:** - **Model Tuning Challenges:** During model tuning, finding the best hyperparameters needed lots of grid searches and cross-validation, which took up a lot of computation power. This was especially true for Random Forest and XGBoost, since they got many hyperparameters like `max_depth`, `nrounds`, and `min_child_weight`. While this process did improve the performance, it took much more time and resources to develop the final model. A better way, like Bayesian optimization, might've helped reduce the time and computation cost. **Final Results Shortcomings:** - **Interpretability:** Even though Random Forest and XGBoost got the best performance in accuracy, precision, sensitivity, and F1-score, they’re harder to understand than simpler models like Decision Trees. Knowing how these models make decisions can be tricky, which could be a problem in real-world uses where you need to explain the model. - **Overfitting Risk:** Even with using cross-validation and other techniques to avoid overfitting, the high performance suggests they might overfit a bit. This is especially the case with XGBoost, which is flexible and can overfit if it's not tuned properly. ##### 7.2. Future Work or Improvement Areas **Development Process:** - **Hyperparameter Optimization:** In the future, better methods like Bayesian optimization or genetic algorithms could be used to make the tuning process faster and more efficient (Alibrahim & Ludwig, 2021). These could help build better models while using fewer resources. - **Model Interpretability:** To fix the interpretability issue, future work could use methods like SHAP (SHapley Additive exPlanations). These methods make complicated models like Random Forest and XGBoost easier to explain by showing how each feature contributes to the predictions (Nohara et al., 2022). **Final Results:** - **Performance Across Different Datasets:** Even though the models worked well on this dataset, it’s important to check how they perform on different datasets with various characteristics (like more complicated or unbalanced data). This will make sure the models generalize well to new data and different kinds of problems. - **Model Efficiency:** While Random Forest and XGBoost achieved high accuracy, they take a lot of computation power, especially when training. It’d be helpful to look at more efficient versions of these models or lightweight methods like LightGBM to improve speed without losing performance (Truong et al., 2023). - **Incorporating Temporal Data:** The current task doesn’t consider how weather patterns might change over time. Future work could use time-series models like LSTM (Long Short-Term Memory) or Temporal Convolutional Networks (TCN) to capture those patterns and maybe improve the prediction accuracy (Gopali et al., 2021). ##### 7.3. Conclusion This project aimed to classify weather types using models like SVM, Decision Trees, Random Forest, and XGBoost. After comparing the models using confusion matrices, ROC curves, accuracy, precision, sensitivity, and F1-scores, Random Forest was chosen as the best model, with XGBoost close behind. Both models showed strong ability to generalize and classify different weather types. However, the project also found some issues, like the high computational cost of tuning the models and the need for better ways to make it possible to deploy in real world. In the future, the focus will be on improving model efficiency, making them more interpretable, testing on different datasets, and maybe using temporal data to better understand weather patterns. In the end, this project showed the potential of machine learning for live weather classification in high accurancy but also pointed out some areas that need more research and improvements. ### 8. Reference - Alibrahim, H., & Ludwig, S. A. (2021). Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. 2021 IEEE Congress on Evolutionary Computation (CEC). https://doi.org/10.1109/cec45853.2021.9504761 - Gopali, S., Abri, F., Siami-Namini, S., & Namin, A. S. (2021). A Comparison of TCN and LSTM Models in Detecting Anomalies in Time Series Data. Scholars.ttu.edu; Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/BigData52589.2021.9671488 - NOAA National Centers for Environmental Information (NCEI) U.S. Billion-Dollar Weather and Climate Disasters (2024). https://www.ncei.noaa.gov/access/billions/, DOI: 10.25921/stkw-7w73 - Nohara, Y., Matsumoto, K., Soejima, H., & Nakashima, N. (2022). Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Computer Methods and Programs in Biomedicine, 214, 106584. https://doi.org/10.1016/j.cmpb.2021.106584 - Truong, V.-H., Sawekchai Tangaramvong, & Papazafeiropoulos, G. (2023). An efficient LightGBM-based differential evolution method for nonlinear inelastic truss optimization. Expert Systems with Applications, 237, 121530–121530. https://doi.org/10.1016/j.eswa.2023.121530