## ASSESSMENT SUMMARY
| Grade | SKILL |
|:------------------------------------ |:------------------------------------:|
|  | Data Visualization and Communication |
|  | Machine Learning |
|  | Scripting and Command Line |
## ASSESSMENT DETAILS
### Data Visualization and Communication
**Summary**: Up to scratch visualisation and communications. Hypotheses were clearly stated and later on validated. Included a presentation as well.
1. Traditional data analysis was done.
2. Data understanding done by visualising it in tabular form
3. Deduced the summary of the data set and identified the steps for data cleaning
4. The necessary type conversions and encoding to be done was pinpointed.
> flight_date to date type
delay_time to numeric
code delay_time = Cancelled as 99
5. Explored the null values. Deduced Airline feature to be the only one containing nulls
> 0.19 % flights with Airline = NULL
6. Concluded that only few flights claimed refund when delay time was 3, an important visualisation. Decided to get rid of airlines containing null values
> Only few flights (<1%) claim refund when delay time = 3
> exclude Airline = NULL
7. Finally, a TODO list was made highlighting the tasks to be performed.
### Machine Learning
**Summary**:Feature engineering is done with utmost care. What is done and why is it done is clearly mentioned in the presentation. All the tasks from the TODO list created while observing the data were completed
#### Feature Engineering
1. Removed the null values.
2. Extra time features were engineered which will help in prediction. Reasoning was included in the form of visualisation in the presentation
> Bring the statistical metrics per time units from history.
Add statistical metrics regarding previous hour.
3. Compared time with delays. Concluded that delays are common in quarter 3


4. Extra flight features engineered with the help of external sources
>Add airports meta data from external data set.
Bring the statistical metrics per airline and arrival airport and country.
5. Following snippet shows the important flight features extracted
```ruby=
extend_flight <- function(df) {
# extend flight related features
# read external data
# airports: https://github.com/epranka/airports-db
airports_raw <- fread("./data/airports-db/raw/airports.csv")
hk_longitude <- 113.915
hk_latitude <- 22.3089
airports <- airports_raw %>%
filter(!iata_code %in% c('','-','0') & type != 'closed') %>%
rowwise() %>%
mutate(
distance_to_hk = distm(c(hk_longitude, hk_latitude), c(longitude_deg, latitude_deg), fun = distHaversine)
) %>%
select(iata_code, type, latitude_deg, longitude_deg, iso_country, distance_to_hk) %>%
# dedup
group_by(iata_code) %>%
slice(which.max(distance_to_hk))
out <-df %>%
left_join(airports, by = c("Arrival" = "iata_code")) %>%
# Standardize
mutate_at(c("distance_to_hk"), ~(scale(.) %>% as.vector)) %>%
select(-latitude_deg, -longitude_deg)
out
}
```
6. Delay time was compared with countries, airlines and arrival points. Following plot shows distribution of delay time with countries

6. 4 data sets were finally saved for future use
> origin, origin + time, origin + flight, origin + time + flight
#### Machine Learning Model
1. Modelling process for both training and testing was included in the presentation in the form of a flow diagram.
2. 4 models were used for training namely Linear Regressor,RandomForest, eXtreme Gradient Boost and SVM. This approach allows to choose the model with promising results.
3. Training was done on the 4 data sets
4. Time to train the models was noted.
5. Models were compared with each other by comparing their Root Mean Square and Mean Absolute errors for the 4 data sets
```ruby=
# model results
model_compare <- data.frame(
LM = c(min(train_result[[1]]$results$RMSE), min(train_result[[1]]$results$MAE)),
RF = c(min(train_result[[2]]$results$RMSE), min(train_result[[2]]$results$MAE)),
XGBT = c(min(train_result[[3]]$results$RMSE), min(train_result[[3]]$results$MAE)),
SVM = c(min(train_result[[4]]$results$RMSE), min(train_result[[4]]$results$MAE))
)
rownames(model_compare) <- c('RMSE', 'MAE')
model_compare
```
6. Finally, the origin+time data set was chosen for testing using the 4 models.
7. Identified SVM with linear settings as the best performing model
> Training RMSE=14.5 MAE=2.7
> Testing RMSE=17.8 MAE=3.6
8. Finally, the model was saved for future use.
### Scripting and Command Line
**Summary**: The project was done in RStudio. Included a README file and a presentation.
2. README.txt was included. It talked about the project structure.
3. Necessary packages required for prediction were mentioned in the README file itself.
4. Code is readable with uniform spacing and consistent styling.
6. Necessary reusable functions were included in features.R file and was imported in other files within the project. This ensured code reusability and reduced code size.
7. Presentation included almost everything from data understanding, feature engineering to training, testing.
8. Future scope of the project was included in the presentation.