## ASSESSMENT SUMMARY
| Grade | SKILL |
|:------------------------------------ |:------------------------------------:|
|  | Data Visualization and Communication |
|  | Machine Learning |
|  | Scripting and Command Line |
## ASSESSMENT DETAILS
### Data Visualization and Communication
**Summary**: Up to scratch visualisation and communications. Easy to understand the procedure and other details with the help of the presentation that contained good visuals.
1. Efforts have been put in doing Exploratory Data Analysis(EDA).
2. Explored how delay varied with time

3. Explored how delay varied with airline and arrival

4. Departure has not been compared with delay as it was correctly observed that the departure airport was same and did not serve any purpose towards prediction.
5. Hence, it was decided to include the following features:
> Time Domain
> Location Domain
> Airline Domain
6. Identified 'distance' between airports as an important parameter for prediction.
### Machine Learning
**Summary**:Feature engineering is done perfectly. Every reason has been specified in the README. Also, different models were tested. All steps and final observations were included in the presentation.
#### Feature Engineering
1. Label feature engineered to store the final value predicted by the model.
2. Time features were included and reason for doing the same was mentioned in the README. Done efficiently with the help of a function. Ensured code readability
```python=
# Create time features
def get_weekday(string):
year, month, day = (int(i) for i in string.split('-'))
dt = datetime.date(year, month, day)
return dt.strftime("%A")
```
2. New feature namely the airline_grp was identified in order to store the categorical values of the group to which a particular airline belonged based on the delay. A good way to approach the problem
```python=
# Create new variable for airline
df.groupby('Airline').mean().label.sort_values(ascending = False).head(20)
df.groupby('Airline').mean().label.sort_values(ascending = False).tail(20)
```
3. Results of the above snippet analyzed and the following categories were identified
```python=
highest_delay = ['BO', 'HB', 'P7']
high_delay = ['O3', 'SV', 'BG']
medium_delay = ['E8', 'O8', 'PK', 'LV', 'MF', 'OX', '2P', 'HO']
lowest_delay = ['XF', 'VQ', 'TV', '3V', 'HZ', 'QG', 'SO', 'JD', 'JT', 'B5', 'Y8', 'NQ', 'TT']
```
2. Similar thing done to create feature named arrival_grp.
3. Made good use of external sources in order to calculate the distance between the airports and hence engineered a new feature 'distance'. Also, identified that there was only one departure airport
```python=
# Create new variable for distance
airports = pd.read_csv("https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat", header = None)
airports.set_axis(['v0', 'v1', 'v2', 'v3', 'v4', 'v5','v6','v7', 'v8', 'v9', 'v10', 'v11', 'v12','v13'], axis=1, inplace=True)
```
```python=
def get_distance(airport):
target = (airports[airports['v4'] == airport].v6.values[0], airports[airports['v4'] == airport].v7.values[0])
return geodesic(start, target).miles
df['distance'] = df['Arrival'].apply(lambda airport: get_distance(airport))
```
5. Finally, the new data frame containing the fruitful changes done using feature engineering was saved for future use
```python=
df.to_csv('dataframe.csv', index = False)
```
6. Categorical variables converted to indicators using dummies. This was done for the date-time features and the two group features namely arrival_grp and airline_grp
```python=
arrival_grp_df = pd.get_dummies(df_model['arrival_grp'], prefix='arrival', drop_first=True)
df_model = pd.concat([df_model, arrival_grp_df], axis=1)
df_model.drop(['arrival_grp'], axis=1, inplace=True)
```
#### Machine Learning Model
1. Different models were used in order to check efficiency of each. Baseline model was Logistic Regression.
2. It was identified that the prediction label distribution was heavily imbalanced, with label = 1 as the minority group. Therefore, the following was done:
> The Synthetic Minority Over-Sampling Technique (SMOTE) technique was used to oversample the minority class (label = 1). This approach significantly improves the modelling performance on such imbalanced dataset.
>
This was a crucial observation made before model making.
3. A function was built that would take model as input and give results. Yet another example of reducing code size
```python=
def model_result(model):
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
print('\nTRAIN RESULTS:')
print("Accuracy:", accuracy_score(y_train, train_pred))
print("AUC-ROC:", roc_auc_score(y_train, train_pred))
print('\nTEST RESULTS:')
print("Accuracy:", accuracy_score(y_test, test_pred))
print("AUC-ROC:", roc_auc_score(y_test, test_pred))
```
4. Accuracy of 95.6% achieved
>
TRAIN RESULTS:
Accuracy: 0.9561647274535693
AUC-ROC: 0.5
TEST RESULTS:
Accuracy: 0.9561643835616438
AUC-ROC: 0.5
>
5. As mentioned before, SMOT(Synthetic Minority Over-sampling Technique) analysis was performed and helped working with an imbalanced dataset
>
TRAIN RESULTS:
Accuracy: 0.7047641203742169
AUC-ROC: 0.587518071087921
TEST RESULTS:
Accuracy: 0.9069568279978497
AUC-ROC: 0.5817559158004117
>
6. Next model to experiment with was RandomForest. This shows every aspect of training the model was considered in order to achieve the best model.
>
TEST RESULTS:
Accuracy: 0.8675996811685543
AUC-ROC: 0.7615040895890443
>
8. Finally gradient boost was taken into consideration. Since the model is heavy, the Light gradient boost was also considered
>
Results for LGB
TEST RESULTS:
Accuracy: 0.893658590839157
AUC-ROC: 0.6968553347860308
>
---
>
Results for Gradient Boost
TEST RESULTS:
Accuracy: 0.8909707676052422
AUC-ROC: 0.7187307346332649
>
7. ROC curve was plotted for each of the model that was build

8. Finally the model was analysed by calculating the importance of each feature and how they contribute towards prediction. This helps in knowing how the model performed and if something better could have been done.
9. Brier score and Mean Absolute error gave promising results and said a lot on how the model performed
```python=
get_mae(test_proba, actual_claim)
#35.21178885709525
get_brier_error(test_proba, actual_claim)
#28011.43678950288
```
10. Finally, the following was conluded
> Among the factors from time domain, location domain, and airline domain, long distance between the departure and arrival airports is the strongest indicator for delay more than 3h. In addition, flights during Tuesday, Wednesday, and Sunday are also more likely to be delayed for more than 3 hours
### Scripting and Command Line
**Summary**: The analysis was made with Jupyter notebook and files were shared through GitHub repository. Presentation and README included for productive understanding of the project.
1. The project was saved on GitHub that gave easy accessibility.
2. README.txt was included in the GitHub repository. It contained briefly explained sections namely Code deployment, Project overview, Project details, Outcome highlights and Future work. These helped in understanding every part of the Jupyter Notebook. Hence, separation of implementation and explanation was a good idea.
3. Necessary packages required for prediction were mentioned in the beginning of the notebooks as imports.
4. Code is readable with uniform spacing and consistent styling. Proper variable names used which does not create confusion.
5. Code and text styling is consistent throughout the notebook.
6. A presentation with simple yet effective styling was included. It not only contained about various stages of the project, but also had visuals in order to make it more presentable.
7. Presentation contained future scope of the project and possible imporovements that can be made.