HackMD - Collaborative Markdown Knowledge Base

## ASSESSMENT SUMMARY | Grade | SKILL | |:------------------------------------ |:------------------------------------:| | ![](https://i.imgur.com/pvfez09.png) | Data Visualization and Communication | | ![](https://i.imgur.com/Ed8qV2H.png) | Machine Learning | | ![](https://i.imgur.com/pvfez09.png) | Scripting and Command Line | ## ASSESSMENT DETAILS ### Data Visualization and Communication **Summary**: Communcations skills are good. Simple but effective use of language helped get what is being done and the reason for doing or not doing something. 1. Exploratory Data Analysis(EDA) is done properly, starting with basic data importing and then exploring more about data. 2. Insights are communicated well with the help of notebook. 3. Below snippet shows some basic yet necessary data exploration that was done ```python= print('RAW data shape:', df_eda.shape) print(df_eda.dtypes.value_counts()) df_eda.head() ``` 4. Found out the missing values in a unique manner, that is, by building a separate table. A good example of code reusability: ```python= def missing_values_table(df): # Total missing values mis_val = df.isnull().sum() # Percentage of missing values mis_val_percent = 100 * df.isnull().sum() / len(df) # Make a table with the results mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1) # Rename the columns mis_val_table_ren_columns = mis_val_table.rename( columns = {0 : 'Missing Values', 1 : '% of Total Values'}) # Sort the table by percentage of missing descending mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,1] != 0].sort_values( '% of Total Values', ascending=False).round(1) # Print some summary information print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n" "There are " + str(mis_val_table_ren_columns.shape[0]) + " columns that have missing values.") # Return the dataframe with missing information return mis_val_table_ren_columns ``` 6. Frequencies of each unique value of delay_time feature found out in order to get some fruitful information on how delay_time can help towards better model making. ```python= print(df_eda['delay_time'].value_counts()[0:10]) ``` 7. Brief conclusion of Exploratory Data Analysis was given > So, after looking at the dataset quickly, we spotted that only one column seems to have missing values, which wouldn't affect much in our model. The train/validation dataset split will have to be coordinated, since the number of '800' in the "is_claim" variable is quite limited. Also, that both 'flight_id' is all unique per event/row, so it won't aid in giving more information to the trained model. ### Machine Learning **Summary**:Feature engineering is done decently. Everything mentioned in EDA has been taken care of during feature engineering. Model building was also done with steadiness. #### Feature Engineering 1. Brief introduction of feature engineering given in a short and effective paragraph > We could either go treating the question as a time series question, or we could treat it as a binary/logistic signal question. We're going to do the second option, since we are more familiar with these type of ML models. 2. Dealt with columns having no values. In Exploratory Data Aalysis, it was found that only one column had null values, which was Airline. After removing nulls, missing_values_table function was used to confirm it. ```python= df_eda_no_nan = df_eda.fillna({"Airline": "unknown"}) missing_values = missing_values_table(df_eda_no_nan) missing_values.head() ``` > OUTPUT: Your selected dataframe has 10 columns. There are 0 columns that have missing values. > 3. Features not contributing towards prediction were removed ```python= # Remove columns: df_final = df_eda_no_nan_dummies.drop(['flight_id', 'flight_no', 'flight_date'], axis=1) ``` 4. Dealt with categorical values. Also, converted some values to other form for easy computation ```python= # One-hot encoding those 'object' type variables: df_eda_no_nan_dummies = pd.get_dummies(df_eda_no_nan, columns=["Departure", "Arrival", "Airline"]) # Label encode the 'flight_no' variable: df_eda_no_nan_dummies["flight_no_cat"] = df_eda_no_nan_dummies["flight_no"].astype('category').cat.codes # Replace 800 to 1 in the 'is_claim' variable: df_final['is_claim'] = df_final['is_claim'].replace(800, 1) # Replace 'Cancelled' value and convert to numeric (float) the 'delay_time' variable: df_final['delay_time'] = df_final['delay_time'].replace('Cancelled', '3.0') df_final["delay_time"] = pd.to_numeric(df_final["delay_time"], downcast="float") ``` #### Machine Learning Model 1. Decided to go with both Regression and Classification, hence, showcasing some unique approach. Used XGBoost for both, namely XGBRegressor and XGNClassifier. 2. Accuracy of more than 99% achieved, which is remarkable. 3. Also, mean absolute and mean square error calculated for gaining extra insights on how the model performed ```python= print("Accuracy: %.4f%%" % (accuracy * 100.0)) print("MAE: %.4f%%" % (mae)) print("MSE: %.4f%%" % (mse)) ``` > OUTPUT: Accuracy: 99.9518% MAE: 0.0005% MSE: 0.0005% > 4. Importance of features identified using the model ![](https://i.imgur.com/qCI1w8F.png) Can be seen how delay_time feature influences claim_rate the most. 5. Similar procedure repeated for Regressor model. Accuracy achieved was: > OUTPUT: Accuracy: 99.9518% MAE: 0.0005% MSE: 0.0005% > Accuracy same as that of the classifier model 6. Importance of Regressor model was also found to be the same as the Classifier. 7. Though the accuracy was not improved or reduced in the classifier, but this provided some extra confirmation and agreement with the previous model. ### Scripting and Command Line **Summary**: The analysis was made with Jupyter notebook and files were shared through github repository. 1. The project was saved on github that gave easy accessibility. 2. README.txt was included in the github repository. 3. Necessary packages required for prediction were mentioned in the README along with other necessary information. 4. Code is readable with uniform spacing and consistent styling. Proper variable names used which does not create confusion. 5. Code and text styling is consistent throughout the notebook.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.