HackMD - Collaborative Markdown Knowledge Base

## ASSESSMENT SUMMARY | Grade | SKILL | |:------------------------------------ |:------------------------------------:| | ![](https://i.imgur.com/O45Rolb.png) | Data Visualization and Communication | | ![](https://i.imgur.com/Ed8qV2H.png) | Machine Learning | | ![](https://i.imgur.com/IwphzV5.png) | Scripting and Command Line | ## ASSESSMENT DETAILS ### Data visualization and communication **Summary**: Able to interact well with the help of distinctive visualization. No doubt the communicaation part helps get a gist of what has been done(without seeing the actual code). 1. Usage of different modules for different like model building, prediction,etc has been done, which does nothing but makes it more understandable. 2. ReadMe has been included, with comprehensive instructions of what actually is being done and how to get everything up and running.Read me snippet ![](https://i.imgur.com/IB4F4oZ.png) 4. Model loggging has been done, a good way to keep track of models. 5. flight_delay.pdf is a well written document. Efforts have been put to tell what the model does by including the screenshots of graphs and other necessary information. Conclusion given clearly states the future scope of the model and how can it be bettered. "We train a Random Forest model to get a strong baseline model. To further improve model, we can:" 5. Following exploratory data analysis are not only attractive but interactive as well. ![](https://i.imgur.com/uGnTBml.png) ![](https://i.imgur.com/dNlOhwk.png) Several other plots are plotted for better understanding of model prediction. * These are plotted against the claim rate. Self cooked featured have also been plotted against claim_rate apart from the existing ones. Like the following: ![](https://i.imgur.com/7EscR3X.png) Some other plots can be seen in the screenshot below ![](https://i.imgur.com/NikTbvk.png) ### Machine Learning **Summary**:Feature engineering is quite imporessive. Several aspects of features have been taken into consideration. #### Feature Engineering 1. Data wrangling is being done by filtering numerical data. ```python= def numeric_filter(df): #-> df, df_numeric 'filter numeric columns, return df w/o numeric cols and df w/' numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] n_df = df.select_dtypes(numerics) df = df.drop(n_df.columns,axis=1) n_df = n_df.rename({i:i+"_num" for i in df.columns}, axis=1) return df,n_df ``` 2. Categorical variable handles by encoding them using category encoders provided by sklearn 3. Found out that most cateogry variable have more than 10 categories, a useful insight. And hence concluded that "Apply Target Encoding and Count Encoding instead of One-Hot encoding." ```python= def categ_encoder(df, df_y, cols, encoders=('target','count')): #-> df 'encode category columns' assert len(encoders)>0, 'encoders is empty' df = df[cols] fitted_encoders = [] fitted_df = [] get_encoder = { 'target': lambda: TargetEncoder(cols=cols,).fit(df,df_y), 'count': lambda: CountEncoder(cols=cols,handle_unknown=0,normalize=True).fit(df), #'onehot': lambda: OneHotEncoder(cols=cols).fit(df) } for en_name in encoders: encoder = get_encoder[en_name]() x = encoder.transform(df) x = x.rename({i:i+"_"+en_name for i in x.columns},axis=1) fitted_encoders.append(encoder) fitted_df.append(x) return fitted_df,fitted_encoders,encoders ``` ![](https://i.imgur.com/aegWBgI.png) 4. Model evaluation is done using confusion matrix is order to know the performance of the classifier. 5. Important features have been identified. Also, addition of new self generated features is done ```python= def add_features(df): df = df.copy() df['timezone'] = df['timezone'].astype('float') df['timezone_diff'] = df.timezone.apply(lambda x: min(abs(x-8),24-abs(x-8))) df['year'] = df['flight_date'].dt.year df['week_from_2013'] = df['flight_date'].dt.week+ (df['year']-min(df['year']))*52 df['month'] = df.flight_date.dt.month df['month_from_2013'] = df['flight_date'].dt.month+ (df['month']-min(df['month']))*12 df['weekday'] = df.flight_date.dt.weekday.astype('object') df['latitude_q10'] = pd.qcut(df['latitude'],10,duplicates='drop').astype(str) df['longitude_q10'] = pd.qcut(df['longitude'],10,duplicates='drop').astype(str) df['altitude_q10'] = pd.qcut(df['altitude'],10,duplicates='drop').astype(str) return df.drop(['flight_date'],1) ``` #### Machine Learning Model **Summary**: Experimented with different models and chose RandomeForest classifier. Ensemble learning is no doubt a good choice for model making. 1. Proper reason and awareness of choosing RandomForest given > Here I choose Random Forest • Fast/ Strong /Robust • Easy to interpret • If proven to be useful, we can further train boosting model to improve acc. 2. Upsampling is pleasingly done ![](https://i.imgur.com/XMn4DIo.png) Can be seen how increased accuracy is achieved as a result. 3. Grid search used to decide model parameters. * Grid search on number of trees and max tree depth. * Best one is number of tree = 200 and max tree depth=5 ![](https://i.imgur.com/fYzNUBo.png) 4. Importance of each features calculated : ```python= def write_model_info(f,model,x,writeFI=False): 'writeFI: whether to write Feature importtance as another file' f.writelines('model info:\n----\n') f.writelines('model paras: {}\n'.format(model.get_params())) f.writelines('**feature_importances**:\n') iter_ = zip(x.columns,model.feature_importances_) for i,j in sorted(iter_,key=lambda x:x[1],reverse=True): f.writelines('{:<15}: {}\n'.format(i,j)) f.writelines('**feature_importances**\n') ``` Hence concluded that > Target and count encoding are proved to be useful. ### Scripting and Command Line 1. Requirements.txt is included stating the necessary evironmental setups and versions of various packages used. 2. Files have been arranged with proper naming conventions. Seperate folders for models, logging, plots, and datasets have been given. 3. Code is readable due to proper usage of comments ,having a function for every task, proper line spacing. 4. Separate .py files for every task. Functions required in other .py files are directly imported. This has reduced several code from being at one place and is a good example of code reusability. 5. ReadMe.txt mentions the role of each file and folder and instructions on how to run everything. 6. Model is saved properly.