# 18. Wednesday, 07-08-2019, ML - Logistic Regression ```python= import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score, confusion_matrix ``` Create a DataFrame by droping a column in the previous DataFrame ```python= X = data.drop(columns='label').values ``` ## Tree decision & Random forest classifier **Import needed library and Initialize** ```python= from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier dtc = DecisionTreeClassifier() rfc = RandomForestClassifier() ``` **Train data** ```python= dtc.fit(X_train, y_train) rfc.fit(X_train, y_train) ``` ## Precision and Accuracy **Predict by using the 2 models and compare the accuracies** We can easily see that RandomForest model has a higher accuracy than Decision Tree ```python= #Decision Tree dtc_predictions = dtc.predict(X_test) dtc_acc = accuracy_score(dtc_predictions, y_test) print(dtc_acc) >>> 0.76 # Random Forest rfc_predictions = rfc.predict(X_test) rfc_acc = accuracy_score(rfc_predictions, y_test) print(rfc_acc) >>> 0.8784 ``` We also can compare by confusion matrix We see that the prediction from Random Forest is higher accurate than Decision Tree as well ```python= dtc_cm = confusion_matrix(dtc_predictions, y_test) sns.heatmap(dtc_cm, annot=True, fmt='d', cmap="plasma") plt.xlabel('Prediction') plt.ylabel('Actual') plt.show() ``` Decision Tree ![](https://i.imgur.com/EKuCTAd.png) ```python= rfc_cm = confusion_matrix(rfc_predictions, y_test) sns.heatmap(rfc_cm, annot=True, fmt='d', cmap="viridis") plt.xlabel('Prediction') plt.ylabel('Actual') plt.show() ``` Random Forest ![](https://i.imgur.com/G0DCiP1.png) ## Optimize model by using KFold CV and GridSearch CV **Check accuracy of current model by changing number of trees in the forest** ```python= # n is n_estimators: number of trees n = [1 ,5 ,10, 20, 50, 100, 200, 500] # 'result' array to save the accuracy score of each set of tree result = [] for i in n: clf = RandomForestClassifier(n_estimators=i) clf.fit(X_train, y_train) predictions = clf.predict(X_test) result.append(accuracy_score(predictions, y_test)) # Last step let's plot n and result on a grid using plt.scatter() plt.scatter(x=n, y=result) print(result) ``` ![](https://i.imgur.com/yp2TzZr.png) We can see that, many trees will take us more time but the accuracy is not too different **Tuning via KFold CV** * Increase accuracy by using cross_val_score * Chose how many fold we will use then take the mean of them ```python= rfc_2 = RandomForestClassifier() from sklearn.model_selection import cross_val_score # Take cross validation score cross_val_score(rfc_2, X_train, y_train, cv=3) >>> array([0.86900958, 0.8608 , 0.86858974]) # We can not take all score from 3 folds -> take mean cross_val_score(rfc_2, X_train, y_train, cv=3).mean() >>> 0.8666557808361324 ``` * Draw chart by using cross validation score (KFold) ```python= val_results = [] for i in n: clf = RandomForestClassifier(n_estimators=i) val_results.append(cross_val_score(clf, X_train, y_train, cv=3).mean()) plt.scatter(n, val_results) ``` ![](https://i.imgur.com/zllutE4.png) ```python= final_rfc = RandomForestClassifier(n_estimators=100) final_rfc.fit(X_train, y_train) final_predictions = final_rfc.predict(X_test) accuracy_score(y_test, final_predictions) >>> 0.936 ``` ### GridSearch CV It will be faster ```python= from sklearn.model_selection import GridSearchCV ``` ```python= # Define the grid input for GridSearchCV param_grid={ "n_estimators":[1, 5, 10, 20, 50, 100, 200, 500] } gridcv = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=3) gridcv.fit(X_train, y_train) ``` We don't need to draw a chart to compare, GridSearch will take care and show us the best score with the best param ```python= gridcv.best_score_ >>> 0.928 gridcv.best_params_ >>> {'n_estimators': 500} final_rfc = RandomForestClassifier(n_estimators=500) final_rfc.fit(X_train, y_train) final_predictions = final_rfc.predict(X_test) accuracy_score(y_test, final_predictions) >>> 0.9376 ``` It is faster and easier but sometime we need to do it manually because we might have to compare between accuracy and resource Ex: We don't need to take more resource (100 < 500 trees) to increase a little bit accuracy(0.936 < 0.9376)