# 18. Wednesday, 07-08-2019, ML - Logistic Regression
```python=
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
```
Create a DataFrame by droping a column in the previous DataFrame
```python=
X = data.drop(columns='label').values
```
## Tree decision & Random forest classifier
**Import needed library and Initialize**
```python=
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
```
**Train data**
```python=
dtc.fit(X_train, y_train)
rfc.fit(X_train, y_train)
```
## Precision and Accuracy
**Predict by using the 2 models and compare the accuracies**
We can easily see that RandomForest model has a higher accuracy than Decision Tree
```python=
#Decision Tree
dtc_predictions = dtc.predict(X_test)
dtc_acc = accuracy_score(dtc_predictions, y_test)
print(dtc_acc)
>>>
0.76
# Random Forest
rfc_predictions = rfc.predict(X_test)
rfc_acc = accuracy_score(rfc_predictions, y_test)
print(rfc_acc)
>>>
0.8784
```
We also can compare by confusion matrix
We see that the prediction from Random Forest is higher accurate than Decision Tree as well
```python=
dtc_cm = confusion_matrix(dtc_predictions, y_test)
sns.heatmap(dtc_cm, annot=True, fmt='d', cmap="plasma")
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.show()
```
Decision Tree

```python=
rfc_cm = confusion_matrix(rfc_predictions, y_test)
sns.heatmap(rfc_cm, annot=True, fmt='d', cmap="viridis")
plt.xlabel('Prediction')
plt.ylabel('Actual')
plt.show()
```
Random Forest

## Optimize model by using KFold CV and GridSearch CV
**Check accuracy of current model by changing number of trees in the forest**
```python=
# n is n_estimators: number of trees
n = [1 ,5 ,10, 20, 50, 100, 200, 500]
# 'result' array to save the accuracy score of each set of tree
result = []
for i in n:
clf = RandomForestClassifier(n_estimators=i)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
result.append(accuracy_score(predictions, y_test))
# Last step let's plot n and result on a grid using plt.scatter()
plt.scatter(x=n, y=result)
print(result)
```

We can see that, many trees will take us more time but the accuracy is not too different
**Tuning via KFold CV**
* Increase accuracy by using cross_val_score
* Chose how many fold we will use then take the mean of them
```python=
rfc_2 = RandomForestClassifier()
from sklearn.model_selection import cross_val_score
# Take cross validation score
cross_val_score(rfc_2, X_train, y_train, cv=3)
>>>
array([0.86900958, 0.8608 , 0.86858974])
# We can not take all score from 3 folds -> take mean
cross_val_score(rfc_2, X_train, y_train, cv=3).mean()
>>>
0.8666557808361324
```
* Draw chart by using cross validation score (KFold)
```python=
val_results = []
for i in n:
clf = RandomForestClassifier(n_estimators=i)
val_results.append(cross_val_score(clf, X_train, y_train, cv=3).mean())
plt.scatter(n, val_results)
```

```python=
final_rfc = RandomForestClassifier(n_estimators=100)
final_rfc.fit(X_train, y_train)
final_predictions = final_rfc.predict(X_test)
accuracy_score(y_test, final_predictions)
>>>
0.936
```
### GridSearch CV
It will be faster
```python=
from sklearn.model_selection import GridSearchCV
```
```python=
# Define the grid input for GridSearchCV
param_grid={
"n_estimators":[1, 5, 10, 20, 50, 100, 200, 500]
}
gridcv = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=3)
gridcv.fit(X_train, y_train)
```
We don't need to draw a chart to compare, GridSearch will take care and show us the best score with the best param
```python=
gridcv.best_score_
>>>
0.928
gridcv.best_params_
>>>
{'n_estimators': 500}
final_rfc = RandomForestClassifier(n_estimators=500)
final_rfc.fit(X_train, y_train)
final_predictions = final_rfc.predict(X_test)
accuracy_score(y_test, final_predictions)
>>>
0.9376
```
It is faster and easier but sometime we need to do it manually because we might have to compare between accuracy and resource
Ex: We don't need to take more resource (100 < 500 trees) to increase a little bit accuracy(0.936 < 0.9376)