Random forest hyperparamter tuning

# Random forest hyperparamter tuning ``` from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier x_training = x_train.to_numpy() y_training = y_train.to_numpy().ravel() x_validation = x_valid.to_numpy() y_validation = y_valid.to_numpy().ravel() m_samples = x_training.shape[0] param_grid = [ {'bootstrap': [True], 'n_estimators': [50, 100, 150], 'max_depth': [3, 4, 5], 'min_samples_split': [2, 4, 6, 8],'max_samples': [1000, 1500, 2000]} ] forest_reg = RandomForestClassifier() grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='f1', return_train_score=True) grid_search.fit(x_training, y_training) print(grid_search.best_params_) print('train', grid_search.best_score_) grid_search.fit(x_validation, y_validation) print('validation',grid_search.best_score_) ``` Since we didn't optimize the random forest classifier, building trees cost a lot of time. sklearn's RandomForestClassifier has optimization for every tree, letting every tree in a forest run in parallel. Using sklearn's GridSearchCV to select the models which has the best f1 score. GridSearchCV try out every combinations in param_grid. In this example, we have 108 possibilities (n_estimators)(max_depth)(min_samples_split)*(max_samples). Also, GridSearchCV uses k-fold cross validation set, which won't waste validation set compared with hold-out validation set. This is the best hyperparameter ![](https://i.imgur.com/uQr3bTa.png)