Random Forest - HackMD

**Random Forest Classifier Overview:** Random Forest is an ensemble method that creates multiple decision trees during training and combines their predictions. It is effective for both classification and regression tasks. **Example Using Iris Dataset:** **Step 1: Import Libraries** ```python import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report ``` **Step 2: Load and Explore the Iris Dataset** ```python # Load the Iris dataset iris = datasets.load_iris() X = iris.data # Feature matrix y = iris.target # Target labels ``` **Step 3: Split the Data into Training and Testing Sets** ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` **Step 4: Create and Train the Random Forest Classifier Model** ```python # Create a Random Forest classifier model random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42) # Train the model on the training data random_forest_model.fit(X_train, y_train) ``` *Params That Can Be Changed* 1. **n_estimators** (default=100): - The number of decision trees in the forest. Increasing the number of trees generally improves model performance but also increases computation time. 2. **criterion** (default=“gini”): - The function used to measure the quality of a split. You can choose between “gini” for Gini impurity or “entropy” for information gain. 3. **max_depth** (default=None): - The maximum depth of each decision tree in the forest. Limiting the depth can help prevent overfitting. 4. **min_samples_split** (default=2): - The minimum number of samples required to split an internal node. Increasing this value can help prevent overfitting. 5. **min_samples_leaf** (default=1): - The minimum number of samples required to be in a leaf node. Similar to min_samples_split, increasing this value can prevent overfitting and result in smaller trees. 6. **max_features** (default=“auto”): - The maximum number of features to consider when making a split. You can specify an integer (e.g., 10), a fraction of the total number of features (e.g., “sqrt” or “auto” for the square root of the total features, or “log2” for the base-2 logarithm of the total features), or “None” (which uses all features). 7. **bootstrap** (default=True): - Whether to use bootstrapping when building trees. If set to False, the entire dataset is used to build each tree. 8. **random_state** (default=None): - The seed used by the random number generator for randomizing data sampling. Setting this ensures reproducibility. 9. **class_weight** (default=None): - Weights associated with classes. You can set this to “balanced” to automatically adjust weights inversely proportional to class frequencies in the input data. 10. **n_jobs** (default=None): - The number of CPU cores to use for parallelism during training. Setting it to -1 uses all available cores. 11. **oob_score** (default=False): - Whether to calculate an out-of-bag (OOB) score, which is a performance metric based on samples not seen during training. **Step 5: Make Predictions** ```python # Make predictions on the test data y_pred = random_forest_model.predict(X_test) ``` **Step 6: Evaluate the Model** ```python # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") # Generate a classification report classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names) print("Classification Report:") print(classification_rep) ``` **Explanation:** 1. We import the necessary libraries, including NumPy for numerical operations, Matplotlib for visualization, and scikit-learn for Random Forest classification and dataset loading. 2. We load the Iris dataset, which contains features like sepal length and width, and target labels representing different iris species. 3. We split the dataset into training and testing sets. Here, we use 80% of the data for training and 20% for testing. 4. We create a Random Forest classifier model using `RandomForestClassifier`. We specify the number of decision trees (estimators) in the forest with the `n_estimators` parameter. 5. The model is trained on the training data using `fit`. 6. We use the trained model to make predictions on the test data. 7. We evaluate the model's performance using accuracy and generate a classification report that includes precision, recall, F1-score, and support for each class. Random Forest is a powerful ensemble method that combines the strength of multiple decision trees to improve classification accuracy and generalization. This example demonstrates its implementation for a classification task using the Iris dataset. You can adapt it for your own classification tasks and datasets.