Decision Tree - HackMD

**Decision Tree Classification Overview:** A decision tree is a supervised machine learning algorithm used for both classification and regression tasks. It makes decisions by splitting the dataset into subsets based on the most significant attributes (features) to predict the target variable. We'll use the Iris dataset, which is a popular choice for classification tasks. **Example Using Iris Dataset:** **Step 1: Import Libraries** ```python import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score, classification_report ``` **Step 2: Load and Explore the Iris Dataset** ```python # Load the Iris dataset iris = datasets.load_iris() X = iris.data # Feature matrix y = iris.target # Target labels ``` **Step 3: Split the Data into Training and Testing Sets** ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` **Step 4: Create and Train the Decision Tree Model** ```python # Create a Decision Tree classifier model decision_tree_model = DecisionTreeClassifier(random_state=42) # Train the model on the training data decision_tree_model.fit(X_train, y_train) ``` *Params That Can Be Changed* 1. **criterion** (default=“gini”): - The function used to measure the quality of a split. Choose between “gini” for Gini impurity or “entropy” for information gain. 2. **max_depth** (default=None): - The maximum depth of the decision tree. Limiting the depth can help prevent overfitting. 3. **min_samples_split** (default=2): - The minimum number of samples required to split an internal node. Increasing this value can help prevent overfitting. 4. **min_samples_leaf** (default=1): - The minimum number of samples required to be in a leaf node. Similar to min_samples_split, increasing this value can prevent overfitting and result in smaller trees. 5. **max_features** (default=None): - The maximum number of features to consider when making a split. You can specify an integer, a fraction of the total number of features, or “auto” (sqrt of features) or “None” (uses all features). 6. **random_state** (default=None): - The seed used by the random number generator for randomizing data. Setting this ensures reproducibility. **Step 5: Make Predictions** ```python # Make predictions on the test data y_pred = decision_tree_model.predict(X_test) ``` **Step 6: Evaluate the Model** ```python # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") # Generate a classification report classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names) print("Classification Report:") print(classification_rep) ``` **Step 7: Visualization (Optional)** ```python # Plot the decision tree (optional) from sklearn.tree import plot_tree plt.figure(figsize=(12, 8)) plot_tree(decision_tree_model, feature_names=iris.feature_names, class_names=iris.target_names, filled=True) plt.title("Decision Tree Visualization") plt.show() ``` **Explanation:** 1. We import the necessary libraries, including NumPy for numerical operations, Matplotlib for visualization, and scikit-learn for decision tree classification and dataset loading. 2. We load the Iris dataset, which contains features like sepal length and width, and target labels representing different iris species. 3. We split the dataset into training and testing sets. Here, we use 80% of the data for training and 20% for testing. 4. We create a decision tree classifier model and train it on the training data. The model learns to classify iris species based on the features. 5. We use the trained model to make predictions on the test data. 6. We evaluate the model's performance using accuracy and generate a classification report that includes precision, recall, F1-score, and support for each class. 7. Optionally, we can visualize the decision tree to see how it makes decisions by splitting the data based on different feature values. **Decision Tree Regressor Overview:** A decision tree regressor is a supervised machine learning algorithm used for regression tasks. Instead of splitting data to classify into classes, it splits the data to predict continuous numeric values. We'll use the Boston Housing dataset, a common dataset for regression tasks. **Example Using Boston Housing Dataset:** **Step 1: Import Libraries** ```python import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error, r2_score ``` **Step 2: Load and Explore the Boston Housing Dataset** ```python # Load the Boston Housing dataset boston = datasets.load_boston() X = boston.data # Feature matrix y = boston.target # Target values (housing prices) ``` **Step 3: Split the Data into Training and Testing Sets** ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` **Step 4: Create and Train the Decision Tree Regressor Model** ```python # Create a Decision Tree regressor model decision_tree_regressor = DecisionTreeRegressor(random_state=42) # Train the model on the training data decision_tree_regressor.fit(X_train, y_train) ``` **Step 5: Make Predictions** ```python # Make predictions on the test data y_pred = decision_tree_regressor.predict(X_test) ``` *Params That Can Be Changed* 1. **criterion** (default=“mse”): - The function used to measure the quality of a split. Choose between “mse” for mean squared error or “friedman_mse” for the Friedman mean squared error. 2. **splitter** (default=“best”): - The strategy used to choose the split at each node. Options are “best” (split based on the best split) or “random” (split randomly). 3. **max_depth** (default=None): - The maximum depth of the decision tree. Limiting the depth can help prevent overfitting. 4. **min_samples_split** (default=2): - The minimum number of samples required to split an internal node. Increasing this value can help prevent overfitting. 5. **min_samples_leaf** (default=1): - The minimum number of samples required to be in a leaf node. Similar to min_samples_split, increasing this value can prevent overfitting and result in smaller trees. 6. **max_features** (default=None): - The maximum number of features to consider when making a split. You can specify an integer, a fraction of the total number of features, or “auto” (sqrt of features) or “None” (uses all features). 7. **random_state** (default=None): - The seed used by the random number generator for randomizing data. Setting this ensures reproducibility. 8. **min_impurity_decrease** (default=0.0): - A node will be split if this split induces a decrease of the impurity greater than or equal to this value. 9. **presort** (default=“deprecated”): - Deprecated parameter. It has no effect and is maintained for backward compatibility. **Step 6: Evaluate the Model** ```python # Calculate mean squared error (MSE) and R-squared (R2) for evaluation mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"Mean Squared Error (MSE): {mse:.2f}") print(f"R-squared (R2): {r2:.2f}") ``` **Step 7: Visualization (Optional)** ```python # Plot the decision tree (optional) from sklearn.tree import plot_tree plt.figure(figsize=(12, 8)) plot_tree(decision_tree_regressor, feature_names=boston.feature_names, filled=True) plt.title("Decision Tree Regressor Visualization") plt.show() ``` **Explanation:** 1. We import the necessary libraries, including NumPy for numerical operations, Matplotlib for visualization, and scikit-learn for decision tree regression and dataset loading. 2. We load the Boston Housing dataset, which contains features like crime rate, number of rooms, and more, along with target values representing housing prices. 3. We split the dataset into training and testing sets. Here, we use 80% of the data for training and 20% for testing. 4. We create a decision tree regressor model and train it on the training data. The model learns to predict housing prices based on the features. 5. We use the trained model to make predictions on the test data. 6. We evaluate the model's performance using mean squared error (MSE) and R-squared (R2) values, which are common metrics for regression tasks. Lower MSE and higher R2 values indicate better performance. 7. Optionally, we can visualize the decision tree to see how it makes splits based on different feature values.