Logistic Regression

## Logistic Regression Overview Logistic regression is a classification algorithm used to model the probability of a binary outcome (1/0, Yes/No, True/False) based on one or more predictor variables. It can be extended to multiclass classification as well. For this demonstration, we'll use the famous Iris dataset, a common choice for classification tasks, available in scikit-learn. ## Example Using Iris Dataset ### Step 1: Import Libraries ```python import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report ``` ### Step 2: Load and Explore the Iris Dataset ```python # Load the Iris dataset iris = datasets.load_iris() X = iris.data # Feature matrix y = iris.target # Target labels # Use only the first two features (sepal length and sepal width) for simplicity X = X[:, :2] ``` ### Step 3: Split the Data into Training and Testing Sets ```python X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` ### Step 4: Create and Train the Logistic Regression Model ```python # Create a Logistic Regression model logistic_model = LogisticRegression() # Train the model on the training data logistic_model.fit(X_train, y_train) ``` **Parameters That Can Be Changed:** 1. **penalty** (default="l2"): - The regularization term to prevent overfitting. Choose between "l1" for L1 regularization, "l2" for L2 regularization, or "none" for no regularization. 2. **C** (default=1.0): - Inverse of regularization strength. Smaller values specify stronger regularization. 3. **solver** (default="lbfgs"): - The algorithm to use for optimization. Options include "lbfgs," "newton-cg," "liblinear," "sag," and "saga." 4. **max_iter** (default=100): - The maximum number of iterations for the solver to converge. 5. **multi_class** (default="auto"): - The strategy for handling multiple classes. Options include "ovr" (one-vs-rest) and "multinomial" (softmax regression). 6. **random_state** (default=None): - The seed used by the random number generator for randomizing data. Setting this ensures reproducibility. 7. **class_weight** (default=None): - Weights associated with classes. You can set this to "balanced" to automatically adjust weights inversely proportional to class frequencies in the input data. ### Step 5: Make Predictions ```python # Make predictions on the test data y_pred = logistic_model.predict(X_test) ``` ### Step 6: Evaluate the Model ```python # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") # Generate a classification report classification_rep = classification_report(y_test, y_pred, target_names=iris.target_names) print("Classification Report:") print(classification_rep) ``` ### Step 7: Visualization (Optional) ```python # Plot decision boundaries for visualization (optional) x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01)) Z = logistic_model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.8) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired) plt.xlabel('Sepal Length (cm)') plt.ylabel('Sepal Width (cm)') plt.title('Logistic Regression Decision Boundaries') plt.show() ``` **Explanation:** 1. We import necessary libraries for numerical operations, visualization, logistic regression, and dataset handling. 2. We load the Iris dataset and use only the first two features for simplicity. 3. The dataset is split into training and testing sets. 4. We create and train a logistic regression model with customizable parameters. 5. Predictions are made on the test data. 6. Model performance is evaluated with accuracy and a classification report. 7. Optional visualization of decision boundaries in the feature space can be performed for better understanding.