# Classification ## Overview Classification in machine learning is a supervised learning technique where the objective is to develop a model to assign categorical labels (or classes) to data points. The model learns the patterns between features (attributes of the data) and their corresponding labels from a labeled training dataset. Then, the trained model is used to predict the class of unseen samples. ### Examples of Classification Applications * **Email Spam Filtering:** Distinguishing between 'spam' and 'not spam' emails. * **Medical Diagnosis:** Identifying the presence or absence of a disease based on medical information. * **Customer Churn Prediction:** Predicting whether customers are likely to discontinue using a service. * **Sentiment Analysis:** Categorizing text data (e.g., reviews, social posts) as positive, negative, or neutral. * **Image Classification:** Assigning images to object categories (e.g., dog, cat, car, etc.). ### Core Concepts * **Features:** The attributes or variables of a data point (e.g., color, size, word frequency). * **Labels:** The target categories (e.g., spam/not spam, diseased/healthy). * **Classes:** The possible categories a data point can belong to. * **Decision Boundary:** The line or hyperplane the model learns to separate different classes in the data. ### Popular Classification Methods * **Discriminative Models:** * These models directly model the decision boundary between classes. * Examples: * Logistic Regression * K-Nearest Neighbors (KNN) * Support Vector Machines (SVM) * **Generative Models:** * These learn the distribution of features within each class and then use probability to classify new data points. * Examples: * Naive Bayes * Linear Discriminant Analysis (LDA) * Quadratic Discriminant Analysis (QDA) ### Evaluation Metrics Choosing the right evaluation metrics is essential for understanding and optimizing your classifier's performance. * **Confusion Matrix:** A table that summarizes your model's predictions: * **True Positive (TP):** Correctly predicted positive cases * **False Positive (FP):** Incorrectly predicted positive cases (Type I error) * **True Negative (TN):** Correctly predicted negative cases * **False Negative (FN):** Incorrectly predicted negative cases (Type II error) ![image_5baaeac4c0](https://hackmd.io/_uploads/Hk-CeOWbC.png) * **Accuracy:** The proportion of correct predictions out of all predictions. $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$$ * Accuracy can be misleading when dealing with imbalanced datasets (i.e., one class is much more frequent than the other). * **Precision:** The fraction of true positive predictions out of all positive predictions made by the model. $$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$ * Prioritizes minimizing false positives. (e.g., email spam detection). * **Recall (Sensitivity):** The fraction of true positive predictions out of all actual positive cases. $$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$ * Prioritizes minimizing false negatives (e.g., medical diagnosis of a serious disease). * **F1-Score:** The harmonic mean of precision and recall. $$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$ * A useful metric when you seek a balance between precision and recall, especially for imbalanced classes. * **F-Beta Score:** A weighted average of precision and recall, where the beta value controls the emphasis on one or the other. $$F_{\beta} = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$$ * $\beta=1$ yields the F1 score, which equally balances precision and recall. * $\beta<1$ gives more weight to precision (lower values of $β$ prioritize precision). * $\beta>1$ gives more weight to recall (higher values of $β$ prioritize recall). * **ROC Curve and AUC:** * **ROC Curve:** A graph plotting True Positive Rate (Recall) vs. False Positive Rate($\frac{FP}{FP+TN}$) at various classification thresholds. * Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. * A perfect classifier would have a ROC curve that touches the top-left corner (100% sensitivity, 100% specificity). * A perfect classifier would have a ROC curve that touches the top-left corner (100% sensitivity, 100% specificity). * **AUC (Area Under the ROC Curve):** Measures a classifier's ability to discriminate between classes. Higher AUC indicates better performance. * A model whose predictions are 100% wrong has an AUC of 0. * A model whose predictions are 100% correct has an AUC of 1. ### Example ```python= from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score, roc_curve, auc, roc_auc_score import matplotlib.pyplot as plt y_true = [0, 1, 1, 0, 1, 0, 1, 1] y_pred = [0, 1, 0, 0, 1, 1, 0, 1] accuracy = accuracy_score(y_true, y_pred) precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) beta1 = 0.5 beta2 = 1 beta3 = 1.5 f_beta1 = fbeta_score(y_true, y_pred, beta=beta1) f_beta2 = fbeta_score(y_true, y_pred, beta=beta2) f_beta3 = fbeta_score(y_true, y_pred, beta=beta3) print("Accuracy: ", accuracy) print("Precision: ", precision) print("Recall: ", recall) print("F1-Score: ", f1) print(f"F-Beta Score (beta={beta1}): ", f_beta1) print(f"F-Beta Score (beta={beta2}): ", f_beta2) print(f"F-Beta Score (beta={beta3}): ", f_beta3) # ROC curve fpr, tpr, thresholds = roc_curve(y_true, y_pred) auc_score = roc_auc_score(y_true, y_pred) # Plot ROC curve plt.figure() plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc_score:.2f})') plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC)') plt.legend(loc="lower right") plt.show() ``` ``` Accuracy: 0.625 Precision: 0.75 Recall: 0.6 F1-Score: 0.6666666666666666 F-Beta Score (beta=0.5): 0.7142857142857143 F-Beta Score (beta=1): 0.6666666666666666 F-Beta Score (beta=1.5): 0.639344262295082 ``` ![image](https://hackmd.io/_uploads/rkEG-07-R.png) ## Discriminative Model Discriminative models excel at directly learning decision boundaries between classes to classify data points. They focus on modeling the probability of a data point belonging to a specific class, given its features. Let's dive into some of the most popular discriminative algorithms: ### Logistic Regression * **Purpose:** Models the probability of a binary outcome $p(X) = \mathbb{P}(Y=1|X)$. * **How it works:** * **Sigmoid function:** Logistic regression employs the sigmoid function to model probabilities between $0$ and $1$: $$p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}$$ ![image](https://hackmd.io/_uploads/Sk4Y5dr-C.png) * **Logit function:** The logit function transforms probabilities into a linear expression, aiding interpretability: $$\text{logit}(p(X)) = \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$$ * **Maximum Likelihood Estimation:** Logistic regression uses this process to find the optimal coefficients ($\beta$ values) that maximize the likelihood of the observed data. $$l(\hat{\beta}_0, \dots, \hat{\beta}_p) = \prod_{i:y_i=1} p(x_i)\prod_{j:y_j=0} (1-p(x_j))$$ * **Key Advantages:** * Interpretable coefficients indicate the effect of each feature on class probabilities. * Provides class probability estimates, useful for ranking or uncertainty assessments. #### Example: Predicting Credit Card Default This example demonstrates how to use logistic regression to predict the probability of credit card default based on balance and student status. ```python= import numpy as np import pandas as pd import statsmodels.api as sm # Set a random seed for reproducibility np.random.seed(0) # Generate a synthetic dataset n_samples = 1000 balance = np.random.normal(1500, 800, n_samples) # Generate balance amounts student = np.random.binomial(1, 0.2, n_samples) # Generate student status (1 for student, 0 for non-student) # Generate binary outcome 'default' # The coefficients (-4, 0.0025, 0.5) are chosen arbitrarily for the purpose of generating a synthetic dataset logit = -4 + 0.0025 * balance + 0.5 * student # Calculate the logit (linear combination of predictors and coefficients) prob_default = 1 / (1 + np.exp(-logit)) # Convert logit to probability using the logistic function default = np.random.binomial(1, prob_default, n_samples) # Generate binary default outcome based on the probability # Create a DataFrame df = pd.DataFrame({ 'balance': balance, 'student': student, 'default': default }) print("Synthetic dataset:") print(df.head()) # Print the first few rows of the DataFrame # Fit logistic regression models # sm.add_constant() is to add a column of ones to the features matrix, which acts as a constant term or intercept in the regression model. X1 = sm.add_constant(df['balance']) # Model with only the balance feature model_balance = sm.Logit(df['default'], X1) result_balance = model_balance.fit(disp=0) # Fit the model and suppress the output X2 = sm.add_constant(df['student']) # Model with only the student status feature model_student = sm.Logit(df['default'], X2) result_student = model_student.fit(disp=0) X3 = sm.add_constant(df[['balance', 'student']]) # Model with both balance and student status features model_both = sm.Logit(df['default'], X3) result_both = model_both.fit(disp=0) # Predict default probability for new observations new_balance = 1000 # Set a new balance value new_student_status = 1 # Set a new student status (1 for student) prob_default_balance = float(result_balance.predict([1, new_balance])) prob_default_student_yes = float(result_student.predict([1, new_student_status])) prob_default_student_no = float(result_student.predict([1, 1 - new_student_status])) prob_default_new = float(result_both.predict([1, new_balance, new_student_status])) # Print model summaries and predicted probabilities print("\nModel with only balance:") print(result_balance.summary()) print(f"\nProbability of default for balance of $1000: {prob_default_balance:.2%}") print("\nModel with only student status:") print(result_student.summary()) print(f"\nProbability of default for a student: {prob_default_student_yes:.2%}") print(f"Probability of default for a non-student: {prob_default_student_no:.2%}") print("\nModel with both balance and student status:") print(result_both.summary()) print(f"\nProbability of default for a student with balance of $1000: {prob_default_new:.2%}") ``` ``` Synthetic dataset: balance student default 0 2911.241877 1 0 1 1820.125767 0 1 2 2282.990387 1 1 3 3292.714559 1 1 4 2994.046392 0 1 Model with only balance: Logit Regression Results ============================================================================== Dep. Variable: default No. Observations: 1000 Model: Logit Df Residuals: 998 Method: MLE Df Model: 1 Date: Mon, 22 Apr 2024 Pseudo R-squ.: 0.3068 Time: 20:05:18 Log-Likelihood: -479.15 converged: True LL-Null: -691.22 Covariance Type: nonrobust LLR p-value: 3.046e-94 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -3.6285 0.245 -14.825 0.000 -4.108 -3.149 balance 0.0024 0.000 15.244 0.000 0.002 0.003 ============================================================================== Probability of default for balance of $1000: 21.84% Model with only student status: Logit Regression Results ============================================================================== Dep. Variable: default No. Observations: 1000 Model: Logit Df Residuals: 998 Method: MLE Df Model: 1 Date: Mon, 22 Apr 2024 Pseudo R-squ.: 0.007691 Time: 20:05:18 Log-Likelihood: -685.91 converged: True LL-Null: -691.22 Covariance Type: nonrobust LLR p-value: 0.001111 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -0.2341 0.072 -3.249 0.001 -0.375 -0.093 student 0.5005 0.154 3.245 0.001 0.198 0.803 ============================================================================== Probability of default for a student: 56.62% Probability of default for a non-student: 44.17% Model with both balance and student status: Logit Regression Results ============================================================================== Dep. Variable: default No. Observations: 1000 Model: Logit Df Residuals: 997 Method: MLE Df Model: 2 Date: Mon, 22 Apr 2024 Pseudo R-squ.: 0.3161 Time: 20:05:18 Log-Likelihood: -472.74 converged: True LL-Null: -691.22 Covariance Type: nonrobust LLR p-value: 1.306e-95 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -3.8376 0.258 -14.856 0.000 -4.344 -3.331 balance 0.0024 0.000 15.199 0.000 0.002 0.003 student 0.6938 0.196 3.548 0.000 0.311 1.077 ============================================================================== Probability of default for a student with balance of $1000: 32.04% ``` ### Multinomial Logistic Regression Multinomial logistic regression is a powerful extension of logistic regression, designed to handle classification problems where there are three or more possible classes ($K > 2$). It's a discriminative model, directly learning the decision boundaries that separate data points belonging to different categories. * **Probabilistic Modeling:** Multinomial logistic regression calculates the probability of a data point belonging to each possible class. The class with the highest probability is assigned as the prediction: $$\mathbb{P}(Y = k|X = x) = \begin{cases} \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{1 + \sum_{i=1}^{K-1}e^{\beta_{i0} + \beta_{i1}x_1 + \cdots + \beta_{ip}x_p}} &\text{ for } k = 1,...,K-1\\ \frac{1}{1 + \sum_{i=1}^{K-1}e^{\beta_{i0} + \beta_{i1}x_1 + \cdots + \beta_{ip}x_p}} &\text{ for } k = K \end{cases} $$ * **Baseline Class:** One class is selected as the reference point. Multinomial logistic regression effectively runs $K-1$ separate logistic regressions, each comparing one class against the chosen baseline. * **Log Odds Ratios and Interpretation:** To compare classes directly (not just against the baseline), log odds ratios are calculated: $$\log \left( \frac{\mathbb{P}(Y = k | X = x)}{\mathbb{P}(Y = k' | X = x)} \right) = (\beta_{k0}-\beta_{k'0}) + (\beta_{k1}-\beta_{k'1})X_1 + \cdots + (\beta_{kn}-\beta_{k'n})X_n$$ * The coefficients $\beta$ provide valuable insights about the influence of features on the odds of a data point belonging to one class versus another. #### Example $(K=3)$ This example demonstrates multinomial logistic regression for a 3-class problem. ```python= import numpy as np from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt # Generate data def make_data(n_data, n_features, cov_matrices, shifts): rng = np.random.RandomState() X = np.concatenate([ rng.randn(n_data, n_features) @ cov + shift for cov, shift in zip(cov_matrices, shifts) ]) y = np.concatenate([ np.full(n_data, i) for i in range(len(cov_matrices)) ]) return X, y def grid(X): # feature X = [X_1,X_2] x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01)) grid_points = np.c_[xx.ravel(), yy.ravel()] return xx, yy, grid_points def plot_data(X, y, title): plt.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=20) plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.title(title) # Define covariance matrices for each class cov_class_1 = np.array([[1, 0.3], [0.3, 1]]) cov_class_2 = np.array([[0.8, 0], [0, 0.9]]) cov_class_3 = np.array([[1, 0.1], [0.1, 0.9]]) cov_matrices = [cov_class_1, cov_class_2, cov_class_3] shifts = [[0, 0], [1, 3], [2, 2]] # Generate data X, y = make_data(n_data=20, n_features=2, cov_matrices=cov_matrices, shifts=shifts) # Create grid to plot decision boundaries xx, yy, grid_points = grid(X) # Fit Multinomial Logistic Regression log_reg_clf = LogisticRegression(multi_class='multinomial', solver='lbfgs') log_reg_clf.fit(X, y) # Predict on grid points Z_log_reg = log_reg_clf.predict(grid_points) Z_log_reg = Z_log_reg.reshape(xx.shape) # Plot decision boundaries and data plt.figure(figsize=(10, 8)) plt.contourf(xx, yy, Z_log_reg, alpha=0.4, cmap='tab10') plt.contour(xx, yy, Z_log_reg, linestyles=['--'], colors='black') plot_data(X,y,title="Multinomial Logistic Regression Decision Boundaries") plt.show() ``` ![image](https://hackmd.io/_uploads/BkcZQfV-0.png) ### K-Nearest Neighbors (KNN) * **Purpose:** Classifies unlabeled data points based on the majority class of their $K$ closest neighbors in the training set. * **How it works:** 1. **Choose a value for $K$:** This hyperparameter represents the number of neighbors to consider. 2. **Calculate distances:** Find the $K$ nearest neighbors in the training set to a new data point. Euclidean distance is common, but other distance metrics can be used. 3. **Majority vote:** Classify the new data point according to the most common class among its $K$ nearest neighbors. * **Key Considerations:** * **Data scaling:** Standardizing or normalizing features is often necessary to prevent features with larger ranges from dominating distance calculations. * **Choosing $K$:** An overly small $K$ leads to overfitting (high variance); an excessively large K leads to underfitting (high bias). Cross-validation helps find the optimal $K$ value. #### Example: Comparing Different $K$ Values This example demonstrates how the choice of K affects the decision boundaries of a KNN classifier. ```python= import numpy as np from sklearn.neighbors import KNeighborsClassifier as KNN import matplotlib.pyplot as plt # Generate data cov_class_1 = np.array([[1, 0.3], [0.3, 1]]) cov_class_2 = np.array([[0.8, 0], [0, 0.9]]) cov_matrices = [cov_class_1, cov_class_2] shifts = [[0, 0], [1, 3]] X, y = make_data(n_data=20, n_features=2, cov_matrices=cov_matrices, shifts=shifts) # Set up subplots fig, axes = plt.subplots(1, 3, figsize=(24, 8)) # Fit KNN for different K values and plot n_neighbors = [2, 5, 10] knn_clfs = [KNN(n_neighbors=n) for n in n_neighbors] xx, yy, grid_points = grid(X) for ax, knn_clf, n_neighbor in zip(axes.flatten(), knn_clfs, n_neighbors): knn_clf.fit(X, y) Z_knn = knn_clf.predict(grid_points).reshape(xx.shape) ax.contourf(xx, yy, Z_knn, alpha=0.4, cmap='tab10') ax.contour(xx, yy, Z_knn, levels=[0.5], linestyles=['--'], colors='black') ax.set_title(f"KNN Classifier (K={n_neighbor})", fontsize=20) ax.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=50) ax.set_xlim(xx.min(), xx.max()) ax.set_ylim(yy.min(), yy.max()) plt.tight_layout() plt.show() ``` ![image](https://hackmd.io/_uploads/SkXJBzVWC.png) ## Generative Models Generative models offer a powerful way to model the underlying probability distributions that give rise to your data. Unlike discriminative models, which directly learn decision boundaries, generative models focus on understanding the joint probability distribution, $\mathbb{P}(X, Y)$ of features $X$ and class labels $Y$. This provides several advantages: * **Data Generation:** Generate new, synthetic samples that resemble your training data (useful for simulations or augmenting datasets). * **Uncertainty Quantification:** Make more informed predictions along with estimates of uncertainty. * **Semi-Supervised Learning:** Leverage generative models effectively even when there's a large amount of unlabeled data. ### Key Concepts * **Joint Probability Modeling:** Generative models strive to learn the joint distribution $\mathbb{P}(X, Y)$. * **Bayes' Theorem for Classification:** Once you have the joint distribution, you can use Bayes' Theorem for classification: $$\mathbb{P}(Y=k | X=x) = \frac{\mathbb{P}(X=x | Y=k) \cdot \mathbb{P}(Y=k)}{\mathbb{P}(X=x)}$$ * **Bayes Classifier:** Explicitly utilizes these probabilities to assign the most probable class to new data points: $$\mathbb{P}(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{i=1}^K\pi_if_i(x)}$$ where: * $\pi_k$ is the prior probability of class $k$. * $f_k(X)$ is the class-conditional density $\mathbb{P}(X|Y=k)$. ### Linear Discriminant Analysis *Linear Discriminant Analysis (LDA)* is a classic classification method that excels when features from different classes follow Gaussian distributions with shared covariance matrices. It builds linear decision boundaries for separating classes. #### Key Assumptions: * **Gaussian Distributed Features (within each class):** Each class is modeled as a multivariate Gaussian distribution. * **Homoscedasticity:** All classes share the same covariance matrix. #### Understanding LDA 1. **Class-Conditional Densities:** The Gaussian probability density function (p.d.f.) for class $k$ is: $$f_k(x)=\frac{1}{(2\pi)^{p/2} \det(\Sigma)^{1/2}} \exp(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k))$$ where: * $x$: A data point (a vector of features). * $\mu_k$: The mean vector of class $k$. * $\Sigma$: The shared covariance matrix. 2. **Discriminant Functions:** LDA uses Bayes' Theorem and these densities to derive discriminant functions for each class: $$\delta_k(x)= x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log(\pi_k)$$ where $\pi_k$ is the prior probability of class $k$. 3. **Decision Rule:** LDA assigns a new data point $x$ to the class with the highest discriminant function value: $$\arg\max_k \delta_k(x)$$ #### Parameter Estimation * **Mean Vector $\mu_k$:** $$\hat{\mu}_k = \frac{1}{n_k} \sum_{i:y_i=k} x_i $$ * **Shared Covariance Matrix $\Sigma$:** $$\hat{\Sigma}=\frac{1}{n-K} \sum_{k=1}^K \sum_{i:y_i=k} (x_i-\hat{\mu}_k) (x_i-\hat{\mu}_k)^T$$ * **Prior Probabilities $\pi_k$:** $$\hat{\pi}_k = \frac{n_k}{n}$$ #### LDA Decision Boundaries The decision boundary between two classes is the set of points where their discriminant functions are equal. For LDA, these boundaries are *linear (or hyperplanes in higher dimensions)*. 1. **Univariate Case $(p=1)$:** In this case, the *Bayes Decision Boundary* is the point $x$ such that $\delta_1(x)=\delta_2(x)$. * For equal priors ($\pi_1=\pi_2=0.5$), it simplifies to $x=\frac{\mu_1+\mu_2}{2}$. ```python= import numpy as np from scipy.stats import norm import matplotlib.pyplot as plt from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA # Example 1: Bayes Decision Boundary mu1, mu2 = -1, 1 # means sigma1, sigma2 = 1, 1 # standard deviations x = np.linspace(-5, 5, 100) y1 = norm.pdf(x, mu1, sigma1) y2 = norm.pdf(x, mu2, sigma2) plt.figure(figsize=(8, 6)) plt.plot(x, y1, color='green', label="N(-1,1)") plt.plot(x, y2, color='magenta', label="N(1,1)") plt.axvline(x=0, color='black', linestyle='-', label="Bayes Decision Boundary") plt.title("Bayes Decision Boundary") plt.legend() plt.show() # Example 2: LDA vs Bayes Decision Boundary cov_matrices=[[[1]],[[1]]] shifts = [-1, 1] X, y = make_data(n_data=20, n_features=1, cov_matrices=cov_matrices, shifts=shifts) # Bayes decision boundary bayes_decision_boundary = np.mean(shifts) # Fit LDA clf = LDA() clf.fit(X, y) lda_decision_boundary = -clf.intercept_ / clf.coef_[0] # Plot plt.figure(figsize=(10, 6)) plt.hist(X[:20], bins=10, color='green', alpha=0.5, label='N(-1,1)') plt.hist(X[20:], bins=10, color='magenta', alpha=0.5, label='N(1,1)') plt.axvline(bayes_decision_boundary, color='black', linestyle='-', label='Bayes') plt.axvline(lda_decision_boundary, color='black', linestyle='--', label='LDA') plt.legend() plt.xlabel('X') plt.ylabel('Frequency') plt.title('LDA vs Bayes Decision Boundary') plt.show() ``` ![image](https://hackmd.io/_uploads/Bk31KGV-R.png) ![image](https://hackmd.io/_uploads/SJ3gKMVbA.png) 2. **Multivariate Case $(p>1)$:** In this case, the *Bayes decision boundary* is $$\{x \in \mathbb{R}^p| \delta_i(x)=\delta_j(x) \text{ for some } i\neq j\}$$ ```python= import numpy as np from scipy.stats import multivariate_normal from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA import matplotlib.pyplot as plt # Define covariance matrix and mean shifts for each class cov_matrix = np.array([[1, 0.5], [0.5, 1]]) cov_matrices = [cov_matrix] * 3 shifts = [[0, 0], [3, 4], [4, 3]] # Generate data X, y = make_data(n_data=20, n_features=2, cov_matrices=cov_matrices, shifts=shifts) # Fit LDA clf = LDA() clf.fit(X, y) # Create grid for plotting decision boundaries xx, yy, grid_points = grid(X) # LDA decision boundary Z_lda = clf.predict(grid_points) Z_lda = Z_lda.reshape(xx.shape) # Bayes decision boundary pdfs = [multivariate_normal(mean=shift, cov=cov_matrix).pdf(grid_points) for shift in shifts] Z_bayes = np.argmax(pdfs, axis=0) Z_bayes = Z_bayes.reshape(xx.shape) # Plot plt.figure(figsize=(8, 8)) plt.contourf(xx, yy, Z_lda, alpha=0.4, cmap='tab10') plt.contour(xx, yy, Z_lda, levels=[0.5, 1.5], linestyles=['--'], colors='black') plt.contour(xx, yy, Z_bayes, levels=[0.5, 1.5], linestyles=['-'], colors='black') plot_data(X, y, title="LDA(--) vs Bayes(-) Decision Boundaries") plt.show() ``` ![image](https://hackmd.io/_uploads/B1SVpfNZR.png) ### Quadratic Discriminant Analysis *Quadratic Discriminant Analysis (QDA)* is a generative classification method that offers more flexibility than Linear Discriminant Analysis (LDA). Let's break down how it works: #### Relaxing LDA's Assumption: * **LDA:** Assumes all classes share the same covariance matrix. * **QDA:** Relaxes this assumption, allowing each class to have its own unique covariance matrix: $$f_k(x)=\frac{1}{(2\pi)^{p/2} \det \Sigma_k^{1/2} } \exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k))$$ #### Consequences * **Decision Boundaries:** QDA's decision boundaries become *quadratic* (as opposed to linear in LDA). This can better model problems where class distributions have distinct shapes. * **Parameter Estimation:** QDA requires estimating more parameters (a separate covariance matrix for each class). #### Quadratic Discriminant Function Similar to LDA, QDA employs discriminant functions for classification: $$ \delta_k(x)=-\frac{1}{2}x^T\Sigma_k^{-1}x+x^T\Sigma_k^{-1}\mu_k -\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log(\det \Sigma_k )+\log(\pi_k) $$ ![image](https://hackmd.io/_uploads/SJ012Pp1R.png) * **Red dashed line (QDA):** Represents the decision boundary modeled by QDA. Its quadratic shape captures the differences in class distributions. * **Black dashed line (LDA):** Shows the decision boundary produced by LDA, which is linear due to LDA's shared covariance assumption. * **Black line (Bayes):** Represents the optimal (Bayes) decision boundary based on the true underlying distributions. ### Naive Bayes *Naive Bayes (NB)* is a flexible and often surprisingly effective generative classifier. It's built upon the idea of conditional independence between features, which simplifies computations. * **Assumptions:** within each class, the predictors are independent $$f_k(x)=f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p)$$ where $f_{kj}$ is the density of the $j$-th predictor in class $k$. * **Modeling Feature Distributions:** * **Continuous Features:** Often modeled with Gaussian distributions (Normal distributions). You estimate the mean $\mu_{jk}$ and variance $\sigma_{jk}^2$ for each feature $j$ within each class $k$. Alternatively, non-parametric methods like Kernel Density Estimation (KDE) can be used. * **Categorical Features:** Class-conditional probabilities are estimated directly from the frequencies of each feature value within each class. * **Bayes' Theorem for Classification:** $$\mathbb{P}(Y=k|X=x)=\frac{\pi_k \times f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p)}{\sum_{i=1}^K \pi_i \times f_{i1}(x_1) \times f_{i2}(x_2) \times \cdots \times f_{ip}(x_p)}$$ #### Example Here's an example comparing Naive Bayes with diagonal and non-diagonal covariance matrices: ```python= import numpy as np from sklearn.naive_bayes import GaussianNB as NB from scipy.stats import multivariate_normal import matplotlib.pyplot as plt # Diagonal Covariance Matrix cov_matrix = np.array([[1, 0], [0, 1.5]]) # Create a diagonal covariance matrix cov_class_1, cov_class_2, cov_class_3 = [cov_matrix] * 3 # Use the same covariance matrix for all classes shifts = [[0, 0], [1, 2], [3, 1]] # Generate data using the diagonal covariance matrices and mean shifts X, y = make_data(n_data=30, n_features=2, cov_matrices=[cov_class_1, cov_class_2, cov_class_3], shifts=shifts) # Fit Naive Bayes classifier nb_clf = NB() nb_clf.fit(X, y) # Create a grid of points for plotting decision boundaries xx, yy, grid_points = grid(X) # Predict classes for the grid points using Naive Bayes Z_nb = nb_clf.predict(grid_points) Z_nb = Z_nb.reshape(xx.shape) # Calculate the true posterior probabilities for the grid points pdfs = [multivariate_normal(mean=shift, cov=cov_matrix).pdf(grid_points) for shift in shifts] pdfs = np.array(pdfs) # Assign each point to the class with the highest probability (Bayes classifier) Z_bayes = np.argmax(pdfs, axis=0) Z_bayes = Z_bayes.reshape(xx.shape) # Plot the decision boundaries and data points fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6)) ax1.contourf(xx, yy, Z_nb, alpha=0.4, cmap='tab10') ax1.contour(xx, yy, Z_nb, levels=[0.5, 1.5], linestyles=['--'], colors='black') ax1.contour(xx, yy, Z_bayes, levels=[0.5, 1.5], linestyles=['-'], colors='black') ax1.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=20) ax1.set_title('Naive Bayes vs Bayes Classifier\nDiagonal Covariance') # Non-Diagonal Covariance Matrix cov_class_1 = np.array([[1, 0.5], [0.5, 1.5]]) cov_class_2 = np.array([[1, 0.5], [0.5, 1.5]]) cov_class_3 = np.array([[1, 0.5], [0.5, 1.5]]) cov_matrices = [cov_class_1, cov_class_2, cov_class_3] # Create different non-diagonal covariance matrices for each class shifts = [[0, 0], [1, 2], [3, 1]] # Generate data using the non-diagonal covariance matrices and mean shifts X, y = make_data(n_data=30, n_features=2, cov_matrices=cov_matrices, shifts=shifts) # Fit Naive Bayes classifier nb_clf = NB() nb_clf.fit(X, y) # Create a grid of points for plotting decision boundaries xx, yy, grid_points = grid(X) # Predict classes for the grid points using Naive Bayes Z_nb = nb_clf.predict(grid_points) Z_nb = Z_nb.reshape(xx.shape) # Calculate the true posterior probabilities for the grid points pdfs = [multivariate_normal(mean=shift, cov=cov_matrix).pdf(grid_points) for shift in shifts] pdfs = np.array(pdfs) # Assign each point to the class with the highest probability (Bayes classifier) Z_bayes = np.argmax(pdfs, axis=0) Z_bayes = Z_bayes.reshape(xx.shape) # Plot the decision boundaries and data points ax2.contourf(xx, yy, Z_nb, alpha=0.4, cmap='tab10') ax2.contour(xx, yy, Z_nb, levels=[0.5, 1.5], linestyles=['--'], colors='black') ax2.contour(xx, yy, Z_bayes, levels=[0.5, 1.5], linestyles=['-'], colors='black') ax2.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=20) ax2.set_title('Naive Bayes vs Bayes Classifier\nNon-Diagonal Covariance') plt.tight_layout() plt.show() ``` ![image](https://hackmd.io/_uploads/Skzmz74-A.png) ## Comparing Classification Methods Choosing the right classification method depends on your data's characteristics and modeling goals. Here's a breakdown of key methods and how they compare: | Method | Key Assumptions | Decision Boundary | |---|---|---| | Logistic Regression | None about predictor distribution | Linear | | KNN | None about predictor distribution | Varies (more flexible) | | LDA | Multivariate normality, equal class covariance matrices | Linear | | QDA | Multivariate normality, class-specific covariances | Quadratic | | Naive Bayes | Conditional feature independence given the class label | Varies (more flexible) | #### Guidance for Choosing a Model * **Start Simple, Iterate:** Begin with logistic regression or LDA due to their simplicity. If performance needs improvement, consider more complex models. * **Check Assumptions:** * Assess the normality of your features. LDA/QDA are better suited for Gaussian-like data. * Consider interactions between features. Strong dependencies might violate Naive Bayes' independence assumption. * **Data Size and Complexity:** * **Small Datasets:** LDA often performs better than QDA to avoid overfitting. * **Large Datasets:** QDA's flexibility becomes advantageous. * **Many Features:** Naive Bayes might excel due to fewer parameters to estimate. #### Additional Insights * **Trade-offs:** More flexible models (QDA, Naive Bayes) have higher variance and can overfit. Simpler models (LDA, Logistic Regression) tend to have higher bias but might be more robust. * **Hybrid Approaches:** It's possible to combine assumptions. For example, you could use Naive Bayes but relax the independence assumption for certain strongly correlated features. * **No One-Size-Fits-All:** The best method is often determined empirically through evaluation and cross-validation. ### Example: Direction Prediction of Stock Market Data In this example, we will be exploring a dataset from the stock market. Our objective is to **predict the direction of the stock market** ('Up' or 'Down') based on various predictors including some lag prices and volume. #### Logistic Regression ```python= from ISLP import load_data, from ISLP.models import ModelSpec as MS, summarize from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA, QuadraticDiscriminantAnalysis as QDA from sklearn.naive_bayes import GaussianNB from sklearn.neighbors import KNeighborsClassifier as KNN from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score import numpy as np import statsmodels.api as sm # Load the data Smarket = load_data('Smarket') allvars = Smarket.columns.drop(['Today', 'Direction', 'Year']) design = MS(allvars) X = design.fit_transform(Smarket) y = Smarket.Direction == 'Up' # Split data into train and test sets train = (Smarket.Year < 2005) X_train, X_test = X.loc[train], X.loc[~train] y_train, y_test = y.loc[train], y.loc[~train] # Transform train and test sets to label("Up" and "Down") sets L_train = Smarket.Direction.loc[train] L_test = Smarket.Direction.loc[~train] # Logistic Regression glm_train = sm.GLM(y_train, X_train, family=sm.families.Binomial()) # The default link function is logit. results = glm_train.fit() probs = results.predict(exog=X_test) # Create a confusion matrix and evaluate metrics labels = np.array(['Down'] * 252) labels[probs > 0.5] = 'Up' print("Logistic Regression (Initial features):") print(confusion_table(labels, L_test)) print(f"\nTest Accuracy = {np.mean(labels == L_test):.4f}") print(f"Test Error = {np.mean(labels != L_test):.4f}") print(f"Precision = {precision_score(L_test, labels, pos_label='Up'):.4f}") print(f"Recall = {recall_score(L_test, labels, pos_label='Up'):.4f}") print(f"F1 score = {f1_score(L_test, labels, pos_label='Up'):.4f}") print("\nModel summary:") print(summarize(results)) print("\n") ``` ``` Logistic Regression (Initial features): Truth Down Up Predicted Down 77 97 Up 34 44 Test Accuracy = 0.4802 Test Error = 0.5198 Precision = 0.5641 Recall = 0.3121 F1 score = 0.4018 Model summary: coef std err z P>|z| intercept 0.1912 0.334 0.573 0.567 Lag1 -0.0542 0.052 -1.046 0.295 Lag2 -0.0458 0.052 -0.884 0.377 Lag3 0.0072 0.052 0.139 0.889 Lag4 0.0064 0.052 0.125 0.901 Lag5 -0.0042 0.051 -0.083 0.934 Volume -0.1163 0.240 -0.485 0.628 ``` * None of the predictor variables (`Lag1` to `Lag5`, `Volume`) are statistically significant at the commonly used significance levels (e.g., 0.05, 0.01). This is indicated by the p-values which are all greater than these levels. * In this case, since `Lag1` and `Lag2` have the smallest p-values, they may be the most promising predictors among these features. However, remember that while these features may improve the model, the lack of statistical significance suggests that any improvements might be marginal. #### Improved Logistic Regression ```python=+ # Feature selection and improved Logistic Regression model = MS(['Lag1', 'Lag2']).fit(Smarket) X = model.transform(Smarket) X_train, X_test = X.loc[train], X.loc[~train] glm_train = sm.GLM(y_train, X_train, family=sm.families.Binomial()) results = glm_train.fit() probs = results.predict(exog=X_test) labels = np.array(['Down'] * 252) labels[probs > 0.5] = 'Up' print("Logistic Regression (Lag1, Lag2):") print(confusion_table(labels, L_test)) print(f"\nTest Accuracy = {np.mean(labels == L_test):.4f}") print(f"Test Error = {np.mean(labels != L_test):.4f}") print(f"Precision = {precision_score(L_test, labels, pos_label='Up'):.4f}") print(f"Recall = {recall_score(L_test, labels, pos_label='Up'):.4f}") print(f"F1 score = {f1_score(L_test, labels, pos_label='Up'):.4f}") print("\n") ``` ``` Logistic Regression (Lag1, Lag2): Truth Down Up Predicted Down 35 35 Up 76 106 Test Accuracy = 0.5595 Test Error = 0.4405 Precision = 0.5824 Recall = 0.7518 F1 score = 0.6563 ``` * Accuracy : $48\% \Rightarrow 56\%$ * Precision : $56.4\% \Rightarrow 58.2\%$ #### LDA ```python=+ lda = LDA(store_covariance=True) X_train, X_test = [M.drop(columns=['intercept']) for M in [X_train, X_test]] lda.fit(X_train, L_train) lda_pred = lda.predict(X_test) print("LDA:") print(confusion_table(lda_pred, L_test)) print(f"\nTest Accuracy = {np.mean(lda_pred == L_test):.4f}") print(f"Test Error = {np.mean(lda_pred != L_test):.4f}") print(f"Precision = {precision_score(L_test, lda_pred, pos_label='Up'):.4f}") print(f"Recall = {recall_score(L_test, lda_pred, pos_label='Up'):.4f}") print(f"F1 score = {f1_score(L_test, lda_pred, pos_label='Up'):.4f}") print("\n") ``` ``` LDA: Truth Down Up Predicted Down 35 35 Up 76 106 Test Accuracy = 0.5595 Test Error = 0.4405 Precision = 0.5824 Recall = 0.7518 F1 score = 0.6563 ``` #### QDA ```python=+ qda = QDA(store_covariance=True) qda.fit(X_train, L_train) qda_pred = qda.predict(X_test) print("QDA:") print(confusion_table(qda_pred, L_test)) print(f"\nTest Accuracy = {np.mean(qda_pred == L_test):.4f}") print(f"Test Error = {np.mean(qda_pred != L_test):.4f}") print(f"Precision = {precision_score(L_test, qda_pred, pos_label='Up'):.4f}") print(f"Recall = {recall_score(L_test, qda_pred, pos_label='Up'):.4f}") print(f"F1 score = {f1_score(L_test, qda_pred, pos_label='Up'):.4f}") print("\n") ``` ``` QDA: Truth Down Up Predicted Down 30 20 Up 81 121 Test Accuracy = 0.5992 Test Error = 0.4008 Precision = 0.5990 Recall = 0.8582 F1 score = 0.7055 ``` #### NB ```python=+ nb = GaussianNB() nb.fit(X_train, L_train) nb_labels = nb.predict(X_test) print("Naive Bayes:") print(confusion_table(nb_labels, L_test)) print(f"\nTest Accuracy = {np.mean(nb_labels == L_test):.4f}") print(f"Test Error = {np.mean(nb_labels != L_test):.4f}") print(f"Precision = {precision_score(L_test, nb_labels, pos_label='Up'):.4f}") print(f"Recall = {recall_score(L_test, nb_labels, pos_label='Up'):.4f}") print(f"F1 score = {f1_score(L_test, nb_labels, pos_label='Up'):.4f}") print("\n") ``` ``` Naive Bayes: Truth Down Up Predicted Down 29 20 Up 82 121 Test Accuracy = 0.5952 Test Error = 0.4048 Precision = 0.5961 Recall = 0.8582 F1 score = 0.7035 ``` #### KNN ```python=+ # KNN best_accuracy, best_precision, best_recall, best_f1 = 0, 0, 0, 0 best_K_acc, best_K_pre, best_K_rec, best_K_f1 = 0, 0, 0, 0 for K in range(1, 20): knn = KNN(n_neighbors=K) knn_pred = knn.fit(X_train, L_train).predict(X_test) accuracy = accuracy_score(L_test, knn_pred) precision = precision_score(L_test, knn_pred, pos_label='Up') recall = recall_score(L_test, knn_pred, pos_label='Up') f1 = f1_score(L_test, knn_pred, pos_label='Up') if accuracy > best_accuracy: best_accuracy = accuracy best_K_acc = K if precision > best_precision: best_precision = precision best_K_pre = K if recall > best_recall: best_recall = recall best_K_rec = K if f1 > best_f1: best_f1 = f1 best_K_f1 = K print("KNN:") print(f"Best Accuracy: {best_accuracy:.4f} at K = {best_K_acc}") print(f"Best Precision: {best_precision:.4f} at K = {best_K_pre}") print(f"Best Recall: {best_recall:.4f} at K = {best_K_rec}") print(f"Best F1 Score: {best_f1:.4f} at K = {best_K_f1}") ``` ``` KNN: Best Accuracy: 0.5317 at K = 3 Best Precision: 0.5960 at K = 4 Best Recall: 0.6099 at K = 3 Best F1 Score: 0.5931 at K = 3 ``` #### Comparison | Model | Accuracy | Precision | Recall | F1 Score | |------------------|----------|-----------|--------|----------| | Logistic Regression (Initial features) | 48.01% | 56.41% | 31.21% | 40.18% | | Logistic Regression (Lag1, Lag2) | 55.95% | 58.24% | 75.18% | 65.63% | | LDA | 55.95% | 58.24% | 75.18% | 65.63% | | QDA | 59.92% | 59.9% | 85.82% | 70.55% | | Naive Bayes | 59.52% | 59.61% | 85.82% | 70.35% | | KNN (K = 3) | 53.17% | 59.6% | 60.99% | 59.31% | * **Feature Selection Matters:** Focusing on `Lag1` and `Lag2` significantly boosted model performance. This highlights the importance of understanding your data and selecting relevant features. * **QDA's Advantage:** QDA outperformed other models, achieving the highest accuracy (59.92%) and F1-score (70.55%). Its ability to model more complex decision boundaries likely contributed to its success. * **KNN's Shortcoming:** KNN($K=3$) is not a good choice and that the relationships within the data are better captured by the other models, particularly QDA and Naive Bayes which have the highest recall and F1 scores. * **Room for Improvement:** These results, while promising, indicate the potential for further gains through: * **Feature Engineering:** Exploring technical indicators and interactions between features. * **Alternative Models:** Considering tree-based methods or time series models. ## Reference * Chap 4 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com) * [Supervised and Unsupervised Learning (an Intuitive Approach)](https://medium.com/@metehankozan/supervised-and-unsupervised-learning-an-intuitive-approach-cd8f8f64b644) * [Evaluation Metrics for Classification](https://medium.com/@impythonprogrammer/evaluation-metrics-for-classification-fc770511052d) * [Multinomial logistic](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) * [Generative Models](https://en.wikipedia.org/wiki/Generative_model) * [Logistic Regression, LDA, QDA, and KNN](https://intro-stat-learning.github.io/ISLP/labs/Ch04-classification-lab.html#logistic-regression)