# Classification
## Overview
Classification in machine learning is a supervised learning technique where the objective is to develop a model to assign categorical labels (or classes) to data points. The model learns the patterns between features (attributes of the data) and their corresponding labels from a labeled training dataset. Then, the trained model is used to predict the class of unseen samples.
### Examples of Classification Applications
* **Email Spam Filtering:** Distinguishing between 'spam' and 'not spam' emails.
* **Medical Diagnosis:** Identifying the presence or absence of a disease based on medical information.
* **Customer Churn Prediction:** Predicting whether customers are likely to discontinue using a service.
* **Sentiment Analysis:** Categorizing text data (e.g., reviews, social posts) as positive, negative, or neutral.
* **Image Classification:** Assigning images to object categories (e.g., dog, cat, car, etc.).
### Core Concepts
* **Features:** The attributes or variables of a data point (e.g., color, size, word frequency).
* **Labels:** The target categories (e.g., spam/not spam, diseased/healthy).
* **Classes:** The possible categories a data point can belong to.
* **Decision Boundary:** The line or hyperplane the model learns to separate different classes in the data.
### Popular Classification Methods
* **Discriminative Models:**
* These models directly model the decision boundary between classes.
* Examples:
* Logistic Regression
* K-Nearest Neighbors (KNN)
* Support Vector Machines (SVM)
* **Generative Models:**
* These learn the distribution of features within each class and then use probability to classify new data points.
* Examples:
* Naive Bayes
* Linear Discriminant Analysis (LDA)
* Quadratic Discriminant Analysis (QDA)
### Evaluation Metrics
Choosing the right evaluation metrics is essential for understanding and optimizing your classifier's performance.
* **Confusion Matrix:** A table that summarizes your model's predictions:
* **True Positive (TP):** Correctly predicted positive cases
* **False Positive (FP):** Incorrectly predicted positive cases (Type I error)
* **True Negative (TN):** Correctly predicted negative cases
* **False Negative (FN):** Incorrectly predicted negative cases (Type II error)

* **Accuracy:** The proportion of correct predictions out of all predictions. $$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$$
* Accuracy can be misleading when dealing with imbalanced datasets (i.e., one class is much more frequent than the other).
* **Precision:** The fraction of true positive predictions out of all positive predictions made by the model. $$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$
* Prioritizes minimizing false positives. (e.g., email spam detection).
* **Recall (Sensitivity):** The fraction of true positive predictions out of all actual positive cases. $$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$
* Prioritizes minimizing false negatives (e.g., medical diagnosis of a serious disease).
* **F1-Score:** The harmonic mean of precision and recall. $$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
* A useful metric when you seek a balance between precision and recall, especially for imbalanced classes.
* **F-Beta Score:** A weighted average of precision and recall, where the beta value controls the emphasis on one or the other. $$F_{\beta} = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$$
* $\beta=1$ yields the F1 score, which equally balances precision and recall.
* $\beta<1$ gives more weight to precision (lower values of $β$ prioritize precision).
* $\beta>1$ gives more weight to recall (higher values of $β$ prioritize recall).
* **ROC Curve and AUC:**
* **ROC Curve:** A graph plotting True Positive Rate (Recall) vs. False Positive Rate($\frac{FP}{FP+TN}$) at various classification thresholds.
* Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.
* A perfect classifier would have a ROC curve that touches the top-left corner (100% sensitivity, 100% specificity).
* A perfect classifier would have a ROC curve that touches the top-left corner (100% sensitivity, 100% specificity).
* **AUC (Area Under the ROC Curve):** Measures a classifier's ability to discriminate between classes. Higher AUC indicates better performance.
* A model whose predictions are 100% wrong has an AUC of 0.
* A model whose predictions are 100% correct has an AUC of 1.
### Example
```python=
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score, roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1, 0, 1]
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
beta1 = 0.5
beta2 = 1
beta3 = 1.5
f_beta1 = fbeta_score(y_true, y_pred, beta=beta1)
f_beta2 = fbeta_score(y_true, y_pred, beta=beta2)
f_beta3 = fbeta_score(y_true, y_pred, beta=beta3)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1-Score: ", f1)
print(f"F-Beta Score (beta={beta1}): ", f_beta1)
print(f"F-Beta Score (beta={beta2}): ", f_beta2)
print(f"F-Beta Score (beta={beta3}): ", f_beta3)
# ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_pred)
auc_score = roc_auc_score(y_true, y_pred)
# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()
```
```
Accuracy: 0.625
Precision: 0.75
Recall: 0.6
F1-Score: 0.6666666666666666
F-Beta Score (beta=0.5): 0.7142857142857143
F-Beta Score (beta=1): 0.6666666666666666
F-Beta Score (beta=1.5): 0.639344262295082
```

## Discriminative Model
Discriminative models excel at directly learning decision boundaries between classes to classify data points. They focus on modeling the probability of a data point belonging to a specific class, given its features. Let's dive into some of the most popular discriminative algorithms:
### Logistic Regression
* **Purpose:** Models the probability of a binary outcome $p(X) = \mathbb{P}(Y=1|X)$.
* **How it works:**
* **Sigmoid function:** Logistic regression employs the sigmoid function to model probabilities between $0$ and $1$: $$p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p}}$$

* **Logit function:** The logit function transforms probabilities into a linear expression, aiding interpretability: $$\text{logit}(p(X)) = \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p$$
* **Maximum Likelihood Estimation:** Logistic regression uses this process to find the optimal coefficients ($\beta$ values) that maximize the likelihood of the observed data. $$l(\hat{\beta}_0, \dots, \hat{\beta}_p) = \prod_{i:y_i=1} p(x_i)\prod_{j:y_j=0} (1-p(x_j))$$
* **Key Advantages:**
* Interpretable coefficients indicate the effect of each feature on class probabilities.
* Provides class probability estimates, useful for ranking or uncertainty assessments.
#### Example: Predicting Credit Card Default
This example demonstrates how to use logistic regression to predict the probability of credit card default based on balance and student status.
```python=
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Set a random seed for reproducibility
np.random.seed(0)
# Generate a synthetic dataset
n_samples = 1000
balance = np.random.normal(1500, 800, n_samples) # Generate balance amounts
student = np.random.binomial(1, 0.2, n_samples) # Generate student status (1 for student, 0 for non-student)
# Generate binary outcome 'default'
# The coefficients (-4, 0.0025, 0.5) are chosen arbitrarily for the purpose of generating a synthetic dataset
logit = -4 + 0.0025 * balance + 0.5 * student # Calculate the logit (linear combination of predictors and coefficients)
prob_default = 1 / (1 + np.exp(-logit)) # Convert logit to probability using the logistic function
default = np.random.binomial(1, prob_default, n_samples) # Generate binary default outcome based on the probability
# Create a DataFrame
df = pd.DataFrame({
'balance': balance,
'student': student,
'default': default
})
print("Synthetic dataset:")
print(df.head()) # Print the first few rows of the DataFrame
# Fit logistic regression models
# sm.add_constant() is to add a column of ones to the features matrix, which acts as a constant term or intercept in the regression model.
X1 = sm.add_constant(df['balance']) # Model with only the balance feature
model_balance = sm.Logit(df['default'], X1)
result_balance = model_balance.fit(disp=0) # Fit the model and suppress the output
X2 = sm.add_constant(df['student']) # Model with only the student status feature
model_student = sm.Logit(df['default'], X2)
result_student = model_student.fit(disp=0)
X3 = sm.add_constant(df[['balance', 'student']]) # Model with both balance and student status features
model_both = sm.Logit(df['default'], X3)
result_both = model_both.fit(disp=0)
# Predict default probability for new observations
new_balance = 1000 # Set a new balance value
new_student_status = 1 # Set a new student status (1 for student)
prob_default_balance = float(result_balance.predict([1, new_balance]))
prob_default_student_yes = float(result_student.predict([1, new_student_status]))
prob_default_student_no = float(result_student.predict([1, 1 - new_student_status]))
prob_default_new = float(result_both.predict([1, new_balance, new_student_status]))
# Print model summaries and predicted probabilities
print("\nModel with only balance:")
print(result_balance.summary())
print(f"\nProbability of default for balance of $1000: {prob_default_balance:.2%}")
print("\nModel with only student status:")
print(result_student.summary())
print(f"\nProbability of default for a student: {prob_default_student_yes:.2%}")
print(f"Probability of default for a non-student: {prob_default_student_no:.2%}")
print("\nModel with both balance and student status:")
print(result_both.summary())
print(f"\nProbability of default for a student with balance of $1000: {prob_default_new:.2%}")
```
```
Synthetic dataset:
balance student default
0 2911.241877 1 0
1 1820.125767 0 1
2 2282.990387 1 1
3 3292.714559 1 1
4 2994.046392 0 1
Model with only balance:
Logit Regression Results
==============================================================================
Dep. Variable: default No. Observations: 1000
Model: Logit Df Residuals: 998
Method: MLE Df Model: 1
Date: Mon, 22 Apr 2024 Pseudo R-squ.: 0.3068
Time: 20:05:18 Log-Likelihood: -479.15
converged: True LL-Null: -691.22
Covariance Type: nonrobust LLR p-value: 3.046e-94
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -3.6285 0.245 -14.825 0.000 -4.108 -3.149
balance 0.0024 0.000 15.244 0.000 0.002 0.003
==============================================================================
Probability of default for balance of $1000: 21.84%
Model with only student status:
Logit Regression Results
==============================================================================
Dep. Variable: default No. Observations: 1000
Model: Logit Df Residuals: 998
Method: MLE Df Model: 1
Date: Mon, 22 Apr 2024 Pseudo R-squ.: 0.007691
Time: 20:05:18 Log-Likelihood: -685.91
converged: True LL-Null: -691.22
Covariance Type: nonrobust LLR p-value: 0.001111
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.2341 0.072 -3.249 0.001 -0.375 -0.093
student 0.5005 0.154 3.245 0.001 0.198 0.803
==============================================================================
Probability of default for a student: 56.62%
Probability of default for a non-student: 44.17%
Model with both balance and student status:
Logit Regression Results
==============================================================================
Dep. Variable: default No. Observations: 1000
Model: Logit Df Residuals: 997
Method: MLE Df Model: 2
Date: Mon, 22 Apr 2024 Pseudo R-squ.: 0.3161
Time: 20:05:18 Log-Likelihood: -472.74
converged: True LL-Null: -691.22
Covariance Type: nonrobust LLR p-value: 1.306e-95
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -3.8376 0.258 -14.856 0.000 -4.344 -3.331
balance 0.0024 0.000 15.199 0.000 0.002 0.003
student 0.6938 0.196 3.548 0.000 0.311 1.077
==============================================================================
Probability of default for a student with balance of $1000: 32.04%
```
### Multinomial Logistic Regression
Multinomial logistic regression is a powerful extension of logistic regression, designed to handle classification problems where there are three or more possible classes ($K > 2$). It's a discriminative model, directly learning the decision boundaries that separate data points belonging to different categories.
* **Probabilistic Modeling:** Multinomial logistic regression calculates the probability of a data point belonging to each possible class. The class with the highest probability is assigned as the prediction: $$\mathbb{P}(Y = k|X = x) =
\begin{cases}
\frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{1 + \sum_{i=1}^{K-1}e^{\beta_{i0} + \beta_{i1}x_1 + \cdots + \beta_{ip}x_p}} &\text{ for } k = 1,...,K-1\\
\frac{1}{1 + \sum_{i=1}^{K-1}e^{\beta_{i0} + \beta_{i1}x_1 + \cdots + \beta_{ip}x_p}} &\text{ for } k = K
\end{cases}
$$
* **Baseline Class:** One class is selected as the reference point. Multinomial logistic regression effectively runs $K-1$ separate logistic regressions, each comparing one class against the chosen baseline.
* **Log Odds Ratios and Interpretation:** To compare classes directly (not just against the baseline), log odds ratios are calculated: $$\log \left( \frac{\mathbb{P}(Y = k | X = x)}{\mathbb{P}(Y = k' | X = x)} \right) = (\beta_{k0}-\beta_{k'0}) + (\beta_{k1}-\beta_{k'1})X_1 + \cdots + (\beta_{kn}-\beta_{k'n})X_n$$
* The coefficients $\beta$ provide valuable insights about the influence of features on the odds of a data point belonging to one class versus another.
#### Example $(K=3)$
This example demonstrates multinomial logistic regression for a 3-class problem.
```python=
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# Generate data
def make_data(n_data, n_features, cov_matrices, shifts):
rng = np.random.RandomState()
X = np.concatenate([
rng.randn(n_data, n_features) @ cov + shift
for cov, shift in zip(cov_matrices, shifts)
])
y = np.concatenate([
np.full(n_data, i)
for i in range(len(cov_matrices))
])
return X, y
def grid(X): # feature X = [X_1,X_2]
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
np.arange(y_min, y_max, 0.01))
grid_points = np.c_[xx.ravel(), yy.ravel()]
return xx, yy, grid_points
def plot_data(X, y, title):
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=20)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(title)
# Define covariance matrices for each class
cov_class_1 = np.array([[1, 0.3], [0.3, 1]])
cov_class_2 = np.array([[0.8, 0], [0, 0.9]])
cov_class_3 = np.array([[1, 0.1], [0.1, 0.9]])
cov_matrices = [cov_class_1, cov_class_2, cov_class_3]
shifts = [[0, 0], [1, 3], [2, 2]]
# Generate data
X, y = make_data(n_data=20, n_features=2, cov_matrices=cov_matrices, shifts=shifts)
# Create grid to plot decision boundaries
xx, yy, grid_points = grid(X)
# Fit Multinomial Logistic Regression
log_reg_clf = LogisticRegression(multi_class='multinomial', solver='lbfgs')
log_reg_clf.fit(X, y)
# Predict on grid points
Z_log_reg = log_reg_clf.predict(grid_points)
Z_log_reg = Z_log_reg.reshape(xx.shape)
# Plot decision boundaries and data
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z_log_reg, alpha=0.4, cmap='tab10')
plt.contour(xx, yy, Z_log_reg, linestyles=['--'], colors='black')
plot_data(X,y,title="Multinomial Logistic Regression Decision Boundaries")
plt.show()
```

### K-Nearest Neighbors (KNN)
* **Purpose:** Classifies unlabeled data points based on the majority class of their $K$ closest neighbors in the training set.
* **How it works:**
1. **Choose a value for $K$:** This hyperparameter represents the number of neighbors to consider.
2. **Calculate distances:** Find the $K$ nearest neighbors in the training set to a new data point. Euclidean distance is common, but other distance metrics can be used.
3. **Majority vote:** Classify the new data point according to the most common class among its $K$ nearest neighbors.
* **Key Considerations:**
* **Data scaling:** Standardizing or normalizing features is often necessary to prevent features with larger ranges from dominating distance calculations.
* **Choosing $K$:** An overly small $K$ leads to overfitting (high variance); an excessively large K leads to underfitting (high bias). Cross-validation helps find the optimal $K$ value.
#### Example: Comparing Different $K$ Values
This example demonstrates how the choice of K affects the decision boundaries of a KNN classifier.
```python=
import numpy as np
from sklearn.neighbors import KNeighborsClassifier as KNN
import matplotlib.pyplot as plt
# Generate data
cov_class_1 = np.array([[1, 0.3], [0.3, 1]])
cov_class_2 = np.array([[0.8, 0], [0, 0.9]])
cov_matrices = [cov_class_1, cov_class_2]
shifts = [[0, 0], [1, 3]]
X, y = make_data(n_data=20, n_features=2, cov_matrices=cov_matrices, shifts=shifts)
# Set up subplots
fig, axes = plt.subplots(1, 3, figsize=(24, 8))
# Fit KNN for different K values and plot
n_neighbors = [2, 5, 10]
knn_clfs = [KNN(n_neighbors=n) for n in n_neighbors]
xx, yy, grid_points = grid(X)
for ax, knn_clf, n_neighbor in zip(axes.flatten(), knn_clfs, n_neighbors):
knn_clf.fit(X, y)
Z_knn = knn_clf.predict(grid_points).reshape(xx.shape)
ax.contourf(xx, yy, Z_knn, alpha=0.4, cmap='tab10')
ax.contour(xx, yy, Z_knn, levels=[0.5], linestyles=['--'], colors='black')
ax.set_title(f"KNN Classifier (K={n_neighbor})", fontsize=20)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=50)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
plt.tight_layout()
plt.show()
```

## Generative Models
Generative models offer a powerful way to model the underlying probability distributions that give rise to your data. Unlike discriminative models, which directly learn decision boundaries, generative models focus on understanding the joint probability distribution, $\mathbb{P}(X, Y)$ of features $X$ and class labels $Y$. This provides several advantages:
* **Data Generation:** Generate new, synthetic samples that resemble your training data (useful for simulations or augmenting datasets).
* **Uncertainty Quantification:** Make more informed predictions along with estimates of uncertainty.
* **Semi-Supervised Learning:** Leverage generative models effectively even when there's a large amount of unlabeled data.
### Key Concepts
* **Joint Probability Modeling:** Generative models strive to learn the joint distribution $\mathbb{P}(X, Y)$.
* **Bayes' Theorem for Classification:** Once you have the joint distribution, you can use Bayes' Theorem for classification: $$\mathbb{P}(Y=k | X=x) = \frac{\mathbb{P}(X=x | Y=k) \cdot \mathbb{P}(Y=k)}{\mathbb{P}(X=x)}$$
* **Bayes Classifier:** Explicitly utilizes these probabilities to assign the most probable class to new data points: $$\mathbb{P}(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{i=1}^K\pi_if_i(x)}$$ where:
* $\pi_k$ is the prior probability of class $k$.
* $f_k(X)$ is the class-conditional density $\mathbb{P}(X|Y=k)$.
### Linear Discriminant Analysis
*Linear Discriminant Analysis (LDA)* is a classic classification method that excels when features from different classes follow Gaussian distributions with shared covariance matrices. It builds linear decision boundaries for separating classes.
#### Key Assumptions:
* **Gaussian Distributed Features (within each class):** Each class is modeled as a multivariate Gaussian distribution.
* **Homoscedasticity:** All classes share the same covariance matrix.
#### Understanding LDA
1. **Class-Conditional Densities:** The Gaussian probability density function (p.d.f.) for class $k$ is: $$f_k(x)=\frac{1}{(2\pi)^{p/2} \det(\Sigma)^{1/2}} \exp(-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k))$$ where:
* $x$: A data point (a vector of features).
* $\mu_k$: The mean vector of class $k$.
* $\Sigma$: The shared covariance matrix.
2. **Discriminant Functions:** LDA uses Bayes' Theorem and these densities to derive discriminant functions for each class: $$\delta_k(x)= x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log(\pi_k)$$ where $\pi_k$ is the prior probability of class $k$.
3. **Decision Rule:** LDA assigns a new data point $x$ to the class with the highest discriminant function value: $$\arg\max_k \delta_k(x)$$
#### Parameter Estimation
* **Mean Vector $\mu_k$:** $$\hat{\mu}_k = \frac{1}{n_k} \sum_{i:y_i=k} x_i $$
* **Shared Covariance Matrix $\Sigma$:** $$\hat{\Sigma}=\frac{1}{n-K} \sum_{k=1}^K \sum_{i:y_i=k} (x_i-\hat{\mu}_k) (x_i-\hat{\mu}_k)^T$$
* **Prior Probabilities $\pi_k$:** $$\hat{\pi}_k = \frac{n_k}{n}$$
#### LDA Decision Boundaries
The decision boundary between two classes is the set of points where their discriminant functions are equal. For LDA, these boundaries are *linear (or hyperplanes in higher dimensions)*.
1. **Univariate Case $(p=1)$:** In this case, the *Bayes Decision Boundary* is the point $x$ such that $\delta_1(x)=\delta_2(x)$.
* For equal priors ($\pi_1=\pi_2=0.5$), it simplifies to $x=\frac{\mu_1+\mu_2}{2}$.
```python=
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# Example 1: Bayes Decision Boundary
mu1, mu2 = -1, 1 # means
sigma1, sigma2 = 1, 1 # standard deviations
x = np.linspace(-5, 5, 100)
y1 = norm.pdf(x, mu1, sigma1)
y2 = norm.pdf(x, mu2, sigma2)
plt.figure(figsize=(8, 6))
plt.plot(x, y1, color='green', label="N(-1,1)")
plt.plot(x, y2, color='magenta', label="N(1,1)")
plt.axvline(x=0, color='black', linestyle='-', label="Bayes Decision Boundary")
plt.title("Bayes Decision Boundary")
plt.legend()
plt.show()
# Example 2: LDA vs Bayes Decision Boundary
cov_matrices=[[[1]],[[1]]]
shifts = [-1, 1]
X, y = make_data(n_data=20, n_features=1, cov_matrices=cov_matrices, shifts=shifts)
# Bayes decision boundary
bayes_decision_boundary = np.mean(shifts)
# Fit LDA
clf = LDA()
clf.fit(X, y)
lda_decision_boundary = -clf.intercept_ / clf.coef_[0]
# Plot
plt.figure(figsize=(10, 6))
plt.hist(X[:20], bins=10, color='green', alpha=0.5, label='N(-1,1)')
plt.hist(X[20:], bins=10, color='magenta', alpha=0.5, label='N(1,1)')
plt.axvline(bayes_decision_boundary, color='black', linestyle='-', label='Bayes')
plt.axvline(lda_decision_boundary, color='black', linestyle='--', label='LDA')
plt.legend()
plt.xlabel('X')
plt.ylabel('Frequency')
plt.title('LDA vs Bayes Decision Boundary')
plt.show()
```


2. **Multivariate Case $(p>1)$:** In this case, the *Bayes decision boundary* is $$\{x \in \mathbb{R}^p| \delta_i(x)=\delta_j(x) \text{ for some } i\neq j\}$$
```python=
import numpy as np
from scipy.stats import multivariate_normal
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import matplotlib.pyplot as plt
# Define covariance matrix and mean shifts for each class
cov_matrix = np.array([[1, 0.5], [0.5, 1]])
cov_matrices = [cov_matrix] * 3
shifts = [[0, 0], [3, 4], [4, 3]]
# Generate data
X, y = make_data(n_data=20, n_features=2, cov_matrices=cov_matrices, shifts=shifts)
# Fit LDA
clf = LDA()
clf.fit(X, y)
# Create grid for plotting decision boundaries
xx, yy, grid_points = grid(X)
# LDA decision boundary
Z_lda = clf.predict(grid_points)
Z_lda = Z_lda.reshape(xx.shape)
# Bayes decision boundary
pdfs = [multivariate_normal(mean=shift, cov=cov_matrix).pdf(grid_points) for shift in shifts]
Z_bayes = np.argmax(pdfs, axis=0)
Z_bayes = Z_bayes.reshape(xx.shape)
# Plot
plt.figure(figsize=(8, 8))
plt.contourf(xx, yy, Z_lda, alpha=0.4, cmap='tab10')
plt.contour(xx, yy, Z_lda, levels=[0.5, 1.5], linestyles=['--'], colors='black')
plt.contour(xx, yy, Z_bayes, levels=[0.5, 1.5], linestyles=['-'], colors='black')
plot_data(X, y, title="LDA(--) vs Bayes(-) Decision Boundaries")
plt.show()
```

### Quadratic Discriminant Analysis
*Quadratic Discriminant Analysis (QDA)* is a generative classification method that offers more flexibility than Linear Discriminant Analysis (LDA). Let's break down how it works:
#### Relaxing LDA's Assumption:
* **LDA:** Assumes all classes share the same covariance matrix.
* **QDA:** Relaxes this assumption, allowing each class to have its own unique covariance matrix: $$f_k(x)=\frac{1}{(2\pi)^{p/2} \det \Sigma_k^{1/2} } \exp(-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k))$$
#### Consequences
* **Decision Boundaries:** QDA's decision boundaries become *quadratic* (as opposed to linear in LDA). This can better model problems where class distributions have distinct shapes.
* **Parameter Estimation:** QDA requires estimating more parameters (a separate covariance matrix for each class).
#### Quadratic Discriminant Function
Similar to LDA, QDA employs discriminant functions for classification:
$$
\delta_k(x)=-\frac{1}{2}x^T\Sigma_k^{-1}x+x^T\Sigma_k^{-1}\mu_k -\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log(\det \Sigma_k )+\log(\pi_k)
$$

* **Red dashed line (QDA):** Represents the decision boundary modeled by QDA. Its quadratic shape captures the differences in class distributions.
* **Black dashed line (LDA):** Shows the decision boundary produced by LDA, which is linear due to LDA's shared covariance assumption.
* **Black line (Bayes):** Represents the optimal (Bayes) decision boundary based on the true underlying distributions.
### Naive Bayes
*Naive Bayes (NB)* is a flexible and often surprisingly effective generative classifier. It's built upon the idea of conditional independence between features, which simplifies computations.
* **Assumptions:** within each class, the predictors are independent $$f_k(x)=f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p)$$ where $f_{kj}$ is the density of the $j$-th predictor in class $k$.
* **Modeling Feature Distributions:**
* **Continuous Features:** Often modeled with Gaussian distributions (Normal distributions). You estimate the mean $\mu_{jk}$ and variance $\sigma_{jk}^2$ for each feature $j$ within each class $k$. Alternatively, non-parametric methods like Kernel Density Estimation (KDE) can be used.
* **Categorical Features:** Class-conditional probabilities are estimated directly from the frequencies of each feature value within each class.
* **Bayes' Theorem for Classification:** $$\mathbb{P}(Y=k|X=x)=\frac{\pi_k \times f_{k1}(x_1) \times f_{k2}(x_2) \times \cdots \times f_{kp}(x_p)}{\sum_{i=1}^K \pi_i \times f_{i1}(x_1) \times f_{i2}(x_2) \times \cdots \times f_{ip}(x_p)}$$
#### Example
Here's an example comparing Naive Bayes with diagonal and non-diagonal covariance matrices:
```python=
import numpy as np
from sklearn.naive_bayes import GaussianNB as NB
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
# Diagonal Covariance Matrix
cov_matrix = np.array([[1, 0], [0, 1.5]]) # Create a diagonal covariance matrix
cov_class_1, cov_class_2, cov_class_3 = [cov_matrix] * 3 # Use the same covariance matrix for all classes
shifts = [[0, 0], [1, 2], [3, 1]]
# Generate data using the diagonal covariance matrices and mean shifts
X, y = make_data(n_data=30, n_features=2, cov_matrices=[cov_class_1, cov_class_2, cov_class_3], shifts=shifts)
# Fit Naive Bayes classifier
nb_clf = NB()
nb_clf.fit(X, y)
# Create a grid of points for plotting decision boundaries
xx, yy, grid_points = grid(X)
# Predict classes for the grid points using Naive Bayes
Z_nb = nb_clf.predict(grid_points)
Z_nb = Z_nb.reshape(xx.shape)
# Calculate the true posterior probabilities for the grid points
pdfs = [multivariate_normal(mean=shift, cov=cov_matrix).pdf(grid_points) for shift in shifts]
pdfs = np.array(pdfs)
# Assign each point to the class with the highest probability (Bayes classifier)
Z_bayes = np.argmax(pdfs, axis=0)
Z_bayes = Z_bayes.reshape(xx.shape)
# Plot the decision boundaries and data points
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.contourf(xx, yy, Z_nb, alpha=0.4, cmap='tab10')
ax1.contour(xx, yy, Z_nb, levels=[0.5, 1.5], linestyles=['--'], colors='black')
ax1.contour(xx, yy, Z_bayes, levels=[0.5, 1.5], linestyles=['-'], colors='black')
ax1.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=20)
ax1.set_title('Naive Bayes vs Bayes Classifier\nDiagonal Covariance')
# Non-Diagonal Covariance Matrix
cov_class_1 = np.array([[1, 0.5], [0.5, 1.5]])
cov_class_2 = np.array([[1, 0.5], [0.5, 1.5]])
cov_class_3 = np.array([[1, 0.5], [0.5, 1.5]])
cov_matrices = [cov_class_1, cov_class_2, cov_class_3] # Create different non-diagonal covariance matrices for each class
shifts = [[0, 0], [1, 2], [3, 1]]
# Generate data using the non-diagonal covariance matrices and mean shifts
X, y = make_data(n_data=30, n_features=2, cov_matrices=cov_matrices, shifts=shifts)
# Fit Naive Bayes classifier
nb_clf = NB()
nb_clf.fit(X, y)
# Create a grid of points for plotting decision boundaries
xx, yy, grid_points = grid(X)
# Predict classes for the grid points using Naive Bayes
Z_nb = nb_clf.predict(grid_points)
Z_nb = Z_nb.reshape(xx.shape)
# Calculate the true posterior probabilities for the grid points
pdfs = [multivariate_normal(mean=shift, cov=cov_matrix).pdf(grid_points) for shift in shifts]
pdfs = np.array(pdfs)
# Assign each point to the class with the highest probability (Bayes classifier)
Z_bayes = np.argmax(pdfs, axis=0)
Z_bayes = Z_bayes.reshape(xx.shape)
# Plot the decision boundaries and data points
ax2.contourf(xx, yy, Z_nb, alpha=0.4, cmap='tab10')
ax2.contour(xx, yy, Z_nb, levels=[0.5, 1.5], linestyles=['--'], colors='black')
ax2.contour(xx, yy, Z_bayes, levels=[0.5, 1.5], linestyles=['-'], colors='black')
ax2.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', s=20)
ax2.set_title('Naive Bayes vs Bayes Classifier\nNon-Diagonal Covariance')
plt.tight_layout()
plt.show()
```

## Comparing Classification Methods
Choosing the right classification method depends on your data's characteristics and modeling goals. Here's a breakdown of key methods and how they compare:
| Method | Key Assumptions | Decision Boundary |
|---|---|---|
| Logistic Regression | None about predictor distribution | Linear |
| KNN | None about predictor distribution | Varies (more flexible) |
| LDA | Multivariate normality, equal class covariance matrices | Linear |
| QDA | Multivariate normality, class-specific covariances | Quadratic |
| Naive Bayes | Conditional feature independence given the class label | Varies (more flexible) |
#### Guidance for Choosing a Model
* **Start Simple, Iterate:** Begin with logistic regression or LDA due to their simplicity. If performance needs improvement, consider more complex models.
* **Check Assumptions:**
* Assess the normality of your features. LDA/QDA are better suited for Gaussian-like data.
* Consider interactions between features. Strong dependencies might violate Naive Bayes' independence assumption.
* **Data Size and Complexity:**
* **Small Datasets:** LDA often performs better than QDA to avoid overfitting.
* **Large Datasets:** QDA's flexibility becomes advantageous.
* **Many Features:** Naive Bayes might excel due to fewer parameters to estimate.
#### Additional Insights
* **Trade-offs:** More flexible models (QDA, Naive Bayes) have higher variance and can overfit. Simpler models (LDA, Logistic Regression) tend to have higher bias but might be more robust.
* **Hybrid Approaches:** It's possible to combine assumptions. For example, you could use Naive Bayes but relax the independence assumption for certain strongly correlated features.
* **No One-Size-Fits-All:** The best method is often determined empirically through evaluation and cross-validation.
### Example: Direction Prediction of Stock Market Data
In this example, we will be exploring a dataset from the stock market. Our objective is to **predict the direction of the stock market** ('Up' or 'Down') based on various predictors including some lag prices and volume.
#### Logistic Regression
```python=
from ISLP import load_data,
from ISLP.models import ModelSpec as MS, summarize
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA, QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import numpy as np
import statsmodels.api as sm
# Load the data
Smarket = load_data('Smarket')
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
design = MS(allvars)
X = design.fit_transform(Smarket)
y = Smarket.Direction == 'Up'
# Split data into train and test sets
train = (Smarket.Year < 2005)
X_train, X_test = X.loc[train], X.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]
# Transform train and test sets to label("Up" and "Down") sets
L_train = Smarket.Direction.loc[train]
L_test = Smarket.Direction.loc[~train]
# Logistic Regression
glm_train = sm.GLM(y_train, X_train, family=sm.families.Binomial()) # The default link function is logit.
results = glm_train.fit()
probs = results.predict(exog=X_test)
# Create a confusion matrix and evaluate metrics
labels = np.array(['Down'] * 252)
labels[probs > 0.5] = 'Up'
print("Logistic Regression (Initial features):")
print(confusion_table(labels, L_test))
print(f"\nTest Accuracy = {np.mean(labels == L_test):.4f}")
print(f"Test Error = {np.mean(labels != L_test):.4f}")
print(f"Precision = {precision_score(L_test, labels, pos_label='Up'):.4f}")
print(f"Recall = {recall_score(L_test, labels, pos_label='Up'):.4f}")
print(f"F1 score = {f1_score(L_test, labels, pos_label='Up'):.4f}")
print("\nModel summary:")
print(summarize(results))
print("\n")
```
```
Logistic Regression (Initial features):
Truth Down Up
Predicted
Down 77 97
Up 34 44
Test Accuracy = 0.4802
Test Error = 0.5198
Precision = 0.5641
Recall = 0.3121
F1 score = 0.4018
Model summary:
coef std err z P>|z|
intercept 0.1912 0.334 0.573 0.567
Lag1 -0.0542 0.052 -1.046 0.295
Lag2 -0.0458 0.052 -0.884 0.377
Lag3 0.0072 0.052 0.139 0.889
Lag4 0.0064 0.052 0.125 0.901
Lag5 -0.0042 0.051 -0.083 0.934
Volume -0.1163 0.240 -0.485 0.628
```
* None of the predictor variables (`Lag1` to `Lag5`, `Volume`) are statistically significant at the commonly used significance levels (e.g., 0.05, 0.01). This is indicated by the p-values which are all greater than these levels.
* In this case, since `Lag1` and `Lag2` have the smallest p-values, they may be the most promising predictors among these features. However, remember that while these features may improve the model, the lack of statistical significance suggests that any improvements might be marginal.
#### Improved Logistic Regression
```python=+
# Feature selection and improved Logistic Regression
model = MS(['Lag1', 'Lag2']).fit(Smarket)
X = model.transform(Smarket)
X_train, X_test = X.loc[train], X.loc[~train]
glm_train = sm.GLM(y_train, X_train, family=sm.families.Binomial())
results = glm_train.fit()
probs = results.predict(exog=X_test)
labels = np.array(['Down'] * 252)
labels[probs > 0.5] = 'Up'
print("Logistic Regression (Lag1, Lag2):")
print(confusion_table(labels, L_test))
print(f"\nTest Accuracy = {np.mean(labels == L_test):.4f}")
print(f"Test Error = {np.mean(labels != L_test):.4f}")
print(f"Precision = {precision_score(L_test, labels, pos_label='Up'):.4f}")
print(f"Recall = {recall_score(L_test, labels, pos_label='Up'):.4f}")
print(f"F1 score = {f1_score(L_test, labels, pos_label='Up'):.4f}")
print("\n")
```
```
Logistic Regression (Lag1, Lag2):
Truth Down Up
Predicted
Down 35 35
Up 76 106
Test Accuracy = 0.5595
Test Error = 0.4405
Precision = 0.5824
Recall = 0.7518
F1 score = 0.6563
```
* Accuracy : $48\% \Rightarrow 56\%$
* Precision : $56.4\% \Rightarrow 58.2\%$
#### LDA
```python=+
lda = LDA(store_covariance=True)
X_train, X_test = [M.drop(columns=['intercept'])
for M in [X_train, X_test]]
lda.fit(X_train, L_train)
lda_pred = lda.predict(X_test)
print("LDA:")
print(confusion_table(lda_pred, L_test))
print(f"\nTest Accuracy = {np.mean(lda_pred == L_test):.4f}")
print(f"Test Error = {np.mean(lda_pred != L_test):.4f}")
print(f"Precision = {precision_score(L_test, lda_pred, pos_label='Up'):.4f}")
print(f"Recall = {recall_score(L_test, lda_pred, pos_label='Up'):.4f}")
print(f"F1 score = {f1_score(L_test, lda_pred, pos_label='Up'):.4f}")
print("\n")
```
```
LDA:
Truth Down Up
Predicted
Down 35 35
Up 76 106
Test Accuracy = 0.5595
Test Error = 0.4405
Precision = 0.5824
Recall = 0.7518
F1 score = 0.6563
```
#### QDA
```python=+
qda = QDA(store_covariance=True)
qda.fit(X_train, L_train)
qda_pred = qda.predict(X_test)
print("QDA:")
print(confusion_table(qda_pred, L_test))
print(f"\nTest Accuracy = {np.mean(qda_pred == L_test):.4f}")
print(f"Test Error = {np.mean(qda_pred != L_test):.4f}")
print(f"Precision = {precision_score(L_test, qda_pred, pos_label='Up'):.4f}")
print(f"Recall = {recall_score(L_test, qda_pred, pos_label='Up'):.4f}")
print(f"F1 score = {f1_score(L_test, qda_pred, pos_label='Up'):.4f}")
print("\n")
```
```
QDA:
Truth Down Up
Predicted
Down 30 20
Up 81 121
Test Accuracy = 0.5992
Test Error = 0.4008
Precision = 0.5990
Recall = 0.8582
F1 score = 0.7055
```
#### NB
```python=+
nb = GaussianNB()
nb.fit(X_train, L_train)
nb_labels = nb.predict(X_test)
print("Naive Bayes:")
print(confusion_table(nb_labels, L_test))
print(f"\nTest Accuracy = {np.mean(nb_labels == L_test):.4f}")
print(f"Test Error = {np.mean(nb_labels != L_test):.4f}")
print(f"Precision = {precision_score(L_test, nb_labels, pos_label='Up'):.4f}")
print(f"Recall = {recall_score(L_test, nb_labels, pos_label='Up'):.4f}")
print(f"F1 score = {f1_score(L_test, nb_labels, pos_label='Up'):.4f}")
print("\n")
```
```
Naive Bayes:
Truth Down Up
Predicted
Down 29 20
Up 82 121
Test Accuracy = 0.5952
Test Error = 0.4048
Precision = 0.5961
Recall = 0.8582
F1 score = 0.7035
```
#### KNN
```python=+
# KNN
best_accuracy, best_precision, best_recall, best_f1 = 0, 0, 0, 0
best_K_acc, best_K_pre, best_K_rec, best_K_f1 = 0, 0, 0, 0
for K in range(1, 20):
knn = KNN(n_neighbors=K)
knn_pred = knn.fit(X_train, L_train).predict(X_test)
accuracy = accuracy_score(L_test, knn_pred)
precision = precision_score(L_test, knn_pred, pos_label='Up')
recall = recall_score(L_test, knn_pred, pos_label='Up')
f1 = f1_score(L_test, knn_pred, pos_label='Up')
if accuracy > best_accuracy:
best_accuracy = accuracy
best_K_acc = K
if precision > best_precision:
best_precision = precision
best_K_pre = K
if recall > best_recall:
best_recall = recall
best_K_rec = K
if f1 > best_f1:
best_f1 = f1
best_K_f1 = K
print("KNN:")
print(f"Best Accuracy: {best_accuracy:.4f} at K = {best_K_acc}")
print(f"Best Precision: {best_precision:.4f} at K = {best_K_pre}")
print(f"Best Recall: {best_recall:.4f} at K = {best_K_rec}")
print(f"Best F1 Score: {best_f1:.4f} at K = {best_K_f1}")
```
```
KNN:
Best Accuracy: 0.5317 at K = 3
Best Precision: 0.5960 at K = 4
Best Recall: 0.6099 at K = 3
Best F1 Score: 0.5931 at K = 3
```
#### Comparison
| Model | Accuracy | Precision | Recall | F1 Score |
|------------------|----------|-----------|--------|----------|
| Logistic Regression (Initial features) | 48.01% | 56.41% | 31.21% | 40.18% |
| Logistic Regression (Lag1, Lag2) | 55.95% | 58.24% | 75.18% | 65.63% |
| LDA | 55.95% | 58.24% | 75.18% | 65.63% |
| QDA | 59.92% | 59.9% | 85.82% | 70.55% |
| Naive Bayes | 59.52% | 59.61% | 85.82% | 70.35% |
| KNN (K = 3) | 53.17% | 59.6% | 60.99% | 59.31% |
* **Feature Selection Matters:** Focusing on `Lag1` and `Lag2` significantly boosted model performance. This highlights the importance of understanding your data and selecting relevant features.
* **QDA's Advantage:** QDA outperformed other models, achieving the highest accuracy (59.92%) and F1-score (70.55%). Its ability to model more complex decision boundaries likely contributed to its success.
* **KNN's Shortcoming:** KNN($K=3$) is not a good choice and that the relationships within the data are better captured by the other models, particularly QDA and Naive Bayes which have the highest recall and F1 scores.
* **Room for Improvement:** These results, while promising, indicate the potential for further gains through:
* **Feature Engineering:** Exploring technical indicators and interactions between features.
* **Alternative Models:** Considering tree-based methods or time series models.
## Reference
* Chap 4 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)
* [Supervised and Unsupervised Learning (an Intuitive Approach)](https://medium.com/@metehankozan/supervised-and-unsupervised-learning-an-intuitive-approach-cd8f8f64b644)
* [Evaluation Metrics for Classification](https://medium.com/@impythonprogrammer/evaluation-metrics-for-classification-fc770511052d)
* [Multinomial logistic](https://en.wikipedia.org/wiki/Multinomial_logistic_regression)
* [Generative Models](https://en.wikipedia.org/wiki/Generative_model)
* [Logistic Regression, LDA, QDA, and KNN](https://intro-stat-learning.github.io/ISLP/labs/Ch04-classification-lab.html#logistic-regression)