# Introduction to Machine Learning (TI508M)
## I. Introduction to Machine Learning
### 1. What is Machine Learning
---
**What is Machine Learning?**
Machine Learning (ML) is a field of artificial intelligence (AI) that enables computers to learn and make predictions or decisions without being explicitly programmed. It relies on statistical models and algorithms that learn patterns from data.
- **Arthur Samuel (1959):** "Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed."
- **Tom Mitchell (1997):** "A computer program learns from experience (E) with respect to a task (T) and performance measure \(P) if its performance on T, as measured by P, improves with experience E."
**Example Applications:**
- Spam email classification.
- Predicting customer purchase behavior.
- Image recognition (e.g., identifying cats in photos).
- Autonomous driving systems.
**Why Use Machine Learning?**
ML is particularly useful when:
- **Data is abundant:** Large datasets are available for training.
- **Explicit rules are unavailable:** For tasks like image recognition, manually coding rules is infeasible.
- **Dynamic environments:** The system must adapt to new data (e.g., fraud detection).
### 2. Types of Machine Learning
---
#### 2.1 Supervised Learning
Supervised learning involves training a model on labeled data, where each input is paired with a corresponding output.
- **Goal:** Predict future labels for unseen data.
- **Subtypes:**
- **Classification:** Predict categorical labels (e.g., spam vs. not spam).
- **Regression:** Predict continuous values (e.g., house prices).
**Example:**
- **Breast Cancer Detection:**
- Features: Tumor size, shape.
- Labels: Benign or malignant.
- **CO2 Emissions Prediction:**
- Features: Engine size, fuel consumption.
- Label: CO2 emissions.
**Key Steps:**
1. Collect labeled data.
2. Define a cost function to measure errors.
3. Use an algorithm to minimize the cost function.
#### 2.2 Unsupervised Learning
Unsupervised learning identifies patterns in unlabeled data. The model works independently to find structure in the data.
- **Goal:** Discover hidden patterns or groupings.
- **Subtypes:**
- **Clustering:** Group similar data points (e.g., customer segmentation).
- **Association Rule Discovery:** Find relationships (e.g., market basket analysis).
**Example:**
- Clustering articles with similar topics.
- Grouping customer data to improve targeted marketing.
#### 2.3 Reinforcement Learning
Reinforcement learning involves an agent learning to take actions in an environment to maximize cumulative rewards.
- **Components:**
- **Agent:** Learner or decision-maker.
- **Environment:** External system interacting with the agent.
- **State:** Current situation of the agent.
- **Action:** Possible moves by the agent.
- **Reward:** Feedback signal for the action taken.
**Example:**
- Training a robotic arm to pick up objects.
- Learning to play chess through trial-and-error.
### 3. Data Science and Data Preparation
---
**Data Science Overview**
Data science uses automated techniques to analyze large datasets and extract valuable insights.
- **Data Analysis:** Human-led exploration to gain insights.
- **Data Analytics:** Automates analysis to discover patterns.
**Preparing Data for Machine Learning**
Data preparation ensures data quality for model training.
- **Steps:**
1. Handle missing values.
2. Remove outliers.
3. Select relevant features.
4. Normalize or scale data.
**Key Concepts:**
- **Attributes:** Features describing data points.
- **Quantitative Attributes:**
- **Discrete:** Countable values (e.g., number of students).
- **Continuous:** Measurable values (e.g., height).
- **Qualitative Attributes:**
- **Ordinal:** Ordered categories (e.g., satisfaction levels).
- **Nominal:** Unordered categories (e.g., gender).
### 4. The Machine Learning Pipeline
---
1. **Define the Problem:** Identify the task and goal.
2. **Collect Data:** Gather diverse, representative datasets.
3. **Prepare Data:** Clean and preprocess the data.
4. **Design Features:** Select the most descriptive features.
5. **Train the Model:** Use training data to learn patterns.
6. **Test the Model:** Evaluate its performance on test data.
7. **Deploy the Model:** Implement it in production.
### 5. Key Comparisons
---
| **Aspect** | **Statistical Models** | **Machine Learning Models** |
|---------------------------|------------------------------------|--------------------------------------|
| **Assumptions** | Strong assumptions (e.g., linearity)| Fewer assumptions |
| **Human Effort** | High (manual feature engineering) | Low (automatic feature extraction) |
| **Data Dependency** | Limited | Relies on large datasets |
| **Predictive Power** | Moderate | High (complex relationships) |
**Example:**
- Statistical: Linear regression for house prices.
- Machine Learning: Random forest for fraud detection.
### 6. Applications of Machine Learning
---
- **Retail:** Market basket analysis, customer segmentation.
- **Finance:** Fraud detection, credit scoring.
- **Healthcare:** Medical diagnosis, disease prediction.
- **Manufacturing:** Process optimization, defect detection.
- **Web:** Search engines, recommendation systems.
## II. Python for Data Science
### 1. Why Python for Data Science?
---
Python is the most popular language in data science due to its simplicity, readability, and extensive community support. It offers:
- **Clear Syntax:** Easy to learn and write.
- **Open Source:** Free and accessible to everyone.
- **Scientific Libraries:** Libraries like NumPy, Pandas, Matplotlib, Scikit-learn, and TensorFlow provide powerful tools for data processing, visualization, and machine learning.
- **Popularity:** Widely used with a large support community.
### 2. Python Basics
---
**Syntax and Variables**
- **Variables:** Containers for storing data values.
```python
age = 25 # Integer
height = 5.9 # Float
name = "Alice" # String
is_student = True # Boolean
```
- **Comments:**
```python
# Single-line comment
"""
Multi-line comment
Useful for documenting code.
"""
```
**Input and Output**
- **Printing:**
```python
var = "Python is awesome"
print(var)
```
- **Input:**
```python
name = input("Enter your name: ")
print("Hello,", name)
```
**Control Structures**
- **Conditional Statements:**
```python
if age > 18:
print("Adult")
else:
print("Minor")
```
- **Loops:**
```python
for i in range(5):
print(i)
```
```python
while age < 30:
print("Young")
age += 1
```
### 3. Data Structures in Python
---
**Collections**
- **List:** Ordered, changeable, allows duplicates.
```python
fruits = ["apple", "banana", "cherry"]
```
- **Tuple:** Ordered, unchangeable, allows duplicates.
```python
coord = (10, 20)
```
- **Set:** Unordered, no duplicates.
```python
vowels = {"a", "e", "i", "o", "u"}
```
- **Dictionary:** Key-value pairs.
```python
capitals = {"France": "Paris", "Italy": "Rome"}
```
### 4. Key Libraries for Data Science
---
Python provides several powerful libraries for machine learning tasks. Each library is tailored to specific stages of the data science pipeline, from preprocessing to model evaluation.
**NumPy**
- **Purpose:** NumPy is used for numerical computations and efficient handling of multi-dimensional arrays. It is especially useful for tasks involving linear algebra, such as matrix operations, which are common in machine learning.
- **Use Cases:**
- Creating and manipulating large arrays or matrices.
- Performing element-wise operations or broadcasting.
- Useful in data preprocessing and mathematical modeling.
**Pandas**
- **Purpose:** Pandas excels in data manipulation and analysis. It provides structures like DataFrames for tabular data handling and Series for one-dimensional data.
- **Use Cases:**
- Reading and cleaning datasets.
- Handling missing data and reshaping.
- Grouping, merging, and aggregating data for feature engineering.
**Matplotlib**
- **Purpose:** This library is used for creating 2D visualizations of data, which help in understanding trends, distributions, or relationships.
- **Use Cases:**
- Visualizing feature distributions or relationships in exploratory data analysis.
- Plotting loss or accuracy during model training.
**Scikit-learn**
- **Purpose:** Scikit-learn is a comprehensive library for implementing machine learning models, evaluating them, and building pipelines.
- **Use Cases:**
- Preprocessing data (e.g., scaling, encoding).
- Training and evaluating models like decision trees or support vector machines.
- Cross-validation and hyperparameter tuning.
**TensorFlow**
- **Purpose:** TensorFlow is a deep learning library that supports both low-level operations and high-level APIs for neural network modeling.
- **Use Cases:**
- Building and training deep neural networks.
- Performing large-scale, distributed machine learning tasks.
These libraries form the backbone of modern machine learning workflows and are often used in combination for end-to-end data science solutions.
**NumPy**
- **Purpose:** Efficiently handle multi-dimensional arrays and perform mathematical computations.
- **Use Cases:** Common in data preprocessing and tasks involving linear algebra, such as matrix operations in machine learning.
- **Key Functions and Their Syntax:**
| Function | Description | Syntax Example |
|--------------|-------------------------------------------------|-------------------------------------|
| `np.array` | Creates an array. | `np.array([1, 2, 3])` |
| `np.dot` | Computes the dot product of two arrays/matrices.| `np.dot(a, b)` |
| `np.mean` | Calculates the mean of array elements. | `np.mean(array)` |
| `np.sum` | Computes the sum of array elements. | `np.sum(array)` |
| `np.reshape` | Reshapes an array to a new shape. | `array.reshape(3, 2)` |
- **Example Usage:**
```python
import numpy as np
array1 = np.array([[1, 2], [3, 4]])
array2 = np.array([[5, 6], [7, 8]])
# Dot product
result = np.dot(array1, array2)
print(result)
# Calculate mean
mean_value = np.mean(array1)
print("Mean:", mean_value)
```
- **Key Functions:**
| Function | Description | Syntax Example |
|--------------|-------------------------------------------------|-------------------------------------|
| `np.array` | Creates an array. | `np.array([1, 2, 3])` |
| `np.dot` | Computes the dot product of two arrays/matrices.| `np.dot(a, b)` |
| `np.mean` | Calculates the mean of array elements. | `np.mean(array)` |
| `np.sum` | Computes the sum of array elements. | `np.sum(array)` |
| `np.reshape` | Reshapes an array to a new shape. | `array.reshape(3, 2)` |
**Pandas**
- **Purpose:** Simplifies the manipulation and analysis of structured data.
- **Use Cases:** Ideal for cleaning, organizing, and transforming tabular datasets. Often used for exploratory data analysis (EDA).
- **Key Functions and Their Syntax:**
| Function | Description | Syntax Example |
|----------------|-----------------------------------------------------------------------------|-------------------------------------------|
| `read_csv` | Reads a CSV file into a DataFrame. | `pd.read_csv('file.csv')` |
| `to_csv` | Writes a DataFrame to a CSV file. | `data.to_csv('output.csv')` |
| `groupby` | Groups data by a specific column for aggregation. | `data.groupby('column').sum()` |
| `merge` | Merges two DataFrames on a key. | `pd.merge(df1, df2, on='key')` |
| `pivot` | Reshapes data by pivoting on a column. | `data.pivot(index='row', columns='col')` |
- **Example Usage:**
```python
import pandas as pd
# Create a DataFrame
data = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
# Group by example
grouped = data.groupby("Age").count()
print(grouped)
# Merge example
other_data = pd.DataFrame({"Name": ["Alice", "Bob"], "Score": [90, 85]})
merged = pd.merge(data, other_data, on="Name")
print(merged)
```
- **Key Functions:**
| Function | Description | Syntax Example |
|----------------|-----------------------------------------------------------------------------|-------------------------------------------|
| `read_csv` | Reads a CSV file into a DataFrame. | `pd.read_csv('file.csv')` |
| `to_csv` | Writes a DataFrame to a CSV file. | `data.to_csv('output.csv')` |
| `groupby` | Groups data by a specific column for aggregation. | `data.groupby('column').sum()` |
| `merge` | Merges two DataFrames on a key. | `pd.merge(df1, df2, on='key')` |
| `pivot` | Reshapes data by pivoting on a column. | `data.pivot(index='row', columns='col')` |
- **Read/Write Data:**
```python
data = pd.read_csv("file.csv")
data.to_csv("output.csv")
```
**Matplotlib**
- **Purpose:** 2D Data Visualization.
```python
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()
```
**Scikit-learn**
- **Purpose:** Scikit-learn is a comprehensive library for implementing machine learning models, evaluating them, and building pipelines.
- **Key Functions:**
| Function | Description | Syntax Example |
|----------------------|-----------------------------------------------------------------------------|--------------------------------------------|
| `train_test_split` | Splits data into training and testing sets. | `train_test_split(X, y, test_size=0.2)` |
| `StandardScaler` | Standardizes features by removing the mean and scaling to unit variance. | `scaler.fit_transform(X)` |
| `fit` | Trains a machine learning model on the given data. | `model.fit(X_train, y_train)` |
| `predict` | Generates predictions using the trained model. | `model.predict(X_test)` |
| `cross_val_score` | Evaluates a model's performance using cross-validation. | `cross_val_score(model, X, y, cv=5)` |
- **Example:**
```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
```
### 5. Data Manipulation and Analysis
---
**DataFrame Operations in Pandas**
- **Creating a DataFrame:**
```python
data = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
```
- **Viewing Data:**
```python
data.head() # First 5 rows
data.tail() # Last 5 rows
```
- **Filtering Rows:**
```python
data[data["Age"] > 25]
```
- **GroupBy:**
```python
data.groupby("Age").count()
```
**Handling Missing Data**
- **Detecting Missing Values:**
```python
data.isnull().sum()
```
- **Dropping Missing Values:**
```python
data.dropna()
```
### 6. Visualizing Data
---
**Matplotlib**
- **Line Plot:**
```python
plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Line Plot")
plt.show()
```
- **Histogram:**
```python
data["Age"].plot.hist()
```
**Seaborn**
- **Purpose:** Enhanced statistical data visualization.
```python
import seaborn as sns
sns.histplot(data["Age"])
```
### 7. Merging and Joining Data
---
**Merge DataFrames**
| Merge Type | Description | Code Example |
|-------------|---------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| **Inner** | Keeps only rows with matching keys in both DataFrames. | `pd.merge(df1, df2, on="Key", how="inner")` |
| **Outer** | Includes all rows from both DataFrames, filling with NaN where no match is found. | `pd.merge(df1, df2, on="Key", how="outer")` |
| **Left** | Includes all rows from the left DataFrame and matches from the right, filling with NaN if absent. | `pd.merge(df1, df2, on="Key", how="left")` |
| **Right** | Includes all rows from the right DataFrame and matches from the left, filling with NaN if absent. | `pd.merge(df1, df2, on="Key", how="right")` |
**Concatenation**
- Concatenates DataFrames along rows or columns.
- Useful for stacking data or adding new columns.
**Example:**
```python
pd.concat([df1, df2], axis=0) # Stack rows
pd.concat([df1, df2], axis=1) # Add columns
```
## III. Supervised Learning and K-Nearest Neighbors
### 1. Introduction to Supervised Learning
---
**Definition**
Supervised learning is a type of machine learning where a model is trained on labeled data. The goal is to learn a mapping from input features to output labels to predict unseen data.
- **Key Components:**
- **Features (Inputs):** Independent variables used for prediction.
- **Labels (Outputs):** Dependent variable or target.
- **Model:** Mathematical representation of the learning process.
**Types of Supervised Learning**
1. **Classification:** Predict discrete labels (e.g., spam vs. not spam).
- Algorithms: K-Nearest Neighbors (KNN), Decision Trees.
2. **Regression:** Predict continuous values (e.g., house prices).
- Algorithms: Linear Regression, KNN for regression.
**Applications**
- **Spam Detection:** Classifying emails as spam or not spam.
- **Fraud Detection:** Identifying fraudulent transactions.
- **Medical Diagnosis:** Predicting diseases from patient data.
- **Image Recognition:** Identifying objects in images.
**Steps in Supervised Learning**
1. **Data Collection:** Gather labeled data.
2. **Data Preprocessing:** Clean and normalize data.
3. **Feature Selection:** Identify important features.
4. **Model Selection:** Choose a suitable algorithm.
5. **Training:** Train the model on the training data.
6. **Evaluation:** Use test data to evaluate performance.
7. **Hyperparameter Tuning:** Optimize model parameters.
8. **Deployment:** Deploy the model to make real-world predictions.
### 2. K-Nearest Neighbors (KNN)
---
**Overview**
KNN is an instance-based learning algorithm used for classification and regression. It relies on distance metrics to find the $k$ closest neighbors to a given data point and makes predictions based on those neighbors.
**How KNN Works**
1. Choose **$k$**, the number of nearest neighbors to consider.
2. Compute the distance between the query point and all training points using a distance metric:
- **Euclidean Distance:**
$$d = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$$
- **Manhattan Distance:**
$$d = \sum_{i=1}^n |x_i - y_i|$$
3. Sort the training data by distance to the query point.
4. Classification:
- Count the class labels of the k nearest neighbors.
- Assign the majority class to the query point.
5. Regression:
- Compute the mean (or weighted mean) of the target values of the k nearest neighbors.
>[!NOTE]Example Statement
>**Dataset:**
>| Point | Feature 1 | Feature 2 | Class |
>|-------|-----------|-----------|-------|
>| A | 2 | 3 | Red |
>| B | 1 | 1 | Blue |
>| C | 4 | 2 | Red |
>| D | 6 | 5 | Blue |
>
>Classify the new point (5, 4) taking $k=3$.
>[!Tip]Solution
>1. **Compute distances**:
> - Distance to A: $\sqrt{(5-2)^2 + (4-3)^2} = \sqrt{9 + 1} = \sqrt{10}$
> - Distance to B: $\sqrt{(5-1)^2 + (4-1)^2} = \sqrt{16 + 9} = \sqrt{25} = 5$
> - Distance to C: $\sqrt{(5-4)^2 + (4-2)^2} = \sqrt{1 + 4} = \sqrt{5}$
> - Distance to D: $\sqrt{(5-6)^2 + (4-5)^2} = \sqrt{1 + 1} = \sqrt{2}$
>2. **Sort neighbors by distance**:
> - D ($\sqrt{2}$), C ($\sqrt{5}$), A ($\sqrt{10}$), B ($5$)
>3. **Majority vote for $k = 3$**:
> - Neighbors: D (Blue), C (Red), A (Red).
> - Predicted Class: **Red**.
**Challenges and Solutions**
1. **Overfitting (Small $k$):**
- Small $k$ values may overfit noise in the data.
- Solution: Use cross-validation to determine an optimal $k$.
2. **Underfitting (Large $k$):**
- Large $k$ values may oversimplify the model.
- Solution: Balance $k$ to reflect dataset size and complexity.
3. **Feature Scaling:**
- Features with larger ranges dominate distance calculations.
- Solution: Normalize or standardize data.
4. **Curse of Dimensionality:**
- High-dimensional data reduces the effectiveness of distance metrics.
- Solution: Use dimensionality reduction techniques like PCA.
5. **Irrelevant Features:**
- Irrelevant features degrade performance.
- Solution: Perform feature selection.
**Python Implementation (Quick Overview)**
**Classification Example:**
```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Example data
X = [[2, 3], [1, 1], [4, 2], [6, 5]]
y = ['Red', 'Blue', 'Red', 'Blue']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Train model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```
**Regression Example:**
```python
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
# Example data
X = [[1000], [1500], [2000], [2500]]
y = [300, 400, 500, 600]
# Train model
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X, y)
# Predict
y_pred = knn.predict([[1750]])
print("Predicted Value:", y_pred)
```
## IV. Regression
### 1. Introduction to Regression
---
**Definition**
Regression is a statistical technique used to predict the value of a dependent (response) variable based on one or more independent (predictor) variables. It identifies relationships between variables and provides a predictive model.
- **Key Types of Regression:**
- **Simple Linear Regression:** One predictor variable and one response variable.
- **Multiple Linear Regression:** Multiple predictor variables.
- **Polynomial Regression:** Models non-linear relationships using polynomial terms.
- **Logistic Regression:** Predicts probabilities for classification problems, outputting binary or categorical results.
**Applications**
- Predicting house prices based on size and location.
- Forecasting sales based on advertising budgets.
- Estimating energy consumption based on weather data.
- Classifying emails as spam or not spam (logistic regression).
### 2. Linear Regression
---
**Simple Linear Regression**
- **Equation:**
$$Y = \theta_0 + \theta_1 X + \epsilon$$
Where:
- $Y$: Dependent variable.
- $X$: Independent variable.
- $\theta_0$: Intercept.
- $\theta_1$: Slope (change in $Y$ for one unit change in $X$).
- $\epsilon$: Residual error (difference between observed and predicted values).
We have the following formulas for the slope and the intercept:
$$\theta_1 = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}$$
$$\theta_0 = \bar{Y} - \theta_1 \bar{X}$$
#### Multiple Linear Regression
- **Equation:**
$$Y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 + ... + \theta_n X_n + \epsilon$$
Where:
- $X_1, X_2, ..., X_n$: Independent variables.
- $\theta_1, \theta_2, ..., \theta_n$: Coefficients representing the effect of each predictor.
- **Application:** Predicting sales based on TV, radio, and newspaper advertising budgets.
**Polynomial Regression**
- **Equation:**
$$Y = \theta_0 + \theta_1 X + \theta_2 X^2 + ... + \theta_n X^n + \epsilon$$
- Captures non-linear relationships by introducing polynomial terms.
>[!NOTE]Example Statement
>**Scenario**
>We aim to predict house prices ($Y$) based on the size of the house in square footage ($X$).
>
>**Training Data**
>| House Size ($X$) (Square Feet) | Price ($Y$) (in $1000s) |
>|----------------------------------|---------------------------|
>| 2104 | 460 |
>| 1416 | 232 |
>| 1534 | 315 |
>| 852 | 178 |
>
>Predict the price of a 1500-square-foot house.
>[!Tip]Solution
>**Step 1: Calculate the Coefficients**
The linear regression equation is:
$$Y = \theta_0 + \theta_1 X$$
>Where:
>- $\theta_0$: Intercept
>- $\theta_1$: Slope
>
>1. **Mean of $X$ and $Y$:**
> $$\bar{X} = \frac{\sum X}{n} = \frac{2104 + 1416 + 1534 + 852}{4} = 1476.5$$
> $$\bar{Y} = \frac{\sum Y}{n} = \frac{460 + 232 + 315 + 178}{4} = 296.25$$
>
>2. **Compute $\theta_1$ (Slope):**
> $$\theta_1 = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}$$
> - $(X_i - \bar{X})\): \([627.5, -60.5, 57.5, -624.5]$
> - $(Y_i - \bar{Y})\): \([163.75, -64.25, 18.75, -118.25]$
> - Numerator:
> $$\begin{align*}\sum (X_i - \bar{X})(Y_i - \bar{Y}) &= 627.5 \cdot 163.75 + (-60.5 \cdot (-64.25)) \\&+ 57.5 \cdot 18.75 + ((-624.5) \cdot (-118.25) \\ &= 145174.375\end{align*}$$
> - Denominator:
> $$\sum (X_i - \bar{X})^2 = 627.5^2 + (-60.5)^2 + 57.5^2 + (-624.5)^2 = 678385.5$$
> - Result:
> $$\theta_1 = \frac{145174.375}{678385.5} \approx 0.214$$
>
>3. **Compute $\theta_0$ (Intercept):**
> $$\theta_0 = \bar{Y} - \theta_1 \bar{X}$$
> $$\theta_0 = 296.25 - 0.214 \cdot 1476.5 \approx -20.29$$
>
>**Step 2: Final Regression Equation**
>$$Y = -20.29 + 0.214 X$$
>
>**Step 3: Prediction**
>$$Y = -20.29 + 0.214 \cdot 1500 = -20.29 + 321 = 300.71$$
>
>The predicted price for a 1500-square-foot house is approximately $300,710.
### 3. Logistic Regression
---
**Overview**
Logistic regression is used for binary classification problems. Unlike linear regression, it models the probability that a given input belongs to a particular class.
- **Equation:**
$$P(Y=1|X) = \frac{1}{1 + e^{-\theta_0 - \theta_1 X}}$$
Where:
- $P(Y=1|X)$: Probability of the positive class.
- $\theta_0$: Intercept.
- $\theta_1$: Coefficient for the independent variable $X$.
- **Sigmoid Function:** Converts the linear regression output into a probability:
$$h_\theta(X) = \frac{1}{1 + e^{-\theta^T X}}$$
**Decision Boundary**
- Logistic regression predicts $Y=1$ if $h_\theta(X) \geq 0.5$, otherwise \( Y=0 \).
- The decision boundary is the threshold where probabilities switch between classes.
**Applications**
- Spam detection (spam or not spam).
- Medical diagnosis (disease presence or absence).
- Customer churn prediction (likely to leave or stay).
**Limitations**
- Assumes linear relationship between independent variables and log-odds.
- Not suitable for non-linear relationships without transformation or feature engineering.
### 4. Gradient Descent for Linear Regression
---
**Overview**
Gradient Descent is an optimization algorithm used to minimize the cost function (e.g., Mean Squared Error).
- **Steps:**
1. Initialize $\theta_0$ and $\theta_1$ with random values.
2. Update parameters iteratively:
$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$
Where:
- $\alpha$: Learning rate.
- $J(\theta)$: Cost function.
- **Cost Function (MSE):**
$$J(\theta) = \frac{1}{2m} \sum (Y_i - \hat{Y}_i)^2$$
### 5. Python Implementation
---
**Simple Linear Regression**
```python
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
# Data
X = np.array([[2104], [1416], [1534], [852]])
Y = np.array([460, 232, 315, 178])
# Model
model = LinearRegression()
model.fit(X, Y)
# Predictions
predicted = model.predict(X)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
# Plot
plt.scatter(X, Y, color='blue')
plt.plot(X, predicted, color='red')
plt.show()
```
**Logistic Regression**
```python
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import accuracy_score
# Example data
X = np.array([[2.5], [3.5], [4.5], [5.5]])
y = np.array([0, 0, 1, 1])
# Model
log_reg = LogisticRegression()
log_reg.fit(X, y)
# Predictions
y_pred = log_reg.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))
```
### 6. Challenges and Solutions
---
**Overfitting**
- **Issue:** Model captures noise rather than the true relationship.
- **Solution:** Use regularization techniques (e.g., Ridge, Lasso).
**Multicollinearity**
- **Issue:** Independent variables are highly correlated.
- **Solution:** Remove or combine correlated predictors.
**Outliers**
- **Issue:** Skew results significantly.
- **Solution:** Detect using visualization or statistical tests and remove if necessary.
**Scaling**
- **Issue:** Features with different scales can bias the model.
- **Solution:** Normalize or standardize data.
## V. Unsupervised Learning and K-Means
### 1. **Introduction to Unsupervised Learning**
---
Unsupervised learning is a type of machine learning where the model learns patterns and relationships from unlabeled data. Unlike supervised learning, there are no predefined labels or target values; the algorithm explores the structure of the data to identify hidden patterns or groupings.
- **Objective:**
The main goal is to discover the underlying structure or distribution in the data without any prior knowledge about outcomes.
- **Key Tasks in Unsupervised Learning:**
1. **Clustering:** Group similar data points into clusters based on their features (e.g., KMeans, Hierarchical Clustering).
2. **Dimensionality Reduction:** Reduce the number of features while retaining important information (e.g., PCA).
3. **Anomaly Detection:** Identify outliers or unusual data points.
4. **Association Rule Learning:** Find relationships between variables in large datasets (e.g., Market Basket Analysis).
- **Applications:**
- Customer segmentation for targeted marketing.
- Image compression using dimensionality reduction techniques.
- Anomaly detection in fraud detection systems.
- Organizing large datasets into meaningful structures (e.g., document clustering).
Unsupervised learning is particularly useful when there is a lack of labeled data, making it a powerful approach for exploratory data analysis and pattern recognition.
### 2. Clustering with K-Means Algorithm
---
**What is Clustering?**
- Clustering is an **unsupervised learning** technique used to group a set of data points into distinct, non-overlapping clusters.
- The goal is to **maximize intra-cluster similarity** and **minimize inter-cluster similarity**.
**K-Means Algorithm**
K-Means is one of the most popular clustering algorithms. It partitions the dataset into $K$ clusters, where each cluster is represented by its **centroid** (mean of all points in the cluster).
**Objective**
The K-Means algorithm aims to minimize the **intra-cluster variance**:
$$\text{W} = \sum_{i=1}^K \sum_{X \in C_i} ||X - \mu_i||^2$$
Where:
- $K$: Number of clusters
- $C_i$: Points in cluster $i$
- $\mu_i$: Centroid of cluster $i$
- $||X - \mu_i||^2$: Squared Euclidean distance between a data point $X$ and its cluster centroid.
**Steps of K-Means**
1. **Initialize** $K$ cluster centroids randomly or using $K$-means++.
2. **Assign** each data point to the nearest cluster (based on distance to centroids).
3. **Update** the centroids by calculating the mean of all points in each cluster.
4. Repeat **Step 2** and **Step 3** until:
- Centroids do not change significantly (convergence), OR
- A maximum number of iterations is reached.
**Illustration of K-Means by Hand**
>[!Note]Example Statement
>We aim to group the following points into $K = 2$ clusters:
>| Point | X | Y |
>|-------|---|----|
>| A | 1 | 1 |
>| B | 1 | 4 |
>| C | 2 | 1 |
>| D | 3 | 3 |
>| E | 5 | 4 |
>
>[!Tip]Solution
>1. **Initialize centroids** randomly: $C_1 = (1,1)$, $C_2 = (5,4)$.
>2. **Assignment Step**: Compute distances to centroids.
> - Distance formula: $d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$.
>
>| Point | $d(C_1)$ | $d(C_2)$ | Assigned Cluster |
>|-------|-------------|-------------|------------------|
>| A | 0.0 | 5.0 | $C_1$ |
>| B | 3.0 | 4.0 | $C_1$ |
>| C | 1.0 | 5.0 | $C_1$ |
>| D | 2.8 | 2.2 | $C_2$ |
>| E | 5.0 | 0.0 | $C_2$ |
>
>3. **Update Centroids**:
> - $C_1 = \text{mean of points A, B, C} = \left( \frac{1+1+2}{3}, \frac{1+4+1}{3} \right) = (1.33, 2.0)$.
> - $C_2 = \text{mean of points D, E} = \left( \frac{3+5}{2}, \frac{3+4}{2} \right) = (4.0, 3.5)$.
>
>4. Repeat **assignment** and **update** steps until convergence.
**Challenges in K-Means**
1. **Number of Clusters (K)**:
- **Solution**: Use the **Elbow Method** to find the optimal $K$.
2. **Initialization Sensitivity**:
- Random initialization can lead to suboptimal results.
- **Solution**: Use **K-Means++** for smarter initialization.
3. **Cluster Shape**:
- K-Means assumes spherical clusters, which may fail for complex shapes.
4. **Outliers**:
- Outliers can distort centroids.
- **Solution**: Remove or preprocess outliers before applying K-Means.
5. **Scalability**:
- K-Means can struggle with large datasets.
- **Solution**: Use **Mini-Batch K-Means** for faster convergence.
**Python Implementation**
```python
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Data
X = np.array([[1, 1], [1, 4], [2, 1], [3, 3], [5, 4]])
# K-Means Clustering
kmeans = KMeans(n_clusters=2, init='k-means++', random_state=42)
kmeans.fit(X)
# Results
print("Cluster Centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
# Plot Clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', marker='o')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='x')
plt.title("K-Means Clustering")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
```
**Output**:
- Cluster centers: Centroids of the $K$ clusters.
- Labels: Cluster assignments for each data point.
### **3. Elbow Method for KMeans**
---
**What is the Elbow Method?**
The Elbow Method is a technique used to determine the **optimal number of clusters $K$** in the KMeans algorithm. It helps identify the value of $K$ where adding more clusters provides minimal improvement, striking a balance between model accuracy and complexity.
**How the Elbow Method Works**
1. **Within-Cluster Sum of Squares (WCSS):**
WCSS measures the sum of squared distances between each data point and its corresponding cluster centroid. A lower WCSS indicates that data points are closer to their centroids.
$$WCSS = \sum_{i=1}^K \sum_{x \in C_i} ||x - \mu_i||^2$$
Where:
- $K$: Number of clusters
- $C_i$: Cluster $i$
- $x$: Data point
- $\mu_i$: Centroid of cluster $i$
2. **Steps to Apply the Elbow Method:**
- Train the KMeans model for different values of $K$ (e.g., from 1 to 10).
- For each $K$, calculate the WCSS.
- Plot the WCSS against $K$).
- Identify the \"elbow point,\" where the WCSS curve starts to flatten out.
3. **Elbow Point:**
- The **optimal $K$** is located at the elbow point of the curve.
- At this point, increasing $K$ further results in diminishing returns (small reductions in WCSS).
**Illustration Explanation**

In the provided graph:
- The **X-axis** represents the number of clusters ($K$).
- The **Y-axis** shows the error rate or WCSS.
- As $K$ increases, the WCSS decreases, but the rate of decrease slows after a certain point.
- The **\"elbow\"** at $K = 3$ marks the optimal number of clusters. Beyond this point, increasing $K$ yields minimal improvement.
## VI. Hierarchical Clustering
### 1. **Introduction to Hierarchical Clustering**
---
Hierarchical clustering is an **unsupervised learning algorithm** used to group similar objects into clusters in a hierarchical structure. Unlike other clustering methods, it produces a **tree-like diagram** called a **dendrogram** that visually represents the nested groupings of data.
### 2. **Types of Hierarchical Clustering**
---
1. **Agglomerative Clustering (Bottom-Up Approach):**
- **Process:**
- Start with each data point as its own cluster.
- Iteratively merge the two closest clusters until a single cluster remains.
- **Visualization:** Dendrogram shows the merging process.
- **Example Use Case:** Customer segmentation.
2. **Divisive Clustering (Top-Down Approach):**
- **Process:**
- Start with all data points in one cluster.
- Iteratively split the clusters until each point forms its own cluster.
- **Visualization:** Dendrogram displays the splitting process.
- **Limitation:** Computationally expensive; less commonly used in practice.
### 3. **Key Concepts in Hierarchical Clustering**
---
**Distance Measures**
The algorithm relies on **distance measures** to compute the closeness between points or clusters. Common metrics include:
- **Euclidean Distance** (most common):
$$d(A, B) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$$
- **Manhattan Distance**:
$$d(A, B) = \sum_{i=1}^n |x_i - y_i|$$
**Linkage Methods**
Linkage methods determine how distances between clusters are calculated:
1. **Single Linkage (Minimum Distance):**
Distance between the closest points of two clusters.
2. **Complete Linkage (Maximum Distance):**
Distance between the farthest points of two clusters.
3. **Average Linkage:**
Average distance between all pairs of points in two clusters.
4. **Ward's Linkage:**
Minimizes the **sum of squared distances** within clusters, focusing on compact clusters.
| **Linkage Method** | **Description** | **Advantages** | **Disadvantages** |
|---------------------|-----------------------------------|------------------------------------|----------------------------------|
| Single Linkage | Closest points between clusters | Works well for elongated clusters | Sensitive to outliers |
| Complete Linkage | Farthest points between clusters | Produces compact clusters | Computationally expensive |
| Average Linkage | Average pairwise distances | Balanced approach | Slower than single linkage |
| Ward's Linkage | Minimizes intra-cluster variance | Produces compact and spherical clusters | Computationally intensive for large data |
### 4. **Steps in Agglomerative Hierarchical Clustering**
---
1. **Initialization:** Treat each data point as its own cluster.
2. **Compute Distance Matrix:** Calculate pairwise distances between all points.
3. **Merge Closest Clusters:** Based on the chosen linkage method, combine the two closest clusters.
4. **Update Distance Matrix:** Recompute distances between the new cluster and the remaining clusters.
5. **Repeat Steps 3–4:** Continue until all points are merged into one cluster.
6. **Construct the Dendrogram:** Visualize the clustering process and determine the number of clusters by **cutting the dendrogram** at a certain height.
### 5. **Example of Agglomerative Clustering**
---
>[!Note]Dataset Information
>We have 5 points:
>| **Point** | \(X\) | \(Y\) |
>|-----------|-------|-------|
>| M1 | 2 | 0 |
>| M2 | 0 | 1 |
>| M3 | 0 | 2 |
>| M4 | 3 | 4 |
>| M5 | 5 | 4 |
>[!Tip]Solution
>**Step 1: Compute Pairwise Distances**
>We calculate the pairwise **Euclidean distance** between all points using the formula:
>$$d(M_i, M_j) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$
>| | **M1** | **M2** | **M3** | **M4** | **M5** |
>|-------|--------|--------|--------|--------|--------|
>| **M1** | 0.0 | 2.23 | 2.82 | 4.12 | 5.0 |
>| **M2** | 2.23 | 0.0 | 1.0 | 4.24 | 5.83 |
>| **M3** | 2.82 | 1.0 | 0.0 | 3.61 | 5.39 |
>| **M4** | 4.12 | 4.24 | 3.61 | 0.0 | 2.0 |
>| **M5** | 5.0 | 5.83 | 5.39 | 2.0 | 0.0 |
>
>**Step 2: Start Clustering**
>1. **Initial Step:** Each data point is treated as its own cluster.
> - Clusters: $C_1 = \{M1\}, C_2 = \{M2\}, C_3 = \{M3\}, C_4 = \{M4\}, C_5 = \{M5\}$.
>
>2. **Iteration 1: Merge Closest Points**
> - Closest points: $M2$ and $M3$ (distance = 1.0).
> - Merge into a new cluster $C_{23} = \{M2, M3\}$.
> - Updated clusters: $C_1, C_{23}, C_4, C_5$.
>
> | | **M1** | $C_{23}$ | **M4** | **M5** |
> |-------|--------|-----------|--------|--------|
> | **M1** | 0.0 | 2.23 | 4.12 | 5.0 |
> | $C_{23}$ | 2.23 | 0.0 | 3.61 | 5.39 |
> | **M4** | 4.12 | 3.61 | 0.0 | 2.0 |
> | **M5** | 5.0 | 5.39 | 2.0 | 0.0 |
>
>3. **Iteration 2: Merge Next Closest Points**
> - Closest cluster: $M1$ and $C_{23}$ (distance = 2.23).
> - Merge into a new cluster $C_{123} = \{M1, M2, M3\}$.
> - Updated clusters: $C_{123}, C_4, C_5$.
>
>| | $C_{123}$ | **M4** | **M5** |
>|-------|------------|--------|--------|
>| $C_{123}$ | 0.0 | 3.61 | 5.0 |
>| **M4** | 3.61 | 0.0 | 2.0 |
>| **M5** | 5.0 | 2.0 | 0.0 |
>
>4. **Iteration 3: Merge Next Closest Points**
> - Closest points: $M4$ and $M5$ (distance = 2.0).
> - Merge into a new cluster $C_{45} = \{M4, M5\}$.
> - Updated clusters: $C_{123}, C_{45}$.
>
>| | $C_{123}$ | $C_{45}$ |
>|-------|------------|-----------|
>| $C_{123}$ | 0.0 | 5.0 |
>| $C_{45}$ | 5.0 | 0.0 |
>
>5. **Iteration 4: Final Merge**
> - Closest clusters: $C_{123}$ and $C_{45}$ (distance = 5.0).
> - Merge into one final cluster $C_{12345} = \{M1, M2, M3, M4, M5\}$.
>
>This gives us the following dendogram:
>
### 6. **Understanding a Dendrogram**
---
A **dendrogram** is a tree-like diagram that illustrates the process of hierarchical clustering. It visually represents the sequence in which clusters are merged or split, as well as the distances between them.
**Key Elements of a Dendrogram**
1. **Data Points (Leaf Nodes):**
- The **x-axis** displays the individual data points or observations (e.g., 0, 1, 2, 3, 4, 5).
- Each leaf represents a single data point before clustering starts.
2. **Height (Y-axis):**
- The **y-axis** represents the **distance** or dissimilarity between clusters being merged.
- The higher the vertical line, the greater the distance between the clusters.
3. **Horizontal Lines:**
- These lines connect clusters that are merged at each step.
- The **height of the connection** indicates the distance between the clusters being joined.
4. **Clusters:**
- By “cutting” the dendrogram at a certain height (horizontal threshold), we can determine the number of clusters.
- Points joined below the cut belong to the same cluster.
**Step-by-Step Interpretation of the Dendrogram**
>[!Note]Dendogram
>Let's interpret the dendogram that was constructed in the previous example.
>
1. **First Merge:**
- $M2$ and $M3$ are merged first at a height of **1** because they are the closest pair of points (minimum distance).
2. **Second Merge:**
- $M4$ and $M5$ are merged at a height of **2**, forming another small cluster.
3. **Third Merge:**
- $M1$ joins the cluster containing $M2$ and $M3$ at a height of **3**, as it is the next closest to this group.
4. **Final Merge:**
- The two large clusters $\{M1, M2, M3\}$ and $\{M4, M5\}$ are merged at a height of **7**, representing the largest distance in the dataset.
**How to Determine the Number of Clusters**
To decide on the number of clusters:
1. Draw a horizontal line across the dendrogram at a chosen height.
2. Count the number of vertical lines that intersect the horizontal line.
- For example, cutting at **height = 3** produces **2 clusters**:
- **Cluster 1:** $M1, M2, M3$
- **Cluster 2:** $M4, M5$
### 7. Pros & Cons, Applications and Python Implementation
---
| **Advantages** | **Disadvantages** |
|---------------------------------------------|------------------------------------------|
| No need to pre-specify the number of clusters | Computationally expensive for large datasets |
| Produces a dendrogram for easy interpretation | Sensitive to noise and outliers |
| Suitable for small to medium-sized datasets | Choice of linkage method affects results |
**Applications of Hierarchical Clustering**
1. **Biology:** Grouping genes or species based on similarity.
2. **Marketing:** Customer segmentation for targeted campaigns.
3. **Image Processing:** Image segmentation and object detection.
4. **Social Network Analysis:** Detecting communities or groups.
**Python Implementation of Hierarchical Clustering**
```python
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Sample Data
data = [[2, 0], [0, 1], [0, 2], [3, 4], [5, 4]]
# Perform Agglomerative Hierarchical Clustering
linkage_matrix = linkage(data, method='ward')
# Plot Dendrogram
plt.figure(figsize=(10, 6))
dendrogram(linkage_matrix, labels=['M1', 'M2', 'M3', 'M4', 'M5'])
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Points")
plt.ylabel("Distance")
plt.show()
```
## VII. Model Evaluation
### **1. Importance of Model Evaluation**
---
Evaluating a machine learning model is essential to measure its performance, compare results, and ensure it generalizes well to unseen data. The choice of evaluation metrics depends on the type of problem:
- **Classification** (binary/multi-class)
- **Regression** (continuous numeric output)
### **2. Classification Metrics**
---
#### 2.1 **Confusion Matrix**
The **confusion matrix** summarizes the prediction results of a classifier:
| **Actual / Predicted** | **Positive (1)** | **Negative (0)** |
|-------------------------|------------------|------------------|
| **Positive (1)** | True Positive (TP) | False Negative (FN) |
| **Negative (0)** | False Positive (FP) | True Negative (TN) |
- **TP (True Positive):** Correctly predicted positives.
- **TN (True Negative):** Correctly predicted negatives.
- **FP (False Positive):** Incorrectly predicted as positive (Type I error).
- **FN (False Negative):** Incorrectly predicted as negative (Type II error).
>[!Tip]Example
>In a spam email classifier tested on 200 emails (100 spam and 100 not spam):
>- 95 not spam predicted correctly → TN = 95
>- 5 not spam predicted as spam → FP = 5
>- 97 spam emails predicted as spam → TP = 97
>- 3 spam predicted as not spam → FN = 3
>
>The confusion matrix is:
>| Actual / Predicted | Spam (1) | Not Spam (0) |
>|---------------------|----------|--------------|
>| **Spam (1)** | 97 | 3 |
>| **Not Spam (0)** | 5 | 95 |
---
#### 2.2 **Accuracy**
The ratio of correctly predicted samples to the total samples:
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
**Interpretation:**
- Accuracy represents how often the model makes correct predictions overall.
- It is easy to understand and widely used but can be misleading when the classes are imbalanced.
- For example, if 95% of emails are not spam, a model that always predicts \"not spam\" will have 95% accuracy even though it fails to identify any spam emails.
>[!Tip]Example
>From the confusion matrix:
>| **Actual / Predicted** | **Spam (1)** | **Not Spam (0)** |
>|-------------------------|--------------|------------------|
>| **Spam (1)** | 97 (TP) | 3 (FN) |
>| **Not Spam (0)** | 5 (FP) | 95 (TN) |
>
>- $TP = 97, TN = 95, FP = 5, FN = 3$
>- Accuracy:
>$$\text{Accuracy} = \frac{97 + 95}{97 + 95 + 5 + 3} = \frac{192}{200} = 0.96$$
>The model achieves **96% accuracy**, indicating good overall performance.
**Limitation:**
If only 10 out of 200 emails are spam, and the model predicts all as \"not spam,\" the accuracy is:
$$\text{Accuracy} = \frac{190}{200} = 0.95$$
The model fails to detect spam (zero $TP$), but accuracy remains high, which is misleading.
**Python Example:**
```python
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
```
---
#### 2.3 **Precision**
The proportion of true positives among all predicted positives:
$$\text{Precision} = \frac{TP}{TP + FP}$$
**Interpretation:**
- Precision focuses on the quality of positive predictions.
- A high precision means fewer false positives ($FP$).
- Precision is particularly important when false positives have serious consequences, such as marking important non-spam emails as spam.
>[!Tip]Example
>From the confusion matrix:
>| **Actual / Predicted** | **Spam (1)** | **Not Spam (0)** |
>|-------------------------|--------------|------------------|
>| **Spam (1)** | 97 (TP) | 3 (FN) |
>| **Not Spam (0)** | 5 (FP) | 95 (TN) |
>
>- $TP = 97, FP = 5$
>- Precision:
>$$\text{Precision} = \frac{97}{97 + 5} = \frac{97}{102} \approx 0.95$$
>The precision is **95%**, meaning 95% of the emails predicted as spam are indeed spam.
**When False Positives Matter:**
If false positives lead to critical emails being wrongly marked as spam, precision must be prioritized.
**Python Example:**
```python
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
print("Precision:", precision)
```
---
#### 2.4 **Recall (Sensitivity / True Positive Rate)**
The proportion of true positives correctly identified:
$$\text{Recall} = \frac{TP}{TP + FN}$$
**Interpretation:**
- Recall focuses on identifying all actual positives.
- A high recall means fewer false negatives ($FN$), which is crucial when missing positives is costly (e.g., undetected spam).
- Recall is particularly important when false negatives carry serious consequences, such as missing a cancer diagnosis.
>[!Tip]Example
>From the confusion matrix:
>| **Actual / Predicted** | **Spam (1)** | **Not Spam (0)** |
>|-------------------------|--------------|------------------|
>| **Spam (1)** | 97 (TP) | 3 (FN) |
>| **Not Spam (0)** | 5 (FP) | 95 (TN) |
>
>- $TP = 97, FN = 3$
>- Recall:
>$$\text{Recall} = \frac{97}{97 + 3} = \frac{97}{100} = 0.97$$
>The recall is **97%**, meaning 97% of actual spam emails were correctly identified.
**When False Negatives Matter:**
If false negatives result in spam emails bypassing filters, recall must be prioritized.
**Python Example:**
```python
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
print("Recall:", recall)
```
---
#### 2.5 **Specificity (True Negative Rate)**
The proportion of true negatives among all actual negatives:
$$\text{Specificity} = \frac{TN}{TN + FP}$$
**Interpretation:**
- Specificity focuses on the quality of negative predictions.
- A high specificity means fewer false positives ($FP$), ensuring that the model correctly identifies most non-spam emails.
- Specificity is particularly important in scenarios where false positives are costly, such as falsely labeling critical non-spam emails as spam.
>[!Tip]Example
>From the confusion matrix:
>| **Actual / Predicted** | **Spam (1)** | **Not Spam (0)** |
>|-------------------------|--------------|------------------|
>| **Spam (1)** | 97 (TP) | 3 (FN) |
>| **Not Spam (0)** | 5 (FP) | 95 (TN) |
>
>- $TN = 95, FP = 5$
>- Specificity:
>$$\text{Specificity} = \frac{95}{95 + 5} = \frac{95}{100} = 0.95$$
>The specificity is **95%**, meaning that 95% of actual non-spam emails were correctly identified as not spam.
**When False Positives Matter:**
If false positives result in critical emails being misclassified as spam, specificity must be prioritized to reduce this error.
**Python Example:**
```python
from sklearn.metrics import confusion_matrix
# Assuming y_true and y_pred are defined
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
specificity = tn / (tn + fp)
print("Specificity:", specificity)
```
---
#### 2.6 **F1-Score**
The harmonic mean of precision and recall:
$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
- **Best for:** Balancing precision and recall.
**Interpretation:**
- F1-Score balances precision and recall, providing a single performance measure.
- It is particularly useful when the dataset is imbalanced or when both false positives and false negatives are important.
- A high F1-Score indicates a good tradeoff between precision and recall.
>[!Tip]Example
>- Precision = $0.95$
>- Recall = $0.97$
>
>F1-Score:
$$F1 = 2 \cdot \frac{0.95 \cdot 0.97}{0.95 + 0.97} = 2 \cdot \frac{0.9215}{1.92} \approx 0.96$$
>The F1-Score is **96%**, reflecting a good balance between precision and recall.
**Why Use F1-Score?**
- In scenarios like spam detection, where both false positives (important emails marked as spam) and false negatives (spam passing filters) are problematic, the F1-Score ensures both are considered.
**Python Example:**
```python
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print("F1-Score:", f1)
```
---
#### 2.7 **ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**
**ROC Curve Overview**
The **ROC curve** (Receiver Operating Characteristic) visually evaluates the performance of a classification model by plotting the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various decision thresholds.
- **True Positive Rate (TPR) or Recall**:
$$\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$
It represents the proportion of actual positives correctly predicted by the model.
- **False Positive Rate (FPR):**
$$\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}$$
It shows the proportion of actual negatives incorrectly predicted as positives.
**Interpreting the ROC Curve**
- **X-Axis**: False Positive Rate (FPR) — Represents errors where non-spam (or class 0) data points are wrongly classified as spam (class 1).
- **Y-Axis**: True Positive Rate (TPR) — Represents how well the model identifies actual positives (spam).
- **The Shape of the ROC Curve**:
- A curve **closer to the top-left corner** indicates a better-performing model, as it achieves a high TPR with a low FPR.
- A curve along the **diagonal (gray dashed line)** corresponds to a random classifier (AUC = 0.5).
ƒ**ROC-AUC Score**
The **AUC (Area Under Curve)** represents the overall ability of the model to distinguish between the positive and negative classes.
- $\text{AUC} = 1.0$: The model perfectly distinguishes between classes.
- $\text{AUC} = 0.5$: The model performs no better than random guessing.
- $0.5 < \text{AUC} < 1.0$: The model has some degree of predictive power.
- $\text{AUC} < 0.5$: The model's predictions are worse than random guessing.
>[!Note]ROC-AUC Curve
>
>*N.B. I changed a bit the values to generate this curve in order for it to be more realistic.*
>[!Tip]Interpretation
>The provided ROC curve has an **AUC = 0.94**, which indicates that the model performs very well at distinguishing between spam and not spam emails.
>
>1. **Low False Positive Rate (FPR):**
> - On the left side of the curve, the model achieves a high True Positive Rate (around 80–100%) with a very low FPR.
> - This means the model correctly identifies most spam emails without mistakenly labeling many non-spam emails as spam.
>
>2. **Performance Overall:**
> - The steep rise in the curve towards the top-left corner reflects that the model performs much better than random guessing (the diagonal line).
> - AUC = 0.94 means that for a randomly selected spam email and a randomly selected non-spam email, there is a **94% chance** that the model will correctly assign a higher score to the spam email.
>
>3. **Decision Thresholds:**
> - The curve shows how TPR and FPR trade off as the decision threshold changes.
> - By lowering the threshold, the model achieves a higher TPR (detects more spam emails) but at the cost of increasing the FPR (mislabeling non-spam emails as spam).
>
>**Summary of the Curve**
>- The ROC curve indicates excellent classification performance (AUC = 0.94).
>- The model can reliably distinguish between spam and not spam, with very few misclassifications.
>- This makes the model suitable for applications where minimizing false positives and false negatives is critical.
**Python Example**
```python
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Example data
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)
# Plot ROC Curve
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--') # Random classifier line
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC-AUC Curve')
plt.legend()
plt.show()
```
### **3. Regression Metrics**
---
#### **3.1 Mean Absolute Error (MAE)**
The **Mean Absolute Error (MAE)** measures the average magnitude of the errors between actual values ($y_i$) and predicted values ($\hat{y}_i$), ignoring their direction. It is calculated as:
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$$
#### **Interpretation:**
- MAE is intuitive and interpretable because it represents the error in the same unit as the target variable.
- Smaller values of MAE indicate a better fit of the model to the data.
- MAE treats all errors equally, which means it is less sensitive to large outliers compared to MSE.
---
#### **3.2 Mean Squared Error (MSE)**
The **Mean Squared Error (MSE)** is the average of the squared differences between actual values ($y_i$) and predicted values ($\hat{y}_i$):
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$
#### **Interpretation:**
- MSE gives higher weight to larger errors because errors are squared.
- It is more sensitive to large outliers compared to MAE, which can be useful when large errors must be penalized.
- Smaller MSE values indicate better predictive accuracy.
---
#### **3.3 Root Mean Squared Error (RMSE)**
The **Root Mean Squared Error (RMSE)** is the square root of MSE, which brings the error back to the same units as the target variable:
$$\text{RMSE} = \sqrt{\text{MSE}}$$
#### **Interpretation:**
- RMSE provides an interpretable measure of error, as it is expressed in the same units as the target variable.
- Like MSE, RMSE penalizes large errors more heavily due to squaring.
- RMSE is often preferred when understanding the actual size of errors is important.
---
#### **3.4 R-Squared ($R^2$)**
The **R-Squared ($R^2$)** value measures the proportion of variance in the target variable that is explained by the model:
$$R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$
Where:
- $y_i$: Actual values
- $\hat{y}_i$: Predicted values
- $\bar{y}$: Mean of the actual values
#### **Interpretation:**
- $R^2$ close to 1 indicates that the model explains most of the variance in the data (good fit).
- $R^2 = 0$ indicates that the model does not explain the variance better than a simple mean predictor.
- $R^2 < 0$: The model performs worse than predicting the mean.
---
#### **3.5 Adjusted R-Squared ($R_{\text{adj}}^2$)**
The **Adjusted R-Squared** is a modified version of $R^2$ that adjusts for the number of predictors (independent variables) in the model. It penalizes the inclusion of unnecessary predictors, preventing overestimation of model performance.
**Formula**
$$R_{\text{adj}}^2 = 1 - \left( \frac{(1 - R^2) \cdot (n - 1)}{n - p - 1} \right)$$
Where:
- $R^2$: R-Squared value
- $n$: Number of observations (data points)
- $p$: Number of predictors (independent variables)
**Interpretation**
- Adjusted $R^2$ provides a more realistic measure of the model's performance when multiple predictors are included.
- Unlike $R^2$, the adjusted $R^2$ **does not increase automatically** when additional predictors are added unless those predictors significantly improve the model.
- A higher $R_{\text{adj}}^2$ indicates a better balance between goodness of fit and model complexity.
**When to Use Adjusted $R^2$:**
- For **multiple linear regression** models with more than one predictor.
- To evaluate whether adding new features improves the model.
---
#### 3.5 **Example of Application Across Metrics**
>[!Note]Scenario
>We aim to predict house prices ($y$) based on house sizes ($X$) using a regression model. Below are the **actual prices** and **predicted prices** (in $1000s):
>
>| **House ID** | **Actual Price ($y_i$)** | **Predicted Price ($\hat{y}_i$)** |
>|--------------|----------------------------|-----------------------------------|
>| 1 | 200 | 210 |
>| 2 | 250 | 240 |
>| 3 | 300 | 280 |
>| 4 | 400 | 390 |
>| 5 | 500 | 470 |
>[!Tip]Solution
>**Step 1: Calculate the Errors**
>$$\text{Errors} = y_i - \hat{y}_i$$
>| **House ID** | $y_i$ | $\hat{y}_i$ | **Error** $(y_i - \hat{y}_i)$ | **Absolute Error** | **Squared Error** |
>|--------------|---------|---------------|--------------------------------|--------------------|-------------------|
>| 1 | 200 | 210 | -10 | 10 | 100 |
>| 2 | 250 | 240 | 10 | 10 | 100 |
>| 3 | 300 | 280 | 20 | 20 | 400 |
>| 4 | 400 | 390 | 10 | 10 | 100 |
>| 5 | 500 | 470 | 30 | 30 | 900 |
>
>**Step 2: Compute the Metrics**
>
>1. **Mean Absolute Error (MAE):**
$$\text{MAE} = \frac{1}{5} \sum |y_i - \hat{y}_i| = \frac{10 + 10 + 20 + 10 + 30}{5} = 16$$
> - **Interpretation:** On average, the predictions deviate by $16,000 from the actual prices.
>
>2. **Mean Squared Error (MSE):**
> $$\text{MSE} = \frac{1}{5} \sum (y_i - \hat{y}_i)^2 = \frac{100 + 100 + 400 + 100 + 900}{5} = 320$$
> - **Interpretation:** The squared errors average to $320,000. Large errors (like 30) have a greater impact.
>
>3. **Root Mean Squared Error (RMSE):**
> $$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{320} \approx 17.89$$
> - **Interpretation:** The average size of errors is approximately $17,890.
>
>4. **R-Squared ($R^2$)**:
> First, compute the total variance ($\sum (y_i - \bar{y})^2$):
> $$\bar{y} = \frac{200 + 250 + 300 + 400 + 500}{5} = 330$$
> Total variance:
> $$\begin{align*}\sum (y_i - \bar{y})^2 &= (200-330)^2 + (250-330)^2 + (300-330)^2 \\ &+ (400-330)^2 + (500-330)^2 \\ &= 16900\end{align*}$$
> Residual variance:
> $$\sum (y_i - \hat{y}_i)^2 = 100 + 100 + 400 + 100 + 900 = 1600$$
> R-Squared:
> $$R^2 = 1 - \frac{1600}{16900} = 1 - 0.0947 \approx 0.905$$
> - **Interpretation:** The model explains **90.5% of the variance** in house prices.
>5. **Adjusted R-Squared ($R_{\text{adj}}^2$)**
> When multiple predictors are involved, $R_{\text{adj}}^2$ adjusts $R^2$ for the number of predictors to avoid overfitting.
> $$R_{\text{adj}}^2 = 1 - \left( \frac{(1 - R^2) \cdot (n - 1)}{n - p - 1} \right)$$
> Where:
> - $n = 5$ (number of observations)
> - $p = 2$ (two predictors: $X_1$ and $X_2$)
> - $R^2 = 0.905$
>
> Substitute values:
> $$R_{\text{adj}}^2 = 1 - \left( \frac{(1 - 0.905) \cdot (5 - 1)}{5 - 2 - 1} \right)$$
> $$R_{\text{adj}}^2 = 1 - \left( \frac{0.095 \cdot 4}{2} \right) = 1 - \frac{0.38}{2} = 1 - 0.19 = 0.81$$
>- **Interpretation:** After accounting for the two predictors, the model explains **81% of the variance** in house prices. The drop from $R^2 = 0.905$ to $R_{\text{adj}}^2 = 0.81$ indicates that while the model performs well, adding predictors introduces complexity that slightly reduces the explained variance.
**Conclusion**
- MAE gives an interpretable, average error size.
- MSE and RMSE penalize large errors, with RMSE offering interpretability.
- $R^2$ provides a measure of how well the model explains the target variable’s variability.
### **4. Choosing the Right Metric**
---
| **Task** | **Metric** | **When to Use** |
|---------------------|---------------------------|------------------------------------|
| **Classification** | Accuracy | Balanced classes |
| **Classification** | Precision | Minimize false positives |
| **Classification** | Recall | Minimize false negatives |
| **Classification** | F1-Score | Balance precision and recall |
| **Classification** | ROC-AUC | Measure overall classifier quality |
| **Regression** | MAE | Errors need clear interpretation |
| **Regression** | MSE / RMSE | Penalize large errors |
| **Regression** | $R^2$ | Model explains variability |
## **VIII. Cross-Validation**
### **1. Introduction to Cross-Validation**
---
Cross-validation is a resampling technique used to evaluate machine learning models on limited data. Its primary purpose is to assess the **generalizability** of a model, ensuring it performs well on unseen data.
- **Key Objectives:**
- Prevent **overfitting** (high variance): Model performs well on training data but poorly on test data.
- Prevent **underfitting** (high bias): Model is too simple and cannot capture patterns.
- Maximize model performance with reliable performance metrics.
### **2. Cross-Validation Process**
---
The dataset is split into **training** and **validation** subsets multiple times to test model performance under different splits.
- **Steps:**
1. Partition the dataset into **k subsets** (folds).
2. For each fold:
- Train the model on $k-1$ subsets.
- Validate the model on the remaining subset (fold).
3. Repeat the process $k$ times, using a different fold as the validation set each time.
4. Average the results across folds to estimate the model's performance.
### **3. Techniques of Cross-Validation**
---
#### **3.1 Holdout Validation**
- The dataset is split into two subsets: **training set** and **test set**.
- Training set: Used to fit the model.
- Test set: Used to evaluate model performance.
**Steps:**
1. Split data into training (e.g., 80%) and testing (20%).
2. Train the model on the training data.
3. Evaluate the model on the test data.
**Pros:** Simple and fast.
**Cons:**
- Not reliable on small datasets.
- Performance depends on the random split.
#### **3.2 K-Fold Cross-Validation**
- The dataset is divided into **k folds** (subsets).
- At each iteration, one fold serves as the validation set, and the remaining $k-1$ folds form the training set.
- This process is repeated $k$ times, and the performance is averaged.
**Steps for $k = 5$:**
| Iteration | **Fold 1** | **Fold 2** | **Fold 3** | **Fold 4** | **Fold 5** |
|-----------|------------|------------|------------|------------|------------|
| 1 | Validation | Training | Training | Training | Training |
| 2 | Training | Validation | Training | Training | Training |
| 3 | Training | Training | Validation | Training | Training |
| 4 | Training | Training | Training | Validation | Training |
| 5 | Training | Training | Training | Training | Validation |
**Advantages:**
- Uses the entire dataset for training and testing.
- Reduces the risk of overfitting and underfitting.
**Python Code Example:**
```python
from sklearn.model_selection import KFold
import numpy as np
# Sample Data
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Test:", test_index)
```
#### **3.3 Stratified K-Fold Cross-Validation**
- An extension of K-Fold used for **classification tasks**.
- Ensures that the class distribution (proportion of labels) in each fold is similar to the original dataset.
**When to Use:**
- Class imbalance problems (e.g., 90% class 0, 10% class 1).
**Python Code Example:**
```python
from sklearn.model_selection import StratifiedKFold
import numpy as np
# Simulated data
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
skf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y):
print("Train:", train_index, "Test:", test_index)
```
#### **3.4 Leave-One-Out Cross-Validation (LOOCV)**
- A special case of K-Fold where $k = n$ (number of samples).
- For each iteration, a single data point is used as the validation set, and the remaining $n-1$ points are used for training.
**Advantages:**
- Maximum use of data.
- Useful when the dataset is very small.
**Disadvantages:**
- Computationally expensive for large datasets.
#### **3.5 Nested Cross-Validation**
- Used for **hyperparameter tuning** and model evaluation simultaneously.
- Outer loop: Splits data into training and test sets.
- Inner loop: Performs K-Fold cross-validation to tune hyperparameters on the training set.
### **4. Cross-Validation in Practice**
---
#### **Example: K-Fold Cross-Validation on Regression Data**
1. **Dataset:** Predict house prices based on square footage.
2. **Goal:** Estimate model performance using K-Fold cross-validation.
**Python Implementation:**
```python
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
# Simulated dataset
X = np.array([[1200], [1500], [1700], [2000], [2500]])
y = np.array([200, 240, 300, 350, 400])
# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
model = LinearRegression()
# Evaluate model
scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')
print("R2 Scores:", scores)
print("Mean R2 Score:", np.mean(scores))
```
**Output Interpretation:**
- $R^2$ scores show how well the model performs across the folds.
- Averaging the scores provides a robust estimate of performance.
### **5. Advantages and Limitations of Cross-Validation**
---
| **Advantages** | **Limitations** |
|---------------------------------------------|-----------------------------------------|
| Reduces risk of overfitting/underfitting. | Computationally expensive for large datasets. |
| Uses the entire dataset for evaluation. | LOOCV is time-consuming for large $n$. |
| Provides a reliable performance estimate. | Requires careful implementation to avoid data leakage. |