# Unsupervised Learning
In machine learning, algorithms are broadly categorized into three types:
* **Supervised Learning:** Models learn from labeled data, using input-output pairs as a guide.
* **Semi-Supervised Learning:** Models use a blend of labeled and unlabeled data, particularly handy when labeled data is limited.
* **Unsupervised Learning:** The focus of this section, these models delve into unlabeled data to reveal hidden patterns, relationships, and structures without explicit targets.
## The Role of Unsupervised Learning
Unsupervised learning algorithms act like data detectives, sifting through unlabeled data to extract valuable insights. They shine in scenarios where:
* **Labeled data is scarce or costly:** Eliminating the need for laborious and potentially expensive labeling.
* **Exploratory analysis is key:** Unveiling inherent patterns and connections within the data itself.
### Key Goals
* **Data Visualization:** Transforming complex, high-dimensional data into visual representations that make patterns and trends easier to grasp.
* **Subgroup Discovery (Clustering):** Identifying natural clusters or groupings within the data based on shared characteristics, without any prior knowledge of group labels.
## Principal Component Analysis (PCA)
PCA is a cornerstone of dimensionality reduction. Let's break down how it works:
**Core Idea:** PCA identifies new directions (principal components) within the data. These directions are linear combinations of the original features that capture the maximum amount of variance in the dataset.
**Intuition:** Think of PCA as finding the axes that best explain the spread and relationships in your data. The first principal component points in the direction of greatest variance, the second in the direction of second-greatest variance (orthogonal to the first), and so on.
**Benefits:**
* **Compression:** Reduces the number of features while retaining crucial information.
* **Visualization:** Enables plotting high-dimensional data in a lower-dimensional space (e.g., 2D or 3D), aiding interpretation.
* **Noise Reduction:** Filtering out noise by focusing on the most significant patterns.
* **Improved Performance:** Can boost the performance of other machine learning algorithms by reducing the "curse of dimensionality."
### Algorithm
:::success
<font color=blue>Algorithm (Principal Component Analysis)</font>
**Input:**
* Data matrix `X` (`n x p`), where $n$ is the number of observations and $p$ is the number of features
* Number of principal components desired $k$
**Output:**
* Transformed data matrix `Z` (`n x k`)
* Principal component loadings (eigenvectors) `V` ($p \times k$)
* Variance explained by each principal component (eigenvalues)
**Procedure:**
1. **Center the data:**
* Subtract the mean of each column (feature) from the corresponding column in `X`.
* The resulting matrix `X_centered` has zero mean for each column.
2. **Calculate the covariance matrix (Optional):**
* Compute the covariance matrix `S` of `X_centered`: `S = (1/(n-1)) * X_centered^T * X_centered`
* If your data is very high-dimensional (large p), you might opt to directly compute the singular value decomposition (SVD) in the next step, as it can be more computationally efficient.
3. Eigenvalue Decomposition (EVD) or Singular Value Decomposition (SVD):
* EVD:
* Compute the eigendecomposition of the covariance matrix S: `S = V * Λ * V^T` where:
* `V` is the matrix of eigenvectors (principal component loadings)
* `Λ` is the diagonal matrix of eigenvalues (variance explained)
* SVD (Alternative):
* Compute the singular value decomposition of the centered data matrix X_centered: `X_centered = U * Σ * V^T` where:
* `U` is the matrix of left singular vectors
* `Σ` is the diagonal matrix of singular values (related to variance explained)
* `V` is the matrix of right singular vectors (principal component loadings)
* Note that `Λ = (1/(n-1)) * Σ^2`
4. **Select Principal Components:**
* Sort the eigenvalues (or singular values) in descending order.
* Select the top `k` eigenvectors (or right singular vectors) corresponding to the largest k eigenvalues (or singular values).
* These selected eigenvectors form the matrix `V` (p x k).
5. **Project Data onto Principal Components:**
* Compute the transformed data matrix `Z`: `Z = X_centered * V`
* The columns of `Z` represent the scores for each principal component.
:::
#### Key Points
* **Dimensionality Reduction:** PCA transforms the data into a lower-dimensional space (k dimensions) while retaining as much of the original variance as possible.
* **Interpretation:** Each principal component is a linear combination of the original features, and the loadings (elements of V) indicate the importance of each feature in each component.
* **Choice of k:** The number of principal components k is usually chosen to capture a desired amount of variance or based on a scree plot (a plot of eigenvalues in descending order).
* **Scaling:** If features have vastly different scales, consider standardizing (scaling to zero mean and unit variance) the data before applying PCA.
### Choosing the Number of Components
The **cumulative explained variance ratio** guides the selection of k (the number of principal components):
* Each principal component explains a portion of the total variance in the data.
* The cumulative variance plot shows how this explained variance accumulates as you include more components.
* Aim for a balance: Select enough components to capture a large portion of the variance (e.g., 90%, 95%) while achieving substantial dimensionality reduction.
### Example: Visualizing Handwritten Digits
The following Python code demonstrates PCA in action. It takes the 64-dimensional handwritten digits dataset from scikit-learn and reduces it to 2 dimensions for visualization:
``` python=
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load the 64-dimensional data from the digits dataset
digits_data = load_digits().data
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(digits_data)
# Project the 64-dimensional data to 2-dimensional
pca_2d = PCA(n_components=2)
reduced_data = pca_2d.fit_transform(scaled_data)
# Plotting the Reduced Data
plt.figure(figsize=(8, 6))
plt.title('Project the 64-dimensional data to 2-dimensional', fontsize=16)
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=10, alpha=0.5)
plt.xlabel('Principal Component 1', fontsize=14)
plt.ylabel('Principal Component 2', fontsize=14)
plt.show()
# Plot the cumulative explained variance ratio
pca = PCA().fit(scaled_data)
plt.figure(figsize=(8, 6))
plt.title('Cumulative Explained Variance Ratio', fontsize=16)
plt.xlabel('Number of Components', fontsize=14)
plt.ylabel('Cumulative Variance Ratio (%)', fontsize=14)
plt.ylim(80, 101)
plt.xlim(18, 66)
plt.grid(axis='x')
cumsum = np.cumsum(pca.explained_variance_ratio_ * 100)
plt.plot(np.arange(1, len(cumsum) + 1), cumsum, marker='.', linestyle='-', color='b')
variance_thresholds = [85, 90, 95, 99]
for threshold in variance_thresholds:
plt.axhline(y=threshold, color='r', linestyle='--', alpha=0.5)
num_components = np.argmax(cumsum >= threshold) + 1
plt.text(num_components + 1, threshold - 2, f'{threshold}% ({num_components} components)', color='r', fontsize=12)
plt.show()
```

## Missing Values and Matrix Completion
Missing data is a common hurdle in data analysis and machine learning. Knowing how to handle missing values is crucial for building robust and accurate models. Matrix completion offers an elegant solution for imputing (estimating) missing entries in your dataset.
### The Problem of Missing Values
* **Why they happen:** Missing values can arise from a variety of sources, including:
* **Data collection issues:** Equipment malfunctions, human errors, or incomplete surveys.
* **Data preprocessing:** Deliberate removal of outliers or erroneous data.
* **Privacy concerns:** Sensitive information may be withheld.
* **Impact on modeling:** Many machine learning algorithms cannot directly handle missing values. They can lead to biased estimates, reduced model performance, and incorrect conclusions.
### Basic Strategies
* **Deletion:** Removing rows or columns with missing data. This is a straightforward approach but should be used with caution, as it can lead to substantial information loss, especially if the missingness is not random.
* **Mean/Median Imputation:** Replacing missing values with the mean or median of the observed values for that feature. This is simple but can distort the distribution of the data and underestimate variability.
### Matrix Completion: A More Sophisticated Approach
Matrix completion leverages the inherent structure of the data to fill in missing entries. It is particularly effective when the underlying data matrix has a low-rank structure, meaning it can be represented by a smaller number of latent factors.
**Key Idea:** Imagine your data as a partially filled grid. Matrix completion aims to "complete" this grid by finding the values that best fit the observed patterns.
#### Algorithm
:::success
<font color=blue>Algorithm: Matrix Completion via PCA (Alternating Least Squares)</font>
**Input:**
* Incomplete data matrix `X` (`n x p`) with observed entries indexed by `O`
* Number of principal components `M`
**Output:** Completed data matrix `tilde_X`
**Procedure:**
1. **Initialization:**
* Fill in missing entries with column means:
$$\tilde{x}_{ij} = \begin{cases} x_{ij} &\text{if } (i, j) \in \mathcal{O} \\ \bar{x}_j &\text{if } (i, j) \notin \mathcal{O}, \end{cases}$$ where $\bar{x}_j$ is the average of observed values in column `j`.
2. **Iterative Refinement (Alternating Least Squares):**
* Repeat until convergence (objective function stops decreasing significantly):
* **PCA Step:**
* Compute the first M principal components of `tilde_X`.
* Let `hat_A` (n x M) be the matrix of principal component scores.
* Let `hat_B` (p x M) be the matrix of principal component loadings.
* **Imputation Step:**
* Update missing entries of `tilde_X` using the low-rank approximation: $$\tilde{x}_{ij} = \sum_{m=1}^M \hat{a}_{im} \hat{b}_{jm} \quad \text{for } (i, j) \notin \mathcal{O}$$
* **Evaluate Objective Function:**
* Calculate the sum of squared errors on the observed entries: $$\sum_{(i, j) \in \mathcal{O}} \left( \tilde{x}_{ij} - \sum_{m=1}^M \hat{a}_{im} \hat{b}_{jm} \right)^2$$
3. **Output:** Return the final completed matrix `tilde_X`.
:::
#### Key Points and Considerations:
* **Low-Rank Assumption:** PCA-based matrix completion assumes that the underlying complete matrix is low-rank (i.e., it can be well-approximated by a small number of principal components).
* **Alternating Least Squares (ALS):** The algorithm alternates between finding the principal components and imputing missing values. This iterative process refines the estimates of both the low-rank structure and the missing entries.
* **Convergence:** The algorithm typically converges to a local minimum of the objective function. Monitoring the decrease in the objective function can help determine when to stop iterating.
* **Choice of M:** The number of principal components M is a hyperparameter that balances model complexity and accuracy. Cross-validation can be used to select an appropriate value.
* **Regularization:** To prevent overfitting, you can add regularization terms (e.g., L2 regularization) to the PCA step.
* **Alternative Approaches:** There are other matrix completion techniques, such as nuclear norm minimization, that don't rely on PCA.
### Real-World Application: Netflix and the Challenge of Missing Ratings
One of the most famous examples of matrix completion in action is the Netflix Prize challenge. Netflix, the streaming giant, faced a massive matrix completion problem with its movie rating data.
* **The Matrix:** Each row represented a customer, and each column represented a movie. Each cell in the matrix held a customer's rating for a movie (if they had watched and rated it). However, most cells were empty, as the average customer had only rated a tiny fraction of Netflix's vast movie library.
* **The Goal:** Netflix wanted to accurately predict the missing ratings to provide better movie recommendations to its users.
* **The Solution:** Matrix completion techniques, such as those based on PCA, played a key role in filling in those missing ratings, dramatically improving recommendation accuracy.
### Example: Imputing Missing Values in USArrests Data with PCA
Let's illustrate matrix completion using a smaller dataset: the `USArrests` data, which contains crime statistics for U.S. states. We'll artificially introduce missing values and then use PCA-based matrix completion to impute them.
```python=
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from statsmodels.datasets import get_rdataset
# Load and standardize the USArrests data
USArrests = get_rdataset('USArrests').data
scaler = StandardScaler()
USArrests_scaled = scaler.fit_transform(USArrests)
# Introduce missing values randomly
np.random.seed(15)
n_omit = 20
missing_rows = np.random.choice(np.arange(USArrests_scaled.shape[0]), n_omit, replace=False)
missing_cols = np.random.choice(np.arange(USArrests_scaled.shape[1]), n_omit, replace=True)
USArrests_missing = USArrests_scaled.copy()
USArrests_missing[missing_rows, missing_cols] = np.nan
# Perform matrix completion (simplified example)
pca = PCA(n_components=1)
USArrests_completed = pca.fit_transform(USArrests_missing)
USArrests_completed = pca.inverse_transform(USArrests_completed)
# Evaluate the imputation (correlation between imputed and actual)
is_missing = np.isnan(USArrests_missing)
correlation = np.corrcoef(USArrests_completed[is_missing], USArrests_scaled[is_missing])[0, 1]
print(f"Correlation between imputed and actual values: {correlation:.3f}")
```
:::spoiler
```python=
# Perform matrix completion
def low_rank(X, M=1):
U, D, V = np.linalg.svd(X)
L = U[:,:M] * D[None,:M]
return L.dot(V[:M])
USArrests_hat = USArrests_missing.copy()
USArrests_bar = np.nanmean(USArrests_hat, axis=0)
USArrests_hat[missing_rows, missing_cols] = USArrests_bar[missing_cols]
thresh = 1e-7
rel_err = 1
count = 0
is_missing = np.isnan(USArrests_missing)
mean_of_the_squared_non_missing_old = np.mean(USArrests_hat[~is_missing]**2)
mean_of_the_squared_non_missing = np.mean(USArrests_missing[~is_missing]**2)
while(rel_err > thresh):
count += 1
# Step 2(a)
USArrests_approximate = low_rank(USArrests_hat, M=1)
# Step 2(b)
USArrests_hat[is_missing] = USArrests_approximate[is_missing]
# Step 2(c)
missing = np.mean(((USArrests_missing - USArrests_approximate)[~is_missing])**2)
rel_err = (mean_of_the_squared_non_missing_old - missing) / mean_of_the_squared_non_missing
mean_of_the_squared_non_missing_old = missing
print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}".format(count, missing, rel_err))
```
:::
|rownames|Murder|Assault|UrbanPop|Rape|
|---|---|---|---|---|
|Alabama|13\.2|236|58|21\.2|
|Alaska|10\.0|263|48|44\.5|
|Arizona|8\.1|294|80|31\.0|
|Arkansas|8\.8|190|50|19\.5|
|California|9\.0|276|91|40\.6|
|Colorado|7\.9|204|78|38\.7|
|Connecticut|3\.3|110|77|11\.1|
|Delaware|5\.9|238|72|15\.8|
|Florida|15\.4|335|80|31\.9|
|Georgia|17\.4|211|60|25\.8|
|Hawaii|5\.3|46|83|20\.2|
|Idaho|2\.6|120|54|14\.2|
|Illinois|10\.4|249|83|24\.0|
|Indiana|7\.2|113|65|21\.0|
|Iowa|2\.2|56|57|11\.3|
|Kansas|6\.0|115|66|18\.0|
|Kentucky|9\.7|109|52|16\.3|
|Louisiana|15\.4|249|66|22\.2|
|Maine|2\.1|83|51|7\.8|
|Maryland|11\.3|300|67|27\.8|
|Massachusetts|4\.4|149|85|16\.3|
|Michigan|12\.1|255|74|35\.1|
|Minnesota|2\.7|72|66|14\.9|
|Mississippi|16\.1|259|44|17\.1|
|Missouri|9\.0|178|70|28\.2|
|Montana|6\.0|109|53|16\.4|
|Nebraska|4\.3|102|62|16\.5|
|Nevada|12\.2|252|81|46\.0|
|New Hampshire|2\.1|57|56|9\.5|
|New Jersey|7\.4|159|89|18\.8|
|New Mexico|11\.4|285|70|32\.1|
|New York|11\.1|254|86|26\.1|
|North Carolina|13\.0|337|45|16\.1|
|North Dakota|0\.8|45|44|7\.3|
|Ohio|7\.3|120|75|21\.4|
|Oklahoma|6\.6|151|68|20\.0|
|Oregon|4\.9|159|67|29\.3|
|Pennsylvania|6\.3|106|72|14\.9|
|Rhode Island|3\.4|174|87|8\.3|
|South Carolina|14\.4|279|48|22\.5|
|South Dakota|3\.8|86|45|12\.8|
|Tennessee|13\.2|188|59|26\.9|
|Texas|12\.7|201|80|25\.5|
|Utah|3\.2|120|80|22\.9|
|Vermont|2\.2|48|32|11\.2|
|Virginia|8\.5|156|63|20\.7|
|Washington|4\.0|145|73|26\.2|
|West Virginia|5\.7|81|39|9\.3|
|Wisconsin|2\.6|53|66|10\.8|
|Wyoming|6\.8|161|60|15\.6|
Correlation between imputed and actual values: 0.711
## Clustering Methods
Clustering is a cornerstone of unsupervised machine learning, designed to reveal hidden structures within datasets that lack explicit labels. Its applications span across diverse domains like customer segmentation, image analysis, anomaly detection, and more. Let's delve into some of the most common clustering algorithms:
### K-means Clustering
K-means clustering aims to partition a dataset into a predefined number of groups K by minimizing the variance within each cluster. The algorithm iteratively assigns points to clusters and recalculates cluster centers until convergence.
#### Algorithm
:::success
<font color=blue>Algorithm: K-Means Clustering</font>
**Input:**
* Data matrix `X` (`n x p`), where `n` is the number of observations and p is the number of features
* Number of clusters `K`
**Output:**
* Cluster assignments for each observation (`1` to `K`)
* Cluster centroids (`K x p`)
**Procedure:**
1. **Initialization:**
* Randomly choose `K` points from the data as initial cluster centroids.
3. **Iterate until convergence:**
* **Assignment Step:** Assign each observation to the cluster whose centroid is closest (using Euclidean distance or other distance metric).
* **Update Step:** Recalculate the centroids of each cluster by taking the mean of all observations assigned to that cluster.
* **Convergence Check:** Stop iterating if the cluster assignments or centroids do not change significantly between iterations.
:::
#### Key Points and Considerations:
* **Random Initialization:** The initial choice of centroids can affect the final clustering results. It's common to run the algorithm multiple times with different random initializations and choose the best solution based on a criterion like the within-cluster sum of squares: $$\text{WCV}(C_i) = \frac{1}{\#C_i} \sum_{x, y \, \in C_i} d(x, y)^2$$
* **Distance Metric:** Euclidean distance is commonly used, but other distance metrics like Manhattan distance or cosine similarity might be more suitable depending on the data.
* **Convergence:** K-means is not guaranteed to find the global optimum. It may converge to different local optima depending on the initialization.
* **Choosing K:** The number of clusters K is a hyperparameter that needs to be chosen carefully. Techniques like the elbow method or silhouette analysis can help determine a suitable value.
* **Assumptions:** K-means assumes that clusters are spherical and of equal size. It may not perform well if these assumptions are violated.
#### Choosing the Optimal Number of Clusters $K$
Determining the right $K$ is crucial. Two popular methods are:
* **Elbow Method:**
* Plot the total WCV against different values of $K$.
* Look for the "elbow" point (of maximum curvature) where adding more clusters yields diminishing returns in reducing the WCV.

* **Silhouette Analysis:**
* Calculates how well each point fits within its assigned cluster compared to its neighboring clusters.
* Choose $K$ that maximizes the average [Silhouette Score](https://en.wikipedia.org/wiki/Silhouette_(clustering)).

#### Strengths and Considerations
* **Strengths:**
* Simple to understand and implement.
* Computationally efficient.
* Works well with spherical, well-separated clusters.
* **Considerations:**
* Requires pre-specifying $K$.
* Sensitive to the initial choice of centroids.
* Assumes spherical clusters and may not perform well with complex shapes.
### Hierarchical Clustering
Hierarchical clustering creates a tree-like structure (dendrogram) that visually represents the relationships between data points. It doesn't require a pre-specified number of clusters.
#### Algorithm
:::success
<font color=blue>Algorithm: Hierarchical Clustering (Agglomerative) </font>
**Input:**
* Data matrix `X` (`n x p`), where `n` is the number of observations and `p` is the number of features
* Distance metric (e.g., Euclidean distance)
* Linkage criterion (e.g., complete, single, average, ward)
**Output:** Dendrogram (tree-like structure representing the hierarchy of clusters)
**Procedure:**
1. **Initialization:**
* Start with `n` clusters, each containing a single data point.
* Calculate pairwise distances between all clusters using the chosen distance metric.
2. **Iterative Merging:**
* Repeat until only one cluster remains:
* **Find Closest Clusters:** Identify the two clusters with the smallest distance according to the linkage criterion.
* **Merge Clusters:** Combine the two closest clusters into a single cluster.
* **Update Distances:** Update the distance matrix by calculating distances between the new cluster and the remaining clusters using the chosen linkage criterion.
3. **Output Dendrogram:**
* Construct a dendrogram to visualize the hierarchical clustering results.
* The height of each merge in the dendrogram corresponds to the distance at which the merge occurred.
:::
#### Key Points
* **Agglomerative vs. Divisive:** This pseudocode outlines the agglomerative approach, where you start with individual points and iteratively merge. The divisive approach (less common) starts with one cluster and recursively splits it.
* **Linkage Criterion:** The choice of linkage criterion significantly impacts the resulting cluster hierarchy.
* **Distance Metric:** Euclidean distance is often used, but other metrics like Manhattan distance or correlation-based distance can be appropriate depending on the data.
* **Dendrogram Interpretation:** The dendrogram is a tree-like structure where the leaves represent the individual data points, and the branches represent the merges. The height of the merge indicates the dissimilarity between the merged clusters. 
#### Linkage Methods
* **Complete Linkage:** Maximum distance between points in two clusters. $$\max_{x \in C_i,\,y \in C_j}d(x, y)$$
* **Single Linkage:** Maximum distance between points in two clusters. $$\min_{x \in C_i,\,y \in C_j}d(x, y)$$
* **Average Linkage:** Average of all pairwise distances between points in two clusters. $$\frac{1}{\#C_i \cdot \#C_j}\sum_{x \in C_i,\,y \in C_j}d(x, y)$$
* **Centroid Linkage:** Distance between the centroids (mean points) of two clusters. $$d(\mu_{C_i}, \mu_{C_j})$$
* **Ward's Method:** Minimizes the increase in within-cluster variance when merging clusters. $$\sum_{x \in C_i \bigcup C_j}d(x, \mu_{C_i \bigcup C_j})$$
#### Strengths and Considerations
* **Strengths:**
* No need to pre-specify the number of clusters.
* Provides a visual representation of cluster hierarchy.
* Can accommodate various cluster shapes.
* **Considerations:**
* Computationally expensive for large datasets.
* Choice of linkage method can significantly impact results.
### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based algorithm that excels at finding clusters of arbitrary shapes and identifying outliers (noise points).
#### Algorithm
:::success
<font color=blue>Algorithm (DBSCAN)</font>
**Input:**
* Data matrix `X` (n x p), where n is the number of observations and p is the number of features
* `epsilon`: Radius of the neighborhood around a point
* `MinPts`: Minimum number of points required to form a dense region
**Output:** Cluster assignments for each data point (cluster label or "noise")
**Procedure:**
1. **Initialization:**
* Mark all points as unvisited.
* Set `cluster_label = 0` (initial cluster label)
2. **Iterate over unvisited points:**
* For each unvisited point `p`:
* Mark `p` as visited.
* Find neighbors of `p` within `epsilon` distance (neighbors).
* If `len(neighbors) < MinPts`:
* Mark `p` as noise.
* Else:
* Increment `cluster_label`.
* Assign `p` to `cluster_label`.
* Expand the cluster:
* Initialize a queue with `neighbors`.
* While the queue is not empty:
* Pop a point `q` from the queue.
* If `q` is unvisited:
* Mark `q` as visited.
* Find neighbors of `q` within `epsilon` distance (`q_neighbors`).
* If `len(q_neighbors) >= MinPts`:
* Add `q_neighbors` to the `queue`.
* Assign `q` to `cluster_label`.
3. **Output:** Return the cluster assignments for all points.
:::
#### Key Points and Considerations:
* **Density-Based:** DBSCAN groups points that are closely packed together (within `epsilon` distance) and have a sufficient number of neighbors `MinPts`.
* **Arbitrary Shapes:** DBSCAN can discover clusters of arbitrary shapes, unlike K-means, which assumes spherical clusters.
* **Noise Handling:** It automatically identifies and labels points that do not belong to any dense cluster as noise.
* **Hyperparameters:** The choice of `epsilon` and `MinPts` is crucial and often requires domain knowledge or careful tuning.
* **`epsilon`:** A larger `epsilon` leads to larger and fewer clusters. A smaller `epsilon` might result in many small clusters or more noise points.
* **`MinPts`:** A higher `MinPts` requires denser regions for cluster formation.
* **Distance Metric:** Euclidean distance is commonly used, but other metrics can be applied depending on the nature of the data.
* **Computational Complexity:** The worst-case complexity can be high, but efficient implementations (e.g., using spatial indexing) can improve performance.
### Example: Clustering with Simulated Data
To solidify our understanding of clustering algorithms, let's apply them to a simple simulated dataset with two clear groups of points.
```python=
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
# Simulate data (two well-separated clusters)
np.random.seed(0)
X, _ = make_blobs(n_samples=50, centers=2, cluster_std=0.60, random_state=0)
```
#### K-Means Clustering
```python=+
# Perform K-means clustering with K = 2 and K = 3
kmeans_2 = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans_3 = KMeans(n_clusters=3, random_state=0).fit(X)
# Plot the results
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].scatter(X[:, 0], X[:, 1], c=kmeans_2.labels_, s=50, cmap='viridis')
axes[0].set_title('K-Means (K=2)')
axes[1].scatter(X[:, 0], X[:, 1], c=kmeans_3.labels_, s=50, cmap='viridis')
axes[1].set_title('K-Means (K=3)')
plt.show()
```

#### Hierarchical Clustering
```python=+
# Perform hierarchical clustering with complete linkage
agg_cluster = AgglomerativeClustering(n_clusters=None, distance_threshold=0, linkage='complete')
agg_cluster.fit(X)
# Create the dendrogram
plt.figure(figsize=(10, 6))
from scipy.cluster.hierarchy import linkage, dendrogram
dendrogram(linkage(X, 'complete'))
plt.title('Hierarchical Clustering Dendrogram (Complete Linkage)')
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
```

#### DBSCAN Clustering
```python=+
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5).fit(X)
# Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, s=50, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.show()
```

## Reference
* **Lecture Notes & Resources**
* Principal Component Analysis:
* [Sanjay Lall and Stephen Boyd, EE104, Stanford University](https://ee104.stanford.edu/lectures/pca.pdf)
* Unsupervised Learning:
* [Sanjay Lall and Stephen Boyd, EE104, Stanford University](https://ee104.stanford.edu/lectures/unsupervised.pdf)
* **Academic Papers & Articles**
* Missing Data:
* [A Review of Methods for Missing Data, Therese D. Pigott](https://doi.org/10.1076/edre.7.4.353.8937)
* Clustering Evaluation:
* [Silhouettes: A graphical aid to the interpretation and validation of cluster analysis](https://doi.org/10.1016/0377-0427(87)90125-7)
* [Estimating the Number of Clusters in a Data Set Via the Gap Statistic, Tibshirani et al.](https://doi.org/10.1111/1467-9868.00293)
* [A dendrite method for cluster analysis, Caliński & Harabasz](https://doi.org/10.1080/03610927408827101)
* **Online Resources & Tutorials**
* [DBSCAN: Density-Based Clustering Essentials](https://www.datanovia.com/en/lessons/dbscan-density-based-clustering-essentials/)
* [Determining The Optimal Number Of Clusters: 3 Must Know Methods](https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#average-silhouette-method)
* [Gap Statistic (Python Implementation)](https://github.com/milesgranger/gap_statistic)
## Reference
* Chap 12 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)