# Clustering
## Introduction
Clustering is a powerful data analysis technique that aims to **group similar objects together**. Imagine it like sorting a basket of fruit: you'd put apples with apples, oranges with oranges, and bananas with bananas. Clustering does the same with data, finding natural groupings within a dataset where objects in the same group are more alike than those in other groups. This process helps us uncover hidden structures and patterns that would otherwise remain obscured.
To illustrate, consider the image below:

Here, we see a seemingly random scattering of points. But after applying a clustering algorithm, three distinct groups emerge, revealing an underlying structure. This is the essence of clustering: bringing order to chaos and extracting meaningful insights from data.
### Why is clustering so important?
Because it plays a crucial role in a wide range of applications across diverse fields:
* **News Analysis:** Clustering can group news articles by topic, allowing us to identify trending themes and analyze public sentiment. For example, during an election, clustering news articles and social media posts can reveal the key issues and how public opinion is shifting.
* **E-commerce:** By grouping customers with similar purchasing behaviors, businesses can personalize recommendations and tailor marketing campaigns. For instance, an online retailer might identify a cluster of customers who frequently purchase outdoor gear and then recommend related products like hiking boots or camping equipment.
* **Image Segmentation:** Clustering can partition an image into distinct regions, facilitating object recognition and image understanding. This is crucial in medical imaging, where clustering can help identify tumors or other abnormalities. For example, in a brain scan, clustering can differentiate between healthy tissue and areas affected by a stroke.
* **Anomaly Detection:** By identifying clusters of "normal" data, we can detect outliers or anomalies that deviate significantly from these patterns. This has vital applications in fraud detection, network security, and manufacturing quality control. For example, a credit card company can use clustering to identify unusual spending patterns that might indicate fraudulent activity.
These are just a few examples of how clustering helps us understand and interact with the world around us. Other applications include:
* **Gene expression analysis:** Clustering genes with similar expression patterns can help identify genes involved in the same biological processes.
* **Social network analysis:** Clustering can reveal communities or groups of individuals with strong connections within a social network.
* **Document analysis:** Clustering documents based on their content can help organize large corpora of text.
### Types of Clustering Approaches
Different clustering algorithms employ different strategies to group data. Here are some common approaches:
* **Center-based clustering:** This approach assumes clusters are defined by central points, with each data point assigned to the cluster with the nearest center. Think of it like grouping houses around central landmarks. Algorithms like k-means and k-medoids fall under this category.
* **Density-based clustering:** Here, clusters are viewed as regions of high data point concentration separated by areas of low density. Imagine identifying densely populated areas on a map. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular algorithm in this category.
* **Spectral clustering:** This technique utilizes the eigenvectors of a similarity matrix to group data points, enabling the discovery of clusters with complex shapes. It's like identifying groups of friends based on their mutual connections in a social network.
To summarize:
| Approach | Description | Characteristics | Example Algorithm |
|----------|-------------|-----------------|-------------------|
| Center-based | Groups data around central points | Simple, efficient, assumes spherical clusters | k-means, k-medoids |
| Density-based | Identifies clusters as dense regions | Can handle arbitrary shapes, robust to outliers | DBSCAN |
| Spectral clustering | Uses eigenvectors of a similarity matrix | Can detect complex shapes, computationally expensive | Spectral clustering |
## Center-based Clustering
Center-based clustering methods operate on the principle that clusters are defined by central points, with data points assigned to the cluster whose center is closest. This intuitive approach encompasses popular algorithms like k-means and k-center, each with its own objective function and characteristics.
### k-Means Clustering
k-Means clustering aims to partition data into $k$ clusters $\{C_j\}_{j=1}^k$ by minimizing the sum of squared distances between data points and their assigned cluster centers. This objective function, known as inertia, can be formally expressed as:
\begin{split}
\text{k-means score}
&= \sum_{i=1}^n \min_{j=1, \dots, k} \|\mathbf{x}_i - \pmb{\mu}_j \|^2
&= \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \pmb{\mu}_j \|^2
\end{split}
where:
* $\mathbf{x}_i$ represents a data point.
* $\pmb{\mu}_j$ represents the center of cluster $j$.
**Key Idea:** Imagine you're trying to place $k$ warehouses to serve a set of customers. You want to minimize the sum of squared distances from each customer to their nearest warehouse. k-Means clustering essentially solves this problem by iteratively refining the warehouse locations (centroids).
**Assumptions:**
* **Euclidean Space:** Data points are assumed to lie in a Euclidean space $\mathbb{R}^d$, where distances are calculated using the standard Euclidean distance formula.
* **Spherical Clusters:** The algorithm implicitly assumes that clusters are roughly spherical and of similar size. This is because the objective function penalizes deviations from the centroid equally in all directions. This can be a limitation when dealing with clusters of varying shapes and sizes. Techniques like using different distance metrics or kernel k-means can help address this issue.
#### Why Squared Distances? A Probabilistic Perspective
The choice of minimizing squared distances might seem arbitrary, but it has a strong foundation in probability. Consider a scenario where your data is generated from a mixture of $k$ Gaussian distributions, each representing a cluster. If these Gaussian distributions are well-separated and have equal variance, then minimizing the sum of squared distances to cluster centers is equivalent to maximizing the likelihood of observing the given data.
Here's a more detailed explanation:
* **Gaussian Mixture Model:** Assume the data points are generated from a mixture of $k$ spherical Gaussian distributions with means $\pmb{\mu}_1, \pmb{\mu}_2, \dots, \pmb{\mu}_k$ and identity covariance matrices. The probability density function of this mixture model is $$\text{Prob}(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}} \frac1k \sum_{i=1}^k e^{-||\mathbf{x}-\pmb{\mu}_i||^2}$$
* **Likelihood:** Given a dataset $A = \{ \mathbf{a}_1, \mathbf{a}_2 \dots, \mathbf{a}_n\}$, the likelihood of observing this data under the Gaussian mixture model with centers $\pmb{\mu}_1, \pmb{\mu}_2, \dots, \pmb{\mu}_k$ is: $$L(A | \pmb{\mu}_1, \pmb{\mu}_2, \dots, \pmb{\mu}_k) = \prod_{i=1}^{n} \text{Prob}(\mathbf{a}_i) = \frac{1}{(2\pi)^{nd/2}} \frac1{k^n} \prod_{i=1}^n \sum_{j=1}^k e^{-||\mathbf{a}_i-\pmb{\mu}_j||^2}$$
* **Approximation:** Let $\pmb{\mu}(\mathbf{x})$ denote the center nearest to $\mathbf{x}$. Since the exponential function decays rapidly, we can approximate the likelihood by considering only the dominant term in the summation: $$\sum_{j=1}^k e^{-||\mathbf{a}_i-\pmb{\mu}_j||^2} \approx e^{-||\mathbf{a}_i-\pmb{\mu}(\mathbf{a}_i)||^2}$$ This leads to the following approximation for the likelihood: $$L(A | \pmb{\mu}_1, \pmb{\mu}_2, \dots, \pmb{\mu}_k) \approx \frac{1}{(2\pi)^{nd/2}} \frac1{k^n} \prod_{i=1}^n e^{-||\mathbf{a}_i-\pmb{\mu}(\mathbf{a}_i)||^2} = c e^{-\sum_{i=1}^n ||\mathbf{a}_i-\pmb{\mu}(\mathbf{a}_i)||^2}$$ where $c$ is a constant.
* **Maximizing Likelihood:** Maximizing the likelihood is equivalent to minimizing the negative log-likelihood: $$-\log L(A | \pmb{\mu}_1, \pmb{\mu}_2, \dots, \pmb{\mu}_k) \approx - \log c + \sum_{i=1}^n ||\mathbf{a}_i-\pmb{\mu}(\mathbf{a}_i)||^2$$ Therefore, minimizing the sum of squared distances to the nearest cluster centers effectively maximizes the likelihood of the data under this Gaussian mixture model.
#### Finding the Optimal Centroids
Given a set of data points and a fixed number of clusters $k$, how do we find the optimal centroids that minimize the k-means score? The answer lies in a key property of centroids:
**Lemma** For a set of points $\{\mathbf{a}_1, \mathbf{a}_2, \dots \mathbf{a}_n\}$, the sum of squared distances to any point $\mathbf{x}$ equals the sum of squared distances to the centroid $\mathbf{c}$ plus $n$ times the squared distance from $\mathbf{x}$ to $\mathbf{c}$:
$$
\sum_{i=1}^{n} ||\mathbf{a}_i - \mathbf{x}||^2 = \sum_{i=1}^{n} ||\mathbf{a}_i - \mathbf{c}||^2 + n ||\mathbf{c} - \mathbf{x}||^2
$$
where $\mathbf{c} = \frac1n \sum_{i=1}^n \mathbf{a}_i$ is the centroid.
<font color=red>Bonus proof</font>
**Corollary** The sum of squared distances of a set of points to a point is minimized when that point is the centroid.
This property implies that for a given cluster, the optimal centroid that minimizes the sum of squared distances within that cluster is simply the mean of the points in the cluster.
#### Lloyd's Algorithm
Lloyd's algorithm is the most common method for performing k-means clustering. It starts with an initial set of centroids and iteratively refines them until convergence.
##### Algorithm
1. **Initialize:** Choose $k$ initial centroids (discussed later).
2. **Assign:** Assign each data point $\mathbf{a}_i$ to its nearest centroid $\mathbf{c}_j$, i.e., assign $\mathbf{a}_i$ to cluster $C_j$ if
$$
||\mathbf{a}_i - \mathbf{c}_j||^2 \leq ||\mathbf{a}_i - \mathbf{c}_l||^2 \text{ for all } l \neq j
$$
3. **Update:** Recalculate the centroids as the mean of the data points assigned to each cluster:
$$
\mathbf{c}_j = \frac{1}{|C_j|} \sum_{\mathbf{a}_i \in C_j} \mathbf{a}_i
$$
4. **Repeat:** Repeat steps 2 and 3 until the centroids stabilize or a predefined stopping criterion is met (e.g., the k-means score doesn't improve significantly).
##### Convergence
Lloyd's algorithm is guaranteed to converge to a local minimum of the k-means score... However, it may not find the globally optimal solution. The algorithm can be sensitive to the initial center selection and may get trapped in suboptimal solutions. Different initializations or running the algorithm multiple times can help mitigate this issue.

##### Initialization Strategies
* **Farthest Traversal:** Pick an initial center, then the farthest point from it, then the farthest from those two, and so on. This method is sensitive to outliers.
* **k-means++:** This approach probabilistically selects initial centroids based on their squared distance from existing centers, making it less sensitive to outliers. More specifically, the probability of choosing a point $\mathbf{a}_i$ as the next centroid is proportional to $D(\mathbf{a}_i)^2$, where $D(\mathbf{a}_i)$ is the distance from $\mathbf{a}_i$ to its nearest existing centroid.
* **Other Algorithms:** Use the output of another clustering algorithm as a starting point.
* **SVD-based:** Methods that utilize singular value decomposition (SVD) to find initial centroids.
#### Ward's Algorithm
Ward's algorithm is a hierarchical clustering method that can be used for k-means clustering. By cutting the dendrogram produced by Ward's algorithm at the appropriate level, we can obtain k clusters.
##### Algorithm
1. **Initialize:** Start with each data point as its own cluster.
2. **Merge:** For each pair of clusters $C_i$ and $C_j$, compute the increase in the k-means cost if they were merged: $$\Delta \text{Cost}(C_i, C_j) = \text{Cost}(C_i \cup C_j) - \text{Cost}(C_i) - \text{Cost}(C_j)$$ where $$\text{Cost}(C) = \sum_{\mathbf{a}_i \in C} ||\mathbf{a}_i - \mathbf{c}||^2$$ is the k-means cost of cluster $C$, and $\mathbf{c}$ is the centroid of $C$. Merge the pair of clusters with the smallest $\Delta \text{Cost}$.
3. **Repeat:** Continue merging until $k$ clusters remain.

#### k-Means Clustering on the Line
When data points lie on a line $\mathbb{R}^1$, k-means clustering can be solved optimally in polynomial time using dynamic programming.
##### Dynamic Programming Approach
1. **Sort:** Sort the data points in ascending order.
2. **Subproblems:** Define a subproblem as finding the optimal k-means clustering for the first $i$ points using $j$ clusters.
3. **Recurrence:** The optimal solution for a subproblem can be computed recursively by considering all possible splits of the first $i$ points into $j$ clusters.
4. **Base Case:** The base case is when $j=1$ or $i=j$.
5. **Solution:** The final solution is obtained by solving the subproblem for $i=n$ and $j=k$.
**Complexity:** The time complexity of this dynamic programming approach is $O(kn^2)$.
### k-Center Clustering
k-Center clustering is another technique for partitioning a dataset into k clusters. Unlike k-Means clustering, which aims to minimize the sum of squared distances between data points and their cluster centers, k-Center clustering seeks to minimize the maximum distance between any data point and its corresponding cluster center. This focus on the maximum distance makes k-Center more robust to outliers compared to k-Means.
**Goal:** Partition a set of points into k clusters such that the maximum distance of any point to its cluster center (the radius of the cluster) is minimized.
**Key Idea:** Think of it like this: you need to build k fire stations to serve a city. You want to place them strategically so that **minimize the maximum distance any house needs to travel** to reach a fire station. k-Center clustering aims to find the optimal placement of these fire stations.

#### Algorithm: Farthest Traversal
The most common algorithm for k-Center clustering is the farthest traversal algorithm. It's a simple greedy approach that iteratively selects the point furthest away from the current set of centers as the next center.
##### Algorithm
1. Initialize: Pick any data point as the first center.
2. Iterate: For $t=2,3,\dots,k$:
* Select the data point that has the largest distance to its nearest existing center.
* Make this point the $t$-th center.
#### Approximation Guarantee
The farthest traversal algorithm provides a 2-approximation for the k-Center problem. This means that the radius found by the algorithm is at most twice the optimal radius.
**Theorem** If there exists a k-clustering with radius $\frac{r}{2}$, then the farthest traversal algorithm finds a k-clustering with radius at most $r$.
**Proof (by contradiction):**
* Assume there is a k-clustering with radius $\frac{r}{2}$, but the farthest traversal algorithm produces a clustering with radius greater than $r$.
* This implies there exists a data point $x$ that is at a distance greater than $r$ from all the centers chosen by the algorithm.
* Since the algorithm always selects the farthest point as the next center, each new center must have been at a distance greater than $r$ from all previous centers (otherwise, x would have been chosen instead).
* This implies we have $k+1$ points (the $k$ centers chosen by the algorithm plus the point $x$) that are pairwise more than distance $r$ apart.
* However, in any k-clustering with radius $\frac{r}{2}$, no two points can be further than $r$ apart (triangle inequality). This contradicts our assumption, proving the theorem.
$\Box$
### Finding Low-Error Clusterings
While k-Means and k-Center focus on optimizing specific objective functions, in many real-world scenarios, we have a notion of a "correct" or "true" clustering. This true clustering might be based on some underlying ground truth or expert knowledge. In these cases, the goal is to find a clustering that is as close as possible to this true clustering.
This raises several questions:
* How do we measure the distance between two clusterings?
* What algorithms can we use to find low-error clusterings?
* What assumptions do we need to make about the data and the true clustering to guarantee the success of these algorithms?
The following sections will explore these questions and introduce techniques like spectral clustering and approximation stability that aim to find clusterings that are provably close to the true clustering under certain conditions.
## Density-based Clustering
While center-based clustering relies on the distance to central points, density-based clustering offers a different perspective. It defines clusters as **regions of high data point concentration separated by areas of low density**. This approach is particularly effective when dealing with clusters that exhibit complex shapes or varying densities, scenarios where center-based methods often struggle.
### Motivation
Think of stars clustered in a galaxy. They are not necessarily grouped around a single center but rather form dense clouds in the vastness of space. This analogy highlights the strength of density-based clustering in capturing:
* **Non-convex clusters:** Clusters with arbitrary shapes that cannot be easily represented by spheres or ellipsoids.
* **Varying densities:** Clusters with different densities within the same dataset.
To illustrate the difference, consider the following visualizations:

This examples demonstrate how density-based clustering can uncover hidden structures that center-based methods might miss.
### Single Linkage Clustering
Single linkage clustering is a classic algorithm for density-based clustering. It follows an agglomerative approach, starting with each data point as its own cluster and iteratively merging the closest pair of clusters.
#### Algorithm
1. **Initialization:** Begin with each data point as a separate cluster.
2. **Merge:** Repeatedly merge the two closest clusters, where the distance between two clusters is defined as the minimum distance between any two points in those clusters: $$d_{min}(C,C') = \min_{\mathbf{x}\in C, \ \mathbf{y} \in C'} d(\mathbf{x},\mathbf{y})$$
3. **Termination:** Continue merging until a stopping criterion is met (e.g., a desired number of clusters is reached).
#### Intuition
Imagine gradually "flooding" the data space. As the "water level" rises, clusters that are close together get connected first, forming larger clusters. This process continues until the desired number of "islands" (clusters) remain.

#### Connection to Minimum Spanning Trees
Interestingly, single linkage clustering has a close relationship with minimum spanning trees. It's equivalent to running Kruskal's minimum spanning tree algorithm and stopping when the desired number of connected components (clusters) is reached. This connection highlights the algorithm's focus on connecting nearby points and forming a "backbone" of the clusters.
#### Theoretical Guarantee
Under certain conditions, single linkage clustering can perfectly recover the true underlying clusters. This happens when:
1. **Inter-cluster separation:** Points in different true clusters are sufficiently far apart.
2. **Intra-cluster connectivity:** Points within the same true cluster are sufficiently close together.
#### Sensitivity to Noise and Outliers
Despite its theoretical guarantees, single linkage clustering suffers from a significant drawback: it's **highly sensitive to noise and outliers**. Even a single outlier can act as a "bridge" between two otherwise well-separated clusters, leading to incorrect merging. This sensitivity motivates the need for more robust approaches.
### Robust Linkage Clustering
Robust linkage clustering aims to overcome the sensitivity of single linkage clustering to noise and outliers. These methods incorporate mechanisms to prevent spurious connections caused by noisy data points.
#### Wishart's Algorithm
Wishart's algorithm takes a density-based approach to achieve robustness.
##### Algorithm
1. **Balls around points:** Place a ball of radius $r$ around each data point.
2. **Increase radius:** Gradually increase the radius $r$ from $0$.
3. **Activate points:** A point becomes "active" when its ball contains at least $t$ other points ($t$ is a parameter).
4. **Connect active points:** Connect two active points with an edge when their balls intersect.
#### Intuition
The parameter $t$ acts as a density threshold. By requiring a minimum number of points within a ball before a point is considered "active," Wishart's algorithm avoids connecting clusters based on thin bridges of outliers or noise. The choice of $t$ depends on the expected density of the clusters and the level of noise in the data.

#### Theoretical Guarantee
With an appropriate choice of $t$, a modified version of Wishart's algorithm can recover a nearly correct clustering under certain assumptions about the density of the true clusters.
## Spectral Clustering
Spectral clustering is a sophisticated technique that leverages the power of linear algebra to uncover hidden cluster structures within data. Unlike center-based methods that rely on distances to central points, spectral clustering analyzes the connectivity patterns between data points. This allows it to identify clusters of arbitrary shapes, even those that are non-convex or intertwined.
### Intuition
Imagine a social network where individuals are connected by "friendship" links. Spectral clustering can be viewed as a method for identifying communities within this network. It does this by analyzing the "vibrational modes" of the network, represented by the eigenvectors of a similarity matrix. These modes reveal how tightly knit different groups of individuals are, effectively uncovering the underlying community structure.
#### Advantages
* **Handles arbitrary shapes:** Can identify clusters of any shape, including those that are non-convex or elongated.
* **Effective in high dimensions:** Performs well even when data points reside in a high-dimensional space.
* **Robust to noise:** Can be more robust to noise and outliers compared to some other clustering methods.
### Steps Involved in Spectral Clustering
Spectral clustering typically involves the following key steps:
1. **Construct the Similarity Matrix $S$:** This matrix quantifies the pairwise similarities between data points. The choice of similarity measure depends on the nature of the data and the desired notion of similarity. Common choices include:
* **Gaussian kernel:** Measures similarity based on Euclidean distance: $$S_{ij} = \exp\left(-\frac{||\mathbf{x}_i - \mathbf{x}_j||^2}{2\sigma^2}\right)$$ where $\sigma$ controls the width of the Gaussian kernel.
* **k-nearest neighbors:** Connects each point to its k nearest neighbors, creating a sparse similarity matrix.
* **Other kernels:** Polynomial kernels, cosine similarity, etc., can be used depending on the application.

2. **Compute the Laplacian Matrix $L$:** The Laplacian matrix is derived from the similarity matrix and encodes the connectivity structure of the data. There are different variants, each with its own properties:
* **Unnormalized Laplacian:** $L = D - S$, where $D$ is the diagonal degree matrix with $D_{ii} = \sum_j S_{ij}$.
* **Normalized Laplacian (symmetric):** $L_{sym} = I - D^{-1/2} S D^{-1/2}$.
* **Normalized Laplacian (random walk):** $L_{rw} = I - D^{-1} S$.
3. **Perform Eigenvalue Decomposition:** Compute the eigenvectors of the Laplacian matrix corresponding to its **smallest eigenvalues**. These eigenvectors capture the low-frequency "vibrational modes" of the data, which correspond to the underlying cluster structure.
### Why Project?
Projecting the data points onto a lower-dimensional subspace spanned by the eigenvectors of the Laplacian has a crucial effect: it **enhances cluster separability**. This is because the projection effectively "stretches" the distances between clusters while "shrinking" the distances within clusters.
#### Theorem
Let $X$ be an $n \times d$ data matrix and $X_k$ be the projection of the rows of $X$ onto the subspace spanned by the first $k$ eigenvectors of the normalized Laplacian matrix. For any matrix $C$ with rank less than or equal to $k$ (representing a k-clustering):
$$
||X_k - C||_F^2 ≤ 8k ||X - C||_2^2
$$
where $||.||_F$ is the Frobenius norm and $||.||_2$ is the spectral norm.
<font color=red>Bonus proof</font>
#### Implication
This theorem shows that the projection $X_k$ is closer to any k-clustering $C$ than the original data matrix $X$ (up to a factor of $8k$). This means that if the cluster centers in the desired clustering $C$ are sufficiently separated, the projection will make it easier to distinguish between clusters.

### Algorithm and Analysis
Here's a more precise spectral clustering algorithm:
#### Algorithm
1. **Construct:** Compute the similarity matrix $S$ and the Laplacian matrix $L$ (e.g., the normalized Laplacian).
2. **Decompose:** Compute the first $k$ eigenvectors $v_1, v_2, \dots, v_k$ of $L$ corresponding to the $k$ smallest eigenvalues.
3. **Project:** Let $V$ be the $n \times k$ matrix whose columns are $v_1, v_2, \dots, v_k$. Normalize the rows of $V$ to have unit length.
4. **Cluster:** Apply k-means clustering to the rows of the normalized matrix $V$.
#### Theoretical Guarantee
Under certain conditions, spectral clustering can be shown to recover a clustering close to the true clustering. These conditions typically involve:
* **Separation:** The different clusters should be well-separated in the similarity graph.
* **Connectivity:** The points within each cluster should be well-connected.
* **Size:** Each cluster should have a sufficient number of points.
### Variations and Extensions
* **Different Laplacians:** Different Laplacian matrices (unnormalized, normalized symmetric, normalized random walk) can be used, each with its own properties and effects on the clustering results.
* **Local Spectral Clustering:** For identifying small, localized communities within a large network, techniques like random walks and early approximations of eigenvectors can be used. This is more computationally efficient than computing the full eigendecomposition.
* **Overlapping Communities:** To handle scenarios where data points may belong to multiple clusters, methods that find minimum 1-norm vectors in the eigenvector subspace can be employed.
* **Uncovering Hidden Structure:** Techniques that iteratively weaken the dominant structure and reapply spectral clustering can be used to uncover hidden clusters that might be masked by more prominent communities.
**Key Takeaway:** Spectral clustering is a versatile and powerful tool for uncovering complex cluster structures in data. Its ability to handle arbitrary shapes, high dimensionality, and noise makes it a valuable technique in various applications, including image segmentation, community detection in social networks, and bioinformatics.
## Kernel Methods
Kernel methods provide a powerful and elegant way to extend traditional clustering algorithms to handle **non-linearly separable data**. They achieve this by implicitly mapping data points into a higher-dimensional space where clusters may become linearly separable. This "kernel trick" allows us to leverage the simplicity and efficiency of linear methods like k-means while capturing complex relationships in the data.
### Motivation: Beyond Linear Separability
Many traditional clustering algorithms, such as k-means, implicitly assume that clusters can be separated by linear boundaries (lines, planes, hyperplanes). However, real-world data often exhibits more intricate patterns.
#### Examples of Non-linearly Separable Data
* **Concentric circles:** Imagine data points forming two concentric circles. No straight line can separate these clusters.
* **Spirals:** Consider data points forming two intertwined spirals. Again, a linear separator would fail to distinguish these clusters.
* **Complex shapes:** Clusters can take on arbitrary shapes, including crescents, curves, and intertwined structures, making linear separation impossible.
Kernel methods address this limitation by transforming the data into a higher-dimensional space where these non-linear relationships can be represented as linear ones.
### The Kernel Trick
The kernel trick is a clever technique that avoids the computational cost of explicitly mapping data to a higher-dimensional space. Instead, it directly computes the dot product of the transformed data points using a kernel function.
#### How it works
1. **Mapping to a higher-dimensional space:** Imagine a mapping function $\phi(\mathbf{x})$ that transforms a data point $\mathbf{x}$ to a higher-dimensional space.
2. **Dot product in the higher-dimensional space:** The dot product between two transformed points $\phi(\mathbf{x})$ and $\phi(\mathbf{y})$ captures their similarity in this space.
3. **Kernel function:** A kernel function $K(\mathbf{x},\mathbf{y})$ directly computes this dot product without explicitly calculating $\phi(\mathbf{x})$ and $\phi(\mathbf{y})$: $$K(\mathbf{x},\mathbf{y}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{y})$$
#### Benefits
* **Efficiency:** Avoids the computational burden of explicitly calculating the high-dimensional mapping.
* **Flexibility:** Allows for a wide range of similarity measures by choosing different kernel functions.
#### Popular Kernel Functions
* **Gaussian kernel:** $$K(\mathbf{x},\mathbf{y}) = \exp\left(- \frac{||\mathbf{x} - \mathbf{y}||^2}{2\sigma^2}\right)$$ Measures similarity based on Euclidean distance, with $\sigma$ controlling the width of the Gaussian.
* **Polynomial kernel:** $$K(\mathbf{x}, \mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} + c)^d$$ Creates polynomial boundaries in the higher-dimensional space, with $c$ and $d$ as parameters.
* **Sigmoid kernel:** $$K(\mathbf{x}, \mathbf{y}) = \tanh(\alpha (\mathbf{x} \cdot \mathbf{y}) + c)$$ Mimics neural networks with a sigmoid activation function.
### Kernel k-Means
Kernel k-means extends the traditional k-means algorithm to handle non-linearly separable data. It uses the kernel trick to implicitly cluster data points based on their relationships in the higher-dimensional space.
#### Algorithm
1. **Choose a kernel function:** Select a kernel function $K(\mathbf{x}, \mathbf{y})$ that captures the desired notion of similarity.
2. **Compute the kernel matrix:** Calculate the kernel matrix $K$ where $K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$.
3. **Perform k-means:** Apply the k-means algorithm using the kernel matrix to compute distances between data points and cluster centers. The algorithm operates on the original data points but implicitly clusters them in the higher-dimensional space defined by the kernel.
### Kernels for Affinity Matrices
Kernels can also be used to construct **affinity matrices** for graph-based clustering methods.
* **Affinity matrix:** An affinity matrix represents the pairwise similarities between data points.
* **Kernel functions as similarity measures:** A kernel function can be used to define the entries of the affinity matrix: $A_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$.
This approach combines the power of kernel methods with the flexibility of graph-based clustering.
**Key Takeaway:** Kernel methods provide a powerful and versatile framework for clustering data with non-linear structures. By implicitly mapping data points to a higher-dimensional space, they enable the use of linear clustering algorithms in non-linear settings. This opens up new possibilities for uncovering hidden patterns and relationships in complex datasets.
## Approximation Stability
In many real-world clustering scenarios, we have an underlying belief that a "true" clustering exists, even if we don't know it beforehand. This true clustering might represent some inherent structure or ground truth in the data. Approximation stability provides a framework for analyzing clustering algorithms in these situations by linking the quality of a clustering solution to its similarity to this true clustering.
### Core Idea and Motivation
Imagine you're clustering images of animals. You represent each image as a point in a high-dimensional space based on its features. While you don't have explicit labels for each animal, you assume a "correct" clustering exists where images of the same species are grouped together.
Now, suppose you use an algorithm like k-means to cluster these images. Ideally, the clustering with the optimal k-means score would perfectly match the true clustering. However, finding this optimal solution is often computationally hard (NP-hard).
Approximation stability relaxes this requirement. It suggests that any clustering with a k-means score **close to the optimal score** is also **reasonably similar to the true clustering**. This assumption implies a certain structure in the data: clusterings with similar objective function values are also similar in how they partition the data.
This stability property allows us to design algorithms that exploit this structure to efficiently find clusterings close to the true clustering, even if finding the absolute optimal solution is computationally intractable.
#### Intuitive Example
Think of a landscape with several valleys. Each valley represents a cluster, and the elevation at a point represents the clustering objective function value. Approximation stability implies that points with similar elevations are likely to be in the same valley (cluster).

### Formal Definitions
To formalize approximation stability, we need to define:
* **Clustering Distance:** A measure of similarity between two clusterings.
* **Approximation Stability:** A property of a dataset that relates the clustering objective function value to the distance from the true clustering.
#### Clustering Distance
Given two k-clusterings $\mathcal{C} = \{C_1, \dots, C_k\}$ and $\mathcal{C}'
= \{C'_1, \dots, C'_k\}$ of a dataset $A$, the distance between them, denoted by $\text{dist}(\mathcal{C}, \mathcal{C}')$, is the minimum fraction of points that need to be reassigned in one clustering to make it identical to the other. Formally:
$$
\text{dist}(\mathcal{C}, \mathcal{C}') = \min_{\sigma} \frac{1}{n} \sum_{i=1}^k |C_i \backslash C'_{\sigma(i)}|
$$
where $\sigma$ is a permutation of $\{1, \dots, k\}$ representing the best possible matching between the clusters in $\mathcal{C}$ and $\mathcal{C}'$, and $|C_i \backslash C'_{\sigma(i)}|$ denotes the number of points in $C_i$ that are not in $C'_{\sigma(i)}$.
#### Approximation Stability
A dataset satisfies $(c,\epsilon)$-approximation-stability for a clustering objective (e.g., k-means, k-median) if every clustering $\mathcal{C}'$ with a cost within a factor $c$ of the minimum-cost clustering $\mathcal{C}^*$ satisfies $\text{dist}(\mathcal{C}', \mathcal{C}_T) \leq \epsilon$.
In simpler terms, any clustering with a near-optimal objective function value is also close to the true clustering $\mathcal{C}_T$.
### Algorithms and Analysis (k-Median Example)
Let's illustrate how approximation stability can be used algorithmically with the k-median objective. The k-median objective aims to find $k$ centers that minimize the sum of distances from each point to its nearest center.
**Goal:** Given a dataset that satisfies $(c,\epsilon)$-approximation-stability for k-median, efficiently find a clustering $\mathcal{C}'$ that is close to the true clustering $\mathcal{C}_T$, i.e., $\text{dist}(\mathcal{C}', \mathcal{C}_T) < \epsilon$.
#### Assumptions
* The true clustering $\mathcal{C}_T$ is the same as the optimal k-median clustering $\mathcal{C}^*$.
* Each cluster in $\mathcal{C}_T$ has at least $2 \epsilon n$ points.
#### Algorithm (k-Median Stability)
1. **Construct Graph:** Create a graph $G$ where each data point is a node. Connect two nodes if their distance is less than a "critical distance" $d_{crit}$. This $d_{crit}$ is carefully chosen based on $c$, $\epsilon$, and the average distance of points to their cluster centers in $\mathcal{C}^*$.
2. **Connect Dense Regions:** Create a new graph $H$ by connecting nodes in $G$ that share at least $1 - O(\epsilon)$ fraction of neighbors. This helps merge dense regions of "good" points within the same cluster.
3. **Identify Clusters:** The connected components in $H$ correspond to the clusters.
4. **Assign Remaining Points:** Assign any remaining points to the cluster with the nearest median.
#### Theoretical Analysis
**Theorem** If a dataset satisfies $(c,\epsilon)$-approximation-stability for k-median and each cluster in the true clustering has at least $2 \epsilon n$ points, then the above algorithm finds a clustering $\mathcal{C}'$ with $\text{dist}(\mathcal{C}', \mathcal{C}_T) < \epsilon$.
<font color=red>Work out the details and make it precise.</font>
**Key Takeaway:** Approximation stability offers a powerful lens for designing and analyzing clustering algorithms. By leveraging the stability properties of data, we can efficiently find clusterings that are provably close to the true clustering, even when finding the exact optimal solution is computationally hard. This approach has broad implications for various clustering problems where a notion of ground truth exists.
## Community Finding in Graphs
Graphs are powerful tools for representing relationships between objects, whether it's users in a social network, proteins interacting in a cell, or web pages linked across the internet. Within these networks, communities – groups of nodes with denser connections amongst themselves than with the rest of the network – often represent meaningful structures. Community finding aims to identify these cohesive subgroups, providing insights into the organization and function of the network.

### Motivation
* **Understanding social structures:** In social networks, communities might correspond to groups of friends, families, or colleagues.
* **Identifying functional modules:** In biological networks, communities could represent proteins involved in the same biological process.
* **Improving recommendations:** In e-commerce, communities might reveal groups of users with similar interests, leading to better product recommendations.
* **Detecting anomalies:** Unusual community structures could indicate anomalies or fraudulent activity in financial networks.
#### Challenges in Social Networks
* **Scale:** Social networks can be massive, with billions of users and trillions of connections, making computational efficiency crucial.
* **Overlapping communities:** Individuals often belong to multiple communities, leading to overlapping structures that are challenging to detect.
* **Hidden structure:** Dominant communities can mask smaller, less apparent communities, requiring specialized techniques to uncover them.
### Defining Density in Graphs
To identify communities, we need to quantify the "density" of connections within a group of nodes. Several measures capture different notions of density:
* **Average degree:** The average number of connections per node within a subgraph.
* **Advantages:** Simple to compute.
* **Limitations:** Favors larger subgraphs, potentially missing small, tightly-knit communities.
* **Conductance:** The ratio of "cut" edges to the total edges incident to the smaller of the two sets created by a cut.
* **Advantages:** Captures how well-knit a subgraph is with respect to the rest of the graph.
* **Limitations:** Finding the minimum conductance cut is computationally hard.
* **Triangle density:** The number of triangles (three nodes all connected to each other) within a subgraph.
* **Advantages:** Captures higher-order connectivity patterns.
* **Limitations:** Can be sensitive to noise and outliers.
* **Modularity:** Measures the difference between the observed connections within a community and the expected connections if edges were placed randomly.
* **Advantages:** Widely used in community detection.
* **Limitations:** Can suffer from a resolution limit, failing to identify small communities.
### Algorithms for Community Finding
Various algorithms address the community finding problem, each with its own strengths and weaknesses:
#### Flow Methods
Flow methods, typically used for finding maximum flow in networks, can be adapted to identify dense subgraphs.
* **Algorithm:**
1. Perform a binary search on a parameter $\lambda$, representing the target average degree.
2. Construct a flow network where nodes represent edges and vertices of the original graph, with capacities designed to reflect $\lambda$.
3. Compute the minimum cut in the flow network.
4. The nodes on the source side of the minimum cut form a dense subgraph.
* **Application:** Effective for finding large, dense subgraphs.
* **Limitations:** Biased towards larger subgraphs, may miss small communities.
#### Spectral Methods
Spectral clustering, as discussed earlier, can be adapted for community detection in social networks.
* **Algorithm:**
1. Construct the adjacency matrix or a similarity matrix representing the network.
2. Compute the eigenvectors of the Laplacian matrix corresponding to the smallest eigenvalues.
3. Project the data onto the subspace spanned by these eigenvectors.
4. Apply a traditional clustering algorithm (e.g., k-means) in the projected space.
* **Application:** Can handle arbitrary shapes and large datasets, effective for overlapping communities.
* **Adaptations for social networks:**
* Use singular vectors of the adjacency matrix to define a "social space."
* Employ techniques like minimum 1-norm vectors to handle overlapping communities.
* Utilize random walks and subspace propagation to find small communities efficiently.
#### Recursive Clustering Based on Sparse Cuts
This method recursively partitions the network by identifying and "cutting" sparse connections between groups.
* **Algorithm:**
1. Choose a sparsity threshold (e.g., conductance).
2. Find a cut with conductance below the threshold.
3. Split the graph along this cut.
4. Repeat recursively until no further splits are possible.
* **Spectral connection:** Cheeger's inequality connects conductance to the eigenvalues of the Laplacian matrix, enabling the use of spectral methods to find approximate sparse cuts.
* **Application:** Effective for finding communities with well-defined boundaries.
### Advanced Techniques for Social Networks
* **Overlapping communities:**
* Minimum 1-norm vectors: Allow nodes to belong to multiple communities with varying degrees of membership.
* Fuzzy clustering: Assign probabilities of belonging to different communities.
* **Finding small communities:**
* Random walks: Explore the network locally to identify small, tightly-knit groups.
* Subspace propagation: Efficiently expand a subspace to capture local community structure.
* **Uncovering hidden structure:**
* Iteratively weaken dominant communities to reveal hidden ones.
* Use higher-order connectivity patterns (e.g., triangle density) to identify communities with specific structural properties.
**Key Takeaway:** Community finding is a vital task in network analysis, with applications across various domains. By understanding the different density measures, algorithms, and advanced techniques, we can effectively uncover the hidden structures and organization within complex networks, leading to valuable insights and applications.
## Dense Submatrices and Communities
Data often takes the form of a matrix, where rows and columns represent different entities, and entries reflect relationships or interactions between them. Consider a social network represented as a matrix where rows are users, columns are groups, and entries indicate group memberships. Or envision a document-term matrix where rows are documents, columns are words, and entries denote word frequencies.
Within these matrices, **dense submatrices** – blocks with a high concentration of large values – often reveal underlying communities or hidden structures. In the social network example, a dense submatrix might indicate a group of users sharing similar interests, as they belong to the same set of groups. In the document-term matrix, a dense submatrix could signify a collection of documents covering a common theme, evidenced by their frequent use of the same words.
### Bipartite Graph
Analyzing dense submatrices can be aided by visualizing the matrix as a bipartite graph.
* **Two Sets of Nodes:** One set represents the rows, the other the columns of the matrix.
* **Edges as Relationships:** An edge connects a row node to a column node if the corresponding matrix entry is significant (e.g., above a certain threshold).

This representation provides an intuitive way to understand connections. Finding dense submatrices in the matrix is analogous to finding **dense subgraphs** in this bipartite graph – groups of nodes with many edges connecting them.
### Defining Density
To identify these dense subgraphs, we need to quantify "density."
* **Density of a Submatrix:** For a subset $S$ of rows and $T$ of columns, the density $d(S, T)$ is: $$d(S,T) = \frac{A(S,T)}{\sqrt{|S||T|}}$$ where $A(S, T)$ is the sum of entries in the submatrix formed by $S$ and $T$, and $|S|$ and $|T|$ denote the number of rows and columns, respectively. This definition normalizes the entry sum by the submatrix size, preventing larger submatrices from inherently having higher densities.
* **Density of the Matrix:** The density $d(A)$ of the entire matrix $A$ is the maximum density over all possible submatrices: $$d(A) = \max_{S,T} d(S,T)$$
#### Special Case: Similarity Matrices
When the matrix represents similarities between objects (rows and columns represent the same entities), the density measure simplifies to:
$$
d(S,S) = \frac{A(S,S)}{|S|}
$$
This represents the average similarity within subset $S$. For a graph's adjacency matrix, this equates to the average degree of the subgraph induced by $S$.
### Algorithms and Approximations
Finding the densest submatrix (i.e., finding $d(A)$ exactly) is computationally challenging.
* **Exact Solutions:**
* **Maximum Average Degree Subgraph:** For similarity matrices, network flow techniques efficiently find the subgraph with maximum average degree.
* **Approximations:**
* **Singular Values:** The density $d(A)$ can be approximated using the matrix's largest singular value $\sigma_1(A)$. Specifically, $d(A)$ is within a factor of $O(\log^2 n)$ of $\sigma_1(A)$ (for entries between 0 and 1).
**Theorem** For a matrix $A$ with entries between 0 and 1:
$$
\sigma_1(A) \geq d(A) \geq \frac{\sigma_1(A)}{4 \log n \log d}
$$
This links matrix density to its spectral properties. Subsets $S$ and $T$ achieving the lower bound can be found using the top singular vector of $A$.
### Average Cohesion
**Average cohesion** offers an alternative density measure based on similarities (dot products) between rows:
* **Average Cohesion of a Subset:** For a subset $S$ of rows, the average cohesion $f(S)$ is: $$f(S) = \frac{\sum_{i,j \in S} \mathbf{a}_i \cdot \mathbf{a}_j}{|S|}$$ where $a_i$ represents the $i$-th row.
* **Average Cohesion of the Matrix:** The average cohesion $f(A)$ of matrix $A$ is the maximum average cohesion over all row subsets: $$f(A) = \max_S f(S)$$
#### Relationship with Density and Singular Values
Average cohesion is related to both density and the top singular value:
**Lemma**
$$
d(A)^2 \leq f(A) \leq d(A) \log n \quad \text{and} \quad \sigma_1(A) \geq f(A) \geq \frac{c \sigma_1(A)^2}{\log n}
$$
where $c$ is a constant.
* Unlike the general density $d(A)$, $f(A)$ can be found exactly using flow techniques.
**Key Takeaway:** Identifying dense submatrices reveals hidden structures and communities within data. While finding the densest submatrix precisely can be computationally difficult, approximations using singular values and alternative measures like average cohesion provide effective solutions. These techniques are applied in diverse fields, including social network analysis, bioinformatics, and information retrieval, where uncovering hidden relationships and communities is crucial for understanding complex data.
## Reference
* Chapter 7 of Blum A, Hopcroft J, Kannan R. Foundations of Data Science. Cambridge University Press; 2020.
## Further Reading
### General
* James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer. (See Chapter 12.4: Clustering Methods) [link 1](https://youtu.be/xhLD4nEEVlE?si=NemA5DqcTNMfW36t) [link 2](https://youtu.be/aygQSHAbMFQ?si=bYVq5qwgfUryoYea)
* Scikit-learn User Guide: 2.3 Clustering [link](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering)
### K-Means
* Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (SODA '07) (pp. 1027–1035). Society for Industrial and Applied Mathematics.
* Witten, D. M., & Tibshirani, R. (2010). A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, 105(490), 713–726. https://doi.org/10.1198/jasa.2010.tm09415
* Awasthi, P., & Balcan, M.-F. (2018). Center Based Clustering: A Foundational Perspective. Journal contribution, Carnegie Mellon University. https://doi.org/10.1184/R1/6475499.v1
### Hierarchical Clustering
* Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58(301), 236–244. https://doi.org/10.1080/01621459.1963.10500845
### DBSCAN
* Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96) (pp. 226–231). AAAI Press.
* Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems, 42(3), Article 19. https://doi.org/10.1145/3068335
### Spectral Clustering
* Ng, A., Jordan, M., & Weiss, Y. (2001). On Spectral Clustering: Analysis and an algorithm. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (Vol. 14). MIT Press. [link](https://papers.nips.cc/paper_files/paper/2001/hash/801272ee79cfde7fa5960571fee36b9b-Abstract.html)
* von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17, 395–416. https://doi.org/10.1007/s11222-007-9033-z
* Gallier, J. (2013). Notes on elementary spectral graph theory. applications to graph clustering using normalized cuts. arXiv preprint arXiv:1311.2492.
### Kernel Methods Cluastering
* Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), 176–190. https://doi.org/10.1016/j.patcog.2007.05.018