Spectral Clustering

# **Introduction to Spectral Clustering** Clustering helps us find natural groups hidden inside data. In a university dataset, for example, clustering might uncover groups such as: * *High-achieving engineering students* * *Artistic international students* * *Pre-health students with strong volunteer backgrounds* Classical methods like **k-means** assume that clusters are “round” and separable in the original feature space. But real data often forms complex shapes like curved structures, graphs, or “two moons” shaped manifolds. This is why **Spectral Clustering** is powerful. It converts your data into a graph, analyzes that graph using linear algebra, and uses **eigenvectors** of a special matrix to reveal natural partitions. This article explains spectral clustering step-by-step, connects it to class concepts, and ends with a *full numerical example we can compute by hand*. --- # **What Spectral Clustering Does** It builds a similarity graph from your data, computes the graph Laplacian, extracts a few important eigenvectors, embeds the data into this new spectral space, and runs a simple clustering algorithm like k-means. --- ## **1. Building a Similarity Graph** Given data points: ($x_1, x_2, \dots, x_n \in \mathbb{R}^d),$ the first step in spectral clustering is to quantify how *similar* each pair of points is. Instead of clustering directly in the original feature space, spectral clustering converts the dataset into a **graph**, where relationships between points are easier to analyze. **Gaussian Kernel (Similarity Function)** To measure similarity between two data points $x_i$ and $x_j$, a commonly used function is the **Gaussian kernel**: $w_{ij} = \exp\left(-\frac{|x_i - x_j|^2}{2\sigma^2}\right)$ The Gaussian kernel is a **smooth similarity function** that converts Euclidean distance into a value between 0 and 1: * Points that are **very close** have similarity values near 1. * Points that are **far apart** have similarity values near 0. * The parameter $\sigma$ controls how quickly similarity decays with distance: * Smaller $\sigma$ emphasizes local neighborhoods. * Larger $\sigma$ creates broader connections. This function ensures that similarity decreases smoothly rather than abruptly. **Adjacency (Similarity) Matrix (W)** All pairwise similarities are stored in the **adjacency matrix** $W \in \mathbb{R}^{n \times n}$ : $W_{ij} = w_{ij}$ The adjacency matrix is a matrix representation of a graph: * Each row and column corresponds to a data point (node). * Each entry $W_{ij}$ indicates whether and how strongly two points are connected. * $W$ is **symmetric**, since distance is symmetric. * Diagonal entries are typically set to zero (no self-loops). Interpreting $W$: * $W_{ij} \approx 1$: points $i$ and $j$ are very similar * $W_{ij} \approx 0$: points $i$ and $j$ are dissimilar --- ### **k-Nearest Neighbors Graph (Sparsification)** In practice, computing similarities between *all* pairs of points leads to a dense graph that is expensive to store and compute. To address this, spectral clustering often uses a **k-nearest neighbors (k-NN) graph**: * Each point is connected only to its (k) closest neighbors. * All other similarities are set to zero. This produces a **sparse graph** that: * Preserves meaningful local structure * Reduces noise from weak, long-distance connections * Improves computational efficiency --- ### **From Similarity Graph to the Laplacian** Once the similarity graph is constructed, its structure is summarized using the **graph Laplacian**. First, define the **degree matrix** (D), a diagonal matrix where: $D_{ii} = \sum_j W_{ij}$ The **unnormalized graph Laplacian** is then defined as: $L = D - W$ The Laplacian captures how each node differs from its neighbors and encodes the global connectivity of the graph. Its eigenvalues and eigenvectors play a central role in spectral clustering. --- ### **Why This Step Matters** This similarity graph, and its associated Laplacian form the foundation of spectral clustering. All subsequent steps, including eigenvector computation, spectral embedding, and k-means clustering, operate on this graph structure rather than directly on the original data. --- # **2. Compute the Degree Matrix and Laplacian** The **degree matrix** D is diagonal: $D_{ii} = \sum_j W_{ij}$ The **unnormalized Laplacian** is: $L = D - W$ This matrix measures how the graph connects together. --- # **3. Compute Eigenvalues and Eigenvectors** We compute the eigenvectors corresponding to the **smallest** eigenvalues of $L$: $Lv_i = \lambda_i v_i$ Sort the eigenvectors by eigenvalue: $0 = \lambda_1 \le \lambda_2 \le \cdots \le \lambda_k$ The first eigenvector is usually trivial. The next few eigenvectors contain the clustering structure. ### **Interpretation** In the second eigenvector $v_2$ (the **Fiedler vector**): * Nodes in the same cluster have similar values. * Sometimes cluster membership is almost encoded by the *sign* of $v_2(i)$. Eigenvectors give us a new representation where clusters are clearer. --- # **4. Embed the Data Using These Eigenvectors** Form a new matrix: $U = [v_1, v_2, \dots, v_k]$ Each original data point becomes a row of $U$ (a point in $\mathbb{R}^k)$. For normalized Laplacians, we also normalize each row: $U_{i,:} \leftarrow \frac{U_{i,:}}{|U_{i,:}|}$ --- # **5. Run k-means in the Spectral Space** Finally: * Run k-means on the rows of $U$ * Assign each point to the nearest cluster center Because the eigenvectors separate the graph structure, k-means performs extremely well here. --- # **Pseudocode for Spectral Clustering** ```text Input: Data matrix X (n x d), number of clusters k 1. Construct similarity matrix W: for i, j: W[i,j] = exp( - ||X[i] - X[j]||^2 / (2 * sigma^2) ) 2. Compute degree matrix D: D[i,i] = sum_j W[i,j] 3. Compute Laplacian: L = D - W 4. Compute eigenpairs: Find the k smallest eigenvalues and eigenvectors of L 5. Form matrix U from these k eigenvectors 6. Normalize rows of U (optional) 7. Run k-means on rows of U Output: Cluster labels for each data point ``` # Diagram 1: From Data $\rightarrow$ graph This diagram shows how we convert raw data points into a similarity graph. Each point becomes a node, and edges represent how similar each pair of points is, based on a chosen similarity function. This graph is the foundation of spectral clustering because all mathematical steps such as the Laplacian, eigenvectors and embedding operate on this graph. ![download](https://hackmd.io/_uploads/Sy29M1A--e.png) # Diagram 2: Laplacian Eigenvector This diagram illustrates the second eigenvector (the Fiedler vector) of the graph Laplacian. Each node receives a scalar value from the eigenvector. Nodes that are strongly connected in the graph get similar values, while nodes in different clusters diverge. The sign or value patterns in this eigenvector often reveal the natural cluster split, this is the key mathematical insight behind spectral clustering. ![download (1)](https://hackmd.io/_uploads/B17aGyA-bl.png) # Diagram 3: Embedding + k-means Here we embed the data into a new space using the Laplacian's eigenvectors and run k-means in that space. After embedding, points that belong to the same graph cluster are now close together, while points from different clusters are far apart. This makes the final clustering step easy because the spectral embedding transforms a complex, potentially non-convex clustering problem into a simple geometric one. ![download (7)](https://hackmd.io/_uploads/B1UWptef-l.png) ![download (2)](https://hackmd.io/_uploads/SJGkXkRZ-g.png) # **A Fully Worked Mini Numerical Example (with Numbers)** Let’s cluster **four points** arranged into two natural groups: * Points 1 and 2 are close * Points 3 and 4 are close * The two groups are far apart ![download (3)](https://hackmd.io/_uploads/SkcO5h0Zbx.png) > Figure: Four data points forming two well-separated clusters We choose the similarity matrix: $W = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix}$ This graph has two disconnected components: * Component A: nodes 1–2 * Component B: nodes 3–4 ### **Step 1: Degree Matrix** $D = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$ Each node has degree 1. ### **Step 2: Laplacian** $L = D - W = \begin{bmatrix} 1 & -1 & 0 & 0 \\ -1 & 1 & 0 & 0 \\ 0 & 0 & 1 & -1 \\ 0 & 0 & -1 & 1 \end{bmatrix}$ ### **Step 3: Compute Eigenvalues & Eigenvectors** The eigenvalues turn out to be: $\lambda = [0, 0, 2, 2]$ Because there are **two connected components**, we get **two zero eigenvalues**. The corresponding eigenvectors are: $v_1 = \begin{bmatrix}1 \\ 1 \\ 1 \\ 1\end{bmatrix}, \quad v_2 = \begin{bmatrix}1 \\ 1 \\ -1 \\ -1\end{bmatrix}$ * $v_1$ = trivial constant vector (always for $\lambda=0)$ * **$v_2$ separates the two clusters perfectly:** * Points 1 & 2 → value (+1) * Points 3 & 4 → value (-1) ### **Visualization** On the 2D plane, it is visualized as: ![download (4)](https://hackmd.io/_uploads/S17D-iCZZx.png) > Figure: The second eigenvector $v_2$ assigns similar values to nodes within the same cluster and opposite values to different clusters. If we draw the values of ($v_2$) on a number line, we will see: ![download (5)](https://hackmd.io/_uploads/HJR7bjA-Zx.png) > Figure: Projecting nodes onto the Fiedler vector creates a 1D space where clusters become linearly separable. There is a perfect cut between the two groups. ### **Step 4: Spectral Embedding** Using the two eigenvectors: $U = \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 1 & -1 \\ 1 & -1 \end{bmatrix}$ Normalize rows (optional): * Points 1 & 2 → $((1,1)/\sqrt{2})$ * Points 3 & 4 → $((1,-1)/\sqrt{2})$ ### **Step 5: k-means** Run k-means with $k = 2$: * Points $(1,2)$ cluster together * Points $(3,4)$ cluster together **Spectral clustering perfectly recovers the two groups.** --- # **Use Cases of Spectral Clustering** Spectral clustering is especially powerful when clusters are **not convex** or **not linearly separable** in the original space. Some real-world examples: --- ### **1. Image Segmentation (Computer Vision)** * **Conceptual use**: * Each pixel is treated as a node. * Edges encode similarity based on color, intensity, texture, and spatial proximity. * The goal is to partition the image into coherent regions (objects, foreground/background). * **Real algorithms & applications**: * **Normalized Cuts (Shi & Malik)** — a classic spectral clustering–based algorithm widely cited in computer vision literature. * Used in early image segmentation pipelines and influenced modern segmentation techniques. * Image processing libraries and research frameworks often include spectral methods as baselines or benchmarks. * **Why spectral clustering works here**: * Objects in images rarely form simple geometric shapes. * Spectral methods respect **connectivity and continuity**, not just raw pixel distance. --- ### **2. Social Networks & Community Detection** * **Conceptual use**: * Nodes represent users. * Edge weights represent friendships, message frequency, or interaction strength. * Goal: identify communities with dense internal connections and sparse external ones. * **Real algorithms & applications**: * **Graph partitioning algorithms** based on Laplacian eigenvectors are used in: * Social network analysis tools * Recommendation systems * Network science research * Used implicitly in platforms analyzing social graphs (e.g., detecting communities or interest groups). * **Why spectral clustering works here**: * Communities are defined by **connectivity**, not spatial coordinates. * Spectral clustering naturally optimizes graph cut objectives that align with community structure. --- ### **3. Document and Text Clustering** * **Conceptual use**: * Documents are nodes. * Similarity is computed using cosine similarity over **TF–IDF** or embedding vectors. * Goal: group documents by topic or theme. * **Real algorithms & applications**: * Implemented in machine learning libraries such as: * **scikit-learn** (`SpectralClustering`) * Research pipelines for topic discovery and corpus exploration * Used in: * News article clustering * Academic paper organization * Search and recommendation systems * **Why spectral clustering works here**: * Text data often lies on **high-dimensional, nonlinear manifolds**. * Spectral methods handle overlapping vocabularies and complex similarity relationships better than centroid-based methods. --- ### **4. Nonlinear Manifolds and Complex Geometric Shapes** * **Conceptual use**: * Data points lie on curved or intertwined shapes (e.g., “two moons”, spirals). * Euclidean distance alone does not capture cluster membership. * **Real algorithms & applications**: * It is used as a benchmark method in: * Manifold learning research * Dimensionality reduction comparisons * It is closely related to techniques such as: * Laplacian Eigenmaps * Diffusion Maps * **Why spectral clustering works here**: * The similarity graph captures **local neighborhood structure**. * Eigenvectors of the Laplacian reveal global structure of the manifold that k-means cannot detect. --- ### **5. Software and Libraries Using Spectral Clustering** Spectral clustering is implemented in widely used tools, including: * **scikit-learn**: `sklearn.cluster.SpectralClustering` * Graph processing libraries and academic research frameworks * Computer vision and network analysis toolkits These implementations are commonly used for: * Exploratory data analysis * Benchmarking clustering performance * Research and prototyping in data science and informatics --- # **Limitations** Spectral clustering is elegant, but not perfect. Some important limitations: ### **1. Computational Cost (Compared to Other Clustering Methods)** One of the primary limitations of spectral clustering is its **computational expense**. * Constructing the similarity matrix $W$ requires computing pairwise similarities between data points, which has: * **Time complexity:** $O(n^2)$ * **Memory complexity:** $O(n^2)$ * Computing eigenvalues and eigenvectors of an $n \times n$ Laplacian matrix is also expensive, typically: * $O(n^3)$ in the worst case for full eigen-decomposition (or still costly for partial decompositions). Because of this, spectral clustering **does not scale well** to very large datasets without approximations. #### **Comparison with other clustering algorithms** * **k-means** * Time complexity: $O(n k d)$ * Extremely fast and scalable * Works well for large datasets, but assumes convex, spherical clusters * **Hierarchical clustering** * Often $O(n^2)$ or worse * Similar scalability issues, but no eigen-decomposition required * **DBSCAN** * Typically $O(n \log n)$ with spatial indexing * Scales better and handles noise, but struggles in high dimensions * **Gaussian Mixture Models (GMMs)** * More expensive than k-means but still significantly cheaper than spectral clustering for large $n$ **Summary:** Spectral clustering is more computationally expensive than k-means, DBSCAN, and GMMs, making it better suited for **small to medium-sized datasets**. --- ### **2. Choosing Parameters** Spectral clustering requires several parameter choices: * Number of clusters $k$ * Similarity function (e.g., Gaussian kernel) * Kernel bandwidth $\sigma$ * Graph construction method (e.g., k-nearest neighbors) Poor parameter choices can dramatically degrade performance. #### **Comparison** * **k-means:** only requires choosing $k$ * **DBSCAN:** requires choosing $\varepsilon$ and minimum points (but not $k$) * **GMMs:** requires choosing $k$ and covariance structure Spectral clustering has **more interacting parameters**, making it harder to tune. --- ### **3. Sensitivity to Graph Construction** The quality of spectral clustering depends heavily on how the similarity graph is built. * Using k-nearest neighbors vs. $\varepsilon$-neighborhoods can lead to different clusterings * Too sparse a graph may disconnect meaningful clusters * Too dense a graph may blur cluster boundaries If the graph does not accurately reflect true relationships in the data, the Laplacian eigenvectors will not reveal meaningful structure. #### **Comparison** * **k-means and GMMs** operate directly in feature space and are less sensitive to graph construction choices * **DBSCAN** is also sensitive to neighborhood definitions but does not rely on global graph structure --- ### **4. Not Probabilistic** Spectral clustering produces **hard cluster assignments** only. * No probability that a point belongs to a cluster * No uncertainty or confidence estimates #### **Comparison** * **Gaussian Mixture Models** provide probabilistic cluster membership * **Bayesian clustering models** can quantify uncertainty This makes spectral clustering less suitable when interpretability or uncertainty estimation is important. --- ### **5. Assumes “Nice” Graph-Based Cluster Structure** Spectral clustering implicitly assumes that: * Clusters correspond to **well-separated regions** in the similarity graph * The clustering objective can be approximated by minimizing a graph cut If the real data does not form such structures, results may be unstable or misleading. #### **Comparison** * **DBSCAN** is more robust to irregular cluster shapes and noise * **k-means** assumes geometric structure rather than connectivity * **Hierarchical clustering** can reveal multiple levels of structure --- ## **When to Use Spectral Clustering** Despite these limitations, spectral clustering is an excellent choice when: * Data has **complex, non-convex structure** * Relationships are better modeled as a **graph** * Dataset size is moderate * Cluster connectivity matters more than raw geometry --- ## **Key Takeaway** Spectral clustering trades **computational efficiency** for **expressive power**. Compared to faster methods like k-means or DBSCAN, it excels at uncovering complex structure, but only when its assumptions and computational costs are appropriate for the problem. --- ## Connecting Back to Class Concepts Spectral clustering ties together several ideas we’ve seen: * **Eigenvalues and eigenvectors**: We use them to find important directions in matrices. Here, eigenvectors of the Laplacian reveal natural groupings in a graph, similar to how eigenvectors of a covariance matrix reveal principal components in PCA. * **Matrices and transformations**: The Laplacian $L$ can be seen as a linear operator that measures how a function on the nodes changes across edges. Minimizing certain quadratic forms involving $L$ leads to smooth partitions of the graph. * **Similarity and distance**: Just like in K-NN, we rely on a notion of distance/similarity between data points to construct the graph. Overall, spectral clustering is a beautiful example of how linear algebra and graph theory combine to produce a powerful data science tool. # **Conclusion** Spectral clustering is a powerful bridge between **graph theory**, **linear algebra**, and **machine learning**. It turns data into a similarity graph, uses eigenvectors of the Laplacian to reveal natural partitions, and clusters cleanly even when shapes are complicated. --- # **References** Belkin, Mikhail, and Partha Niyogi. “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.” *Neural Computation*, vol. 15, no. 6, 2003, pp. 1373–1396. Ester, Martin, et al. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” *Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD)*, 1996, pp. 226–231. Luxburg, Ulrike von. “A Tutorial on Spectral Clustering.” *Statistics and Computing*, vol. 17, no. 4, 2007, pp. 395–416. MacQueen, J. “Some Methods for Classification and Analysis of Multivariate Observations.” *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*, vol. 1, 1967, pp. 281–297. Ng, Andrew Y., Michael I. Jordan, and Yair Weiss. “On Spectral Clustering: Analysis and an Algorithm.” *Advances in Neural Information Processing Systems*, vol. 14, 2002. Pedregosa, Fabian, et al. “Scikit-Learn: Machine Learning in Python.” *Journal of Machine Learning Research*, vol. 12, 2011, pp. 2825–2830. Shi, Jianbo, and Jitendra Malik. “Normalized Cuts and Image Segmentation.” *IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 22, no. 8, 2000, pp. 888–905. Zelnik-Manor, Lihi, and Pietro Perona. “Self-Tuning Spectral Clustering.” *Advances in Neural Information Processing Systems*, vol. 17, 2004. ---