# IML : Unserpervised clustering # Why do we care :::info Group the input data into clusters that share some characteristics ::: - Find pattern in the data (data mining problem) - Visualize the data in a simpler way - Infer some properties of a given data point based on how it relates to other data point (satistical learning) ![](https://i.imgur.com/T01V0W4.png) # Why is it tricky Belongs to unsupervised learning - No grounds truth available to learn/evaluate quality of the algorithm ![](https://i.imgur.com/1vBcxct.png) How to assess how much data points are related to each other? - Which criteria (features) are the most relevant - Which metrics make the most sense How to assess the soundness of the resulting cluster? Is it relevant ? ![](https://i.imgur.com/ojmJbwz.png) # Families of clustering approaches - **Distance-based clustering** - centroid-based approach (k-mean) - connectivity-based approaches (based on distance) - **Density-based clustering** - set of dense points - **Distribution-based clustering** - likelihood of point to belong to the same distribution - **Fuzzy clustering** - Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy $c$-means clustering,...). # $k-$means clustering Partition $n$ observations $x_1,...,x_n$ int $k$ clusters $C=\{C_1,...,C_k\}$ where each observations $x_i$ belongs to the cluster $C_{j*}$ whose mean $\mu_{j*}$ is the closest: $x_i\in S_{j*}$ with $j^*=argmin_j\Vert x_i-\mu_j \Vert_2$ ![](https://i.imgur.com/JvfTtlf.png) ![](https://i.imgur.com/LVIwYHS.png) La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster: ![](https://i.imgur.com/qrvcbBc.png) - Minimize within-cluster sum of squares (variance) - Overall optimization problem: ![](https://i.imgur.com/ZOvFM5r.png) - NP-hard problem, no guarantee to find the optimal value - Stochastic and very sensitive to initial conditions - Sensitive to outliers (thank you $L_2$ norm) - Probably the most used clustering algorithm ## $k-$means and Voronoi tesselation :::info **Voronoi tesselation** parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed ![](https://i.imgur.com/5xj4XVA.png) ::: - $k$-mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means - Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for $k$-means ![](https://i.imgur.com/qtY0EkL.png) ## Determining the optimal number of cluster Combien de clusters a vue de nez pour cette image ? ![](https://i.imgur.com/7c42CyF.png) > 2, 3, 4, 14.... Compute explained variance for an increasing number of clusters $k$ ![](https://i.imgur.com/THW7YRL.png) :::success Plot and find the bend of the elbow ![](https://i.imgur.com/2GEZX4h.png) ::: :::warning Sometimes it does not work :( ![](https://i.imgur.com/eelA1rX.png) ::: ## Sometimes, $k$-means works... But most of the time not as expected. :::warning Probably because the $L_2$ norm that $k$-means tries to minimize ::: ![](https://i.imgur.com/l8CifIm.png) - Sensible of *curse of dimensionality* - Form "normalized Gaussian" clusters - Does not adapt to manifold geometry - Sensible to class imbalance - Sensible to outliers ## Simple Linear Iterative Clustering > A kick-ass image segmentation algorithm using $k$-means :::info SLIC superpixels uses a modified $k$-means clustering in the $Labxy$ space to produce $k$ clusters regurlaly sampled and perceptually coherent from a color point of view. ::: ![](https://i.imgur.com/gpGsA50.png) ![](https://i.imgur.com/t4AZsqN.png) ![](https://i.imgur.com/TM6AfmQ.png) ## k-medoids clustering > Possible extension to $k$-means :::warning Cluster centroids are not initial data points $\Rightarrow$ can be problematic ::: $\Rightarrow$ Replace centroids by medoid (points with the smallest distance to all other points in the cluster) ![](https://i.imgur.com/JgMr7rx.png) ![](https://i.imgur.com/sgBbLcY.png) $\Rightarrow$ $k$-medoid algorithm Overall objective: find $k$ medoids $m_1, . . . , m_k$ that minimize the partitioning cost ![](https://i.imgur.com/9nQxiRq.png) ![](https://i.imgur.com/Zeasggl.png) # Fuzzy $c$-means clustering :::warning $k$-means is a **hard** clustering method: each data point 100% belongs to the cluster ::: :::info **Soft** clustering methods allow each data points to belong to several clusters with various degrees of membership ::: ![](https://i.imgur.com/6LY1olR.png) # Gaussian mixture models > $k$-means on steroids $k$-means works for spherical clusters, but fails in any other cases $\Rightarrow$ try harder Model probability density function $f$ of data as a mixture of multivariate Gaussian ![](https://i.imgur.com/L9AKAI7.png) ![](https://i.imgur.com/1wFKjes.png) Cette courbe est une superposition de plusieurs Gaussiennes: ![](https://i.imgur.com/BQPOTaP.png) ![](https://i.imgur.com/agg6coo.png) ![](https://i.imgur.com/otkQSp3.png) :::danger Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme ::: ## The EM algorithm ### Initialization - Select $k$ random points as initial means $\hat\mu_1,...,\hat\mu_k$ - Init all covariance matrices $\hat\sum_1,...,\hat\sum_k$ as whole data sample covariances matrix $\hat\sum$ - Set uniform mixture weight $\hat\phi_1,...,\hat\phi_k=\frac{1}{k}$ ### Alternate until convergence **Expectation step** Compute membership weight $\hat\gamma_{ij}$ of $x_i$ with respect to $j^{th}$ component $\mathcal N(x\vert\mu_j,\sum_j)$ ![](https://i.imgur.com/klkB7x8.png) **Maximization step** Update weights (in that ordre) ![](https://i.imgur.com/i59ad2z.png) :::success Tadaaaa ![](https://i.imgur.com/fxyRwVS.png) ::: ## $k$-means vs GMM > Let the fight begin! ![](https://i.imgur.com/Z4GXU9M.png) ![](https://i.imgur.com/DypjLho.png) # Kernel Density Estimation > Nonparametric estimation :::info **Goal** Estimate probability density function $f$ based on observations $x_1,...,x_n$ only, assumed to derive from $f$ ~~otherwise wtf are we doing here~~ ![](https://i.imgur.com/m6D3SSw.png) ::: The kernel density estimator with bandwith $h$ at given point $x$ is given by ![](https://i.imgur.com/fzYYYkY.png) ![](https://i.imgur.com/raArYI8.png) ## Exemples ![](https://i.imgur.com/YaZHrgS.png) ## Mean shift clustering :::info shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum ![](https://i.imgur.com/dfK2m9G.png) ![](https://i.imgur.com/I8GJKaQ.png) ::: ### Exemples ![](https://i.imgur.com/kqlMnY4.png) On peut faire la meme chose sur les images en couleurs: ![](https://i.imgur.com/k29t87Y.png) # DBSCAN > Density-base spatial clustering of applications with noise - Divide points into 3 categories (core, boundary, outliers) whether there are at least $minPts$ in their $\epsilon$-neighborhood or not - Find the connected component of core points (ignore all non-core points) - Assign non-core points to nearby clusters if it is less than $\epsilon$ away, otherwise assign to noise ![](https://i.imgur.com/k5bta0x.png) ![](https://i.imgur.com/AbrcJDe.png) # Spectral clustering :::info View clustering task as a min-cut operation in a graph ::: ![](https://i.imgur.com/Y4Kj0pb.png) - Compute similarity graph (but which one?) of data $x_1,...,x_n$ ![](https://i.imgur.com/S6F7mS5.png) - Compute (weighted) adjacency matrix $W$, degree matrix $D$ and Laplacian matrix $L=D-W$ - Perform eigendecomposition of $L=(E,\triangle)$ :::info **Fact #1** 0 is and eigenvalue of $L$ with multiplicity $\sim\#$ connected components in graph, its eigenvectors are identity vectors of those connected components ![](https://i.imgur.com/VZycsNT.png) ::: :::info **Fact #2** Eigenvector of smallest non-zero eigenvalue (Fiedler vector) gives the normalized min-cut of graph ![](https://i.imgur.com/TTP3aX6.png) ::: - Performs $k$-means clustering of the $k$ smallest eigenvectors $[e_1,...,e_k]_{n\times k}$ ![](https://i.imgur.com/Cpn3550.png) # Hierarchical clustering > A very natural way of handling data :::info **Goal** Generate a sequence of nested clusters and ordre them in a hierarchy, represented by a dendogram ![](https://i.imgur.com/j8dnpgU.png) ::: - Leaves the dendogram = initial data - Inner nodes of the dendogram = clusters ## Exemple ![](https://i.imgur.com/apVnKYK.png) ## Agglomerative vs Divise clustering Agglomerative: merge clusters from fine to coarse (bottom-up approach) ![](https://i.imgur.com/VJunfOZ.png) Divisive clustering: split clusters (top-down approach) ![](https://i.imgur.com/eypuhD0.png) - Needs some heuristics to avoid the $O(2^n)$ ways of spitting each cluster... - Not so used in practice ## Bestiarity ![](https://i.imgur.com/WKmG55N.png) ![](https://i.imgur.com/SnuNeX5.png) # Overall comparison of all methods ![](https://i.imgur.com/C0N9moV.png)