IML : Unserpervised clustering Why do we care
Group the input data into clusters that share some characteristics
Find pattern in the data (data mining problem)
Visualize the data in a simpler way
Infer some properties of a given data point based on how it relates to other data point (satistical learning)
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Why is it tricky Belongs to unsupervised learning
No grounds truth available to learn/evaluate quality of the algorithm
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
How to assess how much data points are related to each other?
Which criteria (features) are the most relevant
Which metrics make the most sense
How to assess the soundness of the resulting cluster? Is it relevant ?
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Families of clustering approaches
Distance-based clustering
centroid-based approach (k-mean)
connectivity-based approaches (based on distance)
Density-based clustering
Distribution-based clustering
likelihood of point to belong to the same distribution
Fuzzy clustering
Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy
-means clustering, … ).
means clustering Partition
observations
int
clusters
where each observations
belongs to the cluster
whose mean
is the closest:
with
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster:
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Minimize within-cluster sum of squares (variance)
Overall optimization problem:
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
NP-hard problem, no guarantee to find the optimal value
Stochastic and very sensitive to initial conditions
Sensitive to outliers (thank you
norm)
Probably the most used clustering algorithm
means and Voronoi tesselation
Voronoi tesselation
parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
-mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means
Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for
-means
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Determining the optimal number of cluster Combien de clusters a vue de nez pour cette image ?
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
2, 3, 4, 14 …
Compute explained variance for an increasing number of clusters
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Plot and find the bend of the elbow
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Sometimes it does not work :(
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Sometimes,
-means works … But most of the time not as expected.
Probably because the
norm that
-means tries to minimize
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Sensible of curse of dimensionality
Form "normalized Gaussian" clusters
Does not adapt to manifold geometry
Sensible to class imbalance
Sensible to outliers
Simple Linear Iterative Clustering
A kick-ass image segmentation algorithm using
-means
SLIC superpixels uses a modified
-means clustering in the
space to produce
clusters regurlaly sampled and perceptually coherent from a color point of view.
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
k-medoids clustering
Possible extension to
-means
Cluster centroids are not initial data points
can be problematic
Replace centroids by medoid (points with the smallest distance to all other points in the cluster)
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
-medoid algorithm
Overall objective: find
medoids
that minimize the partitioning cost
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Fuzzy
-means clustering
-means is a hard clustering method: each data point 100% belongs to the cluster
Soft clustering methods allow each data points to belong to several clusters with various degrees of membership
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Gaussian mixture models
-means on steroids
-means works for spherical clusters, but fails in any other cases
try harder Model probability density function
of data as a mixture of multivariate Gaussian
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Cette courbe est une superposition de plusieurs Gaussiennes:
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme
The EM algorithm Initialization
Select
random points as initial means
Init all covariance matrices
as whole data sample covariances matrix
Set uniform mixture weight
Alternate until convergence Expectation step
Compute membership weight
of
with respect to
component
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More →
Maximization step
Update weights (in that ordre)
Tadaaaa
-means vs GMM
Let the fight begin!
Kernel Density Estimation
Nonparametric estimation
Goal
Estimate probability density function
based on observations
only, assumed to derive from
otherwise wtf are we doing here
The kernel density estimator with bandwith
at given point
is given by
Exemples
Mean shift clustering
shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum
Exemples
On peut faire la meme chose sur les images en couleurs:
DBSCAN
Density-base spatial clustering of applications with noise
Divide points into 3 categories (core, boundary, outliers) whether there are at least
in their
-neighborhood or not
Find the connected component of core points (ignore all non-core points)
Assign non-core points to nearby clusters if it is less than
away, otherwise assign to noise
Spectral clustering
View clustering task as a min-cut operation in a graph
Compute similarity graph (but which one?) of data
Compute (weighted) adjacency matrix
, degree matrix
and Laplacian matrix
Perform eigendecomposition of
Fact #1
0 is and eigenvalue of
with multiplicity
connected components in graph, its eigenvectors are identity vectors of those connected components
Fact #2
Eigenvector of smallest non-zero eigenvalue (Fiedler vector) gives the normalized min-cut of graph
Performs
-means clustering of the
smallest eigenvectors
Hierarchical clustering
A very natural way of handling data
Goal
Generate a sequence of nested clusters and ordre them in a hierarchy, represented by a dendogram
Leaves the dendogram = initial data
Inner nodes of the dendogram = clusters
Exemple
Agglomerative vs Divise clustering Agglomerative: merge clusters from fine to coarse (bottom-up approach)
Divisive clustering: split clusters (top-down approach)
Needs some heuristics to avoid the
ways of spitting each cluster …
Not so used in practice
Bestiarity
Overall comparison of all methods