IML : Unserpervised clustering

Why do we care

Group the input data into clusters that share some characteristics

Find pattern in the data (data mining problem)
Visualize the data in a simpler way
Infer some properties of a given data point based on how it relates to other data point (satistical learning)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Why is it tricky

Belongs to unsupervised learning

No grounds truth available to learn/evaluate quality of the algorithm

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

How to assess how much data points are related to each other?

Which criteria (features) are the most relevant
Which metrics make the most sense

How to assess the soundness of the resulting cluster? Is it relevant ?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Families of clustering approaches

Distance-based clustering
- centroid-based approach (k-mean)
- connectivity-based approaches (based on distance)
Density-based clustering
- set of dense points
Distribution-based clustering
- likelihood of point to belong to the same distribution
Fuzzy clustering
- Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy
  $c$ -means clustering,…).

$k -$ means clustering

Partition

n

observations

x_{1}, . . ., x_{n}

int

k

clusters

C = {C_{1}, . . ., C_{k}}

where each observations

x_{i}

belongs to the cluster

C_{j *}

whose mean

μ_{j *}

is the closest:

x_{i} \in S_{j *}

with

j^{*} = a r g m i n_{j} ‖ x_{i} - μ_{j} ‖_{2}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Minimize within-cluster sum of squares (variance)
Overall optimization problem:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

NP-hard problem, no guarantee to find the optimal value
Stochastic and very sensitive to initial conditions
Sensitive to outliers (thank you
$L_{2}$ norm)
Probably the most used clustering algorithm

$k -$ means and Voronoi tesselation

Voronoi tesselation
parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

$k$ -mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means
Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for
$k$ -means

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Determining the optimal number of cluster

Combien de clusters a vue de nez pour cette image ?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2, 3, 4, 14…

Compute explained variance for an increasing number of clusters

k

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Plot and find the bend of the elbow

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sometimes it does not work :(

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sometimes,
$k$ -means works…

But most of the time not as expected.

Probably because the

L_{2}

norm that

k

-means tries to minimize

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sensible of curse of dimensionality
Form "normalized Gaussian" clusters
Does not adapt to manifold geometry
Sensible to class imbalance
Sensible to outliers

Simple Linear Iterative Clustering

A kick-ass image segmentation algorithm using
$k$ -means

SLIC superpixels uses a modified

k

-means clustering in the

L a b x y

space to produce

k

clusters regurlaly sampled and perceptually coherent from a color point of view.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

k-medoids clustering

Possible extension to
$k$ -means

Cluster centroids are not initial data points

\Rightarrow

can be problematic

\Rightarrow

Replace centroids by medoid (points with the smallest distance to all other points in the cluster)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

\Rightarrow

k

-medoid algorithm

Overall objective: find

k

medoids

m_{1}, . . ., m_{k}

that minimize the partitioning cost

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Fuzzy
$c$ -means clustering

k

-means is a hard clustering method: each data point 100% belongs to the cluster

Soft clustering methods allow each data points to belong to several clusters with various degrees of membership

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Gaussian mixture models

$k$ -means on steroids

k

-means works for spherical clusters, but fails in any other cases

\Rightarrow

try harder Model probability density function

f

of data as a mixture of multivariate Gaussian

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Cette courbe est une superposition de plusieurs Gaussiennes:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme

The EM algorithm

Initialization

Select
$k$ random points as initial means
${\hat{μ}}_{1}, . . ., {\hat{μ}}_{k}$
Init all covariance matrices
${\sum^{^}}_{1}, . . ., {\sum^{^}}_{k}$ as whole data sample covariances matrix
$\sum^{^}$
Set uniform mixture weight
${\hat{ϕ}}_{1}, . . ., {\hat{ϕ}}_{k} = \frac{1}{k}$

Alternate until convergence

Expectation step
Compute membership weight

{\hat{γ}}_{i j}

x_{i}

with respect to

j^{t h}

component

N (x | μ_{j}, \sum_{j})

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Maximization step
Update weights (in that ordre)

Tadaaaa

$k$ -means vs GMM

Let the fight begin!

Kernel Density Estimation

Nonparametric estimation

Goal
Estimate probability density function

f

based on observations

x_{1}, . . ., x_{n}

only, assumed to derive from

f

~~otherwise wtf are we doing here~~

The kernel density estimator with bandwith

h

at given point

x

is given by

Exemples

Mean shift clustering

shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum

Exemples

On peut faire la meme chose sur les images en couleurs:

DBSCAN

Density-base spatial clustering of applications with noise

Divide points into 3 categories (core, boundary, outliers) whether there are at least
$m i n P t s$ in their
$ϵ$ -neighborhood or not
Find the connected component of core points (ignore all non-core points)
Assign non-core points to nearby clusters if it is less than
$ϵ$ away, otherwise assign to noise