Try   HackMD

IML : Unserpervised clustering

Why do we care

Group the input data into clusters that share some characteristics

  • Find pattern in the data (data mining problem)
  • Visualize the data in a simpler way
  • Infer some properties of a given data point based on how it relates to other data point (satistical learning)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Why is it tricky

Belongs to unsupervised learning

  • No grounds truth available to learn/evaluate quality of the algorithm

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

How to assess how much data points are related to each other?

  • Which criteria (features) are the most relevant
  • Which metrics make the most sense

How to assess the soundness of the resulting cluster? Is it relevant ?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Families of clustering approaches

  • Distance-based clustering
    • centroid-based approach (k-mean)
    • connectivity-based approaches (based on distance)
  • Density-based clustering
    • set of dense points
  • Distribution-based clustering
    • likelihood of point to belong to the same distribution
  • Fuzzy clustering
    • Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy
      c
      -means clustering,).

k
means clustering

Partition

n observations
x1,...,xn
int
k
clusters
C={C1,...,Ck}
where each observations
xi
belongs to the cluster
Cj
whose mean
μj
is the closest:
xiSj
with
j=argminjxiμj2

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Minimize within-cluster sum of squares (variance)
  • Overall optimization problem:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • NP-hard problem, no guarantee to find the optimal value
  • Stochastic and very sensitive to initial conditions
  • Sensitive to outliers (thank you
    L2
    norm)
  • Probably the most used clustering algorithm

k
means and Voronoi tesselation

Voronoi tesselation
parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • k
    -mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means
  • Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for
    k
    -means

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Determining the optimal number of cluster

Combien de clusters a vue de nez pour cette image ?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

2, 3, 4, 14

Compute explained variance for an increasing number of clusters

k

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Plot and find the bend of the elbow

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sometimes it does not work :(

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sometimes,
k
-means works

But most of the time not as expected.

Probably because the

L2 norm that
k
-means tries to minimize

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Sensible of curse of dimensionality
  • Form "normalized Gaussian" clusters
  • Does not adapt to manifold geometry
  • Sensible to class imbalance
  • Sensible to outliers

Simple Linear Iterative Clustering

A kick-ass image segmentation algorithm using

k-means

SLIC superpixels uses a modified

k-means clustering in the
Labxy
space to produce
k
clusters regurlaly sampled and perceptually coherent from a color point of view.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

k-medoids clustering

Possible extension to

k-means

Cluster centroids are not initial data points

can be problematic

Replace centroids by medoid (points with the smallest distance to all other points in the cluster)
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

k
-medoid algorithm

Overall objective: find

k medoids
m1,...,mk
that minimize the partitioning cost

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Fuzzy
c
-means clustering

k-means is a hard clustering method: each data point 100% belongs to the cluster

Soft clustering methods allow each data points to belong to several clusters with various degrees of membership

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Gaussian mixture models

k-means on steroids

k-means works for spherical clusters, but fails in any other cases
try harder Model probability density function
f
of data as a mixture of multivariate Gaussian

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Cette courbe est une superposition de plusieurs Gaussiennes:
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme

The EM algorithm

Initialization

  • Select
    k
    random points as initial means
    μ^1,...,μ^k
  • Init all covariance matrices
    ^1,...,^k
    as whole data sample covariances matrix
    ^
  • Set uniform mixture weight
    ϕ^1,...,ϕ^k=1k

Alternate until convergence

Expectation step
Compute membership weight

γ^ij of
xi
with respect to
jth
component
N(x|μj,j)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Maximization step
Update weights (in that ordre)

Tadaaaa

k
-means vs GMM

Let the fight begin!

Kernel Density Estimation

Nonparametric estimation

Goal
Estimate probability density function

f based on observations
x1,...,xn
only, assumed to derive from
f
otherwise wtf are we doing here

The kernel density estimator with bandwith

h at given point
x
is given by

Exemples

Mean shift clustering

shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum


Exemples

On peut faire la meme chose sur les images en couleurs:

DBSCAN

Density-base spatial clustering of applications with noise

  • Divide points into 3 categories (core, boundary, outliers) whether there are at least
    minPts
    in their
    ϵ
    -neighborhood or not
  • Find the connected component of core points (ignore all non-core points)
  • Assign non-core points to nearby clusters if it is less than
    ϵ
    away, otherwise assign to noise

Spectral clustering

View clustering task as a min-cut operation in a graph

  • Compute similarity graph (but which one?) of data
    x1,...,xn

  • Compute (weighted) adjacency matrix
    W
    , degree matrix
    D
    and Laplacian matrix
    L=DW
  • Perform eigendecomposition of
    L=(E,)

Fact #1
0 is and eigenvalue of

L with multiplicity
#
connected components in graph, its eigenvectors are identity vectors of those connected components

Fact #2
Eigenvector of smallest non-zero eigenvalue (Fiedler vector) gives the normalized min-cut of graph

  • Performs
    k
    -means clustering of the
    k
    smallest eigenvectors
    [e1,...,ek]n×k

Hierarchical clustering

A very natural way of handling data

Goal
Generate a sequence of nested clusters and ordre them in a hierarchy, represented by a dendogram

  • Leaves the dendogram = initial data
  • Inner nodes of the dendogram = clusters

Exemple

Agglomerative vs Divise clustering

Agglomerative: merge clusters from fine to coarse (bottom-up approach)

Divisive clustering: split clusters (top-down approach)

  • Needs some heuristics to avoid the
    O(2n)
    ways of spitting each cluster
  • Not so used in practice

Bestiarity

Overall comparison of all methods