# IML : Unserpervised clustering
# Why do we care
:::info
Group the input data into clusters that share some characteristics
:::
- Find pattern in the data (data mining problem)
- Visualize the data in a simpler way
- Infer some properties of a given data point based on how it relates to other data point (satistical learning)

# Why is it tricky
Belongs to unsupervised learning
- No grounds truth available to learn/evaluate quality of the algorithm

How to assess how much data points are related to each other?
- Which criteria (features) are the most relevant
- Which metrics make the most sense
How to assess the soundness of the resulting cluster? Is it relevant ?

# Families of clustering approaches
- **Distance-based clustering**
- centroid-based approach (k-mean)
- connectivity-based approaches (based on distance)
- **Density-based clustering**
- set of dense points
- **Distribution-based clustering**
- likelihood of point to belong to the same distribution
- **Fuzzy clustering**
- Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy $c$-means clustering,...).
# $k-$means clustering
Partition $n$ observations $x_1,...,x_n$ int $k$ clusters $C=\{C_1,...,C_k\}$ where each observations $x_i$ belongs to the cluster $C_{j*}$ whose mean $\mu_{j*}$ is the closest: $x_i\in S_{j*}$ with $j^*=argmin_j\Vert x_i-\mu_j \Vert_2$


La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster:

- Minimize within-cluster sum of squares (variance)
- Overall optimization problem:

- NP-hard problem, no guarantee to find the optimal value
- Stochastic and very sensitive to initial conditions
- Sensitive to outliers (thank you $L_2$ norm)
- Probably the most used clustering algorithm
## $k-$means and Voronoi tesselation
:::info
**Voronoi tesselation**
parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed

:::
- $k$-mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means
- Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for $k$-means

## Determining the optimal number of cluster
Combien de clusters a vue de nez pour cette image ? 
> 2, 3, 4, 14....
Compute explained variance for an increasing number of clusters $k$

:::success
Plot and find the bend of the elbow

:::
:::warning
Sometimes it does not work :(

:::
## Sometimes, $k$-means works...
But most of the time not as expected.
:::warning
Probably because the $L_2$ norm that $k$-means tries to minimize
:::

- Sensible of *curse of dimensionality*
- Form "normalized Gaussian" clusters
- Does not adapt to manifold geometry
- Sensible to class imbalance
- Sensible to outliers
## Simple Linear Iterative Clustering
> A kick-ass image segmentation algorithm using $k$-means
:::info
SLIC superpixels uses a modified $k$-means clustering in the $Labxy$ space to produce $k$ clusters regurlaly sampled and perceptually coherent from a color point of view.
:::



## k-medoids clustering
> Possible extension to $k$-means
:::warning
Cluster centroids are not initial data points $\Rightarrow$ can be problematic
:::
$\Rightarrow$ Replace centroids by medoid (points with the smallest distance to all other points in the cluster)


$\Rightarrow$ $k$-medoid algorithm
Overall objective: find $k$ medoids $m_1, . . . , m_k$ that minimize the partitioning cost


# Fuzzy $c$-means clustering
:::warning
$k$-means is a **hard** clustering method: each data point 100% belongs to the cluster
:::
:::info
**Soft** clustering methods allow each data points to belong to several clusters with various degrees of membership
:::

# Gaussian mixture models
> $k$-means on steroids
$k$-means works for spherical clusters, but fails in any other cases $\Rightarrow$ try harder Model probability density function $f$ of data as a mixture of multivariate Gaussian


Cette courbe est une superposition de plusieurs Gaussiennes:



:::danger
Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme
:::
## The EM algorithm
### Initialization
- Select $k$ random points as initial means $\hat\mu_1,...,\hat\mu_k$
- Init all covariance matrices $\hat\sum_1,...,\hat\sum_k$ as whole data sample covariances matrix $\hat\sum$
- Set uniform mixture weight $\hat\phi_1,...,\hat\phi_k=\frac{1}{k}$
### Alternate until convergence
**Expectation step**
Compute membership weight $\hat\gamma_{ij}$ of $x_i$ with respect to $j^{th}$ component $\mathcal N(x\vert\mu_j,\sum_j)$

**Maximization step**
Update weights (in that ordre)

:::success
Tadaaaa

:::
## $k$-means vs GMM
> Let the fight begin!


# Kernel Density Estimation
> Nonparametric estimation
:::info
**Goal**
Estimate probability density function $f$ based on observations $x_1,...,x_n$ only, assumed to derive from $f$ ~~otherwise wtf are we doing here~~

:::
The kernel density estimator with bandwith $h$ at given point $x$ is given by


## Exemples

## Mean shift clustering
:::info
shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum


:::
### Exemples

On peut faire la meme chose sur les images en couleurs:

# DBSCAN
> Density-base spatial clustering of applications with noise
- Divide points into 3 categories (core, boundary, outliers) whether there are at least $minPts$ in their $\epsilon$-neighborhood or not
- Find the connected component of core points (ignore all non-core points)
- Assign non-core points to nearby clusters if it is less than $\epsilon$ away, otherwise assign to noise


# Spectral clustering
:::info
View clustering task as a min-cut operation in a graph
:::

- Compute similarity graph (but which one?) of data $x_1,...,x_n$

- Compute (weighted) adjacency matrix $W$, degree matrix $D$ and Laplacian matrix $L=D-W$
- Perform eigendecomposition of $L=(E,\triangle)$
:::info
**Fact #1**
0 is and eigenvalue of $L$ with multiplicity $\sim\#$ connected components in graph, its eigenvectors are identity vectors of those connected components

:::
:::info
**Fact #2**
Eigenvector of smallest non-zero eigenvalue (Fiedler vector) gives the normalized min-cut of graph

:::
- Performs $k$-means clustering of the $k$ smallest eigenvectors $[e_1,...,e_k]_{n\times k}$

# Hierarchical clustering
> A very natural way of handling data
:::info
**Goal**
Generate a sequence of nested clusters and ordre them in a hierarchy, represented by a dendogram

:::
- Leaves the dendogram = initial data
- Inner nodes of the dendogram = clusters
## Exemple

## Agglomerative vs Divise clustering
Agglomerative: merge clusters from fine to coarse (bottom-up approach)

Divisive clustering: split clusters (top-down approach)

- Needs some heuristics to avoid the $O(2^n)$ ways of spitting each cluster...
- Not so used in practice
## Bestiarity


# Overall comparison of all methods
