IML : Dimensionality reduction

Why do we care ?

We have at hand

n

points

x 1, . . ., x n

lying in some N-dimensional space,

x_{i} \in R^{n}, \forall i = 1, . . ., n,

compactly written as a

n \times N

matrix

X

One row of
$X$ = one sample
One column of
$X$ = a given feature value for all samples

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Example of real high-dimensional data

Real world data is very often high-dimensional

MNIST image classification:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sample
$x$ :image with 28x28 pixels
Data set: 60000 samples
Dimensionality:
$x \in R^{28 \times 28 = 784}$

MUSE hyperspectral image analysis:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sample
$x$ : pixel with 3600 spectral bands
Data set: image with 300x300 pixels
Dimensionality:
$x \in R^{3600}$

Pour discriminer les galaxies ~~c'est raciste ca monsieur~~

The curse of dimensionality

High-dimensional spaces ~~suck donkey ballz~~ suffer from the curse of dimensionality (also called Hughes’ phenomenon)

Sur
$R$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sur
$R^{2}$

Revenir a la meme densite d'echantillonage:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sur
$R^{3}$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Revenir a la meme densite d'echantillonage:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

\frac{ν (S^{n})}{ν ([- 1; 1]^{n})} = \frac{π^{\frac{n}{2}}}{2 Γ (\frac{n}{2} + 1)} \to_{n \to + \infty} 0

Points uniformly distributed in a

n

−cube of side 2 mostly fall outside of the unit sphere!

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Why is it tricky?

We naturally cannot picture anything that is more than 3D in our mind

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Picturing something 3D in a 2D flat screen can already be misleading
Real data naturally lives in (complex) high-dimensional space
Real data is often strongly correlated

And somehow, we want to have a good look to our data before feeding it to some machine learning algorithm (can I use the inherent structure of my data to pimp my machine learning performances?)

How ?

Dimensionality reduction: transform data set

X

with dimensionality

N

into a new data set

Y

(

n \times M

matrix) with dimensionality

M < N

(hopefully

M <\leq N

) such that as few information as possible is lost in the process.

y_{i}

(

i

th row of

Y

) is the low-dimensional counterpart (the projection) of

x_{i}

INFORMATION ???

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Linear approaches

Somehow trying to find a low-dimensional subspace in which the projected data would not be too much distorted after projection.

Johnson-Lindenstrauss lemma
Classical scaling
(The one and only) Principal Component Analysis
And much more…

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Johnson-Lindenstrauss lemma

It’s not because you can that you will

Let

0 < ε < 1

and let

x_{1}, . . ., x_{n}

n

points in

R^{N}

. Then there exists a linear map

f : R^{N} \to R^{M}

such that for every points

x_{i}

and

x_{j}

(1 - ε) ‖ x_{i} - x_{j} ‖^{2} \leq ‖ f (x_{i}) - f (x_{j}) ‖^{2} \leq (1 + ε) ‖ x_{i} - x_{j} ‖^{2}

With

M = \frac{4 \log (n)}{(\frac{ε^{2}}{2} - \frac{ε^{3}}{3}) .}

Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

La douille: il faut trouver la matrice

M

Classical scaling

Also called Principal Coordinates Analysis (PCoA)

Lots of formula here, but you just need to retain the overall idea

PCoA: project data points

X

onto

Y

with a linear mapping

M

such that

Y = X M

such that all pairwise distances between points do not change too much before/after projection

D

is the

n \times n

Euclidean distance matrix with entries

d_{i j} = ‖ x_{i} - x j ‖_{2}

and

D^{(2)} = [d_{i j}^{2}]

, PCoA seeks the linear mapping

M

that minimizes

ϕ (Y) = \sum_{i, j} (d_{i j}^{2} - ‖ y_{i} - y_{j} ‖^{2})

with

y_{i} = x_{i} M

and

‖ m_{i} ‖^{2} = 1 \forall i

Solution: eigendecomposition (=diagonalisation) of the Gram matrix

K = X X^{T} = E Δ E

K

can be obtained by double centering

D^{(2)} : K = - \frac{1}{2} C_{n} D^{(2)} C_{n}

with centering matrix

C_{n} = I_{n} - \frac{1}{n} o n e s (n, n)

Optimal projection onto the first

M

dimensions

Y = Δ_{M}^{\frac{1}{2}} E_{M}^{T}

with

E_{M}

matrix of the

M

largest eigenvectors of

E

Principal component analysis

Also known as the Karhunen-Loeve transform

Closely related to PCoA, but operates on the covariance matrix

X_{c}^{T} X_{c}

PCA seeks the linear mapping

M

that maximizes the projection variance

t r (M^{T} c o v (X) M)

with

‖ m i ‖^{2} = 1 \forall i

X = [\begin{matrix} \overset{u_{1} = moyenne}{\overset{⏞}{x_{1} 1}} & \overset{u_{2}}{\overset{⏞}{x_{1} 2}} \\ ⋮ & ⋮ \\ x_{n} 1 & x_{n} 2 \end{matrix}] \Rightarrow centrage des donnees

X_{c} = [\begin{matrix} x_{1} 1 - u_{1} & x_{1} 2 - u_{2} \\ ⋮ & ⋮ \\ x_{n} 1 - u_{1} & x_{n} 2 - u_{2} \end{matrix}]

Center the data
$X_{c} = C_{n} X$
1.b (opt) Reduce the data
Compute covariance matrix
$\sum = X_{c}^{T} X_{c}$
Perform eigendecomposition
$(E, Δ)$ of
$\sum$
Project on the first
$M$ principal axes
$Y = X E_{M}$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Data after projection is uncorrelated, but has
lost some interpretability

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

PCA is probably the most popular and used unsupervised linear dimensionality reduction technique, but it comes with a bunch of operability questions, the 2 principles being:

How to automatically select the right number of dimensions to project?
How to project a new data point on a learned projection subspace?
- See you in lab session for the answer

Non-linear approaches

When it is assumed that the data does not live
in an Euclidean subspace (why would it anyway?),
some more advanced techniques must be relied
on.

Isomap
Locally linear embedding
Kernel Principal Component Analysis (aka PCA on steroids)
Multilayer autoencoders
And much more…

Isomap

Geodesic distance rocks

Isometric feature mapping: same idea as classical scaling, but using geodesic distance instead of Euclidean distance.

Compute k-nearest neighbor graph of data
$x_{1}, . . ., x_{n}$
Compute all pairwise geodesic distances
Apply classical scaling

Exemple

Isomap applied to some images of the digit 2 in MNIST data

Locally linear embedding

Locally linear embedding: the manifold can be locally considered Euclidean

For each point

x_{i}

get its k-nearest neighbors
$x_{j}$ ,
$j = 1, . . ., k$
Get weights
$w_{i j}$ that best linearly reconstruct
$x_{i}$ with
$x_{j}$ : minimize
$\sum_{i = 1}^{n} ‖ x_{i} - \sum w_{i j} x_{j} ‖$
- with constraints
  $\sum w_{i j} = 1$ (closed-form solution)
Low-dimensional embedding
$\to$ reconstruct
$y_{i}$ with
$y_{j}$ and same weights
$w_{i j}$ :

minimize

\sum_{i = 1}^{n} ‖ y_{i} - \sum w_{i j} y_{j} ‖

with constraints

\frac{1}{n} \sum_{i} y_{i} y_{i}^{T}

and

\sum_{i} y_{i} = 0

(eigendecomposition of a Gram matrix)

The kernel trick

When one actually wants to increase the dimension

Base idea: map

n

non linearly separable points to a (possibly infinite) space where they would be with a function

ϕ

How should we define
$ϕ$ ?
Do we really want to compute stuff in a (possibly infinite) feature space?

Mercer theorem: we do not need to know the mapping

ϕ

explicitly as long as we have a positive semi-definite kernel/Gram matrix

K = [k (x_{i}, x_{j})] = [< ϕ (x_{i}), ϕ (x_{j}) >]

Widely used kernel functions:

Polynomial kernel:
$k (x_{i}, x_{j}) = (x_{i}^{T} x_{j} + 1)^{d}$
Gaussian RBF kernel:
$k (x_{i}, x_{j}) = e^{- γ ‖ x_{i} - x_{j} ‖^{2}}$
Sigmoid kernel:
$k (x_{i}, x_{j}) = \tanh (b x_{i}^{T} x_{j} + c)$

Kernel PCA

PCA on steroids

The maths behind are quite hard, but the following scikit-learn recipe works fine:

Compute kernel matrix
$k = [k (x_{i}, x_{j})] = [< ϕ (x_{i}), ϕ (x_{j}) >]$ and double-center it
$K_{c} = C_{n} K C_{n}$
Eigendecomposition of
$K_{c}$ is strongly related to this of the (intractable) covariance matrix in the feature space
$\to$ get eigenvectors
$V$ and corresponding eigenvalues
$Δ$ of
$K_{c}$ .
Keep the first
$M$ columns of
$\sqrt{Δ V}$ to get the coordinates of projected data points in the low
$M$ -dimensional space.

But things get nasty when one wants to project a new data point

x

that was not known when constructing the kernel…

Non-linear PCA

Also known as autoencoder

Overall idea:

train an autoencoder (neural network with an autoassociative architecture) to perform an identity mapping.
use the output of the bottleneck layer as low-dimensional code.

Bottleneck code is a non-linear combination of entries (thanks to activation functions on the encoder layers)

\to

learned mapping is a non-linear PCA.

Principal components are generalized from straight lines to curves: the projection subspace which is described by all nonlinear components is also curved.

Let’s recap

High-dimensional data set

X

is a

n \times N

matrix, with

n =

number of samples and

N =

dimensionality of underlying space.

Parametric
$\equiv$ explicit embedding from high-dimensional space to low-dimensional one
For LLE:
$p$ is the ratio of non-zero elements in a sparse matrix to the total number of elements
For NL-PCA:
$i$ is the number of iterations and w is the number of weights in the neural network

t-Distributed Stochastic Neighbor Embedding

t-SNE is a popular method to see in 2D or 3D wtf is going on in a high-dimensional spaces.

Construct a probability distribution
$p$ over pairs of points in the high-dim space: the more similar (the closer) the two points, the higher the probability
Define a second probability distribution
$q$ over the points in the low-dim space, and dispatch the points such that the distance between p and q in minimized (for the KullbackLeibler divergence)

t-SNE is excellent in visualizing the well-separated clusters, but fails to preserve the global geometry of the data.
t-SNE depends on a perplexity parameter, which reflects the scale of search for close points.

Independant component analysis

ICA aims to provide a solution to the so-called cocktail party: retrieving independent sources that got mixed-up together with unknown scaling coefficients.

Goal: estimate source

s

and mixing matrix

A

from observation

x = A s

Ill-posed
$\Rightarrow$ enforce independence on source components
Work on higher order statistics (PCA limits to order-2 statistics)
Unkown source must not be Gaussian-distributed

Contrarily to PCA vectors, ICA vectors are not orthogonal and not ranked by importance,
but they are mutually independents.

IML : Dimensionality reduction

Why do we care ?

Example of real high-dimensional data

MNIST image classification:

MUSE hyperspectral image analysis:

The curse of dimensionality

Sur R

Sur R2

Sur R3

Why is it tricky?

How ?

Linear approaches

Johnson-Lindenstrauss lemma

Classical scaling

Principal component analysis

Major challenges related to PCA

Non-linear approaches

Isomap

Exemple

Locally linear embedding

The kernel trick

Kernel PCA

Non-linear PCA

Let’s recap

t-Distributed Stochastic Neighbor Embedding

Independant component analysis

Sur
$R$

Sur
$R^{2}$

Sur
$R^{3}$