Try   HackMD

IML : Dimensionality reduction

Why do we care ?

We have at hand

n points
x1,...,xn
lying in some N-dimensional space,
xiRn,i=1,...,n,
compactly written as a
n×N
matrix
X

  • One row of
    X
    = one sample
  • One column of
    X
    = a given feature value for all samples

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Example of real high-dimensional data

Real world data is very often high-dimensional

MNIST image classification:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Sample
    x
    :image with 28x28 pixels
  • Data set: 60000 samples
  • Dimensionality:
    xR28×28=784

MUSE hyperspectral image analysis:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Sample
    x
    : pixel with 3600 spectral bands
  • Data set: image with 300x300 pixels
  • Dimensionality:
    xR3600

Pour discriminer les galaxies c'est raciste ca monsieur

The curse of dimensionality

High-dimensional spaces suck donkey ballz suffer from the curse of dimensionality (also called Hughes’ phenomenon)

Sur
R

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sur
R2

Revenir a la meme densite d'echantillonage:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Sur
R3

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Revenir a la meme densite d'echantillonage:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

ν(Sn)ν([1;1]n)=πn22Γ(n2+1)n+0

Points uniformly distributed in a

n−cube of side 2 mostly fall outside of the unit sphere!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Why is it tricky?

  • We naturally cannot picture anything that is more than 3D in our mind

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • Picturing something 3D in a 2D flat screen can already be misleading
  • Real data naturally lives in (complex) high-dimensional space
  • Real data is often strongly correlated

And somehow, we want to have a good look to our data before feeding it to some machine learning algorithm (can I use the inherent structure of my data to pimp my machine learning performances?)

How ?

Dimensionality reduction: transform data set

X with dimensionality
N
into a new data set
Y
(
n×M
matrix) with dimensionality
M<N
(hopefully
M<≤N
) such that as few information as possible is lost in the process.
yi
(
i
th row of
Y
) is the low-dimensional counterpart (the projection) of
xi
.

INFORMATION ???

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Linear approaches

Somehow trying to find a low-dimensional subspace in which the projected data would not be too much distorted after projection.

  • Johnson-Lindenstrauss lemma
  • Classical scaling
  • (The one and only) Principal Component Analysis
  • And much more

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Johnson-Lindenstrauss lemma

It’s not because you can that you will

Let

0<ε<1 and let
x1,...,xn
b
n
points in
RN
. Then there exists a linear map
f:RNRM
such that for every points
xi
and
xj

(1ε)xixj2f(xi)f(xj)2(1+ε)xixj2

With

M=4log(n)(ε22ε33).

Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

La douille: il faut trouver la matrice

M.

Classical scaling

Also called Principal Coordinates Analysis (PCoA)

Lots of formula here, but you just need to retain the overall idea

PCoA: project data points

X onto
Y
with a linear mapping
M
such that
Y=XM
such that all pairwise distances between points do not change too much before/after projection

If

D is the
n×n
Euclidean distance matrix with entries
dij=xixj2
and
D(2)=[dij2]
, PCoA seeks the linear mapping
M
that minimizes

ϕ(Y)=i,j(dij2yiyj2)

with

yi=xiM and
mi2=1i

Solution: eigendecomposition (=diagonalisation) of the Gram matrix

K=XXT=EΔE

K can be obtained by double centering
D(2):K=12CnD(2)Cn
with centering matrix
Cn=In1nones(n,n)

Optimal projection onto the first

M dimensions
Y=ΔM12EMT
with
EM
matrix of the
M
largest eigenvectors of
E
.

Principal component analysis

Also known as the Karhunen-Loeve transform

Closely related to PCoA, but operates on the covariance matrix

XcTXc PCA seeks the linear mapping
M
that maximizes the projection variance
tr(MTcov(X)M)
with
mi2=1i
.

X=[x11u1=moyennex12u2xn1xn2]centrage des donnees

Xc=[x11u1x12u2xn1u1xn2u2]

  1. Center the data
    Xc=CnX

    1.b (opt) Reduce the data
  2. Compute covariance matrix
    =XcTXc
  3. Perform eigendecomposition
    (E,Δ)
    of
  4. Project on the first
    M
    principal axes
    Y=XEM

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Data after projection is uncorrelated, but has
lost some interpretability

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

PCA is probably the most popular and used unsupervised linear dimensionality reduction technique, but it comes with a bunch of operability questions, the 2 principles being:

  1. How to automatically select the right number of dimensions to project?
  2. How to project a new data point on a learned projection subspace?
    • See you in lab session for the answer

Non-linear approaches

When it is assumed that the data does not live
in an Euclidean subspace (why would it anyway?),
some more advanced techniques must be relied
on.

  • Isomap
  • Locally linear embedding
  • Kernel Principal Component Analysis (aka PCA on steroids)
  • Multilayer autoencoders
  • And much more

Isomap

Geodesic distance rocks

Isometric feature mapping: same idea as classical scaling, but using geodesic distance instead of Euclidean distance.


  1. Compute k-nearest neighbor graph of data
    x1,...,xn
  2. Compute all pairwise geodesic distances
  3. Apply classical scaling

Exemple

Isomap applied to some images of the digit 2 in MNIST data

Locally linear embedding

Locally linear embedding: the manifold can be locally considered Euclidean

For each point

xi:

  1. get its k-nearest neighbors
    xj
    ,
    j=1,...,k
  2. Get weights
    wij
    that best linearly reconstruct
    xi
    with
    xj
    : minimize
    i=1nxiwijxj
    • with constraints
      wij=1
      (closed-form solution)
  3. Low-dimensional embedding
    reconstruct
    yi
    with
    yj
    and same weights
    wij
    :

minimize

i=1nyiwijyj

with constraints

1niyiyiT and
iyi=0
(eigendecomposition of a Gram matrix)

The kernel trick

When one actually wants to increase the dimension

Base idea: map

n non linearly separable points to a (possibly infinite) space where they would be with a function
ϕ

  • How should we define
    ϕ
    ?
  • Do we really want to compute stuff in a (possibly infinite) feature space?

Mercer theorem: we do not need to know the mapping

ϕ explicitly as long as we have a positive semi-definite kernel/Gram matrix
K=[k(xi,xj)]=[<ϕ(xi),ϕ(xj)>]

Widely used kernel functions:

  • Polynomial kernel:
    k(xi,xj)=(xiTxj+1)d
  • Gaussian RBF kernel:
    k(xi,xj)=eγxixj2
  • Sigmoid kernel:
    k(xi,xj)=tanh(bxiTxj+c)

Kernel PCA

PCA on steroids

The maths behind are quite hard, but the following scikit-learn recipe works fine:

  1. Compute kernel matrix
    k=[k(xi,xj)]=[<ϕ(xi),ϕ(xj)>]
    and double-center it
    Kc=CnKCn
  2. Eigendecomposition of
    Kc
    is strongly related to this of the (intractable) covariance matrix in the feature space
    get eigenvectors
    V
    and corresponding eigenvalues
    Δ
    of
    Kc
    .
  3. Keep the first
    M
    columns of
    ΔV
    to get the coordinates of projected data points in the low
    M
    -dimensional space.

But things get nasty when one wants to project a new data point

x that was not known when constructing the kernel

Non-linear PCA

Also known as autoencoder

Overall idea:

  1. train an autoencoder (neural network with an autoassociative architecture) to perform an identity mapping.
  2. use the output of the bottleneck layer as low-dimensional code.

Bottleneck code is a non-linear combination of entries (thanks to activation functions on the encoder layers)

learned mapping is a non-linear PCA.

Principal components are generalized from straight lines to curves: the projection subspace which is described by all nonlinear components is also curved.

Let’s recap

High-dimensional data set

X is a
n×N
matrix, with
n=
number of samples and
N=
dimensionality of underlying space.

  • Parametric
    explicit embedding from high-dimensional space to low-dimensional one
  • For LLE:
    p
    is the ratio of non-zero elements in a sparse matrix to the total number of elements
  • For NL-PCA:
    i
    is the number of iterations and w is the number of weights in the neural network

t-Distributed Stochastic Neighbor Embedding

t-SNE is a popular method to see in 2D or 3D wtf is going on in a high-dimensional spaces.

  1. Construct a probability distribution
    p
    over pairs of points in the high-dim space: the more similar (the closer) the two points, the higher the probability
  2. Define a second probability distribution
    q
    over the points in the low-dim space, and dispatch the points such that the distance between p and q in minimized (for the KullbackLeibler divergence)

  • t-SNE is excellent in visualizing the well-separated clusters, but fails to preserve the global geometry of the data.
  • t-SNE depends on a perplexity parameter, which reflects the scale of search for close points.

Independant component analysis

ICA aims to provide a solution to the so-called cocktail party: retrieving independent sources that got mixed-up together with unknown scaling coefficients.

Goal: estimate source

s and mixing matrix
A
from observation
x=As
.

  • Ill-posed
    enforce independence on source components
  • Work on higher order statistics (PCA limits to order-2 statistics)
  • Unkown source must not be Gaussian-distributed

Contrarily to PCA vectors, ICA vectors are not orthogonal and not ranked by importance,
but they are mutually independents.