IML : Dimensionality reduction
Why do we care ?
We have at hand points lying in some N-dimensional space, compactly written as a matrix
- One row of = one sample
- One column of = a given feature value for all samples
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Example of real high-dimensional data
Real world data is very often high-dimensional
MNIST image classification:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Sample :image with 28x28 pixels
- Data set: 60000 samples
- Dimensionality:
MUSE hyperspectral image analysis:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Sample : pixel with 3600 spectral bands
- Data set: image with 300x300 pixels
- Dimensionality:
Pour discriminer les galaxies c'est raciste ca monsieur
The curse of dimensionality
High-dimensional spaces suck donkey ballz suffer from the curse of dimensionality (also called Hughes’ phenomenon)
Sur
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Sur
Revenir a la meme densite d'echantillonage:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Sur
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Revenir a la meme densite d'echantillonage:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Points uniformly distributed in a −cube of side 2 mostly fall outside of the unit sphere!
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Why is it tricky?
- We naturally cannot picture anything that is more than 3D in our mind
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Picturing something 3D in a 2D flat screen can already be misleading
- Real data naturally lives in (complex) high-dimensional space
- Real data is often strongly correlated
And somehow, we want to have a good look to our data before feeding it to some machine learning algorithm (can I use the inherent structure of my data to pimp my machine learning performances?)
How ?
Dimensionality reduction: transform data set with dimensionality into a new data set ( matrix) with dimensionality (hopefully ) such that as few information as possible is lost in the process.
(th row of ) is the low-dimensional counterpart (the projection) of .
INFORMATION ???
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Linear approaches
Somehow trying to find a low-dimensional subspace in which the projected data would not be too much distorted after projection.
- Johnson-Lindenstrauss lemma
- Classical scaling
- (The one and only) Principal Component Analysis
- And much more…
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Johnson-Lindenstrauss lemma
It’s not because you can that you will
Let and let b points in . Then there exists a linear map such that for every points and
With
Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
La douille: il faut trouver la matrice .
Classical scaling
Also called Principal Coordinates Analysis (PCoA)
Lots of formula here, but you just need to retain the overall idea
PCoA: project data points onto with a linear mapping such that such that all pairwise distances between points do not change too much before/after projection
If is the Euclidean distance matrix with entries and , PCoA seeks the linear mapping that minimizes
with and
Solution: eigendecomposition (=diagonalisation) of the Gram matrix
can be obtained by double centering with centering matrix
Optimal projection onto the first dimensions with matrix of the largest eigenvectors of .
Principal component analysis
Also known as the Karhunen-Loeve transform
Closely related to PCoA, but operates on the covariance matrix PCA seeks the linear mapping that maximizes the projection variance with .
- Center the data
1.b (opt) Reduce the data
- Compute covariance matrix
- Perform eigendecomposition of
- Project on the first principal axes
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Data after projection is uncorrelated, but has
lost some interpretability
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
PCA is probably the most popular and used unsupervised linear dimensionality reduction technique, but it comes with a bunch of operability questions, the 2 principles being:
- How to automatically select the right number of dimensions to project?
- How to project a new data point on a learned projection subspace?
- See you in lab session for the answer
Non-linear approaches
When it is assumed that the data does not live
in an Euclidean subspace (why would it anyway?),
some more advanced techniques must be relied
on.
- Isomap
- Locally linear embedding
- Kernel Principal Component Analysis (aka PCA on steroids)
- Multilayer autoencoders
- And much more…


Isomap
Geodesic distance rocks
Isometric feature mapping: same idea as classical scaling, but using geodesic distance instead of Euclidean distance.


- Compute k-nearest neighbor graph of data
- Compute all pairwise geodesic distances
- Apply classical scaling

Exemple
Isomap applied to some images of the digit 2 in MNIST data

Locally linear embedding
Locally linear embedding: the manifold can be locally considered Euclidean

For each point :
- get its k-nearest neighbors ,
- Get weights that best linearly reconstruct with : minimize
- with constraints (closed-form solution)
- Low-dimensional embedding reconstruct with and same weights :
minimize
with constraints and (eigendecomposition of a Gram matrix)
The kernel trick
When one actually wants to increase the dimension
Base idea: map non linearly separable points to a (possibly infinite) space where they would be with a function
- How should we define ?
- Do we really want to compute stuff in a (possibly infinite) feature space?
Mercer theorem: we do not need to know the mapping explicitly as long as we have a positive semi-definite kernel/Gram matrix
Widely used kernel functions:
- Polynomial kernel:
- Gaussian RBF kernel:
- Sigmoid kernel:
Kernel PCA
PCA on steroids
The maths behind are quite hard, but the following scikit-learn recipe works fine:
- Compute kernel matrix and double-center it
- Eigendecomposition of is strongly related to this of the (intractable) covariance matrix in the feature space get eigenvectors and corresponding eigenvalues of .

- Keep the first columns of to get the coordinates of projected data points in the low -dimensional space.

But things get nasty when one wants to project a new data point that was not known when constructing the kernel…
Non-linear PCA
Also known as autoencoder
Overall idea:
- train an autoencoder (neural network with an autoassociative architecture) to perform an identity mapping.
- use the output of the bottleneck layer as low-dimensional code.

Bottleneck code is a non-linear combination of entries (thanks to activation functions on the encoder layers) learned mapping is a non-linear PCA.

Principal components are generalized from straight lines to curves: the projection subspace which is described by all nonlinear components is also curved.

Let’s recap
High-dimensional data set is a matrix, with number of samples and dimensionality of underlying space.

- Parametric explicit embedding from high-dimensional space to low-dimensional one
- For LLE: is the ratio of non-zero elements in a sparse matrix to the total number of elements
- For NL-PCA: is the number of iterations and w is the number of weights in the neural network
t-Distributed Stochastic Neighbor Embedding
t-SNE is a popular method to see in 2D or 3D wtf is going on in a high-dimensional spaces.
- Construct a probability distribution over pairs of points in the high-dim space: the more similar (the closer) the two points, the higher the probability
- Define a second probability distribution over the points in the low-dim space, and dispatch the points such that the distance between p and q in minimized (for the KullbackLeibler divergence)



- t-SNE is excellent in visualizing the well-separated clusters, but fails to preserve the global geometry of the data.
- t-SNE depends on a perplexity parameter, which reflects the scale of search for close points.
Independant component analysis
ICA aims to provide a solution to the so-called cocktail party: retrieving independent sources that got mixed-up together with unknown scaling coefficients.

Goal: estimate source and mixing matrix from observation .
- Ill-posed enforce independence on source components
- Work on higher order statistics (PCA limits to order-2 statistics)
- Unkown source must not be Gaussian-distributed

Contrarily to PCA vectors, ICA vectors are not orthogonal and not ranked by importance,
but they are mutually independents.