# IML : Dimensionality reduction # Why do we care ? We have at hand $n$ points $x1,..., xn$ lying in some N-dimensional space, $x_i \in\mathbb R^n , \forall i = 1, . . . , n,$ compactly written as a $n × N$ matrix $X$ - One row of $X$ = one sample - One column of $X$ = a given feature value for all samples ![](https://i.imgur.com/CmbXcfy.png) ## Example of real high-dimensional data :::warning Real world data is very often high-dimensional ::: ### MNIST image classification: ![](https://i.imgur.com/psXUjPa.png) - Sample $x$:image with 28x28 pixels - Data set: 60000 samples - Dimensionality: $x \in\mathbb R^{28×28=784}$ ### MUSE hyperspectral image analysis: ![](https://i.imgur.com/afKNwIT.png) - Sample $x$: pixel with 3600 spectral bands - Data set: image with 300x300 pixels - Dimensionality: $x \in \mathbb R^{3600}$ > Pour discriminer les galaxies ~~c'est raciste ca monsieur~~ ## The curse of dimensionality :::danger High-dimensional spaces ~~suck donkey ballz~~ suffer from the *curse of dimensionality* (also called Hughes’ phenomenon) ::: ### Sur $\mathbb R$ ![](https://i.imgur.com/vesSPAR.png) ![](https://i.imgur.com/D1xI2ST.png) ### Sur $\mathbb R^2$ Revenir a la meme densite d'echantillonage: ![](https://i.imgur.com/jRNkOX5.png) ### Sur $\mathbb R^3$ ![](https://i.imgur.com/7rK9gAn.png) Revenir a la meme densite d'echantillonage: ![](https://i.imgur.com/PUKy7sm.png) $$ \frac{\nu(\mathbb S^n)}{\nu([-1;1]^n)}=\frac{\pi^{\frac{n}{2}}}{2\Gamma(\frac{n}{2}+1)}\to_{n\to+\infty}0 $$ :::warning Points uniformly distributed in a $n$−cube of side 2 mostly fall outside of the unit sphere! ![](https://i.imgur.com/rh4aYb4.png) ::: # Why is it tricky? - We naturally cannot picture anything that is more than 3D in our mind ![](https://i.imgur.com/JLWfSrR.png) - Picturing something 3D in a 2D flat screen can already be misleading - Real data naturally lives in (complex) high-dimensional space - Real data is often strongly correlated :::warning And somehow, we want to have a good look to our data before feeding it to some machine learning algorithm *(can I use the inherent structure of my data to pimp my machine learning performances?)* ::: # How ? :::info Dimensionality reduction: transform data set $X$ with dimensionality $N$ into a new data set $Y$ ($n \times M$ matrix) with dimensionality $M \lt N$ (hopefully $M \lt\le N$) such that as **few information as possible is lost** in the process. $y_i$ ($i$th row of $Y$) is the low-dimensional counterpart *(the projection)* of $x_i$. ::: **INFORMATION ???** ![](https://i.imgur.com/OYfsA0d.png) # Linear approaches Somehow trying to find a low-dimensional subspace in which the projected data would not be too much distorted after projection. - Johnson-Lindenstrauss lemma - Classical scaling - *(The one and only)* Principal Component Analysis - And much more... ![](https://i.imgur.com/dugYCAS.png) ## Johnson-Lindenstrauss lemma It’s not because you *can* that you *will* :::danger Let $0\lt\varepsilon\lt1$ and let $x_1,...,x_n$ b $n$ points in $\mathbb R^N$. Then there exists a linear map $f:\mathbb R^N\to\mathbb R^M$ such that for every points $x_i$ and $x_j$ $$ (1-\varepsilon)\Vert x_i-x_j \Vert^2\le \Vert f(x_i)-f(x_j) \Vert^2\le(1+\varepsilon)\Vert x_i-x_j \Vert^2 $$ With $M=\frac{4\log(n)}{(\frac{\varepsilon^2}{2} − \frac{\varepsilon^3}{3}).}$ > Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics. ::: ![](https://i.imgur.com/w1g2pBu.png) :::warning La douille: il faut trouver la matrice $M$. ::: ## Classical scaling Also called Principal Coordinates Analysis (PCoA) > Lots of formula here, but you just need to retain the overall idea :::danger PCoA: project data points $X$ onto $Y$ with a linear mapping $M$ such that $Y = XM$ such that all pairwise distances between points do not change too much before/after projection ::: If $D$ is the $n \times n$ Euclidean distance matrix with entries $d_{ij} = \Vert x_i − xj\Vert_2$ and $D^{(2)} = [d_{ij}^2]$, PCoA seeks the linear mapping $M$ that minimizes $$ \phi(Y)=\sum_{i,j}(d_{ij}^2-\Vert y_i-y_j\Vert^2) $$ with $y_i = x_iM$ and $\Vert m_i\Vert^2=1\forall i$ :::success Solution: eigendecomposition (=diagonalisation) of the Gram matrix $K = XX^T = E\Delta E$ ::: $K$ can be obtained by double centering $D^{(2)}:K=-\frac{1}{2}C_nD^{(2)}C_n$ with centering matrix $C_n=I_n-\frac{1}{n}ones(n,n)$ Optimal projection onto the first $M$ dimensions $Y=\Delta_M^{\frac{1}{2}}E_M^T$ with $E_M$ matrix of the $M$ largest eigenvectors of $E$. ## Principal component analysis > Also known as the Karhunen-Loeve transform Closely related to PCoA, but operates on the covariance matrix $X_c^T X_c$ PCA seeks the linear mapping $M$ that maximizes the projection variance $tr(M^T cov(X)M)$ with $\Vert mi\Vert^2 = 1 \forall i$. $$ X=\begin{bmatrix} \overbrace{x_11}^{u_1=\text{moyenne}} & \overbrace{x_12}^{u_2}\\ \vdots & \vdots\\ x_n1&x_n2 \end{bmatrix} \Rightarrow \text{centrage des donnees} $$ $$ X_c=\begin{bmatrix} x_11-u_1 & x_12-u_2\\ \vdots & \vdots\\ x_n1-u_1&x_n2-u_2 \end{bmatrix} $$ 1. Center the data $X_c = C_nX$ 1.b (opt) Reduce the data 2. Compute covariance matrix $\sum=X_c^TX_c$ 3. Perform eigendecomposition $(E,\Delta)$ of $\sum$ 4. Project on the first $M$ principal axes $Y=XE_M$ ![](https://i.imgur.com/WJcHD4e.png) ![](https://i.imgur.com/Kw4ZzPC.png) Data after projection is uncorrelated, but has lost some interpretability ![](https://i.imgur.com/93SANtL.png) ## Major challenges related to PCA PCA is probably the most popular and used unsupervised linear dimensionality reduction technique, but it comes with a bunch of operability questions, the 2 principles being: 1. How to automatically select the right number of dimensions to project? - ![](https://i.imgur.com/EmehUrr.png) ![](https://i.imgur.com/2Kgc8Gj.png) 2. How to project a new data point on a learned projection subspace? - See you in lab session for the answer # Non-linear approaches When it is assumed that the data does not live in an Euclidean subspace (why would it anyway?), some more advanced techniques must be relied on. - Isomap - Locally linear embedding - Kernel Principal Component Analysis (aka PCA on steroids) - Multilayer autoencoders - And much more... ![](https://i.imgur.com/EEegGRS.png) ![](https://i.imgur.com/r8hJK6Z.png) ## Isomap > Geodesic distance rocks Isometric feature mapping: same idea as classical scaling, but using geodesic distance instead of Euclidean distance. ![](https://i.imgur.com/OlSpeIx.png) ![](https://i.imgur.com/fBSAufM.png) 1. Compute k-nearest neighbor graph of data $x_1,...,x_n$ 2. Compute all pairwise geodesic distances 3. Apply classical scaling ![](https://i.imgur.com/qc77r5y.png) ### Exemple Isomap applied to some images of the digit 2 in MNIST data ![](https://i.imgur.com/4wnzodY.png) ## Locally linear embedding Locally linear embedding: the manifold can be locally considered Euclidean ![](https://i.imgur.com/cz5QFIk.png) For each point $x_i$: 1. get its k-nearest neighbors $x_j$, $j=1,...,k$ 2. Get weights $w_{ij}$ that best linearly reconstruct $x_i$ with $x_j$: minimize $\sum_{i=1}^n\Vert x_i-\sum w_{ij}x_j\Vert$ ![](https://i.imgur.com/9r9RLAB.png) - with constraints $\sum w_{ij}=1$ (closed-form solution) 4. Low-dimensional embedding $\to$ reconstruct $y_i$ with $y_j$ and same weights $w_{ij}$: minimize $$ \sum_{i=1}^n\Vert y_i-\sum w_{ij}y_j\Vert $$ with constraints $\frac{1}{n}\sum_iy_iy_i^T$ and $\sum_iy_i=0$ (eigendecomposition of a Gram matrix) ## The kernel trick > When one actually wants to increase the dimension Base idea: map $n$ non linearly separable points to a (possibly infinite) space where they would be with a function $\phi$ - How should we define $\phi$ ? - Do we really want to compute stuff in a (possibly infinite) feature space? :::info Mercer theorem: we do not need to know the mapping $\phi$ explicitly as long as we have a positive semi-definite kernel/Gram matrix $K=[\mathcal k(x_i,x_j)]=[<\phi(x_i),\phi(x_j)>]$ ::: Widely used kernel functions: - Polynomial kernel: $\mathcal k(x_i,x_j)=(x_i^Tx_j+1)^d$ - Gaussian RBF kernel: $\mathcal k(x_i,x_j)=e^{-\gamma\Vert x_i-x_j\Vert^2}$ - Sigmoid kernel: $\mathcal k(x_i,x_j)=\tanh(bx_i^Tx_j+c)$ ## Kernel PCA > PCA on steroids The maths behind are quite hard, but the following scikit-learn recipe works fine: 1. Compute kernel matrix $k=[\mathcal k(x_i,x_j)]=[<\phi(x_i),\phi(x_j)>]$ and double-center it $K_c=C_nKC_n$ 2. Eigendecomposition of $K_c$ is strongly related to this of the (intractable) covariance matrix in the feature space $\to$ get eigenvectors $V$ and corresponding eigenvalues $\Delta$ of $K_c$. ![](https://i.imgur.com/Mcio2jt.png) 3. Keep the first $M$ columns of $\sqrt{\Delta V}$ to get the coordinates of projected data points in the low $M$-dimensional space. ![](https://i.imgur.com/OkbLVXa.png) But things get nasty when one wants to project a new data point $x$ that was not known when constructing the kernel... ## Non-linear PCA > Also known as autoencoder Overall idea: 1. train an autoencoder (neural network with an autoassociative architecture) to perform an identity mapping. 2. use the output of the bottleneck layer as low-dimensional code. ![](https://i.imgur.com/s1fuRi5.png) Bottleneck code is a non-linear combination of entries (thanks to activation functions on the encoder layers) $\to$ learned mapping is a non-linear PCA. ![](https://i.imgur.com/rlaEDkG.png) Principal components are generalized from straight lines to curves: the projection subspace which is described by all nonlinear components is also curved. ![](https://i.imgur.com/CtdC82G.png) # Let’s recap High-dimensional data set $X$ is a $n \times N$ matrix, with $n =$ number of samples and $N =$ dimensionality of underlying space. ![](https://i.imgur.com/tSi5kMZ.png) - Parametric $\equiv$ explicit embedding from high-dimensional space to low-dimensional one - For LLE: $p$ is the ratio of non-zero elements in a sparse matrix to the total number of elements - For NL-PCA: $i$ is the number of iterations and w is the number of weights in the neural network ## t-Distributed Stochastic Neighbor Embedding t-SNE is a popular method to see in 2D or 3D wtf is going on in a high-dimensional spaces. 1. Construct a probability distribution $p$ over pairs of points in the high-dim space: the more similar (the closer) the two points, the higher the probability 2. Define a second probability distribution $q$ over the points in the low-dim space, and dispatch the points such that the distance between p and q in minimized (for the KullbackLeibler divergence) ![](https://i.imgur.com/4645qRQ.png) ![](https://i.imgur.com/b4E3cZB.png) ![](https://i.imgur.com/PYnRtgM.png) - t-SNE is excellent in visualizing the well-separated clusters, but fails to preserve the global geometry of the data. - t-SNE depends on a perplexity parameter, which reflects the scale of search for close points. ## Independant component analysis ICA aims to provide a solution to the so-called *cocktail party*: retrieving independent sources that got mixed-up together with unknown scaling coefficients. ![](https://i.imgur.com/2QEhLKX.png) Goal: estimate source $s$ **and** mixing matrix $A$ from observation $x = As$. - Ill-posed $\Rightarrow$ enforce independence on source components - Work on higher order statistics (PCA limits to order-2 statistics) - Unkown source must **not** be Gaussian-distributed ![](https://i.imgur.com/KyAc8m2.png) Contrarily to PCA vectors, ICA vectors are not orthogonal and not ranked by importance, but they are mutually independents.