chapter8.Dimension Reduction

# chapt 8 # Dimension Reduction  <style> .reveal .slides { text-align: left; } </style> Hands on Machine Learning [slides](https://hackmd.io/@yisyuan9887/handsonML_chapt8) [colab](https://colab.research.google.com/drive/1YcxDoQkG4Tuiqf9u_nbVg8YzwQaZHxd-?usp=sharing) ---  ## Context: - Dimension Reduction Intro - Main approach - PCA - Kernel PCA - Locally Linear Embedding - Other Dimension Reduction Techniques --- ## Why do dimension reduction?  1. Speed up training 2. Filter out unnecessary details 3. Visualization Note: we should always first try to train the system with the original data before considering using dimensionality reduction if training is too slow, since reducint dimensionality does lose some information --- ### Curse of Dimensionality  Many things behave differently in high dimensional space. Consider $$X=(X_1,\dots,X_n) \sim \mathcal{u}[0,1]^n$$ <center><img width="500" height="200" src="https://i.imgur.com/SUJPfRe.png"></center> 1. The more dimensions we consider, the more likely a point would be extremist.\ $\text{Pr}(X_i<0.01 \lor X_j>0.99,\ i,j \in \mathbb{N}_n)=1-0.98^n \to 1$ as $n \to \infty$ 2. Randomly pick two points and find out the distances between them.\ $E[||X-Y||_2] \to \sqrt{n/6}$ as $n \to \infty$ Note: 假設x 在一個 n-dimensional unit hypercube 的空間中。 \ n=$1$, $E[||X-Y||_2] \approx 0.33$ n=$2$, $E[||X-Y||_2] \approx 0.52$ n=$3$, $E[||X-Y||_2] \approx 0.66$ n=$10^6$, $E[||X-Y||_2] \approx 408.25$ A new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations.<font color = red>The more dimensions the training set has, the greater the risk of overfitting it.</font> --- ## Main Approaches of Dimensionality Reduction  - Projection - Manifold Learning --- ### Projection  In most real-world problems, training instances are not spread out uniformly across all dimensions. <img align="left" width="500" height="350" src="https://i.imgur.com/L124k78.png"><img align="right" width="450" height="350" src="https://i.imgur.com/73F1vOO.png"> Note: 1. 左圖為資料集集中接近一個超平面(3D->2D) 2. 右圖為將資料集投影到那個超平面後的資料散布圖(兩軸為Z1,Z2) - Proection is not always the best approabh to dimensionality reduction. --- ### Manifold Learning  - A d-dimensional manifold is a part of an n-dimensional space(where d<n) that locally resembl a d-dimensional hyperplane - d-dimensional shpae is bent or twisted in a higher dimensional space. <center><img width="500" height="350" src="https://i.imgur.com/ctHNyJZ.png"></center> Note: In thie case of the Swiss roll, d = 2 and n =3: it locally resembles a 2D plane, but it is rolled in the third dimension --- ### Manifold Learning  Manifold Assumption: 1. most real-world high-dimensional datasets lie close to a much lower-dimensional manifold. 2. (Implicit) task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold. <center><img align="center" width="500" height="350" src="https://i.imgur.com/J2j7EKL.png"></center> Note: Decision boundary may not always be simpler in lower dimension. ---  - Suppose that we wish to visualize n observations with measurements on a set of $p$ features, $X_1,X_2,...,X_p$ as part of an exploratory data analysis - We would like to find a low-dimensional representation of the data that captures as much of the information as poissible --- ## PCA  - PCA identifies the hyperplane that lies closest to the data, and then projects the data onto it. - Preserving the Variance: <center><img align="center" width="600" height="300" src="https://i.imgur.com/MKuLt3K.png"></center> - Select the axis that preserves the maximum amount of variance in order to lose less information than other projections Note: 1. PCA identifies the axis that accounts for the largest amount of variance in the training set. 2. Second axis is the orthogonal to the first one, that acccounts for the largest amount of remaining variance. --- ### Principal Components  - The $i^{th}$ axis is called the $i^{th}$ principal component (PC) - How to get the PC? - PCA finds a sequence of linear combinations of the variables that have maximal variance. - The first <font color = red>principal component score</font> of a set of features $X_1,X_2,...,X_p$ is the normalized linear combination of the features that has the largest variance.$$Z_1 = \phi_{11}X_1+\phi_{21}X_2+\cdots+\phi_{p1}X_p$$ $\sum_{j=1}^p \phi_{j1}^2$ constrained the loaadings to get the better result $\phi_{11},\dots,\phi_{p1}$ : the loadings of the 1st principal components $\Phi_1 = (\phi_{11}\phi_{21}\dots\phi_{p1})^T$:the 1st Principal Components loading vectors(or so-called "PC") Note:限制PCloadings，否則設置這些元素絕對值任意大可能導致任意大方差。 --- ### Computation of PCA  - The sample variance of the $z_{i1}$ $$\frac{1}{n}\sum_{i=1}^nz_{i1}^2=\frac{1}{n}\sum_{i=1}^n(\sum_{j=1}^p\phi_{j1}x_{ij})^2$$ - The frist principal component loading vector $Z_1$ solves the optimization problem: $$\max_\limits{\phi_{11},\dots,\phi_{p1}} \frac{1}{n} \sum_{i=1}^n (\sum_{j=1}^p\phi_{j1}x_{ij})^2 \text{ subject to} \sum_{j=1}^p \phi_{j1}^2 ＝1$$ Note: - PC1 $\Phi_1$ with element $\phi_{11}\phi_{21}...\phi_{p1}$ defines a direction in feature space along which the data vary the most - The second principal component score $Z_2$ is uncorrelated with $Z_1$ equivalent to $\Phi_2$ to be orthogonal(perpendicular) to the direction $\Phi_1$. - There are at most $min(n-1,p)$ principal compoenents --- ### Implementation of PCA - By SVD  - Singular Value Decomposition(SVD) can decompose the training set matrix $X =U \Sigma V^T$ to the matrix multiplication - The sample covariance matrix $$\frac{X^{T}X}{(n-1)} = \frac{V\Sigma U^TU\Sigma V^T}{(n-1)} = V\frac{\Sigma ^2}{(n-1)}V^T = V \Lambda V^T $$ - PC scores for all PCs are $Z = X_{proj}=XV = U\Sigma$ Note: connect to colab --- ### Details with PCA  - The direction of the principal components is not stable - It might be flipped the direction on the same axes - Choosing the right number of Dimensions Note: screeplot, .95 經驗法則等等 --- ### Proportion Variance Explained  - Total Variance present in a data set: $\sum_{j=1}^p Var(X_j) = \sum_{j=1}^p \frac{1}{n} \sum_{i=1}^n x_{ij}^2$ - Variance explained by the $m^{th}$ principal component: $Var(Z_m) = \frac{1}{n} \sum_{i=1}^n z_{im}^2$ - Total variance = projected variance + approximation error $\sum_{j=1}^p Var(X_j)=\sum_{m=1}^M Var(Z_m)+\frac{1}{n}\sum_{j=1}^p\sum_{i=1}^n(x_{ij}-\sum_{m=1}^M\phi_{jm}z_{im})^2$ - Proportion Variance Explained for $m^{th}$ PC: $\frac{Var(Z_m)}{\sum_{j=1}^p Var(X_j)}$ --- #### Reconstruction Error  - Reconstruction : decompress the reduced dataset back to p dimensions by applying the inverse transformation of the PCA projection$$X_{recon} = X_{d-proj}V_d^T = (XV_d)V_d^T \approx X(V_dV_d^{-1})=X$$ Note:link to colab --- ### Incremental PCA  - large meomories required for fitting the training set while implementing PCA - split training set into minibatches and feed IPCA algorithm one minibatch at a time - large training sets - PCA online ### Randomized PCA  - Speed up the computation in Sparse Data Note: 使用 Randomized PCA 會比較快去得到PCA, colab 看一下performance / time的圖 --- ## Kernel PCA  - Suppose that instead of using the points $x_i$ as is, we wanted to go to some different <font color = red>feature space </font>$\phi(x_i)\in \mathbb{R}^N$ - In the higher dimensional space, we can then do PCA Note: 在第五章中，提到使用Kernel的方法將資料映射到高維度的特徵空間，以用來實現SVM的非線性分類及回歸。在這裏我們也可以使用類似的方法來做PCA 也就是說，透過Kernel我們能將data轉換到一個更高維的空間，並且再在這個高維的空間，使用標準PCA將資料投影回到一個較低維度的子空間中，資料樣本即可使用線性分類器來分離 --- ## Kernel PCA  - Similarity Matrix:$$K = \Phi(X)\Phi(X)^T$$ - Linear kernel: $k(x_i,x_j) = x_i^Tx_j$ - RBF kernel: $k(x_i,x_j) = exp(-\gamma ||x_i-x_j||^2)$ - Sigmoid kenel: $k(x_i,x_j) = tanh(\gamma x_i^Tx_j+c_0)$ Note: 相似矩陣 x在高維度上的內積會與內積後再映射到高維度空間 Radial Basis Function --- ### Select a kernel and Tuning parameter  - unsupervised learning not easy to evaluate - implement `Gridsearch` and `Pipeline` with the following supervised learning - reconstruction error Note: 看colab --- ## Locally Linear Embedding(LLE)  - Manifold Learning technique <center><img width="500" height="350" src="https://i.imgur.com/kmNPXw8.jpg"></center> --- ### Locally Linear Embedding(LLE)  - First step:$$\hat{W} = argmin_{W}\sum_{i=1}^n (x^{(i)}-\sum_{j=1}^n w_{i,j}x^{(j)})^2 \\\text{subject to}\begin{cases} w_{i,j}=0 & \text{if $x^{(j)}$ is not one of the k c.n of $x^{(i)}$}.\\ \sum_{j=1}^n w_{i,j}=1 & \text{for $i = 1,2,\cdots,n$} \end{cases}$$ - Second step: reducing dimensionality while preserving relationships$$\hat{Z} = argmin_Z\sum_{i=1}^m (z^{(i)}-\sum_{j=1}^m \hat{w}_{i,j}z^{(j)})^2 $$ Note: 1. 先去計算每個資料點與最接近的點的線性相關程度 2. 將其映射到d維的空間中，在降維後盡可能的保留每個資料點與最接近的資料點的線性距離 the second step is to map the training instances into a d-dimensional space(where d< n)while preserving these local relationships as much as possible --- ## Other Dimensionality Reduction Techniques  - Multidimensional Scaling(MDS) : reduces dimensionality while trying to preserve the distances between the instances - Isomap: creates a graph by connecting each instance to its nearest neighbores, then reduces dimensionality while trying to preserve <font color = blue>geodesic distances </font>between the instances - t-Distributed Stochastic Neighbor Embedding(t-SNE): try to keep similar instances close and dissimilar instances apart Note: - Linear Discriminant Analysis(LDA): classification algorithm - learns the most discriminative axes between the classes, which can be used to difine a hyperplane onto which to project the data. - benefit: keep the classes as far as possible ---