Principal Compnent Analysis (PCA)

# Principal Compnent Analysis (PCA) ## What is PCA? A dimensionality reduction technique used to reduce a large data set into a smaller data set while still maintaining significant patterns and trends. In other words, PCA helps us summarize large and complex datasets into a simpler form that’s easier to analyze, visualize, or interpret. Imagine a dataset containing 15 different features. There is no way we would be able to visualize this data in a 3-D world, let alone on a 2-D plane. PCA addresses this issue. ## Eigen Values & Eigen Vectors To understand the maths behind PCA, first we need to understand the concept of Eigen values and Eigen vectors. > Intuition: Imagine applying a transformation (like stretching or rotating) to a vector space. Most vectors change direction, but eigenvectors are special — they only get stretched or shrunk, not rotated. The amount they are stretched by is called the eigenvalue. So, eigenvectors point in the directions that stay "true" under transformation, and eigenvalues tell us how much they are scaled. #### Mathematical Definition If $A$ is an $n \times n$ matrix, then a non-zero vector $x$ is called an eigenvector of $A$ if: $$ A x = \lambda x $$ Here: - $x$ is the eigenvector, - $\lambda$ is the eigenvalue, - $A$ is the transformation matrix. This means applying $A$ to $x$ doesn’t change its direction, only its length. #### Derivation $$ A x = \lambda x $$ $$ A x - \lambda x = 0 $$ $$ (A - \lambda I) x = 0 $$ For a non-trivial solution ($x \ne 0$), the matrix $(A - \lambda I)$ must be non-invertible, so its determinant is zero: $$ \det[(A - \lambda I) x] = 0 $$ Which simplifies to: $$ \boxed{\det(A - \lambda I) = 0} $$ This is called the characteristic equation — solving it gives you the eigenvalues $\lambda$. You can then plug each one back in to find the corresponding eigenvectors $x$. ## Calculating Eigen Values & Eigen Vectors ## Example: Finding Eigenvalues and Eigenvectors Given matrix: $$ A = \begin{bmatrix} -5 & 2 \\ -7 & 4 \end{bmatrix} $$ We are tasked to find its **eigenvalues** and **eigenvectors**. --- ### Step 1: Find Eigenvalues Use the formula: $$ \det(A - \lambda I) = 0 $$ where $I$ is the $2 \times 2$ identity matrix: $$ I = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} $$ Substituting: $$ \det\left(\begin{bmatrix} -5 & 2 \\ -7 & 4 \end{bmatrix} - \lambda \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\right) = 0 $$ This simplifies to: $$ \det\left(\begin{bmatrix} -5 - \lambda & 2 \\ -7 & 4 - \lambda \end{bmatrix}\right) = 0 $$ Expand the determinant: $$ (-5 - \lambda)(4 - \lambda) - (-7)(2) = 0 $$ $$ -20 + 5\lambda - 4\lambda + \lambda^2 + 14 = 0 $$ $$ \lambda^2 + \lambda - 6 = 0 $$ $$ (\lambda + 3)(\lambda - 2) = 0 $$ $$ \implies \boxed{\lambda = -3, \quad 2} $$ The eigen values of the matrix A are -3 and 2. --- ### Step 2: Find Eigenvectors #### Case 1: when $\lambda = 2$ Solve: $$ (A - 2I)x = 0 $$ Substituting: $$ \left(\begin{bmatrix} -5 & 2 \\ -7 & 4 \end{bmatrix} - \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\right) \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 0 $$ $$ \begin{bmatrix} -7 & 2 \\ -7 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 0 $$ $$ -7x_1 + 2x_2 = 0 $$ $$ x_1 = \frac{2}{7}x_2 $$ Thus, an eigenvector corresponding to $\lambda = 2$ is: $$ x = \begin{bmatrix} 2 \\ 7 \end{bmatrix} $$ --- #### Case 2: $\lambda = -3$ Solve: $$ (A + 3I)x = 0 $$ Substituting: $$ \left(\begin{bmatrix} -5 & 2 \\ -7 & 4 \end{bmatrix} + \begin{bmatrix} 3 & 0 \\ 0 & 3 \end{bmatrix}\right) \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 0 $$ $$ \begin{bmatrix} -2 & 2 \\ -7 & 7 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = 0 $$ $$ -2x_1 + 2x_2 = 0 $$ $$ x_1 = x_2 $$ Thus, an eigenvector corresponding to $\lambda = -3$ is: $$ x = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$ --- ### Final Answer: - The Eigenvalues are $\lambda = -3$, $\lambda = 2$ - The corresponding eigenvectors are: - For $\lambda = 2$: $\begin{bmatrix} 2 \\ 7 \end{bmatrix}$ - For $\lambda = -3$: $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$ ## A thought experiment Let's consider the below data points on a 2-D plane. ![image](https://hackmd.io/_uploads/rJbFbl7ylg.png) When projecting the data points on to x-axis, we can see that the data points are more sparser comparing the data points projected on the y-axix. Thus, we are able to retain more infomation when projecting on the x-axis. In other words, the direction of maximum variance is going to retain the most information. Let's say if we are able to find a vector in the direction of maximum variance and project all the data points on to that vector we can retain most amount of information and also reduce the dimension from 2 to 1. ![image](https://hackmd.io/_uploads/SkY0GeX1xx.png) From the above image we can say that the vector in the direction of maximum variance and the vector that follows the trend of the data are same. This is called **Maximum Variance Formulation.** ## Maximum Variance Formulation Let's consider a set of data points from 1 to n. $\{x_i\}, \space where \space i = 1, 2, .... n$, with total number of dimensions D. And we want to reduce it to M dimensions. So the idea here is to find the direction of maximum variance across D dimensions and reduce it to M. ### Steps to find the direction of max variance: Step 1: Take any random unit vector $(u)$ in some direction. Step 2: Project all vectors of datasets on to that unit vector $(u)$. Step 3: calculate the variance of this projected data. Step 4: If we just maximize the variance w.r.t that unit vector $(u)$, we can find the direction of maximum variance. Let's see how to perform those steps mathematically #### Step 1: Random unit vector $\mathbf{u}$ $$ \mathbf{u}_1^T\mathbf{u}=1 $$ > Magnitude of $\mathbf{u}_1$ is 1 → it is a unit vector. --- #### Step 2: Projection of $\mathbf{x}_n$ (all datapoints) on $\mathbf{u}_1$ $$ \mathbf{u}_1^T\mathbf{x}_n $$ > With a Simple dot product — we get the projection of all data points on a vector. --- #### Step 3: Calculate the variance The below formula calculates the variance of data projected onto a unit vector $u_1$. It measures how much the projected values deviate from their mean projection. PCA uses this to find the direction $u_1$ that captures the maximum variance in the data $$ \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} \left(u_1^T x_n - u_1^T \bar{x}\right)^2 $$ > Variance is the sum of squared difference between mean and actual value divided by the number of data points where, $u^T x_n$ → value of one vector in dataset $u^T \bar{x}$ → mean of vectors in the dataset $n$ → total number of datapoints in the dataset Taking $u_1^T$ out, since it is common. $$ \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} \left(u_1^T (x_n - \bar{x}) \right)^2 $$ Let, $z_n = x_n - \bar{x}$. Then, $$ \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} \left(u_1^T z_n\right)^2 $$ We can rewrite the inside terms as, $$ \left(u_1^T z_n\right)^2 = \left(u_1^T z_n\right)\left(z_n^T u_1\right) = u_1^T z_n z_n^T u_1 $$ Substituting back into the sum: $$ \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} u_1^T z_n z_n^T u_1 $$ Since $u_1$ is constant: $$ \text{Variance} = u_1^T \left( \frac{1}{n} \sum_{i=1}^{n} z_n z_n^T \right) u_1 $$ By definition of covariance matrix, $$ S = \frac{1}{n} \sum_{i=1}^{n} z_n z_n^T = \frac{1}{n} \sum_{i=1}^{n} (x_n - \bar{x})(x_n - \bar{x})^T $$ Finally the expression becomes: $$ \text{Variance} = u_1^T S u_1 $$ --- #### Step 4: Maximize variance w.r.t. the unit vector We want to maximize the projected variance: $$ \arg\max (u_1^T S u_1) $$ >Note: The argmax used here refers to the Numpy's menthod. Where it returns the index (position) of the maximum value instead of the value itself. Since we are interested in finding the eigen vector with maximum eigen value. Refer this [Link](https://www.geeksforgeeks.org/numpy-argmax-python/). There is an issue here, If we just maximize $u_1^T S u_1$, without any constraint, the solution becomes trivial. That is, The bigger we make $\mathbf{u}_1$ (increase its magnitude), the bigger the variance gets — it can go to infinity. Since we don't care about the magnitude, but only the direction, we force $\mathbf{u}_1$ to be a unit vector. $$||u_1|| = 1 \quad or \quad u_1^T u_1 = 1 \quad\Rightarrow \quad 1 - u_1^T u_1 = 0 $$ > Note: $u_1^T u_1 =||u_1||^2$ This ensures that we are only finding the best direction. To include this constraint while maximizing, we use a **Lagrange multiplier $(\lambda_1)$** So we rewrite the equation as, $$ \arg\max \left( u_1^T S u_1 + \lambda_1 (1 - u_1^T u_1) \right) $$ Now we find the first derivative, using the vector derivative mentioned in the above picture, of the above function and equate it to zero. <center> <img src="https://hackmd.io/_uploads/HkwteVryeg.png" width="30%"> </center> $$ 2 S u_1 - 2 \lambda_1 u_1 = 0 $$ This can be rewritten as: $$ \boxed{S u_1 = \lambda_1 u_1} $$ The above equation is of the form $A x = \lambda x$. This proves that the **eigenvector $(u_1)$** of that dataset is going to be the **direction of maximum variance**. And this is called the **first principal component**. --- ### Steps to Perform Dimensionality Reduction using PCA In the above thought experiment, we have reduced the dimension from 2 to 1, so we got only one eigenvector. To reduce from $D$ dimensions to $M$, we will use the **top M eigenvalues**, and their corresponding eigenvectors will be the directions of maximum variance. - Compute the **covariance matrix** $S$, of our data. - Find the **eigenvalues** and **eigenvectors** of the covariance matrix $S$. - Order the **eigenvalues from largest to smallest**. - Select the **top M eigenvectors** corresponding to the largest eigenvalues — these are our **principal components**. - Multiply the **original data** with the selected **principal components** to get the **new dataset** in the reduced dimension space. ## PCA Example Consider the below dataset. | Sample | Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 | | -------- | --------- | --------- | --------- | --------- | --------- | |1 | 2 | 0 | 1 | 3 | 5| |2 | 3 | 2 | 2 | 4 | 7| |3 | 4 | 4 | 3 | 5 | 9| |4 | 5 | 6 | 4 | 6 | 11| In matrix form: $$ X = \begin{bmatrix} 2 & 0 & 1 & 3 & 5 \\ 3 & 2 & 2 & 4 & 7 \\ 4 & 4 & 3 & 5 & 9 \\ 5 & 6 & 4 & 6 & 11 \end{bmatrix} $$ Computing the covariance matrix using the below formula: $$ S = \frac{1}{n} \sum_{i=1}^{n} (X_n - \bar{X})(X_n - \bar{X})^T $$ Where $X_n$ is the element value and $\bar{X}$ is the mean of that column. After calculation: $$ X_c = X_n - \bar{X} = \begin{bmatrix} -1.5 & -3.0 & -1.5 & -1.5 & -3.0 \\ -0.5 & -1.0 & -0.5 & -0.5 & -1.0 \\ 0.5 & 1.0 & 0.5 & 0.5 & 1.0 \\ 1.5 & 3.0 & 1.5 & 1.5 & 3.0 \end{bmatrix} $$ $$ X_c \cdot {X_c}^T= \begin{bmatrix} 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 10.0 & 20.0 & 10.0 & 10.0 & 20.0 \\ 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 10.0 & 20.0 & 10.0 & 10.0 & 20.0 \end{bmatrix} $$ $$ \frac{1}{4} \space X_c \cdot {X_c}^T= \frac{1}{4} \begin{bmatrix} 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 10.0 & 20.0 & 10.0 & 10.0 & 20.0 \\ 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 10.0 & 20.0 & 10.0 & 10.0 & 20.0 \end{bmatrix} $$ $$ S = \begin{bmatrix} 1.25 & 2.5 & 1.25 & 1.25 & 2.5 \\ 2.5 & 5.0 & 2.5 & 2.5 & 5.0 \\ 1.25 & 2.5 & 1.25 & 1.25 & 2.5 \\ 1.25 & 2.5 & 1.25 & 1.25 & 2.5 \\ 2.5 & 5.0 & 2.5 & 2.5 & 5.0 \end{bmatrix} $$ Calculating the eigen values and eigen vectors: Eigenvalues: $$ \begin{bmatrix} 0.0 \\ 13.75 \\ 0.0 \\ 0.0 \\ 0.0 \end{bmatrix} $$ Eigenvectors: $$ \begin{bmatrix} -0.9535 & 0.3015 & 0.0 & 0.0 & 0.0 \\ 0.1907 & 0.6030 & 0.7746 & 0.0 & 0.0 \\ 0.0953 & 0.3015 & -0.2582 & -0.4082 & -0.8165 \\ 0.0953 & 0.3015 & -0.2582 & 0.8816 & -0.2367 \\ 0.1907 & 0.6030 & -0.5164 & -0.2367 & 0.5266 \end{bmatrix} $$ Ordering the eigen vectors based on eigen values Eigen Value | Eigen Vector | --- | --- 13.75| [ 0.3015, -0.9535, 0. , 0. , 0. ] -0.0 | [ 0.603 , 0.1907, 0.7746, 0. , 0. ] -0.0| [ 0.3015, 0.0953, -0.2582, -0.4082, -0.8165] -0.0| [ 0.3015, 0.0953, -0.2582, 0.8816, -0.2367] -0.0| [ 0.603 , 0.1907, -0.5164, -0.2367, 0.5266] Since we are converting from 5D to 2D we are choosing the top 2 Eigen vectors. And we are gonna multiply this with our data $X_c$. And finally we get: $$ \text{Reduced Data} = \begin{bmatrix} -4.9749 & -0.0 \\ -1.6583 & -0.0 \\ 1.6583 & 0.0 \\ 4.9749 & 0.0 \\ \end{bmatrix} $$ ## PCA using Python ``` import numpy as np # Step 1: Original data X = np.array([ [2, 0, 1, 3, 5], [3, 2, 2, 4, 7], [4, 4, 3, 5, 9], [5, 6, 4, 6, 11] ]) # Step 2: Center the data X_centered = X - X.mean(axis=0) # Step 3: Covariance matrix S = np.dot(X_centered.T, X_centered) / X.shape[0] # Step 4: Eigen decomposition eigenvalues, eigenvectors = np.linalg.eig(S) # Step 5: Sort eigenvalues and eigenvectors idx = np.argsort(eigenvalues)[::-1] eigenvalues = np.round(eigenvalues[idx], 4) eigenvectors = np.round(eigenvectors[:, idx], 4) # Step 6: Select top 2 principal components principal_components = eigenvectors[:, :2] # Step 7: Project the data to new 2D space X_reduced = np.round(np.dot(X_centered, principal_components), 4) # Output print("Eigenvalues:\n", eigenvalues) print("\nEigenvectors:\n", eigenvectors) print("\nReduced Data:\n", X_reduced) ``` <center> <img src="https://hackmd.io/_uploads/S1a8W4nJxx.png" width="100%"> </center> ## PCA in Action with actual data I have found an GeeksforGeeks article that uses PCA to visualize cancer dataset. The dimensions is reduced from 31 to 2. The data: ![image](https://hackmd.io/_uploads/H1MXNE2keg.png) After applying PCA, they have reduced to 2 dimensions and below find the scatter plot of the same. <center> <img src="https://hackmd.io/_uploads/HyVuE42keg.png" width="100%"> </center> > Reference: {%preview https://www.geeksforgeeks.org/principal-component-analysis-pca/ %} ## Pros & Cons of PCA ### Pros - **Dimensionality Reduction:** Reduces the number of features while preserving maximum variance. - **Noise Reduction:** Helps remove noise and redundant information from data. - **Visualization:** Makes it possible to visualize high-dimensional data in 2D or 3D. - **Speeds up computation:** Smaller feature sets mean faster model training and inference. - **Uncorrelated Features:** Transforms features into uncorrelated components, which can benefit some models. ### Cons - **Loss of Interpretability:** Principal components are linear combinations of original features, so they are harder to interpret. - **Variance-based only:** PCA captures variance, not necessarily the most important features for prediction. - **Assumes Linearity:** PCA only captures linear relationships between features. - **Sensitive to Scaling:** PCA needs standardized data; otherwise, features with larger scales dominate. - **Can Lose Important Information:** Reducing dimensions can accidentally drop crucial information if not chosen carefully. ## Recent Advancements ### Robust PCA (RPCA) **What it is:** Enhances classical PCA to handle outliers and corrupt data. **Recent Use Cases:** Background subtraction in videos, anomaly detection. **Formulation:** Decomposes a matrix as $X = L + S$ where: $L$: low-rank matrix (main structure) $S$: sparse matrix (outliers) ### Kernel PCA Extension of PCA to capture non-linear structures using the kernel trick. Applications: Image de-noising, manifold learning, bioinformatics. Recent trends: Faster kernel approximations using Nyström method or random features. ## Conslusion Principal Component Analysis (PCA) is a powerful technique for simplifying complex datasets by reducing their dimensions while preserving as much variance as possible. Through mathematical foundations like eigenvalues, eigenvectors, and covariance matrices, PCA identifies the directions of maximum variability in the data. We’ve explored how to compute PCA step-by-step, visualized the reduced data, and discussed its advantages and limitations. ## References https://www.youtube.com/watch?v=FgakZw6K1QQ https://www.youtube.com/watch?v=nEvKduLXFvk https://www.youtube.com/watch?v=H99JRtDDnvk https://medium.com/data-science/principal-component-analysis-made-easy-a-step-by-step-tutorial-184f295e97fe