PCA - HackMD

### Principal Compnent Analysis (PCA) #### What is PCA? A dimensionality-reduction method that compresses a large dataset into a smaller one while preserving the most important patterns and relationships. Essentially, PCA condenses complex, high-dimensional data into a simpler form that’s easier to analyze, interpret, or visualize. For example, if you have a dataset with 20 features, you can’t directly visualize it in 2D or 3D. PCA overcomes this limitation by projecting the data into a lower-dimensional space while retaining as much meaningful structure as possible. Principal Components: To capture the most variance, the first principal component is the eigenvector corresponding to the largest eigenvalue $(\lambda _{1})$. Subsequent components are found by selecting other eigenvectors in descending order of their corresponding eigenvalues, ensuring they are orthogonal (uncorrelated) to the previous ones. ![images](https://hackmd.io/_uploads/rJbUq_DW-l.png) The Maximum Variance Formulation is an approach to defining Principal Component Analysis (PCA), where the goal is to find a lower-dimensional representation of data that maximizes the variance of the projected data points. #### Maximum Variance Formulation Suppose we have a collection of data points: {xᵢ}, where i = 1, 2, …, n, each with D dimensions, and we want to reduce them to M dimensions. The goal is to find the directions (vectors) that capture the highest variance in the original D-dimensional space and project the data onto those M directions.. Steps to find the direction of max variance: Step 1: Take any random unit vector $(u)$ in some direction. Step 2: Project all vectors of datasets on to that unit vector $(u)$. Step 3: Calculate the variance of this projected data. Step 4: If we just maximize the variance w.r.t that unit vector $(u)$, we can find the direction of maximum variance. Let's see how to perform those steps mathematically #### Step 1: Random unit vector $(u)$ $$ u_1^\top u = 1 $$ Magnitude of $(u_1)$is 1 → it is a unit vector. #### Step 2: Projection of $\mathbf{x}_n$ (all datapoints) on $(u_1)$ $$u_1^\top x_n$$ With a Simple dot product — we get the projection of all data points on a vector. #### Step 3: Calculate the variance The below formula calculates the variance of data projected onto a unit vector $𝑢_1$.It measures how much the projected values deviate from their mean projection. PCA uses this to find the direction $𝑢_1$ that captures the maximum variance in the data $$Varinace =\frac{1}{n}\sum_{i=1}^n (u_1^\top x_n - u_1^\top \bar{x})^2$$ Variance is the sum of squared difference between mean and actual value divided by the number of data points where, $𝑢𝑇𝑥𝑛$ → value of one vector in dataset $𝑢𝑇𝑥¯$ → mean of vectors in the dataset $𝑛$ → total number of datapoints in the dataset Taking $u_1^\top$ out, since it is common. $$\text{Variance} = \frac{1}{n} \sum_{i=1}^{n} \left( \mathbf{u}_1^\top (\mathbf{x}_n - \bar{\mathbf{x}}) \right)^2$$ Let, $\mathbf{z}_n = \mathbf{x}_n - \bar{\mathbf{x}}$.Then, $$ \text{Variance} = \frac{1}{n}\sum_{i=1}^{n} \left( \mathbf{u}_1^\top \mathbf{z}_n \right)^2 $$ We can rewrite the inside terms as, $$ \left( \mathbf{u}_1^\top \mathbf{z}_n \right)^2 = \left( \mathbf{u}_1^\top \mathbf{z}_n \right) \left( \mathbf{z}_n^\top \mathbf{u}_1 \right) = \mathbf{u}_1^\top \mathbf{z}_n \mathbf{z}_n^\top \mathbf{u}_1 $$ Substituting back into the sum: $$ \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{u}_1^\top \mathbf{z}_n \mathbf{z}_n^\top \mathbf{u}_1 $$ Since $\mathbf{u}_1$ is constant: $$ \text{Variance} = \mathbf{u}_1^\top \left( \frac{1}{n} \sum_{i=1}^{n} \mathbf{z}_n \mathbf{z}_n^\top \right) \mathbf{u}_1 $$ By definition of covariance matrix: $$ S = \frac{1}{n} \sum_{i=1}^{n} \mathbf{z}_n \mathbf{z}_n^\top = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}_n - \bar{\mathbf{x}}) (\mathbf{x}_n - \bar{\mathbf{x}})^\top $$ Finally the expression becomes: $$ \text{Variance} = \mathbf{u}_1^\top S \mathbf{u}_1 $$ #### Step 4: Maximize variance w.r.t. the unit vector We want to maximize the projected variance: $$\arg\max \left( \mathbf{u}_1^\top S \mathbf{u}_1 \right)$$ There is an issue here, If we just maximize $\mathbf{u}_1^\top S \mathbf{u}_1$ without any constraint, the solution becomes trivial. That is, The bigger we make $(u_1)$(increase its magnitude), the bigger the variance gets — it can go to infinity. Since we don't care about the magnitude, but only the direction, we force $(u_1)$ to be a unit vector. $\lVert \mathbf{u}_1 \rVert = 1 \quad \text{or} \quad \mathbf{u}_1^\top \mathbf{u}_1 = 1 \quad \Rightarrow \quad 1 - \mathbf{u}_1^\top \mathbf{u}_1 = 0$ Note: $\mathbf{u}_1^\top \mathbf{u}_1 = \lVert \mathbf{u}_1 \rVert^2$ This ensures that we are only finding the best direction. To include this constraint while maximizing, we use a Lagrange multiplier $\lambda_1$ $$ \arg\max \left( u_1^\top S u_1 + \lambda_1(1 - u_1^\top u_1) \right) $$ Now we find the first derivative, using the vector derivative rule, of the function and set it equal to zero: $$ 2 S u_1 - 2 \lambda_1 u_1 = 0 $$ This can be rewritten as the standard **eigenvalue equation**: $$ \boxed{S u_1 = \lambda_1 u_1} $$ The above equation is of the form $𝐴𝑥=𝜆𝑥$. This proves that the eigenvector $(𝑢1)$ of that dataset is going to be the direction of maximum variance. And this is called the first principal component. ## Steps to Perform Dimensionality Reduction using PCA In the above thought experiment, we have reduced the dimension from 2 to 1, so we got only one eigenvector. To reduce from $𝐷$ dimensions to $𝑀$, we will use the top M eigenvalues, and their corresponding eigenvectors will be the directions of maximum variance. * Compute the covariance matrix $𝑆$, of our data. * Find the eigenvalues and eigenvectors of the covariance matrix $𝑆$. * Order the eigenvalues from largest to smallest. * Select the top M eigenvectors corresponding to the largest eigenvalues — these are our principal components. * Multiply the original data with the selected principal components to get the new dataset in the reduced dimension space. ### PCA Example Consider the below dataset. | Sample | Feature 1 | Feature 2| Feature3 | Feature 4 | Feature 5 | | ------ | ---------- | ---------- | -------- | --------- | --------- | | 1 | 2 | 0 | 1 | 3 | 5 | | 2 | 3 | 2 | 2 | 4 | 7 | | 3 | 4 | 4 | 3 | 5 | 9 | | 4 | 5 | 6 | 4 | 6 | 11 | In matrix form: Given the data matrix: $$ X = \begin{bmatrix} 2 & 0 & 1 & 3 & 5 \\ 3 & 2 & 2 & 4 & 7 \\ 4 & 4 & 3 & 5 & 9 \\ 5 & 6 & 4 & 6 & 11 \end{bmatrix} $$ Computing the covariance matrix using the below formula: $$S = \frac{1}{n} \sum_{i=1}^{n} (x_n - \bar{x})(x_n - \bar{x})^T$$ Where $𝑋𝑛$ is the element value and $𝑋$¯ is the mean of that column. After calculation: $$ X_c = X - \mathbf{1} \bar{x} = \begin{bmatrix} -1.5 & -3.0 & -1.5 & -1.5 & -3.0 \\ -0.5 & -1.0 & -0.5 & -0.5 & -1.0 \\ 0.5 & 1.0 & 0.5 & 0.5 & 1.0 \\ 1.5 & 3.0 & 1.5 & 1.5 & 3.0 \end{bmatrix} $$ Compute the covariance matrix using: $$ S = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x}) (x_i - \bar{x})^T = \frac{1}{n} X_c^T X_c $$ $$ S = \frac{1}{4} \begin{bmatrix} 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 10.0 & 20.0 & 10.0 & 10.0 & 20.0 \\ 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 5.0 & 10.0 & 5.0 & 5.0 & 10.0 \\ 10.0 & 20.0 & 10.0 & 10.0 & 20.0 \end{bmatrix} $$ The covariance matrix $S$ is: $$ S = \begin{bmatrix} 1.25 & 2.5 & 1.25 & 1.25 & 2.5 \\ 2.5 & 5.0 & 2.5 & 2.5 & 5.0 \\ 1.25 & 2.5 & 1.25 & 1.25 & 2.5 \\ 1.25 & 2.5 & 1.25 & 1.25 & 2.5 \\ 2.5 & 5.0 & 2.5 & 2.5 & 5.0 \end{bmatrix} $$ Calculating the eigen values and eigen vectors: Eigenvalues: The eigenvalues vector $\Lambda$ is: $$ \Lambda = \begin{bmatrix} 0.0 \\ 13.75 \\ 0.0 \\ 0.0 \\ 0.0 \end{bmatrix} $$ The corresponding eigenvectors matrix $V$ is: $$ V = \begin{bmatrix} -0.9535 & 0.3015 & 0.0 & 0.0 & 0.0 \\ 0.1907 & 0.6030 & 0.7746 & 0.0 & 0.0 \\ 0.0953 & 0.3015 & -0.2582 & -0.4082 & -0.8165 \\ 0.0953 & 0.3015 & -0.2582 & 0.8816 & -0.2367 \\ 0.1907 & 0.6030 & -0.5164 & -0.2367 & 0.5266 \end{bmatrix} $$ Ordering the eigen vectors based on eigen values | Eigen Values | Eigen Vector | | | ------------ | -------------------------------------------- | ----------------------------------- | | 13.75 | [ 0.3015, -0.9535, 0. , 0. , 0. ] | -0.0 | [ 0.3015, 0.0953, -0.2582, -0.4082, -0.8165] | | -0.0 | [ 0.3015, 0.0953, -0.2582, 0.8816, -0.2367] | | -0.0 | [ 0.603 , 0.1907, -0.5164, -0.2367, 0.5266] | | -0.0 | [ 0.603 , 0.1907, -0.5164, -0.2367, 0.5266] | Since we are converting from 5D to 2D we are choosing the top 2 Eigen vectors. And we are gonna multiply this with our data The reduced data matrix after PCA is: $$ X_{\text{reduced}} = \begin{bmatrix} -4.9749 & 0.0 \\ -1.6583 & 0.0 \\ 1.6583 & 0.0 \\ 4.9749 & 0.0 \end{bmatrix} $$ ### Pseudocode Algorithm: PCA(X, k) Input: - X: data matrix (n × d) - k: number of principal components Output: - X_reduced: transformed data (n × k) - W: projection matrix (d × k) - explained_variance: variance explained by each PC 1. Center the data: X_centered = X - mean(X, axis=0) 2. Compute covariance matrix: C = (1/(n-1)) * X_centered^T * X_centered 3. Compute eigendecomposition: eigenvalues, eigenvectors = eig(C) 4. Sort by eigenvalues (descending): idx = argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] 5. Select top k eigenvectors: W = eigenvectors[:, 0:k] 6. Project data onto principal components: X_reduced = X_centered * W 7. Compute explained variance: explained_variance = eigenvalues[0:k] / sum(eigenvalues) 8. Return X_reduced, W, explained_variance **Example: The Breast Cancer Dataset** PCA is often used in machine learning to simplify high-dimensional data, such as the breast cancer dataset, which contains 30+ clinical features for each patient. Goal: Predict whether a tumor is malignant or benign. **How PCA Helps:** Standardize: All features are scaled to have mean 0 and standard deviation 1. Transform: PCA converts the 30+ correlated features into new, uncorrelated principal components. Reduce: Keep only the top 2–3 components that capture most of the variance. **Result:** Visualization: The reduced data can now be plotted in 2D or 3D, revealing clear clusters. Modeling: Models like logistic regression or SVM train faster and may perform better because redundant noise is removed. **Pros & Cons of PCA** **Pros** * Dimensionality reduction: Lowers the number of features while retaining most of the dataset’s variance. * Noise filtering: Removes noise and redundant information, improving data quality. * Better visualization: Enables high-dimensional data to be represented in 2D or 3D plots. * Faster computation: Fewer features lead to quicker model training and prediction. * Uncorrelated components: Transforms features into independent components, which can be beneficial for certain algorithms. **Cons** * Reduced interpretability: Principal components are combinations of several features, making them difficult to interpret. * Variance-focused: PCA prioritizes variance, which may not align with the most predictive features. * Assumes linearity: Only captures linear relationships and may miss complex patterns. * Scale-sensitive: Requires standardized data to prevent large-scale features from dominating. **Conslusion** PCA provides a systematic way to simplify high-dimensional data by identifying the directions in which the data varies the most and projecting it onto those directions. By reducing dimensions while retaining the majority of the meaningful structure, PCA makes complex datasets easier to visualize, analyze, and interpret. It is a powerful tool for uncovering underlying patterns, removing noise, and preparing data for further machine-learning tasks References https://www.youtube.com/watch?v=FD4DeN81ODY https://www.youtube.com/watch?v=iRbsBi5W0-c&t=1146s https://www.youtube.com/watch?v=FgakZw6K1QQ https://medium.com/data-science/principal-component-analysis-made-easy-a-step-by-step-tutorial-184f295e97fe