# *Principal Component Analysis (PCA)*: ![1_Z3a77HufPLUqxdXclewd5w](https://hackmd.io/_uploads/HkWeZIDZZg.png) Principal Component Analysis (PCA) is a dimensionality-reduction technique used to simplify high-dimensional datasets while preserving as much important structure as possible. PCA identifies the **directions of greatest variance** in the data and creates new variables, called **principal components**, aligned with these directions. This makes complex datasets easier to visualize, analyze, and model. #### Why Do We Need PCA? Many real datasets—especially in medicine, biology, imaging, or finance—contain dozens or hundreds of measurements per sample. For example, a medical dataset might measure: * cell size * roundness * texture * smoothness * compactness * symmetry …and many more This creates high-dimensional data, which is hard to visualize and analyze. #### PCA helps by: * finding the directions in which the data varies the most * rotating the coordinate system to align with these directions * keeping only the top few components that contain most of the structure **Intuition: What Does PCA Actually Do?** PCA: **1**.Finds the **axis of maximum variation** → this becomes **PC1**. **2**.Finds the next axis perpendicular to the first with the next largest variation →**PC2**. **3**.Repeats until all components are found. **4**.Allows us to keep only the top k components. The amount of variance explained by PC1 and PC2 depends on the dataset. In some domains with highly correlated features—such as genomics, spectroscopy, and certain imaging or macroeconomic datasets—the first two principal components may capture a large portion of the total variance. However, this is not universal; many datasets have a more spread-out variance structure. **This clarification aligns with findings from PCA survey papers such as:** #### Jolliffe & Cadima (2016) #### Bro & Smilde (2014)* ## PCA Algorithm (with Pseudocode) ``` Pseudocode: Step-by-Step PCA Step 1: Compute mean and std of each feature mu -> mean of each column of X sigma -> standard deviation of each column of X Step 2: Standardize data X_standardized = (X - mu) / sigma Step 3: Compute covariance matrix Sigma = covariance_matrix(X_standardized) Step 4: Compute eigenvalues and eigenvectors of Sigma [lambda_i, v_i] = eig(Sigma) Step 5: Sort eigenvectors by descending eigenvalues Sort v_i in order of lambda_i (largest → smallest) Step 6: Select top k eigenvectors V_k = [v_1, v_2, ..., v_k] Step 7: Project standardized data onto top k eigenvectors X_reduced = X_standardized * V_k Step 8: Return X_reduced ``` ## Explanation of PCA Pseudocode Steps **1**. **Standardize the data** Ensures all features contribute equally. Without scaling, variables with large ranges dominate PCA. **2**. **Compute the covariance matrix** Shows how features vary together. PCA uses this structure to find relationships between variables. **3**. **Calculate eigenvalues and eigenvectors** Eigenvectors = new directions (principal components) Eigenvalues = amount of variance captured in each direction **4**. **Sort and choose top components** We keep the components that explain the most variance and drop the rest. **5**. **Project data onto selected components** Transforms the original data into a lower-dimensional space aligned with principal components. ## Visual Example: PCA on Breast Cancer Dataset PCA is useful when our dataset has many features (30, 1000, or even more), and we want to see patterns that are impossible to visualize directly. For example, imagine we collect images of breast cells and measure 30 properties for each cell: * size * roundness * texture * smoothness * compactness * symmetry … etc. This means every cell is represented by 30 numbers — a 30-dimensional point. We want to check whether benign and cancerous cells look different in these measurements. But we cannot plot 30-dimensional data. **PCA helps by**: * finding the most important directions of variation * compressing the 30 dimensions into just 2 numbers per cell * allowing us to plot the data on a simple 2D scatter plot When we apply PCA to the Breast Cancer Wisconsin dataset, the 2D PCA plot clearly shows that cancerous and benign cells form separate clusters. This means PCA successfully reveals structure that is hard to see in the original high-dimensional data. ![image](https://hackmd.io/_uploads/HyNtKUQ-Ze.png) ## PCA Example 2: Height and Weight Consider a simple dataset of 10 people: | Person | Height (cm) | Weight (kg) | | ------ | ----------- | ----------- | | 1 | 150 | 50 | | 2 | 160 | 55 | | 3 | 165 | 65 | | 4 | 170 | 68 | | 5 | 155 | 52 | | 6 | 172 | 70 | | 7 | 158 | 54 | | 8 | 168 | 66 | | 9 | 162 | 60 | | 10 | 175 | 72 | **Step 1**: Plot the Raw Data ![heightweight1 (1)](https://hackmd.io/_uploads/rJkNCImZZe.gif) **Scatter plot**: Height vs Weight In order to perform a PCA, we first find the axis of greatest variation, which is basically the line of best fit. This line is called the first principal component. **Blue line** = PC1, direction of maximum variance. ![heightweight2](https://hackmd.io/_uploads/HkBrCLm-Wx.gif) ## Red dots = projected points along PC1. ![PCA1-smaller-smaller (1)](https://hackmd.io/_uploads/Skb8CUQbbe.gif) This reduces the 2D data to 1D while retaining the main information (taller → heavier trend). # Covariance: Covariance measures how two features vary together: $$ \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X}) (Y_i - \bar{Y}) $$ - **Positive covariance** → both features tend to increase or decrease together - **Negative covariance** → one feature increases while the other decreases - **Zero covariance** → features are uncorrelated ![image](https://hackmd.io/_uploads/HJ0cUkHZ-x.png) ## Symmetric Property A covariance matrix is always **symmetric** because: $$ \text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i) $$ - **Diagonal elements** → variances of each feature - **Off-diagonal elements** → covariances between features For a dataset with features $(X_1, X_2, \dots, X_p)$, the covariance matrix $\Sigma$ is: $$ \Sigma = \begin{bmatrix} \text{Cov}(X_1, X_1) & \text{Cov}(X_1, X_2) & \dots & \text{Cov}(X_1, X_p) \\ \text{Cov}(X_2, X_1) & \text{Cov}(X_2, X_2) & \dots & \text{Cov}(X_2, X_p) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_p, X_1) & \text{Cov}(X_p, X_2) & \dots & \text{Cov}(X_p, X_p) \end{bmatrix} $$ - **Diagonal elements** → variance of each feature - **Off-diagonal elements** → covariance between features --- ## Example: Height–Weight Data Sample data (3 people): | Person | Height (cm) | Weight (kg) | |--------|-------------|------------| | 1 | 150 | 50 | | 2 | 160 | 55 | | 3 | 170 | 68 | ### Step 1: Compute means $$ \bar{H} = 160, \quad \bar{W} = 57.67 $$ ### Step 2: Center the data $$ H - \bar{H} = [-10, 0, 10], \quad W - \bar{W} = [-7.67, -2.67, 10.33] $$ ### Step 3: Compute variances and covariance #### Sample Variance Formula $$ \mathrm{Var}(X) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1} $$ #### Sample Covariance Formula $$ \mathrm{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n - 1} $$ #### Applied to the given data $$ \mathrm{Var}(H) = \frac{(-10)^2 + 0^2 + 10^2}{3 - 1} = 100 $$ $$ \mathrm{Var}(W) = \frac{(-7.67)^2 + (-2.67)^2 + 10.33^2}{3 - 1} \approx 86.35 $$ $$ \mathrm{Cov}(H, W) = \frac{(-10)(-7.67) + 0(-2.67) + 10(10.33)}{3 - 1} = 90 $$ ### Step 4: Covariance Matrix The covariance matrix for Height ($H$) and Weight ($W$) is: $$ \Sigma = \begin{bmatrix} \mathrm{Var}(H) & \mathrm{Cov}(H, W) \\ \mathrm{Cov}(H, W) & \mathrm{Var}(W) \end{bmatrix}= \begin{bmatrix} 100 & 90 \\ 90 & 86.35 \end{bmatrix} $$ - **Diagonal elements** → variance of each feature: - $100$ → Height variance - $86.35$ → Weight variance - **Off-diagonal elements** → covariance between features: - $90$ → covariance between Height and Weight This matrix is **symmetric** because: $$ \mathrm{Cov}(H, W) = \mathrm{Cov}(W, H) $$ ## Linear Transformations and Eigenvectors To understand PCA, it’s helpful to know a bit about linear transformations. ### What is a Linear Transformation? A linear transformation is a rule that moves points in space. It can be represented by a **matrix**. Multiplying a point \(p\) by a matrix \(A\) moves the point: $$ p' = A \cdot p $$ **Example:** $$ A = \begin{bmatrix} -1 & 0 \\ 0 & -1 \end{bmatrix} $$ Multiplying any point by this matrix **rotates it 180° around the origin**. ![rotate-1 (1)](https://hackmd.io/_uploads/ByA2uCN--l.gif) --- ## Shearing Transformation Another example is **shearing**, which “slants” the points: $$ A = \begin{bmatrix} 1 & 0.4 \\ 0.4 & 1 \end{bmatrix} $$ When points are multiplied by this matrix, they shift in a particular direction. Some points, however, **stay on the same line after transformation**. ![animated_eig2](https://hackmd.io/_uploads/By4lY0N-We.gif) --- ## Eigenvectors and Eigenvalues Notice something interesting in the shearing transformation: Any point that starts on the line \(y = x\) ends up on the **same line** after transformation. Examples: \((1,1), (2,2), (42,42)\) … all stay on \(y = x\). This line is an **eigenvector** of the matrix. **Definition:** - **Eigenvector:** a direction that does **not rotate** after a linear transformation - **Eigenvalue:** how far points move along that line **Example:** - A point starting at \((1,1)\) might end at \((2,2)\) → eigenvalue = 2 **Intuition:** Think of the matrix as a “journey plan”: - **Eigenvectors** = the direction you travel - **Eigenvalues** = how far you move along that direction ## PCA and Singular Value Decomposition (SVD) While PCA can be introduced using eigenvectors of the covariance matrix, in practice it is usually computed using **Singular Value Decomposition (SVD)**. SVD is **more stable, faster, and works well for high-dimensional or large datasets**. ### What is SVD? For a centered data matrix \(X\) (mean-subtracted), SVD factorizes it as: $$ X = U \Sigma V^T $$ Where: $$ \begin{aligned} & U \text{ — left singular vectors, representing directions of data points in original space} \\ & \Sigma \text{ — diagonal matrix of singular values, showing variance captured by each direction} \\ & V \text{ — right singular vectors, representing the principal components (directions of maximum variance)} \end{aligned} $$ ### How SVD Produces PCA $$ \begin{aligned} & \text{Columns of } V \text{ = principal components} \\ & \text{Squares of singular values } (\Sigma^2) \text{ = eigenvalues of the covariance matrix} \\ & \text{Projected data (scores)}: \quad X_{\text{PCA}} = U \Sigma \end{aligned} $$ This avoids computing the covariance matrix explicitly, which is especially useful when the number of features is greater than the number of samples or the dataset is very large. ### Intuition Behind SVD SVD decomposes a centered data matrix \(X\) as: $$ X = U \Sigma V^T $$ Intuitively, each component means: - **Vᵀ**: rotates the original data into the new coordinate system defined by the principal components. - **Σ**: scales the data along each principal component according to the amount of variance it captures. - **U**: positions the data points in the rotated and scaled space. Multiplying **U Σ** gives the coordinates of the data in principal component space: $$ X_{\text{PCA}} = U \Sigma $$ ### Why SVD is Preferred for PCA $$ \begin{aligned} & \text{Numerically stable — avoids rounding errors} \\ & \text{Handles high dimensions — works even when features > samples} \\ & \text{Efficient for large datasets — avoids computing huge covariance matrices} \\ & \text{Provides variance explained — singular values indicate variance captured by each component} \end{aligned} $$ --- ## Applications of PCA **1. Medical Imaging (MRI, CT, X-ray)** PCA reduces noise and compresses large images by keeping only meaningful visual structures. Helps highlight important regions like tumors or fractures. **2. Genomics & Bioinformatics** Gene expression datasets are huge (thousands of genes). PCA reveals major biological patterns and helps cluster patients or identify disease subtypes. **3. Finance & Investment** PCA reduces many correlated financial indicators into a few key factors that represent market movement, risk, or volatility. **4. Image Compression** Large images can be rebuilt using only the principal components, reducing storage while preserving essential details. **5. Machine Learning Preprocessing** Reduces dimensionality before training models—improves speed, reduces overfitting, and removes noise. **6. Customer & Marketing Analytics** PCA finds major behavioral patterns like “budget shoppers” vs. “premium customers” by grouping correlated purchasing behaviors. ## Limitations of PCA & How to Mitigate Them **1. PCA is linear** — cannot capture curved or nonlinear patterns **Mitigation**: Use Kernel PCA or Autoencoders for nonlinear dimensionality reduction. **2. Very sensitive to feature scaling** **Mitigation**: Always standardize or normalize data before PCA. **3. Sensitive to outliers** **Mitigation**: Use Robust PCA (RPCA) Remove/replace extreme outliers (IQR, z-score) Use median-based scaling **4. PCA components are hard to interpret** **Mitigation**: Use Sparse PCA to get cleaner, simpler components Apply Varimax rotation to make component loadings easier to understand **5. High variance does not always mean “important” in real-world meaning** **Mitigation**: Combine PCA with domain knowledge or feature selection for better interpretation. **6. PCA loses original feature meaning** **Mitigation**: Use feature selection instead of PCA when interpretability is important Use Autoencoders to capture features with less information loss ## Conclusion PCA is a fundamental data science tool that reduces dimensionality, uncovers hidden patterns, and simplifies visualization. Uses covariance matrices, eigenvalues, and eigenvectors Preserves most meaningful variation Compresses high-dimensional data into a smaller set of principal components Even in simple examples like height and weight, PCA captures the main trend and compresses the dataset into a single meaningful dimension. For high-dimensional datasets, PCA makes analysis feasible, interpretable, and visualizable.