PCA - HackMD

**Chapter 1: Introduction to Principal Component Analysis** Principal Component Analysis (PCA) is a widely used technique in multivariate statistics and machine learning for dimensionality reduction. It transforms complex datasets by re expressing correlated features into a smaller set of uncorrelated components, known as principal components. These components are ordered by the amount of variance they capture from the original data, with the first principal component (PC₁) retaining the maximum possible information, followed by PC₂, PC₃, and so on. The transformation is linear and orthogonal, ensuring that each new axis is statistically independent from the others. The need for dimensionality reduction arises from a fundamental challenge in high dimensional data analysis: as the number of features increases, models become more prone to overfitting, computational inefficiency, and interpretability loss. This phenomenon is known as the curse of dimensionality, where excessive features introduce noise and spurious patterns that degrade model performance. Contrary to the intuition that more data always improves learning, feeding a model with too many features can lead it to memorize irrelevant fluctuations rather than generalizing meaningful trends. Dimensionality reduction techniques aim to mitigate this problem by preserving the most informative aspects of the data while discarding low variance directions. Among these techniques, PCA stands out as a method of feature extraction, where the original feature space is mapped to a lower dimensional space through linear combinations of input variables. This transformation retains as much variance as possible, ensuring that the reduced dataset remains representative of the original structure. Mathematically, PCA is closely tied to the Singular Value Decomposition (SVD), a robust and efficient matrix factorization method. Given a standardized data matrix $X$ , SVD decomposes it into $X = U \Sigma V^\top$ , where the columns of $V$ represent the principal directions—eigenvectors of the covariance matrix—and the singular values in $\Sigma$ quantify the variance captured along each direction. This decomposition provides both a computational pathway and a geometric interpretation of PCA: the data is rotated into a new basis aligned with its intrinsic variance structure. **Chapter 2: Mathematical Steps of PCA** This chapter formalizes Principal Component Analysis as a stepwise procedure, focusing on definitions, linear-algebraic operations, and criteria for selecting and interpreting components. Principal Component Analysis (PCA) can be understood as a stepwise algorithm. **Step 1: Standardize the Data** Different features in a dataset often have different units and scales. If left unadjusted, features with larger numerical ranges dominate the analysis, even if they are not more informative. To ensure fair comparison, PCA begins by standardizing the data. Each feature is transformed so that it has a mean of 0 and a standard deviation of 1. This ensures that variance is comparable across all features. $$z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}, \quad Z = \mathrm{standardize}(X), \quad \mu_j = \frac{1}{n}\sum_{i=1}^n x_{ij}, \quad \sigma_j = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \mu_j)^2}$$where: • $x_{ij}$ is the value of feature (j) for observation (i) • $\mu_j$ is the mean of feature (j) • $\sigma_j$ is the standard deviation of feature (j) **Result:** The standardized dataset (Z) has features with mean 0 and variance 1. This prevents scale dominance and makes variance a meaningful measure of information. **Step 2: Compute the Covariance Matrix** Once the data is standardized, PCA calculates the covariance matrix, which summarizes how features vary together. - The diagonal entries represent the variance of each feature. - The off diagonal entries represent the covariance between pairs of features, showing whether they increase together (positive covariance), move in opposite directions (negative covariance), or are independent (near zero). $$S = \frac{1}{n-1} Z^\top Z$$ where $S \in \mathbb{R}^{m \times m}$ and $m$ is the number of features. **Result:** The covariance matrix encodes the relationships among features. Highly correlated features will show large covariance, while independent features will show values close to zero. This matrix is the foundation for PCA, because it identifies the directions in which the data varies most. **Step 3: Eigen Decomposition of the Covariance Matrix** The next step is to extract the principal directions of variation by performing eigen decomposition on the covariance matrix. - Eigenvectors represent directions in feature space. - Eigenvalues represent the amount of variance captured along those directions. - PCA orders these directions by descending eigenvalues, so the first principal component captures the maximum variance. $$S v_i = \lambda_i v_i, \quad i = 1, \dots, m$$where: • $v_i$ are the eigenvectors (principal axes) • $\lambda_i$ are the eigenvalues (variance explained by each axis) **Result:** * The eigenvector with the largest eigenvalue becomes the first principal component PC₁, capturing the maximum variance. * Subsequent eigenvectors form PC₂, PC₃, etc., each orthogonal to the previous and capturing progressively less variance. * Together, they define a rotated coordinate system aligned with the intrinsic structure of the data. **Step 4: Select the Top Principal Components** After eigen decomposition, we obtain a full set of eigenvalues and eigenvectors. Not all of them are equally important — some capture large amounts of variance, while others capture very little. The goal is to select only the most informative components. - Eigenvalues $(\lambda_i)$ indicate how much variance each principal component explains. - Components are ranked in descending order of eigenvalues. - We decide how many components $(k)$ to retain based on explained variance and interpretability. Variance explained by component $i$: $$\text{Var}(\text{PC}_i) = \lambda_i$$ Explained variance ratio (EVR): $$\text{EVR}_i = \frac{{\lambda_i}}{{\sum_{j=1}^{m} \lambda_j}}$$ Cumulative explained variance ratio (CEVR): $$\text{CEVR}_k = \sum_{i=1}^{k} \text{EVR}_i$$ let's choose $k$ such that $\text{CEVR}_k$ reaches a threshold (commonly 80–95%). The selected eigenvectors form the matrix $$V_k = [v_1 ; v_2 ; \cdots ; v_k]$$ This defines the reduced feature space. **Step 5: Project the Data into Principal Component Space** Once the top $k$ components are selected, the original standardized data is transformed into this new reduced space. - Each observation is expressed as a linear combination of the selected eigenvectors. - The result is a new dataset with $k$ dimensions instead of $m$, but still retaining most of the original variance. $$T = Z V_k$$ where: o $Z$ is the standardized data matrix $(n \times m)$ o $V_k$ is the matrix of top (k) eigenvectors $(m \times k)$ o $T$ is the transformed dataset $(n \times k)$ • **Result:** - Rows of $T$ represent observations in the reduced principal component space. - Columns of $T$ are the principal component scores, showing how much each observation aligns with each principal axis. - This reduced dataset is easier to visualize, analyze, and feed into machine learning models without suffering from the curse of dimensionality. In summary, Principal Component Analysis(PCA) begins by standardizing the dataset so that each feature has $mean = 0$ and $variance = 1$. From this standardized data, PCA computes the covariance matrix, which encodes how features vary together. By performing eigen decomposition of this matrix, PCA obtains eigenvectors (directions of maximum variation) and eigenvalues (the variance captured along each direction). The components with the largest eigenvalues are then selected to retain most of the variance while reducing dimensionality. Finally, the data is projected into this reduced space using the transformation $T = Z V_k$, simplifying analysis while preserving the most informative structure. function PCA(X, k): # Step 1: Center data mu = mean(X, axis=0) X_centered = X - mu # Step 2: Covariance matrix Sigma = (1/(n-1)) * transpose(X_centered) * X_centered # Step 3: Eigen-decomposition eigenvalues, eigenvectors = eig(Sigma) # Step 4: Sort eigenvectors by eigenvalues idx = argsort(eigenvalues, descending=True) eigenvectors = eigenvectors[:, idx] # Step 5: Select top k Wk = eigenvectors[:, 0:k] # Step 6: Project data Z = X_centered * Wk return Z, Wk, eigenvalues **Chapter3:** **PCA with an example of Cricket Batters stats** Imagine a galaxy of cricket(sports) batters, each shining with Runs, Average, Strike Rate, and Boundaries. At first glance the constellation looks scattered and complex. Principal Component Analysis (PCA) helps rotate our view, revealing the brightest patterns—one axis showing overall dominance, another highlighting contrasting styles of play. Batting statistics of a player span multiple dimensions. Runs measure productivity, Average reflects consistency, Strike Rate captures tempo, and Boundaries highlight aggression. Taken together, these four features form a four-dimensional space where each player is a point. Applying PCA:- **• First Component (PC1):** captures maximum variance — often overall batting dominance. **• Second Component (PC2):** highlights secondary contrasts — such as stylistic differences between accumulators and aggressors. By projecting players onto these new axes, PCA transforms scattered statistics into interpretable patterns. Here is the dataset of Batting statsitics of 15 batters of recently concluded women's worldcup 2025. Goal is to find the top 3 batters of the tournament. ![image](https://hackmd.io/_uploads/H1JtsWu-We.png) Selecting four key features to describe batting statistics. • Runs: Total runs • Average: Batting average • Strike_rate: Runs per 100 balls • Boundaries: Sum of fours and sixes ![image](https://hackmd.io/_uploads/HkuPaYcWZg.png) **Step 1: Standardize the data** Feature Means & Standard Deviations ![Standardize](https://hackmd.io/_uploads/BkAjDT7f-e.png) | Feature | Mean | Std Dev | |-------------|--------|---------| | Runs | 287.73 | 70.19 | | Average | 51.99 | 15.47 | | Strike Rate | 94.45 | 18.47 | | Boundaries | 38.60 | 12.11 | Standardize Z-scores ![image](https://hackmd.io/_uploads/Hk2KVc9-bl.png) **Step 2: Covariance Matrix** ![correalation](https://hackmd.io/_uploads/ryCaw6QzWl.png) | | Runs | Average | SR | Boundaries | |------------|-------|---------|-------|------------| | Runs | 1.000 | 0.636 | 0.419 | 0.910 | | Average | 0.636 | 1.000 | 0.752 | 0.626 | | SR | 0.419 | 0.752 | 1.000 | 0.587 | | Boundaries | 0.910 | 0.626 | 0.587 | 1.000 ![image](https://hackmd.io/_uploads/r1MDHo7Mbx.png) **Step3: Eigen values and explained Variance** ![Eigendecomposition](https://hackmd.io/_uploads/ry8bdTQG-e.png) | Component | Eigenvalue | Explained Variance | |-----------|------------|----------------- | | PC1 | 2.9722 | 74.30% | | PC2 | 0.7193 | 17.98% | | PC3 | 0.2620 | 6.55% | | PC4 | 0.0465 | 1.16% | PC1 + PC2 together explain ~92.3% of the variance. ![image](https://hackmd.io/_uploads/rJXBIimMZg.png) ![image](https://hackmd.io/_uploads/rJKoUnQM-x.png) **Step 4: Eigenvectors(sign flipped for positive interpertability)** | Feature | PC1 | PC2 | PC3 | PC4 | |------------|--------|--------|--------|--------| | Runs | 0.5047 | 0.5498 | -0.1513| 0.6481 | | Average | 0.5054 | -0.3627| -0.7391| -0.2583| | SR | 0.4575 | -0.6482| 0.5207 | 0.3152 | | Boundaries | 0.5297 | 0.3820 | 0.3996 | -0.6433 :::info **Note:** Eigenvectors in PCA represent directions, not fixed arrows. If $v$ is an eigenvector, then $-v$ is equally valid because both point along the same line. Flipping the sign does not change the variance captured or the geometry of the transformation—it simply reverses the orientation of the axis. In practice, analysts often flip signs to make interpretation more intuitive (e.g., ensuring Runs, Average, and Boundaries load positively on PC1 so it clearly represents batting strength). ::: PC1 loads positively on all four features → overall batting strength. PC2 contrasts Strike Rate vs Average → batting style. **Step5 :Players PC Scores** ![Project on PCs](https://hackmd.io/_uploads/SkBVd67Gbe.png) | Player | PC1 | PC2 | PC3 | PC4 | |-------------------|-------|-------|--------|--------| | Laura Wolvaardt | 3.178 | 1.879 | -0.049 | -0.073 | | Ashleigh Gardner | 2.434 | -1.439| -0.304 | 0.139 | | Alyssa Healy | 2.083 | -1.161| 0.128 | -0.359 | | Smriti Mandhana | 1.583 | 0.823 | 0.243 | 0.351 | | Phoebe Litchfield | 1.011 | -0.103| 0.904 | -0.131 | | Jemimah Rodrigues | 0.485 | -1.037| -0.459 | -0.039 | | Pratika Rawal | -0.184| 0.835 | -0.403 | -0.214 | | Sophie Devine | -0.405| -0.075| -0.824 | 0.215 | | Heather Knight | -0.415| 0.352 | -0.109 | 0.005 | | Nat Sciver-Brunt | -1.103| 0.011 | -0.120 | 0.256 | | Harmanpreet Kaur | -1.378| -0.118| 0.575 | 0.257 | | Amy Jones | -1.537| -0.011| 0.365 | -0.198 | | Brooke Halliday | -1.557| -0.311| -0.356 | 0.058 | | Tazmin Brits | -1.643| -0.244| 0.886 | 0.028 | | Sharmin Akhter | -2.553| 0.599 | -0.478 | -0.297 | To visualize how players cluster based on these components, we project them onto the PC1–PC2 plane. ![image](https://hackmd.io/_uploads/HynwIYufbx.png) ***PCA Scatterplot — Players on PC1 vs PC2*** The above 2D projection shows how players are distributed based on overall batting strength (PC1) and stylistic contrast (PC2). Players in the top-right quadrant combine dominance with tempo, while those in other regions reflect different batting profiles. ![image](https://hackmd.io/_uploads/SJp_vhQz-g.png) **Ranking by PC1 (Top 3 Batters)** | Rank | Player | PC1 Score | |------|-------------------|-----------| | 1 | Laura Wolvaardt | 3.18 | | 2 | Ashleigh Gardner | 2.43 | | 3 | Alyssa Healy | 2.08 | ![image](https://hackmd.io/_uploads/H193EnXzWx.png) **Result interpretation:** • Laura Wolvaardt dominates PC1 due to high Runs + Boundaries + Average. • Ashleigh Gardner and Alyssa Healy climb into the top 3 thanks to their aggressive Strike Rates and strong averages. • PC1 is the overall batting strength axis, while PC2 adds stylistic nuance (tempo vs consistency). **Chapter 4: Practical Applications of PCA and Limitations** **Linear Algebra Perspective** PCA is essentially a rotation of the coordinate system to align with directions of maximum variance.Eigenvectors form an orthonormal basis, ensuring components are uncorrelated and non-redundant. This perspective highlights PCA as a geometric transformation: reshaping the data space into axes that maximize interpretability. **Practical Use Case: Image Compression** Each grayscale image can be represented as a vector of pixel intensities.Images are high-dimensional (e.g., 32×32 pixels = 1024 features).PCA reduces dimensionality by retaining only the top principal components. Instead of storing all 1024 pixel values, fewer numbers (principal components) are stored and reconstruct approximate images with minimal loss. **Strengths of PCA** • **Noise Reduction:** Removes low-variance components often associated with random fluctuations. • **Efficiency:** Speeds up algorithms by reducing feature count, especially in machine learning pipelines. • **Generalization:** Provides a compact representation that often improves model robustness. **Limitations of PCA** • **Linearity:** PCA only captures linear relationships; nonlinear structures may be lost. - *Example:* If the data lies on a curved manifold (e.g., handwritten digits or word embeddings in NLP(natural language processing)), PCA flattens these relationships and may miss meaningful clusters. - *Preferred alternative:* *t-SNE* (t-distributed Stochastic Neighbor Embedding) t-SNE is better for visualizing clusters in high-dimensional data, such as distinguishing between different handwritten digits in the classic MNIST dataset (a benchmark collection of 70,000 images of digits 0–9). While PCA may only separate digits by stroke thickness or intensity, t-SNE reveals clear clusters for each digit identity.. - *Concrete case:* In handwritten digit recognition, PCA may only separate digits by broad visual traits such as stroke thickness or overall intensity. By contrast, t‑SNE reveals clear clusters for each digit identity (0–9), showing how nonlinear methods can uncover meaningful groupings that PCA alone might miss. 2. **Interpretability**: Principal components are linear combinations of features not original features — harder to explain intuitively. - *Example:* A principal component might combine “Runs” and “Strike Rate” with opposite signs, making it unclear whether it represents “aggressiveness” or “consistency.” Analysts often flip signs or rename PCs to make them more intuitive. 3. **Scaling Sensitivity**: PCA is sensitive to feature scaling. PCA requires standardized features; otherwise, large-scale features dominate. - *Example:* If “Runs” are measured in hundreds and “Strike Rate” in decimals, unscaled PCA will be dominated by “Runs.” Standardization ensures each metric contributes fairly. **Chapter 5 :Conclusion** Principal Component Analysis (PCA) has shown itself to be a powerful technique for simplifying complex datasets while preserving their essential structure. By transforming correlated variables into orthogonal components, PCA highlights dominant trends and reduces dimensionality without losing interpretive depth. **Key Takeaways** - PCA reduces dimensionality by capturing maximum variance in fewer components, making complex data easier to interpret. - Orthogonal principal components ensure independence, allowing clearer separation of underlying patterns. - The workflow — standardization, correlation matrix, eigen decomposition, and score calculation — provides a reproducible path from raw data to insight.