# Principal Component Analysis (PCA)
>[!Important] **Math & Logic Termination Project**
### Author
**Roshan Ravishankar Naidu**
Luddy School of Informatics, Computing, and Engineering
Indiana University, Indianapolis
---
### Instructor
**Leon Johnson**
### Teaching Assistant
**Jorge Puga Hernandez**
---
### **Course**
Math & Logic
---
## **Abstract**
Principal Component Analysis (PCA) is a fundamental technique in mathematics, statistics, and machine learning to reduce dimensionality of high-dimensional data while preserving as much variability as possible.
This writeup presents a complete mathematical and computational exploration of PCA, beginning with its theoretical foundations and moving toward practical implementation.
Here, PCA is implemented using a generated dataset with validated results using the Scikit-Learn PCA module. Topics include variance explained analysis, visualization of principal components, and reconstruction of the original dataset from reduced dimensions. These analyses highlight PCA’s effectiveness in compression, noise reduction, and feature extraction.
The results confirm that PCA is mathematically elegant, computationally efficient, and practically powerful for discovering latent structure in complex datasets. This project therefore offers both rigorous theoretical insight and hands-on data science application.
---
## Introduction
Modern datasets often contain dozens, hundreds, or even thousands of variables. While high-dimensional data can encode rich information, it also introduces several challenges:
- Increased computational cost
- Difficulty in visualization
- Presence of multicollinearity
- Noise amplification
- Risk of overfitting in Machine Learning models
To address these issues, dimensionality reduction techniques play a crucial role in mathematical modeling, statistics, and data science.
Among them, **Principal Component Analysis (PCA)** stands as one of the most elegant and widely used methods. PCA transforms high-dimensional data into a new coordinate system, where the axes called **principal components** are chosen to maximize variance and reveal the most informative structure in the data.
PCA is not only a practical algorithm but also a concept grounded in deep mathematical principles: covariance matrices, eigenvalues, eigenvectors, orthogonality, and linear transformations. Thus, it bridges linear algebra,
optimization, statistics, and geometry.
---
## Importance of PCA
PCA is extensively used in:
- Image compression
- Noise reduction
- Visualization of high-dimensional datasets
- Preprocessing for machine learning models
- Identifying latent structure or hidden patterns
- Feature extraction and decorrelation
Because of its ability to simplify data without losing significant information, PCA remains a foundational tool in exploratory data analysis and dimensionality reduction.
---
## **Theoretical Foundations of PCA**
Principal Component Analysis (PCA) is a linear transformation technique rooted in statistics and linear algebra. PCA identifies directions (principal components) in which the data varies the most and projects the data onto those directions.
### A. Preliminary Mathematical Concepts
#### 1. Vectors and Matrices
A dataset with \( n \) observations and \( p \) features can be represented
as a matrix:
$$
X =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1p} \\
x_{21} & x_{22} & \cdots & x_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n1} & x_{n2} & \cdots & x_{np} \\
\end{bmatrix}
$$
Each **row** is an observation, each **column** is a feature.
#### 2. Mean Centering
To prepare data for PCA, we subtract the mean of each feature (Data centered around 0):
$$
\tilde{X} = X - \mu
$$
Where:
$$
\mu_j = \frac{1}{n} \sum_{i=1}^{n} x_{ij}
$$
#### **3. Variance**
Variance measures the spread of a variable:
$$
Var(X_j) = \frac{1}{n-1} \sum_{i=1}^{n} (x_{ij} - \mu_j)^2
$$
#### **4. Covariance**
Covariance measures how two variables change together (How two features (columns) vary. If both increase together, it’s positive. If one goes up while the other goes down, it’s negative):
$$
Cov(X_j, X_k)
= \frac{1}{n-1} \sum_{i=1}^{n} (x_{ij} - \mu_j)(x_{ik} - \mu_k)
$$
#### **5. Covariance Matrix**
For all feature pairs, we form the covariance matrix (captures the pairwise relationships (covariances) between all the features):
$$
\Sigma = \frac{1}{n-1} \tilde{X}^T\tilde{X}
$$
It is a square $p \times p$ matrix, symmetric and positive semi-definite.
#### **6. Eigenvalues and Eigenvectors**
**Eigenvectors ($\mathbf{v}$)** are the directions that don’t change when you apply the matrix $A$. When the matrix acts on the eigenvector, it scales the vector by the corresponding eigenvalue.
**Eigenvalue (λ)** is how much the eigenvector is scaled when the transformation is applied. If λ is large, the eigenvector represents a direction of high variance. If λ is small or close to zero, it represents a low variance direction.
### B. Deriving PCA from the Principles
The goal of PCA is to find a direction (vector) $\mathbf{w}$ such that when we project data onto that direction, the variance of the projections is maximized.
#### **Eigen Vector & Eigen Value:**
For a square matrix $(A)$, if:
$$
A\mathbf{v} = \lambda \mathbf{v}
$$
then:
- $\mathbf{v}$ is an **eigenvector**
- $\lambda$ is the **eigenvalue**
The Largest eigenvalues correspond to directions of maximum variance.
Once the eigenvalues and eigenvectors of the covariance matrix are all computed, they are then sorted in descending order based on the magnitude of their eigenvalues. The principal components are then selected from the top eigenvectors. The first principal component corresponds to the eigenvector with the largest eigenvalue, the second principal component corresponds to the eigenvector with the second-largest eigenvalue, and so on. These principal components are orthogonal, meaning they are uncorrelated. This matrix is used to project the original data into a lower-dimensional space, resulting in the reduced dataset.
#### **Objective function:**
To maximize the variance along a direction, we first define the variance of the projections:
$$
\max_{\mathbf{w}} \ \ Var(\mathbf{w}^T \tilde{X})
$$
#### **Rewriting using covariance matrix:**
$$
Var(\mathbf{w}^T \tilde{X}) = \mathbf{w}^T \Sigma \mathbf{w}
$$
#### **Optimization problem:**
To **prevent trivial solutions** and ensure the algorithm finds a meaningful result by focusing on direction rather than magnitude. This process is known as constraining the solution to the unit sphere.
Thus, we constrain $\| \mathbf{w} \| = 1$:
$$
\max_{\mathbf{w}} \ \mathbf{w}^T\Sigma\mathbf{w} \quad
\text{subject to} \quad \mathbf{w}^T\mathbf{w} = 1
$$
#### **Using Lagrange multipliers:**
$$
L(\mathbf{w}, \lambda) = \mathbf{w}^T\Sigma\mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)
$$
Take derivative and set to zero:
$$
\Sigma \mathbf{w} = \lambda \mathbf{w}
$$
>[!Note]
>$\mathbf{w}$ is the eigenvector of the covariance matrix Σ.
>$\lambda$ is the corresponding eigenvalue.
Thus, **principal components are eigenvectors** of the covariance matrix.
The **largest eigenvalue gives the first principal component**, and so on.
#### **Reconstruction of the Original Dataset:**
After transforming the data to a lower-dimensional subspace using PCA, we can approximate the original dataset by projecting the reduced data back into the original space using the principal components. This reconstruction process is essential to understand how much of the original information is preserved in the reduced space.
When performing PCA, we obtain a lower-dimensional representation \(Z\) of the original data $\tilde{X}$:
$$
Z = \tilde{X} W_k
$$
where $W_k$ is the matrix of the top $k$ principal components. To reconstruct the data, we use:
$$
\tilde{X}_{\text{reconstructed}} = Z W_k^T
$$
This equation approximates the original dataset $\tilde{X}$ using only the first $(k)$ principal components, with $(k)$ being the number of dimensions retained after PCA.
Reconstruction can be useful in Data compression, Noise reduction, Data visualization, etc. And understanding how well PCA can reconstruct data helps assess how much information is lost in dimensionality reduction and informs the decision about how many components to retain.
### **PCA via Covariance Matrix**
Steps to perform PCA using eigen-decomposition:
1. Center the data:
$\tilde{X} = X - \mu$
2. Compute covariance matrix:
$\Sigma = \frac{1}{n-1} \tilde{X}^T\tilde{X}$
3. Compute eigenvalues and eigenvectors of $\Sigma$
4. Sort eigenvectors in decreasing order of eigenvalues
5. Select top $k$ components to form projection matrix:
$$
W_k = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k]
$$
6. Transform and project the original data onto the new lower-dimensional subspace:
$$
Z = \tilde{X} W_k
$$
>[!Note]
>SVD-based PCA is more numerically stable and preferred in large datasets.
---
## **PCA Pseudocode**
Below is the clean, language-agnostic pseudocode representation of the PCA algorithm:
```text
PCA(X, k):
Input:
- X: data matrix of size (n × p)
- k: number of principal components to keep
Step 1: Standardize each feature of X
For each column j:
X[:, j] = (X[:, j] - mean(X[:, j])) / std(X[:, j])
Step 2: Center the data
X_centered = X - mean(X)
Step 3: Compute covariance matrix
Sigma = (1 / (n - 1)) * (X_centeredᵀ × X_centered)
Step 4: Compute eigenvalues and eigenvectors of Sigma
(eigenvalues, eigenvectors) = eig(Sigma)
Step 5: Sort eigenvectors by descending eigenvalues
indices = argsort(eigenvalues, descending=True)
eigenvectors_sorted = eigenvectors[:, indices]
Step 6: Choose top k eigenvectors
W_k = eigenvectors_sorted[:, 0:k]
Step 7: Transform the data
Z = X_centered × W_k
Output:
- Z: reduced representation of X
- W_k: principal directions
- eigenvalues: variance explained by each component
```
## Practical Implementation
This section demonstrates how Principal Component Analysis (PCA) is implemented using **Scikit-Learn’s PCA module**
on a small synthetic dataset of student test scores, including generating the data, standardising the values via Standard Scaler, computing components, eigenvalues, and explained variance, visualizing scree plots, explained variance, 2D/3D scatter plots, and biplots, and finally providing a detailed explanation of the PCA implementation and visualizations.
```python
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler # Import StandardScaler
# 2. Generate a synthetic dataset based on a real-world scenario (Student Test Scores)
# 20 students and 5 subjects: Math, Science, English, History, Art
num_students = 20
num_subjects = 5
# Generated scores with some correlation to simulate real-world data
# For example, Math and Science scores might be related, and English and History
np.random.seed(42) # for reproducibility
# Base scores (e.g., average performance for each student)
base_scores = np.random.rand(num_students, 1) * 20 + 60 # scores between 60 and 80
# Subject specific variations and correlations
# Introduced some correlation: Math/Science and English/History
math_science_common_factor = np.random.randn(num_students, 1) * 8
english_history_common_factor = np.random.randn(num_students, 1) * 8
math_scores = base_scores + math_science_common_factor + np.random.randn(num_students, 1) * 5
science_scores = base_scores + math_science_common_factor * 0.7 + np.random.randn(num_students, 1) * 5
english_scores = base_scores + english_history_common_factor + np.random.randn(num_students, 1) * 5
history_scores = base_scores + english_history_common_factor * 0.6 + np.random.randn(num_students, 1) * 5
art_scores = base_scores + np.random.randn(num_students, 1) * 7 # less correlation with others
# Combined into a DataFrame for better readability and to name features
data = np.hstack([math_scores, science_scores, english_scores, history_scores, art_scores])
feature_names = ['Math', 'Science', 'English', 'History', 'Art']
X_df = pd.DataFrame(data, columns=feature_names)
# Ensured scores are within a reasonable range (e.g., 0-100)
X_df = X_df.clip(0, 100)
X = X_df.values # Used the numpy array for PCA
print("Synthetic Dataset (Student Scores - X) shape:", X.shape)
print("First 5 rows of the dataset:")
print(X_df.head())
# 3. Standardize the data before applying PCA
# Scaling is important for PCA because PCA is sensitive to the variance of the features.
# Features with larger variances will dominate the principal components if not scaled.
# StandardScaler transforms the data such that its mean is 0 and standard deviation is 1.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 4. Instantiate the PCA class
pca = PCA()
# 5. Fit the PCA model to the *scaled* synthetic dataset X
pca.fit(X_scaled)
# 6. Extracted the principal components, eigenvalues (explained variance), and explained variance ratio
components = pca.components_
eigenvalues = pca.explained_variance_
explained_variance_ratio = pca.explained_variance_ratio_
print("\nPrincipal Components (shape):", components.shape)
print("\nEigenvalues (Explained Variance) after scaling:")
print(eigenvalues)
print("\nExplained Variance Ratio after scaling:")
print(explained_variance_ratio)
```

### Scree Plot and Explained Variance
Scree plot to show eigenvalues vs. number of components and an explained variance plot to display cumulative variance retained.
- The scree plot visualizes the ordered eigenvalues and shows how variance decreases across components.
- A clear "elbow" indicates diminishing returns from additional components.
- The cumulative explained variance plot shows how much total variance is retained as more components are included.
- Selecting the first \(m\) components corresponds to projecting the data onto the subspace spanned by the top \(m\) eigenvectors.
```python
import matplotlib.pyplot as plt
# Get the number of components
num_components = len(eigenvalues)
components_range = np.arange(1, num_components + 1)
# Calculate cumulative explained variance
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Create a figure with two subplots
plt.figure(figsize=(14, 6))
# Subplot 1: Scree Plot
plt.subplot(1, 2, 1)
plt.plot(components_range, eigenvalues, marker='o', linestyle='-', color='b')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.xticks(components_range)
plt.grid(True)
# Subplot 2: Cumulative Explained Variance Plot
plt.subplot(1, 2, 2)
plt.plot(components_range, cumulative_explained_variance, marker='o', linestyle='-', color='r')
plt.title('Cumulative Explained Variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.axhline(y=1, color='gray', linestyle='--', label='Total Variance')
plt.xticks(components_range)
plt.yticks(np.arange(0, 1.1, 0.1))
plt.grid(True)
plt.legend()
# Adjust layout and display plots
plt.tight_layout()
plt.show()
```

### 2D/3D Scatter Plot
Project the data onto the first few principal components (PC1, PC2, PC3) and generate 2D and 3D scatter plots.
- Each axis (PC1, PC2, PC3) corresponds to an eigenvector direction.
- Each point represents an observation projected onto the principal component axes.
- Greater spread along an axis indicates higher variance captured by that component.
```python
import matplotlib.pyplot as plt
import plotly.express as px
# 1. Transform the *scaled* dataset X_scaled using the fitted PCA model
X_pca = pca.transform(X_scaled)
# 2D plot using matplotlib:
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('2D PCA Scatter Plot (PC1 vs. PC2)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
```

```python
# Interactive 3D plot using Plotly:
# Calculate total score for each student to use for coloring
total_scores = X_df.sum(axis=1)
# Create a DataFrame containing PCA components, total scores, and original features for hover data
pca_plotly_df = pd.DataFrame({
'PC1': X_pca[:, 0],
'PC2': X_pca[:, 1],
'PC3': X_pca[:, 2],
'Total Score': total_scores
})
# Add original subject scores to the DataFrame for hover data
pca_plotly_df = pd.concat([pca_plotly_df, X_df], axis=1)
fig_3d = px.scatter_3d(
pca_plotly_df, # Pass the combined DataFrame
x='PC1',
y='PC2',
z='PC3',
color='Total Score', # Color points by total score, now a column in pca_plotly_df
color_continuous_scale=px.colors.sequential.Viridis, # Use a sequential color scale
title='Interactive 3D PCA Scatter Plot (PC1 vs. PC2 vs. PC3) - Colored by Total Score',
labels={'PC1': 'Principal Component 1', 'PC2': 'Principal Component 2', 'PC3': 'Principal Component 3', 'Total Score': 'Total Score'},
hover_name=pca_plotly_df.index.astype(str), # Use student index as hover name
hover_data=X_df.columns.tolist() + ['Total Score'], # Show all original scores and total score on hover
height=700,
template='plotly_dark'
)
fig_3d.update_layout(scene=dict(xaxis_title='Principal Component 1', yaxis_title='Principal Component 2', zaxis_title='Principal Component 3'))
fig_3d.show()
```

### Biplots (Scores and Loadings)
biplots to visualize both the data points and the loading vectors, showing the relationships between variables and principal components.
- Combines the scatter plot of the data points (projected onto the first two principal components) with vectors representing the original features (loadings).
- Variables pointing in similar directions are positively correlated; opposite directions indicate negative correlation.
- Projecting a point onto a variable arrow approximates that variable’s value.
```python
import matplotlib.pyplot as plt
# Create a figure and an axes object for the biplot
plt.figure(figsize=(10, 8))
# Plot the PCA-transformed data (X_pca[:, 0] against X_pca[:, 1])
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7, label='Data Points')
# Calculate a scaling factor for the loading vectors
# components are already transposed in pca.components_
# np.sqrt(eigenvalues) is used to scale the loadings by the explained variance
# A constant factor (e.g., 2) can be used to adjust the arrow length for better visualization
scaling_factor = 2 # Adjust as needed for better visualization
loadings = pca.components_.T * np.sqrt(pca.explained_variance_) * scaling_factor
# Iterate through the loading vectors and draw arrows
for i, (x, y) in enumerate(loadings[:, :2]): # Only consider first two principal components for 2D biplot
plt.arrow(0, 0, x, y, color='red', alpha=0.8, head_width=0.05, head_length=0.05)
plt.text(x * 1.1, y * 1.1, feature_names[i], color='darkred', ha='center', va='center')
# Set the title of the biplot
plt.title('Biplot of PC1 vs. PC2')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
# Add a grid to the plot
plt.grid(True)
plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)
plt.legend()
# Display the plot
plt.show()
```

### Summary:
#### Data Analysis Key Findings
* A synthetic dataset consisting of **20 samples** and 5 features was successfully generated for the PCA, simulating student test scores in 5 subjects.
* The Principal Component Analysis (PCA) identified 5 principal components.
* The eigenvalues (explained variance) for the components were: **[239.24, 110.73, 45.93, 25.82, 14.28]** (rounded to two decimal places).
* The explained variance ratios were: **[0.5487, 0.2540, 0.1053, 0.0592, 0.0327]** for PC1 through PC5, respectively. This indicates that the first principal component (PC1) explains approximately 54.87% of the total variance, followed by PC2 explaining about 25.40%.
* The cumulative explained variance plot showed that using the first three principal components (PC1, PC2, and PC3) captures approximately **90.80%** (54.87% + 25.40% + 10.53%) of the total variance in the dataset.
* Visualizations included a scree plot, a cumulative explained variance plot, 2D and 3D scatter plots of the data projected onto the principal components, and a biplot illustrating the relationships between original features and the first two principal components.
#### Insights or Next Steps
* **Dimensionality Reduction Potential:** Given that the first three principal components explain over 90% of the variance, it is highly feasible to reduce the dimensionality of this dataset from 5 features to 3 features while retaining a significant amount of information. This reduction can simplify subsequent analysis or modeling without substantial loss of information.
* **Feature Contribution Analysis:** The biplot indicates how each original feature contributes to the principal components. By observing the direction and length of the loading vectors, we can infer which original features are most strongly correlated with PC1 or PC2, providing insights into the underlying structure of the data and identifying potential groupings or relationships among the subjects. For instance, features pointing in similar directions are positively correlated, and those pointing in opposite directions are negatively correlated.
### Explained PCA Implementation:
#### Detailed Explanation of PCA Implementation and Visualizations
This section provides a comprehensive overview of the Principal Component Analysis (PCA) implementation, from data generation to the interpretation of various visualizations.
##### 1. Dataset Generation
A synthetic dataset simulating **20 students' scores across 5 subjects (Math, Science, English, History, Art)** was created using `numpy`. The data was designed to include some underlying correlations (e.g., Math and Science scores, English and History scores) to mimic real-world scenarios. Each student's scores were generated based on a `base_score` with subject-specific variations and common factors, ensuring a range of scores between 0 and 100. This dataset, `X_df`, was then converted to a NumPy array `X` for PCA.
##### 2. PCA Implementation Steps
The `sklearn.decomposition.PCA` class was used to perform PCA:
* **Instantiation**: `pca = PCA()` initializes the PCA model without specifying the number of components, meaning all possible components (equal to the number of features or samples, whichever is less) are computed.
* **Fitting**: `pca.fit(X)` calculates the principal components, eigenvalues, and explained variance from the dataset `X`. This step involves standardizing the data (if not already done, though `sklearn.decomposition.PCA` does not inherently standardize, it's often a pre-processing step for PCA in practice to ensure features with larger scales don't dominate).
* **Extraction of Results**:
* `pca.components_`: These are the principal components (eigenvectors), representing the directions of maximum variance in the data. Each row corresponds to a principal component, and each column corresponds to an original feature.
* `pca.explained_variance_`: These are the eigenvalues, representing the amount of variance explained by each principal component. They are ordered from largest to smallest.
* `pca.explained_variance_ratio_`: This shows the proportion of total variance explained by each principal component.
##### 3. Interpretation of Visualizations
###### a. Scree Plot
* **Description**: The scree plot displays the eigenvalues (explained variance) against the number of principal components.
* **Interpretation**: We look for an "elbow" point in the plot, where the curve sharply changes slope before leveling off. Components before the elbow are typically considered significant, as they explain a substantial amount of variance. In our plot, the first component (PC1) has a significantly higher eigenvalue, followed by PC2, and then the values drop off more gradually. This suggests that the first two or three components capture most of the variance.
###### b. Cumulative Explained Variance Plot
* **Description**: This plot shows the cumulative sum of the explained variance ratio as more principal components are added.
* **Interpretation**: This plot helps in deciding the optimal number of principal components to retain for dimensionality reduction. We aim to select a number of components that collectively explain a high percentage of the total variance (e.g., 80% or 90%). In our case, the plot clearly shows that approximately 90.80% of the total variance is explained by the first three principal components, making PC1, PC2, and PC3 good candidates for reducing dimensionality while retaining most of the information.
###### c. 2D and 3D Scatter Plots
* **Description**: The original data `X` is transformed into the principal component space using `X_pca = pca.transform(X)`. The 2D plot visualizes the data points projected onto the first two principal components (PC1 vs. PC2), and the 3D plot visualizes the data points onto the first three principal components (PC1 vs. PC2 vs. PC3).
* **Interpretation**: These plots allow us to visually inspect the data's structure in a reduced-dimensional space. We can observe clusters, outliers, or patterns that might not be evident in the high-dimensional original data. For instance, if points form distinct groups in the 2D or 3D plots, it suggests underlying clusters in the original data. The spread of the points indicates the variance captured by those components.
###### d. Biplot
* **Description**: A biplot combines the scatter plot of the data points (projected onto the first two principal components) with vectors representing the original features (loadings). The loading vectors are scaled versions of the principal components, indicating the direction and strength of the relationship between the original variables and the principal components.
* **Interpretation**:
* **Data Points**: Similar to the 2D scatter plot, the positions of the data points show their relationships in the PC space.
* **Feature Vectors**:
* **Direction**: Features pointing in similar directions are positively correlated. Features pointing in opposite directions are negatively correlated. Features at a 90-degree angle are largely uncorrelated.
* **Length**: The length of a vector indicates how much that feature contributes to the principal component. Longer vectors suggest a stronger influence on the component.
* **Relationship to PCs**: Features that align closely with PC1 (horizontal axis) strongly influence PC1, and similarly for PC2 (vertical axis).
* **Example**: If "Math" and "Science" vectors point in roughly the same direction and are long, it suggests they are highly correlated and contribute significantly to PC1 or PC2. If "Art" points in a different direction, it might be less correlated with the other subjects or might be captured by a different principal component. This visualization helps in understanding which original features drive the separation or patterns observed in the principal components."""
---
### **PCA Visualizations**
For analysis, PCA results are visualized using:
- **Scree Plot**: eigenvalues vs number of components
- **Explained Variance Plot**: cumulative variance retained
- **2D/3D Scatter Plot**: projecting data onto PC1, PC2, PC3
- **Biplots**: combined point + loading vector visualization
---
## **Results and Analysis**
In this section, we analyze the results produced by Principal Component Analysis (PCA) on our synthetic dataset.
We evaluate:
- Variance captured by each principal component
- The loading (weight) structure of eigenvectors
- Scatter plots of data projected into lower dimensions
- Interpretation of patterns revealed by PCA
- Comparison between original and reduced feature spaces
---
## **Discussion**
Principal Component Analysis (PCA) has demonstrated its strength as a powerful dimensionality-reduction technique throughout this project.
Here, we reflect on the results, interpret their significance, and connect the mathematical foundations with the computational outcomes.
### **Limitations of PCA**
While PCA is extremely useful, it has important limitations:
#### **1. Linearity**
Here PCA relies on the fact that data can be represented as linear combinations of eigenvectors of the covariance matrix. Each principal component is a straight line or axis and these axes must align with the directions of maximum variance. If the data has a curved or complex structure that cannot be captured by linear axes then PCA might produce misleading results.
In many real-world datasets in the domains of image processing, speech recognition, etc. the relationships between variables are more nonlinear.
#### **2. Scaling sensitivity**
Features with large variance dominate unless standardized.
#### **3. Interpretability**
Principal components are combinations of variables — they may be difficult to interpret as real-world concepts.
#### **4. Sensitivity to outliers**
Outliers can distort the covariance structure significantly.
#### **5. Data must be continuous**
Categorical variables require preprocessing (e.g., one-hot encoding).
Understanding these limitations helps clarify when PCA is applicable and when alternative methods (t-SNE, UMAP, Autoencoders) may be better suited.
### **Strengths of PCA (Balanced Perspective)**
- **Fast and computationally efficient**
- **Deterministic** (unlike stochastic methods like t-SNE)
- **Mathematically interpretable** through eigen-decomposition
- **Effective at noise reduction**
- **Widely used and easy to integrate** into ML workflows
For structured, correlated datasets like the one in this project, PCA is an excellent choice.
---
## Conclusion
Principal Component Analysis is one of the most fundamental and widely used tools in data science and machine learning. It combines mathematical elegance with practical usefulness, allowing us to simplify large datasets while still preserving the core patterns and relationships within them.
From the work in this project, we conclude:
- PCA reduces dimensionality effectively while retaining most of the important variance.
- It reveals hidden structure and correlations that are not obvious in the raw data.
- A small number of principal components are often enough to represent the dataset.
- Standardizing the data before PCA is critical for meaningful results.
In summary, PCA provides a powerful way to make high-dimensional data more understandable and manageable. It enhances visualization, improves performance of downstream models, and offers deep insight into the underlying structure of a dataset, making it an essential technique for any data scientist or analyst.
---
# Appendix
This appendix summarizes all notation, symbols, terminology, and mathematical constructs used throughout the PCA project. It includes a complete legend of variables, definitions, formulas, and supporting details.
## A. Symbol Table (Legend)
| Symbol | Meaning |
|--------|---------|
| $X$ | Original data matrix of size $n \times p$ |
| $n$ | Number of observations |
| $p$ | Number of features |
| $x_{ij}$ | Value of feature $j$ for sample $i$ |
| $\mu_j$ | Mean of feature $j$ |
| $\sigma_j$ | Standard deviation of feature $j$ |
| $\tilde{X}$ | Mean-centered data matrix |
| $X'$ | Standardized data matrix |
| $\Sigma$ | Covariance matrix |
| $\Sigma_{jk}$ | Covariance between features $j$ and $k$ |
| $\lambda_i$ | Eigenvalue of covariance matrix |
| $\mathbf{v}_i$ | Eigenvector corresponding to $\lambda_i$ |
| $V$ | Matrix of eigenvectors |
| $W_k$ | Projection matrix of top $k$ principal components |
| $Z$ | PCA-transformed data (scores) |
| $Z_k$ | Reduced representation using top $k$ components |
| $\hat{X}$ | Reconstructed data |
| $U, \Sigma_{svd}, V^T$ | SVD matrices such that $\tilde{X} = U\Sigma_{svd}V^T$ |
| $EVR$ | Explained Variance Ratio |
| $\|\cdot\|_F$ | Frobenius norm |
| $k$ | Number of principal components |
## B. Terminology & Definitions
### **Observation / Sample**
One row of the dataset.
### **Feature / Variable**
One column of the dataset.
### **Mean-Centering**
Subtracting the mean from each feature to achieve zero mean.
### **Standardization**
$$
x' = \frac{x - \mu}{\sigma}
$$
### **Covariance**
Measure of how two variables change together.
### **Covariance Matrix**
A $p \times p$ symmetric matrix containing pairwise covariances.
### **Eigenvalue**
Variance captured by a principal component.
### **Eigenvector**
Direction of maximum variance.
### **Principal Component**
Eigenvector of the covariance matrix.
### **Scores (Projected Data)**
$Z = \tilde{X} W_k$
### **Dimensionality Reduction**
Mapping data to fewer dimensions while preserving variability.
### **Reconstruction**
$\hat{X} = ZW_k^T + \mu$
## C. Matrix Dimensions Reference
| Object | Size | Description |
|--------|------|-------------|
| $X$ | $n \times p$ | Original dataset |
| $\tilde{X}$ | $n \times p$ | Mean-centered data |
| $\Sigma$ | $p \times p$ | Covariance matrix |
| $V$ | $p \times p$ | Eigenvectors |
| $W_k$ | $p \times k$ | Top $k$ eigenvectors |
| $Z_k$ | $n \times k$ | Reduced representation |
| $\hat{X}$ | $n \times p$ | Reconstructed data |
| $U$ | $n \times p$ | Left singular vectors |
| $\Sigma_{svd}$ | $p \times p$ | Singular values |
| $V^T$ | $p \times p$ | Right singular vectors |
## D. PCA Formulas (Complete)
### **1. Mean**
$$
\mu_j = \frac{1}{n} \sum_{i=1}^n x_{ij}
$$
### **2. Standardization**
$$
x'_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}
$$
### **3. Centered Data**
$$
\tilde{X} = X - \mu
$$
### **4. Covariance Matrix**
$$
\Sigma = \frac{1}{n-1}\tilde{X}^T \tilde{X}
$$
### **5. Eigenvalue Equation**
$$
\Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i
$$
### **6. Explained Variance Ratio**
$$
EVR_i = \frac{\lambda_i}{\sum_j \lambda_j}
$$
### **7. Projection Matrix**
$$
W_k = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k]
$$
### **8. PCA Scores**
$$
Z = \tilde{X} W_k
$$
### **9. Reconstruction**
$$
\hat{X} = Z W_k^T + \mu
$$
### **10. SVD Relationship**
$$
\tilde{X} = U \Sigma_{svd} V^T
$$
## E. PCA Algorithm (Summary)
1. Standardize data
2. Mean-center
3. Compute covariance matrix
4. Compute eigenvalues and eigenvectors
5. Sort eigenvalues descending
6. Select top $k$ eigenvectors
7. Compute scores $Z = \tilde{X}W_k$
## F. Plot Reference
| Plot | Purpose |
|-------|---------|
| Scree Plot | Eigenvalues |
| Cumulative Variance Plot | How many PCs to keep |
| PC1–PC2 Scatter | 2D visualization |
| 3D PCA Plot | 3D visualization |
## G. Additional Notes
### **1. Sign Ambiguity**
Eigenvectors may flip sign, thus PCA is unaffected.
### **2. Linearity Assumption**
PCA is linear.
Nonlinear methods include:
- t-SNE
- UMAP
- Autoencoders
### **3. PCA as Rotation**
It rotates coordinate axes to align with maximal variance.
## H. Glossary
| Term | Definition |
|------|------------|
| Latent Variable | Hidden factor generating the data |
| Orthogonality | Uncorrelated directions |
| Rank | Number of non-zero eigenvalues |
| Loading | Feature weight in a PC |
| Projection | Mapping into PC space |
| Low-Rank Approximation | Fewer dimensions than original |