---
title: 'ML Group 15 Project'
disqus: hackmd
---
ROBPCA: A New Approach to Robust Principal Component Analysis
===
Group 15
---
> Honor Code: We shall be honest in our efforts and will make our parents proud.
Shikha Bhat (2019A7PS0063G)
Yash Trivedi (2019B4AA0834G)
Viraj Sharma(2020A7PS1011G)
## Motivation
While solving any ML problem, the data that we use to analyze and feed to the model is of great importance. Principal Component Analysis is widely used for the analysis of high-dimensional data, and allows us to represent the dataset as linear combinations of the original variables in a lower dimension.
PCA often allows for interpretation and better understanding of the different sources of variation. However, the classic approach to PCA is sensitive to outliers (very high or low observations), because it is dependent on the variance-covariance matrix. The first principal component of a set of p variables is the derived variable formed as a linear combination of the original variables that explains the **most** variance.
Outliers increase the variance of the data. The principal components may then be distorted so as to fit the outlier, which leads to a bad interpretation of the results. Thus, the authors of this paper aim to introduce a new, robust method of PCA, which can address this problem. They call it ROBPCA.
## Problem Definition
1. To develop a robust method through which we can accurately apply PCA to high dimensional data having outliers
2. Using this robust PCA, get a diagnostic plot that can be used to detect and classify outliers.
## Methodology
### The ROBPCA Method
Previous efforts have been to replace the classical covariance matrix by robust covariance estimators but these cannot resist many outliers or are limited to small to moderate dimensions. Projection pursuit techniques have also been used in the past, which can handle higher dimensions. The ROBPCA method attempts to combine the advantages of both approaches -
1. The projection pursuit technique is used for the initial dimension reduction of the data.
2. Some ideas based on the MCD estimator (minimum covariance determinant) are then applied to this lower-dimensional data space.
### Diagnostic Plots (Outlier Maps)
Diagnostic plots help in distinguishing between regular observations and three types of outliers - orthogonal outliers, good leverage points and bad levarage points by plotting the orthogonal distance of the outliers from the PCA subspace vs the robust score distance of the observations.
## Results
ROBPCA was applied to several real datasets from chemometrics and engineering, and the resulting diagnostic plots were compared across four other PCA methods. For simplicity, here we summarize in a table how the ROBPCA method compares to Classical PCA for 3 datasets -
| Dataset | Dimensions (training examples x features) | No. of components retained | Diagnostic Plot | Results |
| ------------- | ----------------------------------------- | --------------------------------------- | --- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Car | 111 x 11 | 2 |  | Both CPCA and ROBPCA detected same set of outliers, but the subspace found by CPCA is attracted toward the bad leverage points. |
| Octane | 226 x 39 | 2 || ROBPCA was able to detect 6 outliers, while CPCA detected only 1. The first principal component from CPCA is clearly attracted by the six outliers, yielding a classical eigenvalue of .13. In contrast, the first robust eigenvalue l1 is only .01. |
| Glass Spectra | 750 x 180 | 3 | | CPCA does not find important outliers. ROBPCA clearly distinguishes 2 major groups in the data, a smaller group of bad leverage points, a few orthogonal outliers, and 1 isolated case |
## Discussion
The results show that ROBPCA is a promising method that gives robust estimates even when data contains outliers. The associated outlier maps are very useful to visualize and classify the different outliers. A side by side comparison of these plots for different PCA methods as shown in the paper show the superiority of ROBPCA, as it is able to identify the outliers which are not identified by CPCA. As shown in the results, the eigenvalue associated with the principal component in CPCA was almost 10 times that with ROBPCA, which shows the extent to which the CPCA subspace is attracted to outliers, and ROBPCA is not.
## Conclusion
In this study, the authors construct a fast and robust algorithm ROBPCA which can apply PCA on high dimensional data. They apply PP techniques and these results are used to project the observations on smaller dimension subspace. Within this subspace, they apply ideas of robust covariance estimation. The results were promising and definitely a step up from the CPCA approacch. The ROBPCA method thus opens a door to practical robust multivariate calibration and to the analysis of regression data with both outliers and multicollinearity.