PCA - HackMD

--- title: PCA tags: teach:MF --- # Principle Component Analysis (PCA) ## Agenda 1. Lecture 2. Lab: [eig](https://colab.research.google.com/drive/1gp1aMLQvT13FWAnejPK8gXxsOH0QYKTN#scrollTo=fMFzTsszEnK2), [svd](https://colab.research.google.com/drive/1fqoOBjIRORVfqLWzF7p6NIzQBgxsw-cg#scrollTo=7MrysX2IODKG), [pca from scratch](https://colab.research.google.com/drive/1vf-aP35VEIE-0WV2OPpLFRqtkkwZPDLU#scrollTo=yDXwY4nuTCX9), [pca demo wine](https://colab.research.google.com/drive/1D8c7A1nreKv8o5eSh5mPfI8j-dBHX9dW#scrollTo=jH7H5XfaUY-F), [read data in Colab](https://colab.research.google.com/drive/1dSyHSwdRtNTYmcFwVn1ndgd2jxlGuknA#scrollTo=pwM-xtAfhwXf) 3. [Jamboard](https://jamboard.google.com/d/1AHdHugUxfKAbzLCyeRv0My-XKd9DuRIYVqM1MS9dTqQ/viewer?f=1) 4. Homework 6 ## Overview ### Objectives of PCA 1. Principle Component Analysis (PCA) captures the intrinsic variability in the data. 2. It helps to reduce the dimensionality of a dataset: - to ease interpretation, - as a way to avoid overfitting, - to prepare for subsequent analysis. ### Applications in Quantitative Finance PCA can be used to extract the most representative feature. This can be further used in: 1. model building - portfolio construction - Risk management 2. covariance estimation > In quantitative finance, principal component analysis can be directly applied to the risk management of interest rate derivative portfolios. Trading multiple swap instruments which are usually a function of 30–500 other market quotable swap instruments is sought to be reduced to usually 3 or 4 principal components, representing the path of interest rates on a macro basis. Converting risks to be represented as those to factor loadings (or multipliers) provides assessments and understanding beyond that available to simply collectively viewing risks to individual 30–500 buckets. [(PCA, Wiki)](https://en.wikipedia.org/wiki/Principal_component_analysis) ### Referenced links 1. [Lesson 6 of Prof. Jia Li](https://newonlinecourses.science.psu.edu/stat508/lesson/6) 2. [Lec 23 of Prof. David Sontag](https://people.csail.mit.edu/dsontag/courses/ml16/slides/lecture23.pdf) 3. [How to Calculate Principal Component Analysis (PCA) from Scratch in Python](https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/) ## Formulation of PCA Suppose a random vector $\chi\sim N_p(0,\Sigma)$. We would like to find a unit vector $w$ to maximize the variance of $\chi'w$: $$Var(\chi'w) = w'\Sigma w, $$ so that $w$ is a unit vector satisfying $w'w=1$. ### Solution Let the Eigen decomposition of $\Sigma$ to be $\Sigma = V\Lambda V'$. $\Lambda = diag(\lambda_1,\ldots,\lambda_p)$ is a diagonal matrix with positive diagonal elements: $\lambda_1>\lambda_2>\cdots>\lambda_p$. Then, \begin{eqnarray*} w'\Sigma w &=& w'V\Lambda V' w\\ &=& \tilde{w}' \Lambda \tilde{w}= \lambda_1 \tilde{w}_1^2 + \lambda_2 \tilde{w}_2^2+\cdots+\lambda_p^2 \tilde{w}_p^2\\ &\leq & \lambda_1 \tilde{w}_1^2 + \lambda_1 \tilde{w}_2^2+\cdots+\lambda_1^2 \tilde{w}_p^2\\ &= & \lambda_1( \tilde{w}_1^2 + \tilde{w}_2^2+\cdots+ \tilde{w}_p^2)\\ &=& \lambda_1. \end{eqnarray*} Therefore, when $\tilde{w}_1=1$, $\tilde{w}_2=0$, $\ldots$, $\tilde{w}_p=0$, $Var(\chi' w)$ is maximized. ## With the data input matrix $X$ The input matrix $X$ of dimension $N\times p$: $$X = \left(\begin{array}{cccc}x_{1,1}&x_{1,2}&\cdots & x_{1,p}\\ x_{2,1}&x_{2,2}&\cdots & x_{2,p}\\ \vdots & \vdots & \vdots & \vdots\\ x_{N,1} & x_{N,2} & \cdots & x_{N,p} \end{array}\right)$$ The rows of the above matrix represent the cases or observations. The columns represent the variables observed in each unit. These represent the characteristics. Assume that the columns of X are centered, i.e., the *estimated* column mean is subtracted from each column. ### The Singular Value Decomposition (SVD) The SVD of the $N\times p$ matrix $X$ has the form $X = UDV'$. - $U=(u_1,\ldots,u_N)$ is an $N\times N$ orthogonal matrix. $\{u_1,\ldots,u_N\}$ form an orthonormal basis for the space spanned by the column vectors of $X$. - $V=(v_1,\ldots,v_p)$ is an $p\times p$ orthogonal matrix. $\{v_1,\ldots,v_p\}$ form an orthonormal basis for the space spanned by the raw vector of $X$. - $D$ is an $N\times p$ rectangular matrix with nonzero elements along the first $p\times p$ submatrix diagonal. $diag(d_1,d_2,\ldots, d_p)$, $d_1\geq d_2\geq \cdots\geq d_p\geq 0$ are the singular values of $X$ with $N>p$. Let plum denote matrix and vector transpose. With SVD of $X$ (i.e., $X=UDV'$), the Eigen decomposition of $X'X$ is \begin{eqnarray*} X' X &=& (UDV')'(UDV')\\ &=& V D' U' U D V' \\ &=& V D' D V'\\ &=& V D^2 V'. \end{eqnarray*} Here, $D^2 = D' D$. If you have the SVD, you already have the Eigen value decomposition for $X'X$. - The columns of $V$ (i.e., $v_j$, $j=1,\ldots,p$) are the eigenvectors of $X' X$. They are called *principle component direction* of $X$. - The diagonal values in $D$ (i.e., $d_1$, $j=1,\ldots,p$) are the square roots of the eigenvalues of $X'X$. ### Principle components If you project $X$ onto the *principle components directions*, you get the *principle components*. $$z_1 = X v_1 = \left(\begin{array}{c}x_{1,1}v_{1,1}+x_{1,2}v_{1,2}+\cdots+x_{1,p}v_{1,p}\\ x_{2,1}v_{1,1}+x_{2,2}v_{1,2}+\cdots+x_{2,p}v_{1,p}\\\vdots \\ x_{N,1}v_{1,1}+x_{N,2}v_{1,2}+\cdots+x_{N,p}v_{1,p}\end{array}\right). $$ With the SVD of $X$, it can be seen that $z_1 = Xv_1 = d_1u_1$ following direct matrix calculation: \begin{eqnarray*} z_1 &= & Xv_1 = (UDV')v_1\\ & =& UD\left(\begin{array}{c}v_1'\\v_2'\\\vdots\\v_p'\end{array}\right)v_1 = UD\left(\begin{array}{c}1\\0\\\vdots\\0\end{array}\right)\\ & = & U \left(\begin{array}{cccc}d_1 &0&\cdots &0 \\0 & 0 & \cdots & 0\\ \vdots &\vdots &\vdots & \vdots \\ 0 & 0& \cdots & 0 \end{array}\right) = d_1 u_1 \end{eqnarray*} The first principle components of $X$, $z_1$, has the largest sample variance amongst all normalized linear combinations of the column of $X$. The variance of $z_1$ is $z_1'z/N$: \begin{eqnarray*} z_1'z_1/N = (d_1u_1)'(d_1u_1)/N= d_1^2 u_1'u_1/N = d_1^2/N. \end{eqnarray*} The principal components of $X$ are $z_j = X v_j$, for $j=1,\ldots,p$. Similarly, it is easy to see $$z_j = Xv_j =d_j u_j.$$ Subsequent principal components $z_j$ have maximum variance $d_j^2/N$, subject to being orthogonal to the earlier ones. ![](https://i.imgur.com/vgTpKZp.gif) ## Approach 1: PCA using SVD 1. Demeaned $X$ (so that each column has zero mean). 2. Obtain the SVD of $X$: $X = UDV'$, where $V$ is the eigenvectors. 3. The first principal component direction $v_1$ and the $k$-th principle component direction $v_k$: - $v_1$ is the eigenvector associated with the largest eigenvalue, $d_1^2$, of $X'X$. - $z_1=Xv_1$ has the largest sample variance amongst all normalized linear combinations of the columns of $X$. - $z_1$ is called the first principal component of $X$. And we have $Var(z_1)=z_1'z_1/N=d_1^2/N$. - The second principle component direction $v_2$ is the eigenvector corresponding to the second largest eigenvalue, $d_2^2$, of $X'X$. - The eigenvector for the $k$-th largeset eigenvalue corresponds to the $k$-th principal component direction $v_k$. - The $k$-th principle component of $X$, $z_k$, has maximum variance $d_k^2/N$, subject to being orthogonal to the earlier ones. ## Approach 2: Estimate $\Sigma$ and Eigen decomposition 1. Demeaned $X$ (so that each column has zero mean). 2. Obtain the sample covariance matrix of $X$: $S = X'X/N$. 3. Obtain the eigenvectors and eigen values using eigen decomposition: $S = V\Lambda V'$. 4. Obtain the $k$-th principle components $z_k=X v_k$. ## How many components to use? We use Scree plot to decide the number of components to use, by plotting the accumulative variance explained against the number of principle components. The choice of how many components to extract is fairly *arbitrary*. ## Other issues ### Why do dimensionality reduction? - Computational: compress data time/space efficiency. - Statistical: fewer dimensions better generalization. - Visualization: understand the structure of data. - Anomaly detection: describe normal data, detect outliers. ### High-dimensional data When faced with situations involving high-dimensional data, it is natural to consider projecting those data onto a lower-dimensional subspace without losing important information. - Variable selection also called **feature selection**. - Shrinkage: Ridge regression and Lasso. - Creating a reduced set of linear or nonlinear transformations of the input variables, also called **feature extraction**, e.g. PCA. ## HW 6 Review linear algebra: 1. SVD. 2. Eigen decompotion. ## Hw 7 1. Textbook: p 417, 10. (Note: Use python instead of R!) 2. Use the iris dataset with *four* features. (a) Draw the 2D plot using the first two principle components. Use different colors to indicate different classes. (b) Implement the PCA using SVD and report the eigen vectors. Plot the first one and two principle components. Is the plot the same as (a)? (c) Implement the PCA using eigen decomposition and report the eigen vectors. Plot the first one and two principle components. Is the plot the same as (a)?