SVD - HackMD

# Singular Value Decomposition ### Big Idea The **Singular Value Decomposition (SVD)** states that any $m \times n$ matrix $A$ can be factored into $A = P \Sigma Q^T$. This decomposition is built upon the properties of the symmetric matrices $A^T A$ and $A A^T$, whose eigenvalues lead to the *singular values* $\sigma_i$. The SVD provides orthonormal bases for all four *fundamental subspaces* of $A$ and is essential for stable, high-accuracy matrix computations. ## SVD Construction The construction of the SVD begins with two key observations from the *Spectral Theorem* applied to the matrices $A^T A$ (which is $n \times n$ symmetric) and $A A^T$ (which is $m \times m$ symmetric): 1. $A^T A$ and $A A^T$ are orthogonally diagonalizable: $A A^T = P D_1 P^T$ and $A^T A = Q D_2 Q^T$, where $P$ and $Q$ are orthogonal matrices of eigenvectors, and $D_1$ and $D_2$ are real diagonal matrices of eigenvalues. 2. Both $A^T A$ and $A A^T$ have non-negative eigenvalues. They also share the *same set of positive eigenvalues*. :::info **Definition (Singular Values)** The *singular values* $\sigma_i$ of $A$ are the square roots of the positive eigenvalues $\lambda_i$ shared by $A^T A$ and $A A^T$: $$ \sigma_i = \sqrt{\lambda_i} \ \ , \ \ i=1,\dots,r $$ where $r$ is the number of positive eigenvalues, sorted in decreasing order $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$. ::: :::danger **Theorem (The SVD Theorem)** Any $m \times n$ matrix $A$ can be decomposed as $A = P \Sigma Q^T$. * $P$ ($m \times m$) and $Q$ ($n \times n$) are orthogonal matrices. * $\Sigma$ ($m \times n$) is a diagonal matrix containing the singular values $\sigma_i$: $$ \Sigma = \left[ \begin{array}{ccc|c} \sigma_1 & & & \\ & \ddots & & \mathbf{0} \\ & & \sigma_r & \\ \hline & \mathbf{0} & & \mathbf{0} \end{array} \right]_{m \times n} $$ ::: :::info **Definition** In the singular decomposition $A = P \Sigma Q^T$: * The columns of $P$ are the *left singular vectors* of $A$ (eigenvectors of $A A^T$). * The columns of $Q$ are the *right singular vectors* of $A$ (eigenvectors of $A^T A$). ::: ## Algorithm for Computing the SVD The standard approach for computing the SVD is to diagonalize $A^T A$ (or $A A^T$ if $m < n$): 1. **Find Eigenvalues of $A^T A$:** Calculate the real and non-zero eigenvalues ($\lambda_i$) and the corresponding orthonormal eigenvectors ($\mathbf{v}_i$) of $A^T A$. The singular values are $\sigma_i = \sqrt{\lambda_i}$. 2. **Construct $\Sigma$ and $Q$:** Form the diagonal matrix $\Sigma$ from the singular values and the orthogonal matrix $Q$ from the sorted eigenvectors $\mathbf{q}_i$ ($Q = [\mathbf{q}_1 \mid \cdots \mid \mathbf{q}_n]$). 3. **Construct $P$:** The first $r$ columns of $P$ ($\mathbf{u}_i$) are found directly using the relationship between $A$, $\mathbf{p}_i$, and $\sigma_i$: $$\mathbf{p}_i = \frac{1}{\sigma_i} A \mathbf{p}_i, \quad i = 1, \dots, r$$ The orthonormal set $\{\mathbf{p}_1, \dots, \mathbf{p}_r\}$ is then extended (using Gram-Schmidt) to a full orthonormal basis $\{\mathbf{p}_1, \dots, \mathbf{p}_m\}$, which forms the matrix $P$. :::warning **Example** Consider $A = \begin{bmatrix} 1 & 0 & 1 \\ -1 & 1 & 0 \end{bmatrix}$. 1. **Eigenvalues/Singular Values:** $A^T A = \begin{bmatrix} 2 & -1 & 1 \\ -1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}$ has eigenvalues $\lambda_1 = 3$, $\lambda_2 = 1$, $\lambda_3 = 0$. * The singular values are $\sigma_1 = \sqrt{3}$ and $\sigma_2 = 1$. * The $\Sigma$ matrix is $\Sigma = \begin{bmatrix} \sqrt{3} & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}$. 2. **Right Singular Vectors $Q$:** The eigenvectors of $A^T A$ are: * $\mathbf{q}_1 = \frac{1}{\sqrt{6}} \begin{bmatrix} 2 \\ -1 \\ 1 \end{bmatrix}$ ($\lambda_1=3$), * $\mathbf{q}_2 = \frac{1}{\sqrt{2}} \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}$ ($\lambda_2=1$), * $\mathbf{q}_3 = \frac{1}{\sqrt{3}} \begin{bmatrix} -1 \\ -1 \\ 1 \end{bmatrix}$ ($\lambda_3=0$). * $Q = \begin{bmatrix} \mathbf{q}_1 & \mathbf{q}_2 & \mathbf{q}_3 \end{bmatrix}$ 3. **Left Singular Vectors $P$:** Calculated from $\mathbf{p}_i = \frac{1}{\sigma_i} A \mathbf{q}_i$: * $\mathbf{p}_1 = \frac{1}{\sigma_1} A \mathbf{q}_1 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix}$ * $\mathbf{p}_2 = \frac{1}{\sigma_2} A \mathbf{q}_2 = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}$ * $P = \begin{bmatrix} \mathbf{p}_1 & \mathbf{p}_2 \end{bmatrix} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ -1 & 1 \end{bmatrix}$ The Singular Value Decomposition is $A = P \Sigma Q^T$. ::: ### Corollaries of SVD The SVD directly reveals the dimension and orthonormal bases for all four fundamental subspaces of $A$: :::danger **Theorem** Let $A = P \Sigma Q^T$, and let $r$ be the number of positive singular values $\sigma_i$. * $\mathrm{rank}(A) = r$. * $\mathcal{N}(A) = \mathrm{span} \{ \mathbf{q}_{r+1},\dots,\mathbf{q}_n \}$ (spanned by the last $n-r$ right singular vectors). * $\mathcal{R}(A^T) = \mathrm{span} \{ \mathbf{q}_1,\dots,\mathbf{q}_r \}$ (spanned by the first $r$ right singular vectors). * $\mathcal{N}(A^T) = \mathrm{span} \{ \mathbf{p}_{r+1},\dots,\mathbf{p}_m \}$ (spanned by the last $m-r$ left singular vectors). ::: ### Norm and Condition Number The SVD links the singular values to the norms and condition number of $A$: * The $\ell_2$ matrix norm is the largest singular value: $\| A \| = \sigma_1$. * If $A$ is square and invertible, the norm of its inverse is determined by the smallest singular value $\sigma_n$: $\| A^{-1} \| = 1/\sigma_n$. * The condition number of an invertible matrix A is the ratio of the largest to the smallest singular value: $\mathrm{cond}(A) = \frac{\sigma_1}{\sigma_n}$. :::success **Exercise** 1. The singular value decomposition $A = P \Sigma Q^{T}$ is unique. (True or False) 2. Find the singular value decomposition of $A = \begin{bmatrix} 1 & 1 & 1 \\ -1 & 2 & -1 \\ 1 & 0 & -1 \end{bmatrix}.$ 3. Find the singular value decomposition of $A = \begin{bmatrix} 1 & 2 & -1 \\ 2 & 1 & 4 \end{bmatrix}.$ 4. Suppose that the matrix $A$ has the following singular value decomposition: $$A = \begin{bmatrix}0 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} \sigma_1 & 0 & 0 \\ 0 & \sigma_2 & 0 \end{bmatrix} \begin{bmatrix} \dfrac{1}{\sqrt{2}} & \dfrac{1}{\sqrt{2}} & 0 \\ 0 & 0 & 1 \\ -\dfrac{1}{\sqrt{2}} & \dfrac{1}{\sqrt{2}} & 0 \end{bmatrix},$$ where $\sigma_1 \ge \sigma_2 > 0$. * a. What are $\|A\|$ and $\text{rank}(A)$? * b. Write down an orthonormal basis for $\mathcal{N}(A)$ and an orthonormal basis for $\mathcal{R}(A)$. * c. What are the eigenvalues and eigenvectors of $A^{\ast}A$ and $AA^{\ast}$? ::: ## Principal Component Analysis *Principal Component Analysis (PCA)* is a technique used to reduce the dimensionality of data by transforming it into a new coordinate system. The principal components are the new coordinate axes, ordered so that the first component captures the largest possible variance in the data, the second component captures the second largest, and so on. The *Singular Value Decomposition (SVD)* provides the analytical solution for PCA, revealing that the weight vectors are the right singular vectors of the data matrix. ### Definitions and Construction Assume the $n \times p$ data matrix $X$ is *normalized* (the mean of each column is 0). We view each row $\mathbf{x}_i$ as a $p$-dimensional data vector. * **Weight Vector $\mathbf{w}_k$:** A unit vector that defines a new axis of projection. * **Principal Component:** The projection coefficient $\langle \mathbf{x}_i , \mathbf{w}_k \rangle$ of a data vector $\mathbf{x}_i$ onto the axis $\mathbf{w}_k$. The goal of PCA is to find the *first weight vector $\mathbf{w}_1$* that maximizes the total squared length of the orthogonal projections of all data points onto that direction: $$ \text{Maximize } \sum_{i=1}^n \langle \mathbf{x}_i , \mathbf{w}_1 \rangle^2 = \| X \mathbf{w}_1 \|^2 $$ The *$k$th weight vector $\mathbf{w}_k$* is the unit vector that maximizes the variance of the remaining data after projecting onto the first $k-1$ components. Since $\mathbf{w}_1$ captures the most information, $\mathbf{w}_k$ is chosen to be *orthogonal* to all previous weight vectors $\{\mathbf{w}_1, \dots, \mathbf{w}_{k-1}\}$. ![image](https://hackmd.io/_uploads/HyK-6yVJ-e.png) ### PCA and SVD The problem of maximizing $\| X \mathbf{w} \|^2$ is equivalent to finding the eigenvectors of the covariance matrix $X^T X$. The *Singular Value Decomposition* provides the necessary link to the solution: :::danger **Theorem** * If $X = P \Sigma Q^T$ is the SVD of $X$, and $\mathbf{q}_1, \dots, \mathbf{q}_p$ are the columns of $Q$ corresponding to the ordered singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_p$, then: $$\mathbf{w}_k = \mathbf{q}_k$$ * The vector $\mathbf{w}_1$ corresponds to the largest singular value, $\sigma_1$, meaning it is the direction of maximum variance in the data. ::: :::warning **Example** Consider the data shown in the scatter plot. ![image](https://hackmd.io/_uploads/HyhDOmXy-e.png) The data matrix $X$ is normalized (mean of each column is 0): $$X = \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ -1 & 1 \\ 0 & 0 \\ 1 & -1 \\ 1 & 1 \\ 2 & 2 \end{bmatrix}$$ 1. **Calculate the Covariance Matrix $X^T X$:** $$X^{\mathsf{T}} X = \begin{bmatrix} 12 & 8 \\ 8 & 12 \end{bmatrix}$$ 2. **Find Eigenvalues:** The eigenvalues of $X^T X$ are the roots of the characteristic polynomial:$$C_{X^{\mathsf{T}} X}(\lambda) = \lambda^2 - 24\lambda + 80 = (\lambda - 20)(\lambda - 4)$$ This yields the eigenvalues $\lambda_1 = 20$ and $\lambda_2 = 4$. 3. **Find Weight Vectors ($\mathbf{w}_k$):** The weight vectors are the normalized eigenvectors (right singular vectors) of $X^T X$: * The *first weight vector* $\mathbf{w}_1$, corresponding to the maximum eigenvalue $\lambda_1=20$, points in the direction of maximum variance:$$\mathbf{w}_1 = \mathbf{q}_1 = \dfrac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \end{bmatrix}$$ * The *second weight vector* $\mathbf{w}_2$, corresponding to $\lambda_2=4$, is orthogonal to $\mathbf{w}_1$:$$\mathbf{w}_2 = \mathbf{q}_2 = \dfrac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ -1 \end{bmatrix}$$ ::: :::success **Exercise** 5. Determine whether the statement is True or False. * a. Let $X$ be an $n \times p$ (normalized) data matrix, let $\mathbf{x}_i$ be a row of $X$ and let $\mathbf{w}_1$ be the first weight vector. If $\|\mathbf{x}_i\| > 0$, then $|\mathbf{x}_i \cdot \mathbf{w}_1| > 0$. * b. Let $X$ be an $n \times p$ (normalized) data matrix, let $\mathbf{x}_i$ be a row of $X$ and let $\mathbf{w}_1$ be the first weight vector. Then $|\mathbf{x}_i \cdot \mathbf{w}_1| \le \|\mathbf{x}_i\|$. * c. Let $X$ be an $n \times p$ (normalized) data matrix, let $\mathbf{x}_i$ be a row of $X$ and let $\mathbf{w}_1$ be the first weight vector. Then $\mathbf{x}_i \cdot \mathbf{w}_1 \ge 0$. * d. Let $X$ be an $n \times p$ (normalized) data matrix with $p > 2$, let $\mathbf{x}_i$ be a row of $X$, let $\mathbf{w}_1$ be the first weight vector, and let $\mathbf{w}_2$ be the second weight vector. Then $|\mathbf{x}_i \cdot \mathbf{w}_2| \le |\mathbf{x}_i \cdot \mathbf{w}_1|$. 6. Determine whether the statement is True or False. * a. Let $X$ be an $n \times p$ (normalized) data matrix and let $\mathbf{x}_i$ and $\mathbf{x}_j$ be two different rows of $\mathbf{X}$ such that $\|\mathbf{x}_i\| < \|\mathbf{x}_j\|$. If $\mathbf{w}_1$ is the first weight vector of $X$, then $|\mathbf{x}_i \cdot \mathbf{w}_1| < |\mathbf{x}_j \cdot \mathbf{w}_1|$. * b. Let $X$ be an $n \times p$ (normalized) data matrix and let $\mathbf{x}_i$ and $\mathbf{x}_j$ be two different rows of $X$ such that $\mathbf{x}_i \cdot \mathbf{x}_j = 0$. If $\mathbf{w}_1$ is the first weight vector of $X$ and $\mathbf{x}_i \cdot \mathbf{w}_1 = 0$, then $\mathbf{x}_j \cdot \mathbf{w}_1 = 0$. * c. Let $X$ be an $n \times 2$ data matrix and let $\mathbf{Y}$ be the matrix with the same columns as $\mathbf{X}$ but switched. If $X$ and $Y$ represent the same set of data points, then all the singular values of $X$ are equal. * d. Let $X$ be an $n \times 2$ data matrix and let $Y$ be the matrix with the same columns as $X$ but switched. If $X$ and $Y$ represent the same set of data points, then $\mathbf{w}_1 = \begin{bmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{bmatrix}^{\mathsf{T}}$. 7. Find the weight vectors for the data matrix $X$ representing the points in the scatter plot. ![image](https://hackmd.io/_uploads/BkF2JE7kZl.png) 8. Suppose $X$ is a $100 \times 4$ data matrix such that $X^T X$ is given. $$X^T X = \begin{bmatrix} 2 & 0 & 0 & 0 \\ 0 & 1.5 & 0 & 0 \\ 0 & 0 & 2 & 1 \\ 0 & 0 & 1 & 2 \end{bmatrix}$$ Find all the weight vectors of $X$. ::: ## Pseudoinverse and Least Squares The *pseudoinverse* $A^+$ generalizes the concept of the matrix inverse to any matrix (even non-square or singular matrices) using the SVD. It provides a universal solution for the *least squares approximation* of $A\mathbf{x} \approx \mathbf{b}$ and is crucial for stable data analysis and dimensionality reduction. :::info **Definition** Let $A$ be an $m \times n$ matrix with $\mathrm{rank}(A) = r$, and let $A = P \Sigma Q^T$ be its SVD. The *pseudoinverse* of $A$ (also known as the Moore-Penrose inverse) is defined as $A^+ = Q \Sigma^+ P^T$ with $$ \Sigma^+ = \left[ \begin{array}{ccc|c} \sigma_1^{-1} & & & \\ & \ddots & & \mathbf{0} \\ & & \sigma_r^{-1} & \\ \hline & \mathbf{0} & & \mathbf{0} \end{array} \right]_{n \times m} $$ ::: ### Properties and Applications The pseudoinverse provides the direct solution to the least squares problem: :::danger **Theorem** * For any matrix $A$ with full column rank ($\mathrm{rank}(A) = n$), the least squares approximation to $A\mathbf{x} \approx \mathbf{b}$ is $\mathbf{x}_{\text{LS}} = A^+ \mathbf{b}$. * The matrix $AA^+$ is the *orthogonal projection matrix* onto the range (column space) $\mathcal{R}(A)$. * The matrix $A^+A$ is the *orthogonal projection matrix* onto the range of the transpose (row space) $\mathcal{R}(A^T)$. * The pseudoinverse satisfies the key generalized inverse properties: $AA^+A = A$ and $A^+AA^+=A^+$. ::: ## SVD Expansion The SVD provides an additive breakdown of the matrix $A$, which is essential for understanding and manipulating data. :::danger **Theorem (SVD Rank Expansion)** The SVD can be written as a sum of rank-one matrices, where $r$ is the rank of $A$:$$A = \sum_{i=1}^r \sigma_i \mathbf{p}_i \mathbf{q}_i^T = \sigma_1 \mathbf{p}_1 \mathbf{q}_1^T + \cdots + \sigma_r \mathbf{p}_r \mathbf{q}_r^T$$Each term $\sigma_i \mathbf{p}_i \mathbf{q}_i^T$ is a rank-one matrix. Because the singular values are ordered ($\sigma_1 \ge \sigma_2 \ge \cdots$), the first few terms account for most of the matrix's information. ::: :::info **Definition (Truncated SVD)** The truncated SVD expansion of rank $k$ ($A_k$) is the best rank-$k$ approximation of $A$: $$ A_k = \sum_{i=1}^k \sigma_i \mathbf{p}_i \mathbf{q}_i^T = \sigma_1 \mathbf{p}_1 \mathbf{q}_1^T + \cdots + \sigma_k \mathbf{p}_k \mathbf{q}_k^T $$ By keeping only the largest $k$ singular value terms, we filter out noise and capture the main structure. ::: ### Application: Noise Reduction (Regularization) When solving $A \mathbf{x} = \mathbf{b}$ with noise $\mathbf{e}$ added to the right side ($A \hat{\mathbf{x}} = \mathbf{b} + \mathbf{e}$), the direct solution $\hat{\mathbf{x}} = A^{-1}\mathbf{b} + A^{-1}\mathbf{e}$ can be corrupted. This "inverted noise" $A^{-1}\mathbf{e}$ is amplified by the presence of small singular values, $\sigma_i$, in the SVD expansion of $A^{-1}$ (where they appear as large terms, $\sigma_i^{-1}$). To counteract this, we use the *truncated pseudoinverse* $A_k^+$: $$ A_k^+ = \sum_{i=1}^k \frac{1}{\sigma_i} \mathbf{q}_i \mathbf{p}_i^T $$ By using $A_k^+$, we intentionally omit the terms associated with the smallest singular values, $\sigma_{k+1}, \dots, \sigma_r$. This stabilizes the solution by avoiding the extreme amplification of noise, leading to a better solution $\hat{\mathbf{x}} \approx A^+_k \mathbf{b} + A^+_k\mathbf{e}$. This technique is a form of *regularization* used widely in image processing (e.g., deblurring) and inverse problems (e.g., computed tomography). :::success **Exercise** 9. Find the rank-2 pseudoinverse $A_2^{+} = \frac{1}{\sigma_1} \mathbf{q}_1 \mathbf{p}_1^T + \frac{1}{\sigma_2} \mathbf{q}_2 \mathbf{p}_2^T$ of the matrix $A = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & -2 \\ 1 & -1 & 1 \end{bmatrix}$. 10. Let $A$ be an $m \times n$ matrix with SVD $A = P \Sigma Q^{\mathsf{T}}$. Let $k < \min\{m, n\}$ and $A_k$ be the rank-$k$ SVD expansion. Describe the singular value decomposition of $A - A_k$. 11. Identify the correct expressions for the norms: * a. The rank $k$ SVD expansion of $A$ is given by $A_k = \sum_{i=1}^{k} \sigma_i \, \mathbf{p}_i \, \mathbf{q}_i^T$. Therefore, $\|A_k\| = \ldots$ * b. The rank-$k$ pseudoinverse of $A$ is given by $A_k^{+} = \sum_{i=1}^{k} \frac{1}{\sigma_i} \, \mathbf{q}_i \, \mathbf{p}_i^T$. Therefore, $\|A_k^{+}\| = \ldots$ * c. The pseudoinverse of $A$ is given by $A^{+} = \sum_{i=1}^{r} \frac{1}{\sigma_i} \, \mathbf{q}_i \, \mathbf{p}_i^T$. Therefore, $\|A^{+}\| = \ldots$ 12. Suppose we want to solve the system $A \mathbf{x} = \mathbf{b}$. A small change $\Delta \mathbf{b}$ produces a change in the solution such that $A(\mathbf{x} + \Delta \mathbf{x}) = \mathbf{b} + \Delta \mathbf{b}$. Describe the unit vector $\Delta \mathbf{b}$ that will produce the largest change $\|\Delta \mathbf{x}\|$. :::