SVD - HackMD

# Best-Fit Subspaces and Singular Value Decomposition (SVD) ## Introduction Singular Value Decomposition (SVD) is a fundamental technique in linear algebra that reveals the intrinsic structure of data by identifying the most significant directions of variation. It achieves this by finding the best-fitting subspaces, minimizing the distance between data points and the subspace or maximizing the projections onto it. Key concepts covered in this document include: * **Best-fit subspaces:** The core idea of finding the optimal line or hyperplane that best represents a dataset, minimizing the overall error. * **SVD:** A powerful matrix factorization technique that decomposes any matrix $A$ into a product of three matrices: $A=UDV^T$, where $U$ and $V$ have orthonormal columns, and $D$ is diagonal with positive singular values. * **Applications:** SVD's broad utility across various domains, such as Principal Component Analysis (PCA), clustering, ranking, and optimization problems, showcasing its real-world impact. By understanding the theoretical foundations and practical applications of SVD, you'll gain valuable insights into how to effectively analyze and interpret complex data structures, making it an essential tool in your data science toolkit. ## Preliminaries This section provides a foundation for understanding the concept of best-fit subspaces. ### Projection and Distance The relationship between the projection of a point onto a line and its distance to the line, rooted in the Pythagorean theorem, is crucial. $$ a_{i1}^2 + a_{i2}^2 + \dots + a_{id}^2 = (\text{length of projection})^2 + (\text{distance of point to line})^2 $$ where $\mathbf{a}_i = (a_{i1}, a_{i2}, \dots a_{id})$) is a point in $d$-dimensional space. ![Screenshot 2024-09-18 at 12.19.52 PM](https://hackmd.io/_uploads/B1PgSCw6R.png) ### Minimization and Maximization There exists an equivalence: minimizing the sum of squared distances to a line is the same as maximizing the sum of squared projection lengths onto it. This extends to best-fit subspaces. ### Interpretations of Best-Fit Subspace * It minimizes the sum of squared distances of data points to it. (Least-squares fit) * It maximizes the sum of squared projections of data points onto it. (Captures most data content) ### Choice of Objective Function Squared distances are used due to their mathematical properties, notably the equivalence between minimizing them and maximizing squared projections. ## Singular Vectors In the context of an $n \times d$ matrix $A$, consider its rows as $n$ points in a $d$-dimensional space. The goal is to find the best-fit line through the origin that minimizes the sum of squared distances of these points to the line. * **First Right-Singular Vector and Value:** The first right-singular vector $\mathbf{v}_1$ of $A$ is defined as the unit vector that maximizes the length of the transformed vector $A\mathbf{v}$: $$\mathbf{v}_1 = \mathop{\arg \max}_{|\mathbf{v}|=1} |A\mathbf{v}|$$ The corresponding first singular value $\sigma_1(A)$ represents the magnitude of this transformation: $$\sigma_1(A) = |A \mathbf{v}_1|$$ Furthermore, $\sigma_1^2$ can be interpreted as the sum of squared lengths of the projections of the data points onto the line defined by $\mathbf{v}_1$: $$\sigma_1^2 = \sum_{i=1}^n (\mathbf{a}_i \cdot \mathbf{v}_1)^2$$ where $\mathbf{a}_i$ is the $i$-th row of $A$. * **Subsequent Right-Singular Vectors and Values:** To identify the best-fit 2-dimensional subspace, we seek a unit vector $\mathbf{v}_2$ orthogonal to $\mathbf{v}_1$ that maximizes $|A \mathbf{v}|$. This leads to the definition of the second right-singular vector and second singular value. This process is iterated to find further right-singular vectors and singular values until all significant directions of variation in the data have been captured. * **Left-Singular Vectors:** The left-singular vectors $\mathbf{u}_i$ are obtained by normalizing the transformed vectors $A \mathbf{v}_i$: $$\mathbf{u}_i = \frac{1}{\sigma_i(A)} A \mathbf{v}_i$$ ### Greedy Algorithm The theorem "The Greedy Algorithm Works" assures us that the subspaces spanned by the first $k$ singular vectors indeed represent the best-fit $k$-dimensional subspaces for the matrix $A$. **Theorem (The Greedy Algorithm Works)** Let $A$ be an $n \times d$ matrix with singular vectors $\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_r$. For $1 \leq k \leq r$, let $V_k$ be the subspace spanned by $\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k$. For each $k$, $V_k$ is the best-fit $k$-dimensional subspace for $A$. **Proof.** The statement is clearly true for $k=1$ (the best-fit $1$-dimensional subspace is simply the line spanned by the first singular vector). We will now proceed with the proof by induction on $k$. * **Base Case ($k = 2$):** * Let $W$ be any 2-dimensional subspace that is considered a best-fit for $A$. * Choose an orthonormal basis $(\mathbf{w}_1, \mathbf{w}_2)$ of $W$ such that $\mathbf{w}_2$ is perpendicular to $\mathbf{v}_1$. Recall that $\mathbf{v}_1$ is the first left singular vector of $A$, corresponding to the largest singular value. * y the definitions of $\mathbf{v}_1$ and $V_2$ (the subspace spanned by the first two left singular vectors of $A$), we have the following inequalities: * $|A \mathbf{w}_1|^2 \leq |A \mathbf{v}_1|^2$ (since $\mathbf{v}_1$ maximizes $|A \mathbf{v}|$ among all unit vectors) * $|A \mathbf{w}_2|^2 \leq |A \mathbf{v}_2|^2$ (since $\mathbf{v}_2$ maximizes $|A \mathbf{v}|$ among all unit vectors orthogonal to $\mathbf{v}_1$) * Combining these inequalities, we get: $$|A \mathbf{w}_1|^2 + |A\mathbf{w}_2|^2 \leq |A \mathbf{v}_1|^2 + |A\mathbf{v}_2|^2$$ * This implies that that the sum of squared projections of the rows of $A$ onto $V_2$ is at least as large as the sum of squared projections onto $W$. Since $W$ was assumed to be a best-fit 2-dimensional subspace (i.e., it maximized the sum of squared projections), this implies that $V_2$ is also a best-fit 2-dimensional subspace. * **Inductive Step:** * Let's assume that the proposition holds for $k−1$. In other words, we assume that $V_{k−1}$ is the best-fit $(k−1)$-dimensional subspace for $A$. * Now, consider $W$ to be a best-fit $k$-dimensional subspace for $A$. * We choose an orthonormal basis $(\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_k)$ of $W$ in a way that $\mathbf{w}_k)$ is orthogonal to all the vectors $\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_{k-1}$. * Applying our induction hypothesis, we can state: $$|A \mathbf{w}_1|^2 + |A \mathbf{w}_2|^2 + \dots + |A\mathbf{w}_{k-1}|^2 \leq |A \mathbf{v}_1|^2 + |A \mathbf{v}_2|^2 + \dots + |A\mathbf{v}_{k-1}|^2$$ * Furthermore, because $\mathbf{w}_k$ is orthogonal to $\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_{k-1}$, the definition of $\mathbf{v}_k$ tells us that: $|A \mathbf{w}_k|^2 \leq |A \mathbf{v}_k|^2$ * Combining these inequalities, we get: $$|A \mathbf{w}_1|^2 + |A \mathbf{w}_2|^2 + \dots + |A\mathbf{w}_{k-1}|^2 + |A \mathbf{w}_k|^2 \leq |A \mathbf{v}_1|^2 + |A \mathbf{v}_2|^2 + \dots + |A\mathbf{v}_{k-1}|^2 + |A \mathbf{v}_k|^2$$ * This inequality demonstrates that the sum of squared projections of the rows of $A$ onto $V_k$ is at least as large as the sum of squared projections onto $W$. As $W$ was assumed to be a best-fit $k$-dimensional subspace, we conclude that $V_k$ is also a best-fit $k$-dimensional subspace. $\Box$ ### Frobenius Norm The Frobenius norm of $A$, denoted by $||A||_F$, serves as a measure of the matrix's overall magnitude. It is defined as the square root of the sum of squares of all its entries: $$ ||A||_F = \sqrt{\sum_{j,k} a_{jk}^2} $$ **Lemma** For any matrix $A$, the sum of squares of the singular values equals the square of the Frobenius norm. That is, $$ \sum_{i=1}^r \sigma_i^2(A) = ||A||_F^2 $$ This underscores the fact that the singular values encapsulate the complete "magnitude" of the matrix $A$. **Proof.** Let $V = \text{span}\{\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_r\}$ be the row space of $A$. * **Expressing rows in terms of the basis:** Since any vector orthogonal to $V$ is also orthogonal to all rows of $A$, for each row $\mathbf{a}_j$, we have: $$\mathbf{a}_j \cdot \mathbf{v} = 0 \text{ for all } \mathbf{v} \perp V$$ This implies that $\mathbf{a}_j$ can be expressed as a linear combination of the basis vectors of $V$: $$\mathbf{a}_j = \sum_{i=1}^{r} c_{ji} \mathbf{v}_i$$ where $c_{ji}$ are scalar coefficients. * **Calculating the squared norm of a row:** Now, let's calculate the squared norm of $\mathbf{a}_j$: \begin{split} |\mathbf{a}_j|^2 &= \left| \sum_{i=1}^{r} c_{ji} \mathbf{v}_i \right|^2 = \left( \sum_{i=1}^{r} c_{ji} \mathbf{v}_i \right) \cdot \left( \sum_{k=1}^{r} c_{jk} \mathbf{v}_k \right) \\ &= \sum_{i=1}^{r} \sum_{k=1}^{r} c_{ji} c_{jk} (\mathbf{v}_i \cdot \mathbf{v}_k) = \sum_{i=1}^{r} c_{ji}^2 |\mathbf{v}_i|^2 \\ &= \sum_{i=1}^{r} c_{ji}^2 = \sum_{i=1}^{r} (\mathbf{a}_j \cdot \mathbf{v}_i)^2 \end{split} * **Summing over all rows:** Summing the squared norms of all rows, we get the squared Frobenius norm: \begin{split} ||A||_F^2 &= \sum_{j=1}^{n} |\mathbf{a}_j|^2 = \sum_{i=1}^{r} \sum_{j=1}^{n} (\mathbf{a}_j \cdot \mathbf{v}_i)^2 \\ &= \sum_{i=1}^{r} |A \mathbf{v}_i|^2 = \sum_{i=1}^{r} \sigma_i^2(A) \end{split} This completes the proof. $\Box$ ## Singular Value Decomposition (SVD) This section formally introduces the Singular Value Decomposition (SVD), a central matrix factorization technique with profound implications for data analysis and representation. ### Theorem (SVD) Let $A$ be an $n \times d$ matrix with right-singular vectors $\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_r$, left-singular vectors $\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_r$, and corresponding singular values $\sigma_1, \sigma_2, \dots, \sigma_r$. Then, $A$ can be expressed as a sum of rank-1 matrices: $$ A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T $$ Bonus proof This equation provides a geometric interpretation: each row of $A$ is decomposed into its components along the $r$ orthogonal directions defined by the $\mathbf{v}_i$'s. In a more compact matrix notation, the SVD is represented as: $$ A = UDV^T $$ where: * $U$ is an $n \times r$ matrix whose columns are the left-singular vectors $\mathbf{u}_i$. * $D$ is an $r \times r$ diagonal matrix with the singular values $\sigma_i$ on its diagonal. * $V^T$ is the transpose of an $r \times d$ matrix $V$ whose rows are the right-singular vectors $\mathbf{v}^T_i$. The SVD theorem provides a powerful tool for understanding and manipulating matrices, enabling dimensionality reduction, data compression, and noise filtering, among other applications. ## Best Rank-$k$ Approximations Given an $n \times d$ matrix $A$ and its Singular Value Decomposition (SVD): $$ A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T $$ We can construct a rank-$k$ approximation $A_k$ by truncating the SVD after the first $k$ terms: $$ A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T $$ where $k$ is any integer between $1$ and $r$ (the rank of the original matrix $A$). ### Key Properties **Lemma (Projection onto Subspace)** The rows of $A_k$ are the projections of the rows of $A$ onto the subspace $V_k$ spanned by the first $k$ singular vectors of $A$. **Proof.** * Let $\mathbf{a}$ be an arbitrary row vector of $A$. * Since the right-singular vectors $\mathbf{v}_i$ form an orthonormal basis for $V_k$, the projection of a onto $V_k$ is given by: $$\sum_{i=1}^{k}(\mathbf a\cdot\mathbf{v}_i)\mathbf{v}_i^T$$ * To obtain the matrix whose rows are the projections of the rows of $A$ onto $V_k$, we apply this projection to each row of $A$. This can be expressed in matrix form as: $$\sum_{i=1}^{k} A\mathbf{v}_i {\mathbf{v}_i}^T$$ * Now let's simplify this expression. Recall that from the definition of left-singular vectors, we have $A\mathbf {v}_i = \sigma_i \mathbf{u}_i$. Substituting this into the previous expression, we get: $$\sum_{i=1}^{k} A\mathbf{v}_i \mathbf{v}_i^T = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$ * Finally, observe that the right-hand side of this equation is exactly how we defined the rank-$k$ approximation $A_k$. $\Box$ **Theorem (Optimal Approximation)** For any matrix $B$ of rank at most $k$, $A_k$ is the best rank-$k$ approximation to $A$ in terms of the Frobenius norm. i.e. $$||A−A_k||_F \leq ||A−B||_F$$ In simpler words, if you aim to approximate a matrix $A$ with a lower-rank matrix (of rank $k$), then $A_k$ (derived by truncating the SVD of $A$) is the optimal choice, regardless of whether you measure the error using the Frobenius norm or the 2-norm. **Proof.** * Let $B$ be a matrix of rank at most $k$ that minimizes $||A-B||_F^2$. Let $V$ be the subspace spanned by the rows of $B$. * The dimension of $V$ is at most $k$ because the rank of a matrix is equal to the dimension of its row space. * We claim that each row of $B$ must be the projection of the corresponding row of $A$ onto $V$. * Suppose, for the sake of contradiction, that there is a row of $B$ that is not the projection of the corresponding row of $A$ onto $V$. * We can replace this row of $B$ with the projection of the corresponding row of $A$ onto $V$. This operation maintains the row space of $B$ within $V$, ensuring that the rank of $B$ remains at most $k$. * However, this replacement strictly reduces $||A-B||_F^2$, contradicting our initial assumption that $B$ minimizes this quantity. * Therefore, our supposition must be false, and each row of $B$ is indeed the projection of the corresponding row of $A$ onto $V$. * Given that each row of $B$ is the projection of the corresponding row of $A$ onto $V$, it follows that $||A-B||_F^2$ represents the sum of squared distances of the rows of $A$ to the subspace $V$. * We know from the theorem "The Greedy Algorithm Works" that $A_k$ minimizes the sum of squared distances of the rows of $A$ to any $k$-dimensional subspace. * Consequently, since $V$ is a $k$-dimensional subspace, we have: $$||A−A_k||_F \leq ||A−B||_F$$ $\Box$ ### Practical Significance This has profound implications for scenarios where numerous matrix-vector multiplications involving the same matrix are required. By approximating the original matrix with a lower-rank counterpart, computational efficiency can be significantly enhanced while maintaining a controlled level of error. This trade-off between accuracy and computational cost is invaluable in various real-world applications, particularly when dealing with large-scale datasets. ## Left Singular Vectors This section focuses on the left-singular vectors $\mathbf{u}_i$, which are derived from the right-singular vectors $\mathbf{v}_i$ and singular values $\sigma_i$: $$ \mathbf{u}_i = \frac{1}{\sigma_i} A \mathbf{v}_i $$ ### Orthogonality Similar to the right-singular vectors, the left-singular vectors are also pairwise orthogonal. This orthogonality property is fundamental to the SVD decomposition and its various applications. **Theorem** The left singular vectors are pairwise orthogonal. Bonus proof ### Best Rank-$k$ 2-norm Approximation Beyond being the best approximation in terms of the Frobenius norm, $A_k$, the rank-$k$ approximation obtained from SVD, is also proven to be the best in terms of the 2-norm. **Lemma** $$||A−A_k||_2 = \sigma_{k+1}^2$$ Bonus proof **Theorem** Let $A$ be an $n \times d$ matrix. For any matrix $B$ of rank at most $k$, $$ ||A−A_k||_2 \leq ||A−B||_2 $$ where $||A||_2 = \max_{|\mathbf{x}| \leq 1} |A \mathbf{x}|$ is the 2-norm or the spectral norm (it is also equal to the largest singular value $\sigma_1(A)$ of $A$). Bonus proof ### Relationship with Eigenvectors The following lemma highlights the connection between left and right-singular vectors, mediated by singular values. It also establishes a link between SVD and eigenvector decomposition for square symmetric matrices. **Lemma (Analog of eigenvalues and eigenvectors)** $$ A \mathbf{v}_i = \sigma_i \mathbf{u}_i \quad \text{and} \quad A^T \mathbf{u}_i = \sigma_i \mathbf{v}_i $$ Bonus proof In essence, this section underscores the significance of left-singular vectors in SVD and their relationship with eigenvectors, providing further insights into the underlying structure and properties of the SVD decomposition. ## Power Method for Singular Value Decomposition The Power Method is an iterative algorithm designed for efficiently computing the Singular Value Decomposition (SVD) of a matrix, particularly advantageous when dealing with large and sparse matrices. It capitalizes on the properties of the matrix $B = A^T A$ to estimate the dominant singular vectors and values of the original matrix $A$. ### Core Principles 1. **Eigen-decomposition of B:** * The matrix $B = A^T A$ is square and symmetric, sharing the same right-singular vectors and left-singular vectors as the original matrix $A$. * The eigenvectors of $B$ correspond to the right-singular vectors of $A$. * The eigenvalues of $B$ are the squares of the singular values of $A$: $$B \mathbf{v}_j = \sigma_j^2 \mathbf{v}_j$$ 2. **Power Iteration:** * Repeatedly multiplying $B$ by itself $k$ times leads to: $$B^k = \sum_{i=1}^r \sigma_i^{2k} \mathbf{v}_i \mathbf{v}_i^T$$ * Assuming $\sigma_1$ (the largest singular value) is strictly greater than $\sigma_2$, the first term in the summation dominates as $k$ increases: $$B^k \approx \sigma_1^{2k} \mathbf{v}_1 \mathbf{v}_1^T$$ * Normalizing the first column of $B^k$ provides an approximation of the first right-singular vector $\mathbf{v}_1$ of $A$. 3. **Efficient Computation:** * Instead of computing the full matrix $B^k$, we can multiply $B^k$ by a random vector $\mathbf{x}$. * Expressing $\mathbf{x}$ in terms of $B$'s singular vectors leads to: $$B^k \mathbf{x} \approx (\sigma_1^{2k} \mathbf{v}_1 \mathbf{v}_1^T) (\sum^d_{i=1} c_i v_i) = \sigma_1^{2k} c_1 \mathbf{v}_1$$ * Normalizing this result yields an estimate of $\mathbf{v}_1$. * $B^k \mathbf{x}$ is computed efficiently using a series of matrix-vector multiplications: $$B^k \mathbf{x} = (A^T A) \dots (A^T A) \mathbf{x}$$ ### Addressing Ties and Convergence The Power Method's effectiveness is contingent on certain conditions, such as the separation between the largest and second-largest singular values and the initial choice of the random vector $\mathbf{x}$. **Theorem** Let $A$ be an $n \times d$ matrix and $\mathbf{x}$ a unit length vector in $\mathbb{R}^d$ with $|\mathbf{x}^T \mathbf{v}_1| \geq \delta$, where $\delta > 0$. Let $V$ be the space spanned by the right singular vectors of $A$ corresponding to singular values greater than $(1 − \epsilon) \sigma_1$. Let $\mathbf{w}$ be the unit vector after $k = \frac{\ln(1 / \epsilon \delta)}{2 \epsilon}$ iterations of the power method, namely $$ \mathbf{w} = \frac{(A^T A)^k \mathbf{x}}{|(A^T A)^k \mathbf{x}|} $$ Then $\mathbf{w}$ has a component of at most $\epsilon$ perpendicular to $V$. Bonus proof **Lemma** Let $\mathbf{y} \in \mathbb{R}^n$ be a random vector with the unit-variance spherical Gaussian as its probability density. Normalize $\mathbf{y}$ to be a unit length vector by setting $\mathbf{x} = \mathbf{y}/|\mathbf{y}|$. Let $\mathbb{v}$ be any unit length vector. Then $$ \text{Prob}\left( |\mathbf{x}^T \mathbf{v}| \leq \frac{1}{20 \sqrt{d}} \right) \leq \frac{1}{10} + 3 e^{-d / 96} $$ Bonus proof In conclusion, the Power Method offers a computationally efficient approach to estimating the dominant singular vectors and values of a matrix, particularly valuable for large-scale datasets. Its reliance on matrix-vector multiplications makes it well-suited for sparse matrices, where traditional SVD algorithms might be computationally prohibitive. ## Singular Vectors and Eigenvectors This section explores the intricate relationship between singular vectors, derived from Singular Value Decomposition (SVD), and eigenvectors, obtained through eigenvalue decomposition. ### Key Connection * For a square matrix $B$, if $B \mathbf{x} = \lambda \mathbf{x}$, then $\mathbf{x}$ is an eigenvector and $\lambda$ is its corresponding eigenvalue. * In the context of SVD, where $B = A^TA$, the right-singular vectors $\mathbf{v}_j$ of $A$ also serve as eigenvectors of $B$, with their corresponding eigenvalues being the squares of the singular values, $\sigma_j^2$. * Similarly, the left-singular vectors $\mathbf{u}_j$ of $A$ are eigenvectors of the matrix $AA^T$, again with eigenvalues $\sigma_j^2$. ### Positive Semi-Definite Matrices * A matrix $B$ is classified as positive semi-definite if the quadratic form $\mathbf{x}^T B \mathbf{x}$ is non-negative for all vectors $\mathbf{x}$. * Matrices constructed in the form $A^TA$ are inherently positive semi-definite. * Any positive semi-definite matrix $B$ can be decomposed into a product $A^TA$. Furthermore, its eigenvalue decomposition can be systematically derived from the SVD of the matrix $A$. ## Applications of Singular Value Decomposition This section explores the practical applications of Singular Value Decomposition (SVD) across various domains. ### Centering Data Centering data, the process of subtracting the centroid (mean) from each data point, is crucial in Principal Component Analysis (PCA). It ensures that the analysis focuses on how data varies relative to its mean. * **Best-fitting subspace vs. affine space:** SVD directly finds the best-fitting subspace through the origin. Centering data allows us to find the best-fitting affine space (a translated subspace) by applying SVD to the centered data. ![Screenshot 2024-09-18 at 1.14.11 PM](https://hackmd.io/_uploads/BJb2bJ_pC.png) #### Key Lemmas * The best-fit line minimizing the sum of squared perpendicular distances passes through the centroid. * The $k$-dimensional affine space minimizing the sum of squared perpendicular distances also passes through the centroid. Bonus proof These lemmas imply that centering data simplifies the problem, enabling direct application of SVD to find the best-fitting subspace. ### Principal Component Analysis (PCA) Principal Component Analysis (PCA) is a cornerstone technique in dimensionality reduction and feature extraction, harnessing the power of SVD to unveil the most influential underlying factors or patterns within data. #### Illustrative Scenario: Movie Recommendations Consider a scenario where you have a matrix $A$ representing customer preferences for movies. Each entry $a_{ij}$ signifies how much customer $i$ enjoys movie $j$. * **Underlying Factors:** It's reasonable to assume that a small number of factors, such as genre preferences (comedy, drama, action), influence these preferences. * **Low-Rank Approximation:** SVD comes into play by finding the best rank-$k$ approximation $A_k$ of the original matrix $A$. This approximation can be expressed as the product of two matrices: $U$ (capturing customer affinities for these underlying factors) and $V$ (reflecting the extent to which each factor is present in each movie). * **Interpretation and Prediction:** The dot product of a row in $U$ (representing a specific customer's preferences) and a column in $V$ (representing a specific movie's factor composition) provides an estimate of how much that particular customer will appreciate that particular movie. * **Noise:** The discrepancy between the original matrix $A$ and its low-rank approximation $A_k$ is regarded as noise, encapsulating the variability in preferences not explained by the identified factors. #### Addressing Incomplete Data In many real-world scenarios, like movie recommendations, we might only have partial information about the matrix $A$. Not all customers rate all movies, leading to missing entries. PCA, in conjunction with collaborative filtering techniques, can effectively estimate these missing values, empowering personalized recommendations or targeted advertising strategies. ### Clustering a Mixture of Spherical Gaussians This subsection elucidates how Singular Value Decomposition (SVD) can be strategically employed to cluster data points originating from a mixture of spherical Gaussians. Clustering, a fundamental task in unsupervised learning, involves partitioning data points into groups (clusters) based on their inherent similarity. The challenge lies in determining the optimal number of clusters and assigning points to these clusters accurately. #### Mixture Models and Gaussians We begin by introducing the concept of mixture models, where a probability density function is represented as a weighted sum of simpler component densities. In this specific context, we focus on mixtures of spherical Gaussians, mathematically expressed as: $$ f = w_1 p_1 + w_2 p_2 + \dots + w_k p_k $$ where: * $p_1, \dots, p_k$ are the component probability densities, each representing a spherical Gaussian distribution. * $w_1, \dots, w_k$ are the mixture weights, which are positive real numbers that sum to 1, signifying the contribution of each Gaussian to the overall mixture. #### Clustering Challenge and Inter-center Separation The clustering task becomes notably challenging when the component Gaussians in the mixture have centers that are in close proximity to each other. We delve into the critical role of inter-center separation, measured in units of standard deviation, in ensuring successful clustering. The following key insights are highlighted: * If $\mathbf{x}$ and $\mathbf{y}$ are two independent samples drawn from the same spherical Gaussian with standard deviation $\sigma$, then their squared distance is approximately: $$|x - y|^2 \approx 2 \left(\sqrt{d} \pm O(1) \right)^2 \sigma^2$$ * Conversely, if $\mathbf{x}$ and $\mathbf{y}$ are samples from different spherical Gaussians, each characterized by standard deviation $\sigma$ and means separated by distance $\Delta$, then their squared distance is approximately: $$|x - y|^2 \approx 2 \left(\sqrt{d} \pm O(1) \right)^2 \sigma^2 + \Delta^2$$ To achieve effective clustering, it is imperative that points originating from the same Gaussian are closer to each other than points from distinct Gaussians. This translates into the following requirement for the inter-center separation: $$ \Delta > c d^{1/4} $$ where $c$ is a suitable constant. #### SVD for Clustering The central idea behind utilizing SVD for clustering is to identify the subspace spanned by the $k$ centers of the Gaussians. Projecting the data points onto this subspace effectively reduces the complexity of the clustering problem. #### Key Points * The projection of a spherical Gaussian onto a subspace retains its spherical Gaussian nature with the same standard deviation. * The best-fit $k$-dimensional subspace for $k$ spherical Gaussians is precisely the subspace that encompasses their centers. * SVD possesses the capability to identify this subspace, even when provided with a limited number of samples from the mixture. * Compared to traditional distance-based clustering methods, SVD-based clustering necessitates a smaller inter-mean separation (approximately one standard deviation), rendering it advantageous in specific scenarios. Fill details with code implementation ### Ranking Documents and Web Pages Singular Value Decomposition (SVD) extends its utility to the realm of ranking documents and web pages, enabling us to assess their relevance or importance within a collection or network. #### Document Ranking In the context of a document collection, the "intrinsic relevance" of a document can be quantified as its projection onto the best-fit direction, which corresponds to the top left-singular vector of the term-document matrix. This direction encapsulates the maximum sum of squared projections of the entire collection, effectively serving as a representative synthetic term-document vector. By ranking documents based on their projection onto this direction, we obtain a measure of their relevance to the collection as a whole. #### Web Page Ranking (HITS Algorithm) The HITS (Hyperlink-Induced Topic Search) algorithm employs SVD to discern authoritative and hub pages within the vast expanse of the World Wide Web. ![cars1](https://hackmd.io/_uploads/SymavJ_6R.png) * **Authorities:** Web pages recognized as the most prominent and reliable sources of information on a particular topic. * **Hubs:** Web pages that serve as curators, identifying and linking to authoritative pages on a specific topic. * **Circular Definition:** The definitions of hubs and authorities are inherently intertwined: a hub gains significance by pointing to numerous authorities, while an authority derives its authority from being referenced by many hubs. * **Hub and Authority Weights:** The HITS algorithm assigns hub weights (vector $\mathbf{u}$) and authority weights (vector $\mathbf{v}$) to each node (web page) within the web graph, represented by its adjacency matrix $A$. * **Power Iteration:** The algorithm commences with an initial authority vector $\mathbf{v}$ and iteratively refines both the hub vector $\mathbf{u}$ and the authority vector $\mathbf{v}$ using the following formulas: $$\mathbf{v} = \frac{A^T \mathbf{u}}{|A^T \mathbf{u}|}, \quad \mathbf{u} = \frac{A^T \mathbf{v}}{|A^T \mathbf{v}|}$$ * **Convergence to Singular Vectors:** This iterative process is guaranteed to converge to the left-singular and right-singular vectors of the adjacency matrix $A$. * **Ranking:** Upon convergence, the left vector $\mathbf{u}$ embodies the hub weights, while the authority ranking is established by projecting each column of $A$ onto $\mathbf{u}$. This projection is mathematically equivalent to ranking based on the values within the right-singular vector $\mathbf{v}$. Bonus implementation #### PageRank Another prominent web page ranking algorithm, PageRank, operates on the principle of random walks on the web graph. Its intricacies will be explored in greater detail in a subsequent chapter. ### An Illustrative Application of SVD: Manifold Dimensionality in Deep Neural Networks This subsection showcases a practical and insightful application of SVD within the domain of deep neural networks. Specifically, it demonstrates how SVD can be leveraged to estimate the dimensionality of manifolds residing in the activation space of these networks. #### Context * Deep neural networks are adept at classifying input images into various categories, such as "cat," "dog," or "car." * These images are internally mapped to a high-dimensional activation space, which can potentially consist of thousands of dimensions. * It is hypothesized that images belonging to a particular category, such as "cat" images, might lie on a lower-dimensional manifold embedded within this vast activation space. #### Challenge Determining the precise dimension of this manifold presents a formidable challenge due to several factors: * The availability of images is typically limited. For instance, we might only have access to a thousand cat images. * These images tend to be scattered widely across the activation space, making it difficult to discern the underlying manifold structure. * Estimating the tangent subspace at a specific activation vector necessitates the presence of numerous nearby cat activation vectors, which might not be readily available. #### SVD's Role SVD steps in to address these challenges through a clever two-step process: 1. **Generating Nearby Images:** * SVD is applied to each image, represented as a matrix (e.g., 1000 x 1000). * A few small singular values are selectively zeroed out. This subtle modification alters the image slightly without significantly impacting its visual appearance. * Crucially, these modified images have activation vectors that lie in close proximity to the original image's activation vector in the activation space. * This process is repeated to generate a multitude of such images, effectively creating a denser sampling around the original image's activation vector. 2. **Estimating Manifold Dimension:** * A matrix is constructed using these generated activation vectors. * SVD is then performed on this matrix. * The number of "large" singular values serves as an indicator of the dimension of the tangent subspace at the original cat activation vector. This, in turn, provides an approximation of the dimension of the underlying cat manifold itself. * The threshold for determining "large" singular values can be established by examining the cumulative sum of squared singular values or by observing a pronounced drop in their magnitudes. **In essence,** SVD empowers us to generate a denser cloud of points surrounding the original image's activation vector. This facilitates the estimation of the tangent subspace's dimension, which ultimately sheds light on the dimensionality of the manifold itself. This knowledge is invaluable for comprehending the inherent structure of the data represented within the activation space of a neural network, contributing to a deeper understanding of its inner workings. ### An Application of SVD to a Discrete Optimization Problem This subsection illustrates the application of SVD in tackling a discrete optimization challenge, specifically the maximum cut problem in the context of a directed graph. #### The Maximum Cut Problem In essence, the maximum cut problem seeks to partition the nodes of a directed graph into two distinct subsets, denoted as $S$ and its complement $\bar{S}$, such that the number of edges traversing from $S$ to $\bar{S}$ is maximized. ![images](https://hackmd.io/_uploads/H1Q9AJdaA.png) * **Adjacency Matrix Representation:** We employ the adjacency matrix $A$ to represent the graph, where $a_{ij}=1$ signifies the presence of an edge from node $i$ to node $j$, and $0$ otherwise. * **Indicator Variables:** To facilitate the mathematical formulation, we associate an indicator variable $x_i$ with each vertex $i$. If $i$ belongs to the subset $S$, then $x_i = 1$; otherwise, $x_i = 0$. * **Objective Function:** The count of edges crossing the cut (from $S$ to $\bar{S}$) is mathematically expressed as: $$\sum_{i,j} x_i (1 - x_j) a_{ij}$$ * **Optimization Problem:** The maximum cut problem can be formally stated as the following optimization task: \begin{split}\max \ &\mathbf{x}^T A (\mathbf{1} - \mathbf{x}) \\ \text{ s.t. } &x_i \in \{0, 1\}\end{split} where $\mathbf{1}$ denotes a vector consisting entirely of $1$'s. This problem is known to be NP-hard, implying that finding an exact solution in polynomial time is computationally intractable. #### SVD as an Approximation Tool For dense graphs, characterized by a substantial number of edges, SVD offers a pathway to obtain a near-optimal solution in polynomial time. The strategic approach involves the following steps: 1. **SVD of A:** Compute the Singular Value Decomposition of the adjacency matrix $A$. 2. **Low-Rank Approximation:** Replace the original matrix $A$ with its rank-$k$ approximation $A_k$, derived from the SVD: $$A_k = \sum_{i=1}^k \sigma_i u_i v_i^T$$ 3. **Modified Optimization Problem:** Solve the modified optimization problem, where the original adjacency matrix $A$ is substituted with its low-rank approximation $A_k$: \begin{split}\max \ &\mathbf{x}^T A_k (\mathbf{1} - \mathbf{x}) \\ \text{ s.t. } &x_i \in \{0, 1\}\end{split} #### Key Results 1. **Error Bound:** The discrepancy between the objective function values for the original and modified problems is subject to a bound: $$|\mathbf{x}^T A (\mathbf{1} - \mathbf{x}) - \mathbf{x}^T A_k (1 - \mathbf{x})| \leq n^2 \sqrt{\frac{1}{k+1}}$$ This bound ensures that the solution obtained from the modified problem remains near-optimal with respect to the original problem. 2. **Efficient Solution:** Leveraging the low-rank structure of $A_k$, a near-optimal solution for the modified problem can be computed efficiently in polynomial time (for a constant value of k) using dynamic programming techniques. **Theorem** Given a directed graph $G(V,E)$, a cut of size at least the maximum cut minus $O\left(\frac{n^2}{\sqrt{k}}\right)$ can be computed in time polynomial in $n$ for any fixed $k$. Bonus proof and implementation ## Reference * Chapter 3 of Blum A, Hopcroft J, Kannan R. Foundations of Data Science. Cambridge University Press; 2020. ## Further Reading ### Principal Component Analysis * James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning with Applications in Python. Springer. (See Chapter 12.1: Principal Components, 12.2: Higher-order Principal Components, and 12. Py: Principal Components) [link1](https://youtu.be/kpuQqOzQXfM?si=sxS2egqpmb5_DHFy) [link2](https://youtu.be/O30nHhyBiAs?si=-vXwsoypzWcboFng) [link3](https://youtu.be/3zpISnhksqQ?si=HY8E2SjZshtWSd7O) * Scikit-learn User Guide: 2.5. Decomposing signals in components (matrix factorization problems) [link](https://scikit-learn.org/stable/modules/decomposition.html) ### Mixtures of Spherical Gaussians * Scikit-learn User Guide: 2.1. Gaussian mixture models [link](https://scikit-learn.org/stable/modules/mixture.html) * Learning mixtures of spherical Gaussians: moment methods and spectral decompositions by Jonathan Wilder Lavington: https://www.cs.ubc.ca/labs/lci/mlrg/slides/20201028.pdf * Hsu, D., & Kakade, S. M. (2013). Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science (ITCS '13). Association for Computing Machinery, New York, NY, USA, 11–20. https://doi.org/10.1145/2422436.2422439 * Tosh, C., & Dasgupta, S. (2017). Maximum likelihood estimation for mixtures of spherical Gaussians is NP-hard. Journal of Machine Learning Research, 18(1), 6399–6409 ### Recommender system * [Wikipedia: Recommender system](https://en.wikipedia.org/wiki/Recommender_system) * Park, K., & Konstan, J. A. (2015). Practical Recommender Systems. Manning Publications Co. ### HITS * [Wikipedia: HITS algorithm](https://en.wikipedia.org/wiki/HITS_algorithm) * [Link analysis](https://nlp.stanford.edu/IR-book/html/htmledition/link-analysis-1.html) section in "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze * [Towards Data Science: HITS Algorithm: Link Analysis Explanation and Python Implementation from Scratch](https://towardsdatascience.com/hits-algorithm-link-analysis-explanation-and-python-implementation-61f0762fd7cf) * Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5), 604-632. ### Maximum Cut * [Wikipedia: Maximum cut](https://en.wikipedia.org/wiki/Maximum_cut) * [maximum-cut](https://github.com/dwave-examples/maximum-cut) * Crescenzi, P., Kann, V., Halldórsson, M., & Karpinski, M. A compendium of NP optimization problems. (See the [Network Design](https://www.csc.kth.se/~viggo/wwwcompendium/node69.html) section)