Thoughts Behind the Equation $(A^TA+\lambda I_n)^{-1}A^T=A^T(AA^T+\lambda I_m)^{-1}$

# Thoughts Behind the Equation $(A^TA+\lambda I_n)^{-1}A^T=A^T(AA^T+\lambda I_m)^{-1}$ We often encounter one fact in the context of the [Ridge Regression](https://en.wikipedia.org/wiki/Tikhonov_regularization): Let $A\in\mathbb{R}^{m\times n}$ be a matrix and $\lambda\in\mathbb{R}$ be a non-zero scalar. Then $$(A^TA+\lambda I_n)^{-1}A^T=A^T(AA^T+\lambda I_m)^{-1}\tag{1},$$ where $I_n$ and $I_m$ denote the identity matrices in $\mathbb{R}^n$ and $\mathbb{R}^m$, respectively. Its proof is trivial, by showing that $$A^T(AA^T+\lambda I_m)=(A^TA+\lambda I_n)A^T.$$ And both $AA^T+\lambda I_m$ and $A^TA+\lambda I_n$ are invertible matrices (but, why?). In this note, I would like to go into this equation another way, with the help of the Singular Value Decomposition (SVD). --- ## Facts Let me first build up some facts that would be useful later. Let $M\in\mathbb{R}^{m\times n}$ be any matrix, we write down the SVD of $M$ both in the matrix form and in the vector form: $$M=U\Sigma V^T = \sum_{j\le r} \sigma_j u_j v_j^T$$ where $\sigma_j>0$ for each $j\le r$, $\{u_j\}_{j\le m}$ an orthonormal basis in $\mathbb{R}^m$ (i.e., the columns of $U$), and $\{v_j\}_{j\le n}$ an orthonormal basis in $\mathbb{R}^n$ (i.e., the columns of $V$). Then the transpose $M^T$ and the inverse $M^{-1}$ (which exists when $m=n=r$) are $$M^T=\sum_{j\le r} \sigma_j v_ju_j^T\tag{2}$$ and $$M^{-1}=\sum_{j\le r} \frac{1}{\sigma_j}v_ju_j^T\tag{3}$$ These results will be used in the next section. --- ## Dissect Now, for arbitrary matrix $A\in\mathbb{R}^{m\times n}$, which is of rank $r$, and any but fixed positive scalar $\lambda\in\mathbb{R_{+}}$. We express the SVD of the matrix $A$ by: $$ A=\sum_{j\le r}\sigma_ju_jv_j^T$$. First, We start by computing the term $(A^TA+\lambda I_n)^{-1}A^T$. Using the observations built in the previous section, we found that $$A^TA=\sum_{j\le r} \sigma_j^2v_jv_j^T.$$ Hence $$A^TA+\lambda I_n = \sum_{j\le r}(\sigma_j^2+\lambda)v_jv_j^T+\sum_{j>r}\lambda v_jv_j^T.$$ Furthermore, we have $$(A^TA+\lambda I_n)^{-1} = \sum_{j\le r}{1\over \sigma_j^2+\lambda}v_jv_j^T+\sum_{j>r}{1\over \lambda}v_jv_j^T$$ and $$(A^TA+\lambda I_n)^{-1}A^T = \sum_{j\le r}{\sigma_j\over \sigma_j^2+\lambda}v_j u_j^T$$ It completes the computation of $(A^TA+\lambda I_n)^{-1}A^T$. To derive $A^T(AA^T+\lambda I_m)^{-1}$, we use the same argument as above, and the results are written below: $$AA^T=\sum_{j\le r} \sigma_j^2u_ju_j^T.$$ $$AA^T+\lambda I_m = \sum_{j\le r}(\sigma_j^2+\lambda)u_ju_j^T+\sum_{j>r}\lambda u_ju_j^T.$$ $$(AA^T+\lambda I_m)^{-1} = \sum_{j\le r}{1\over \sigma_j^2+\lambda}u_ju_j^T+\sum_{j>r}{1\over \lambda}u_ju_j^T$$ $$A^T(AA^T+\lambda I_m)^{-1} = \sum_{j\le r}{\sigma_j\over \sigma_j^2+\lambda}v_j u_j^T.$$ --- ## A Weapon to Reduce Computation Cost In the Ridge Regression, the optimization problem we seek to solve is that: $$\text{minimize}\quad\|Aw-y\|_2^2+\lambda\|w\|_2^2,$$ Its optimal solution $\hat{w}$ is $(A^TA+\lambda I_n)^{-1}A^Ty$. We can directly derive this fact by setting the gradient of the objective function to be zero. However, when the number of columns of $A$ is much larger than that of rows of $A$ (e.g., the number of features is much larger than the number of samples), the time complexity of evaluating $(A^TA+\lambda I_n)^{-1}A^T$ is higher (for intance, in order to obtain $A^TA$, it requires about $mn^2$ operations). In that case, replacing the $\hat{w}$ with $\hat{w}=A^T(AA^T+\lambda I_m)^{-1}y$ in terms of the equation $(1)$ would give rise to the lower time complexity (it requires about $m^2n$ to get $AA^T$). To validate the above argument, here is a simple experiment using Python. ```python= import time import numpy as np if __name__ == "__main__": n_samples = 123 n_features = 4567 lambda_ = 89 A = np.random.randn(n_samples, n_features) start = time.time() LHS = np.linalg.inv(A.T @ A + lambda_ * np.identity(n_features)) @ A.T end = time.time() print(f"Elapsed time of computing LHS: {end - start:.4f}s.") start = time.time() RHS = A.T @ np.linalg.inv(A @ A.T + lambda_ * np.identity(n_samples)) end = time.time() print(f"Elapsed time of computing RHS: {end - start:.4f}s.") assert LHS.shape == (n_features, n_samples) assert RHS.shape == (n_features, n_samples) assert np.allclose(LHS, RHS) ``` My output is something as follows: ```bash= Elapsed time of computing LHS: 3.1054s. Elapsed time of computing RHS: 0.0052s. ``` Additionally, in the ridge regression, we can discover that the feature function is a linear combination of inner products through the expression $\hat{w}=A^T(AA^T+\lambda I_m)^{-1}y$, which introduces the motivation for the so-called kernel tricks. After examing the equation $(1)$ through these interpretations, it more or less becomes a vivid object instead of the merely rigid mathematical notation. ###### tags: `singular-value-decomposition` `linear-algebra` `tikhonov-regularization`