# SVD, EVD and PCA
###### tags: `Maching Learning` `Principle Component Analysis` `PCA`
## What singular value meaning?
- For a $n\times d$ matrix $A$
- Minimize the vertical distance

$$
x_{i1}^2+ x_{i2}^2+\cdots +x_{id}^2=(length~of~projection)^2+(distance~of~point~to~line)^2
$$
That is
$$
(distance~of~point~to~line)^2 = x_{i1}^2+ x_{i2}^2+\cdots +x_{id}^2-(length~of~projection)^2
$$
To minimize the distance of point to line
$$
\min \sum_{i=1}^n x_{i1}^2+ x_{i2}^2+\cdots +x_{id}^2
$$
Define the *the first singular vector*, $\mathbf{v}_1$, of $A$
$$
\mathbf{v}_1 = \arg \max_{|\mathbf{v}|=1}|A\mathbf{v}|
$$
The value $\sigma _1(A)=|A\mathbf{v}_1|$ is called the *first singular value* of $A$
:::info
此處可以發現,$\mathbf{v}_1$便是所有點投影到該向量上後總和最大的方向
Note : $A=\begin{pmatrix} \mathbf{x}_1 \\\ \mathbf{x}_2 \\\ \vdots \\\ \mathbf{x}_n\end{pmatrix}$
then $|A\mathbf{v}_1|=\sigma_1$
:::
The *second singular vector*, $\mathbf{v}_2$ , is defined by the best fit line perpendicular to $\mathbf{v}_1$
$$
\mathbf{v}_2 =\arg \max_{\mathbf{v} \perp \mathbf{v}_1, |\mathbf{v}|=1}{|A \mathbf{v} |}
$$
The value $\sigma _2(A)=|A\mathbf{v}_2|$ is called the *second singular value* of $A$ and so on.
The process stops when we have found
$$
\mathbf{v}_1, \mathbf{v}_2, \cdots , \mathbf{v}_r
$$
as singular vectors and
$$
\arg \max_{\mathbf{v} \perp \mathbf{v}_1, \mathbf{v}_2, \cdots , \mathbf{v}_r, \\ |\mathbf{v}|=1}{|A \mathbf{v} |} = 0
$$
:::info
此處代表$rank(\mathbf{A})=r$,$\mathbf{v}_{r+1}$ is null space of $A$
:::
## Singular Value Decomposition
Since $\mathbf{v}_1, \mathbf{v}_2, \cdots , \mathbf{v}_r$ span the spcae of all rows of $A$. Thus, for each row of $A$, $\mathbf{a}_j$,
$$
\sum_{j=1}^n|\mathbf{a}_j|^2=\sum_{j=1}^n \sum_{i=1}^r (\mathbf{a}_j \mathbf{v}_i)^2 =\sum_{i=1}^r \sum_{j=1}^n (\mathbf{a}_j \mathbf{v}_i)^2 =\sum_{i=1}^r|A\mathbf{v}_i |^2=\sum_{i=1}^r \sigma_i ^2(A)
$$
Now, $A\mathbf{v}$ is the linear combination of $A\mathbf{v}_1, A\mathbf{v}_2, \cdots , A\mathbf{v}_r$. So the $A\mathbf{v}_1, A\mathbf{v}_2, \cdots , A\mathbf{v}_r$ form a fundamental set of vectors associated with $A$. We normalize them to length one by
$$
\mathbf{u}_i=\frac{1}{\sigma_i(A)}A\mathbf{v}_i
$$
The vectors $\mathbf{u}_1, \mathbf{u}_2, \cdots , \mathbf{u}_r$ are called the *left singular vectors* of $A$. The $\mathbf{v}_i$ are called *right singluar vectors*.
:::info
此處可以看成 $A\mathbf{v}_i=\sigma_i \mathbf{u}_i$,先前有提到$\sigma _1(A)=|A\mathbf{v}_1|$,
$A$的各點$\mathbf{a}_j$投影到$\mathbf{v}_i$上的總合長度為$\sigma_i$,而其方向為$\mathbf{u}_i$。
:::
Therefore,
$$
A=\sum_{i=1}^r \sigma_i(A)\mathbf{u}_i\mathbf{v}_i^T
$$
:::success
:bulb: 回想我們在進行SVD計算時,我們會將$A$矩陣變成$A^TA$來計算$right~singular~vector,~\mathbf{v}_i$
然後再使用$AA^T$來計算$left~singular~vector,~\mathbf{u}_i$。但是我們比較在意的一般只有$\mathbf{v}_i,~\sigma_i$,分別代表想要投影的方向以及其重要性。
:::
## SVD v.s. EVD
Eigenvalue decomposition only can be applied on a square matrix.
For a square matrix $A$, we can find a vector $\mathbf{v}_i$ such that
$$
A\mathbf{v}_i = \sigma_i \mathbf{v}_i
$$
Here, we can find that EVD is a special case of SVD, $\mathbf{v}_i = \mathbf{u}_i$.
Therefore,
$$
A=\sum_{i=1}^r \sigma_i(A)\mathbf{v}_i\mathbf{v}_i^T
$$
When matrix $A$ is semi-positive definite, the SVD result equals to EVD.
## PCA v.s. SVD
PCA常用於特徵降維,其目的是找到最能夠代表所有特徵的方向並進行投影,同時提供使用者重要性來評估要保留多少特徵。
如何評估最能夠代表特徵的方向呢?PCA使用共變異矩陣(Covariance matrix)來判斷。

圖中的$\mathbf{v}$對資料點有著較遠的距離,因此不是最好的投影方向。最好的投影方向應該為$\mathbf{v}'$。
從以上就可以發現這個概念跟SVD跟相似,主要差異在於
$$
PCA~needs~to~calculate~the~covariance~matrix~first \\and~then~applys~SVD
$$
Covariance matrix of matrix $A$ and $B$ is
$$
cov(A,B)=\frac{1}{d-1}\sum_{i=1}^d(A_i-\mu_A)^*(B_i-\mu_B),~where~A_i~is~a~column~vector~of~A
$$
:::info
一般而言,量測到的數據$A$會記錄成
$$
A = \begin{pmatrix} \mathbf{y}_1 \\\ \mathbf{y}_2 \\\ \vdots \\\ \mathbf{y}_n \end{pmatrix},~where~\mathbf{y}_i~is~1\times d
$$
但是要計算資料的變異數會以量測特徵(column)為單位
:::
Therefore, PCA algorithm is
$$
cov(A)\mathbf{v} = \sum_{i=1}^r\sigma_i \mathbf{u}_i
$$
Also, $cov(A)$ is symetric matrix which is also positive definite.
$$
\mathbf{v}_i=\mathbf{u}_i
$$
其中,$\sigma_i$便是第$i$ 主成分的重要性,$\mathbf{v}_i$則是主成分方向。即
$$
PCA(X)=SVD(X^TX)=EVD(X^TX),~where~X~is~row~vector.
$$
## Discussion
:question: Why we can use $\mathbf{v}$ to project the original data? This $\mathbf{v}$ is obtained by covariance matrix!
:bulb: Key concept : find a direction to let the projection has **maximum variance**
$$
variance = \sigma^2 =\frac{1}{n} \sum_{i=1}^n(\mathbf{v}^Tx_i-\mu)^2=\frac{1}{n} \sum_{i=1}^n(\mathbf{v}^Tx_i)^2,~if~\mu=0
\\ Therefore, \sigma^2=\frac{1}{n} \sum_{i=1}^n(\mathbf{v}^Tx_i)(\mathbf{v}^Tx_i)^T=\frac{1}{n} \sum_{i=1}^n(\mathbf{v}^Tx_ix_i^T\mathbf{v})=\mathbf{v}^T(\sum_{i=1}^n(x_ix_i^T))\mathbf{v}
\\ =\mathbf{v}^TC\mathbf{v},~where~C~is~covariance~matrix.
$$
So, our target function will become
$$
\mathbf{v}=\arg\max_{||\mathbf{v}||=1}\mathbf{v}^TC\mathbf{v}
\\ f(\mathbf{v}, \lambda)=\mathbf{v}^TC\mathbf{v}-\lambda(||\mathbf{v}||-1)
\\ ~~~~~~~~~~~~=\mathbf{v}^TC\mathbf{v}-\lambda(\mathbf{v}^T\mathbf{v}-1)
$$
After the partial differential,
$$
C\mathbf{v} = \lambda\mathbf{v}
\\ ||\mathbf{v}||=1
$$
:::info
Notice that this form is standard form of eigenvalue problem. If we can find the eigenvector of covariance matrix, then we find a projection direction which cam make projected data has maximum variance.
:::
## Example
For a matrix $\mathbf{A}$
$$
A = \begin{pmatrix} 1&2 \\\ 3&4 \\\ 5&6 \\\ 7&8 \end{pmatrix}
$$
PCA result will be
$$
[coeff,score,latent]=pca(\mathbf{A}),\\\
coeff= \begin{pmatrix} 0.7071&0.7071 \\\ 0.7071&-0.7071\end{pmatrix},\\\
score= \begin{pmatrix} -4.243&4.484\times10^{-16} \\\ -1.414&-1.623\times10^{-16} \\\ 1.414&1.623\times10^{-16} \\\ -4.243&3.402\times10^{-16} \end{pmatrix},\\\
latent = \begin{pmatrix} 13.333 \\\ 1.232\times10^{-31}\end{pmatrix}
$$
SVD of covariance matrix $\mathbf{C}=\frac{1}{N-1}(\mathbf{A}-\mu)^2$ will be
$$
[\mathbf{U},\mathbf{S},\mathbf{V}] = svd(\mathbf{C}),\\\
\mathbf{S}=\begin{pmatrix} 13.333&0 \\\ 0&7.626\times10^{-17}\end{pmatrix},\\\
\mathbf{V}=\begin{pmatrix} 0.7071&0.7071 \\\ 0.7071&-0.7071\end{pmatrix}=\mathbf{U}\\\
$$
## Reference
[直觀理解並應用主成分分析](https://leemeng.tw/essence-of-principal-component-analysis.html)
[奇異值分解|線代啟示錄](https://ccjou.wordpress.com/2009/09/01/%E5%A5%87%E7%95%B0%E5%80%BC%E5%88%86%E8%A7%A3-svd/)
[CMU teaching material](https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/book-chapter-4.pdf)