主成份分析 (Principle Component Analysis)

--- tags: 機器學習 (Machine Learning) --- # 主成份分析 (Principle Component Analysis) ## 降維 (Dimensional Reduction) 為什麼需要降維？ 1. 在原始的高維度空間中，常常包含了不必要的噪音訊息 2. 將高維度空間壓縮至較低維度的空間後，可以有效減少機器學習訓練縮需的時間 3. 數據可視化：因為人類可辨識的維度有限，如果可以將維度降到較低維度，我們比較有機會找出資料中有用之訊息。 4. 維度災難：維度若是呈現線性上升，但樣本所需卻為指數上升。舉例而言，如果一個維度需要100個點來估計機器學習所需的參數，當維度為20維時，我們需要$100^{20}$個樣本點，這個數量級的數據資料是非常難取得的；若沒有增加樣本點，則在高維度空間中，樣本數變得非常稀疏。為了緩解維度過高所造成的上述問題，一個重要的方法稱為降維，即通過某種轉換，將一個原本屬於高維度的資料點降到較低維度，且資料特性不能差太多。 :::info * **Feature selection** is the process of identifying and selecting relevant features for your sample. * **Feature engineering** is manually generating new features from existing features, by applying some transformation or performing some operation on them. ::: ## Principle Componet Analysis (PCA) ### 矩陣符號與運算 1. $X_{n \times p}$為我們拿到之資料，總共有n個樣本點，每個樣本點有p個變項，例如身高、體重、性別等。 $$ X_{n \times p} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix} $$ 2. 假設有一個矩陣$V_{p \times q}$，若$q$個行向量之間互為獨立。則 $$ V^T V = \begin{pmatrix} \cdots & v_{(1)}^T & \cdots \\ \cdots & v_{(2)}^T & \cdots \\ \cdots & \vdots & \cdots \\ \cdots & v_{(q)}^T & \cdots \\ \end{pmatrix} \begin{pmatrix} \vdots & \vdots & \vdots & \vdots \\ v_{(1)} & v_{(2)} & \cdots & v_{(q)} \\ \vdots & \vdots & \vdots & \vdots \end{pmatrix} =I_{q \times q} $$ 因為 $$ v_{(i)}^Tv_{(j)} = \begin{cases} 1 & \text{for all $i$} \\ 0 & \text{for all $i \not= j$} \end{cases} $$ 但$$VV^T \not = I.$$ 3. 協方差矩陣(Covairance Matrix)：假設X經過中心(centralized) $$ Var(X)=\frac{1}{n} X^T X= \frac{1}{n} \begin{pmatrix} \sum_{i=1}^n x_{i1}^2 & \sum_{i=1}^n x_{i1}x_{i2} & \cdots & \sum_{i=1}^n x_{i1}x_{ip} \\ \sum_{i=1}^n x_{i2}x_{i1} & \sum_{i=1}^n x_{i2}x_{i2} & \cdots & \sum_{i=1}^n x_{i2}x_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{i=1}^n x_{ip}x_{i1} & \sum_{i=1}^n x_{ip}x_{i2} & \cdots & \sum_{i=1}^n x_{ip}x_{ip} \end{pmatrix} $$ 4. Rayleigh定理^[[Rayleigh定理證明](https://ccjou.wordpress.com/2010/03/16/hermitian-%E7%9F%A9%E9%99%A3%E7%89%B9%E5%BE%B5%E5%80%BC%E7%9A%84%E8%AE%8A%E5%8C%96%E7%95%8C%E5%AE%9A/)]：若$A$為一個$n\times n$的Hermitian矩陣，則 \begin{align*} \max_{x} \frac{x^T A x}{x^T x}&=\lambda_1 \\ \min_{x} \frac{x^T A x}{x^T x}&=\lambda_n \end{align*} 也就是說Rayleigh商的上界與下界分別為A的最大特徵值$\lambda_1$與最小特徵值$\lambda_n$，且Rayleigh商的最大值與最小值分別發生於對應$\lambda_1$與$\lambda_n$的特徵向量$\mathbf{u}_1$和 $\mathbf{u}_n$。 ### 推導 1. 我們將資料($x$)轉換到較低的維度點(z)上，假設將x轉換到z的函數為 $$ f(Z)=\mu + V_q Z $$ 其中$\mu$是偏移量，$V_q$是一個$p \times q$的矩陣($p>q$)，且$q$個行向量皆獨立，記為 $$ v_{(i)}^Tv_{(j)} = \begin{cases} 1 & \text{for all $i=j$} \\ 0 & \text{for all $i \not= j$} \end{cases} $$ :::success 透過$f(Z)$將原本為$p$維度的點$x_{1 \times p}$透過$V_q$轉換到$q$維度上的點$z_{1 \times q}$ ::: 2. 我們首先必須思考什麼是==資料特性不能差太多==?一種量測的方法即為使用$L_2$-norm，透過數學子表示為 $$ L(V_q)=\sum_{i=1}^n \| x_i - \mu - V_q z_i \|_2^2 $$ 利用最小化$L(V_q)$找到最適合的$\mu,V_q$和$z_i$。 3. 分別對$\mu$和$z_i$偏微分得到$\hat{\mu}$和$\hat{z}_i$為 \begin{align*} \hat{\mu}&=\frac{1}{n} \sum_{i=1}^n x_i = \bar{x}\\ \hat{z}_i&=V_q (x_i-\bar{x}) \end{align*} 將所得到的$\hat{\mu}$和$\hat{z}_i$帶回$L(V_q)$中， \begin{align*} \min_{V_q} &\sum_{i=1}^n \| (x_i-\bar{x})-V_q V_q^T (x_i-\bar{x}) \|^2_2\\ &= \min_{V_q} \left\{ \sum_{i=1}^n (x_i-\bar{x})^T (x_i-\bar{x}) - \sum_{i=1}^n (x_i-\bar{x})^T V_q V_q^T (x_i-\bar{x}) \right\} \\ &= \sum_{i=1}^n (x_i-\bar{x})^T (x_i-\bar{x}) - \max_{V_q} tr \left( \sum_{i=1}^n (x_i-\bar{x})^T V_q V_q^T (x_i-\bar{x}) \right) \\ &= \sum_{i=1}^n (x_i-\bar{x})^T (x_i-\bar{x}) - \max_{V_q} tr \left( V_q^T \left[ \sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^T \right] V_q \right)\\ &= \sum_{i=1}^n (x_i-\bar{x})^T (x_i-\bar{x}) - \max_{V_q} tr \left( V_q^T \Sigma V_q \right) \end{align*} 4. 利用Rayleigh定理：將 $$ \max_{V_q} V_q^T \Sigma V_q $$ 改寫為 $$ V_q^T Var(X) V_q = \begin{pmatrix} \cdots & v_{(1)}^T & \cdots \\ \cdots & v_{(2)}^T & \cdots \\ \cdots & \vdots & \cdots \\ \cdots & v_{(q)}^T & \cdots \\ \end{pmatrix} \Sigma \begin{pmatrix} \vdots & \vdots & \vdots & \vdots \\ v_{(1)} & v_{(2)} & \cdots & v_{(q)} \\ \vdots & \vdots & \vdots & \vdots \end{pmatrix} $$ 此時我們轉換成最大化 $$ \mathop{\arg \max}_{v_{(i)} \in \mathbb{R}^q, \|v_{(i)}\|=1} v_{(i)}^T \Sigma v_{(i)} \text{ $\forall$ $i=1,...,q$} $$ 其中$v_{(i)} \perp v_{(j)}$ for all $i \not= j$。由Rayleigh定理可得知，最佳上界即為$\Sigma$的最大特徵值$\lambda_1$，且$\lambda_1$對應的特徵向量為$\mathbf{u_1}$，則 $$ \hat{v}_{(1)}=\mathbf{u_1} $$ 依此類推即可得轉換空間至$V_q$中最後一個維度$q$，即為$\Sigma$之第$q$個大的特徵值$\lambda_q$($\lambda_1 \geq \lambda_2 \geq...\geq \lambda_q \geq ... \geq \lambda_p$)，所對應之特徵向量為$\mathbf{u_q}$ $$ \hat{v}_{(q)}=\mathbf{u_q} $$ 最後 $$ \hat{V}_q = \begin{pmatrix} \vdots & \vdots & \vdots & \vdots \\ \mathbf{u_1} & \mathbf{u_2} & \cdots & \mathbf{u_q} \\ \vdots & \vdots & \vdots & \vdots \end{pmatrix} $$ :::info 我們透過矩陣的線性轉換($f$)試圖讓樣本轉換到另一個較低維度的同時，並不會失去過多的資訊。我們利用$L_2$-norm去衡量損失的資訊，經過轉換後，可以將$L(V_q)$改寫為找到一個投影向量讓投影後的資料變異量($Var$)最大。因此出現最大方差理論，認為在信號處理中認為信號具有較大的方差，噪聲有較小的方差。 ::: :::success The Rayleigh Quotient $$ \max \frac{x^T A x}{x^T x} $$ where $A$ is symmetric. Let $x^T x =1$. Then, The Lagrangian is $$ L(x)= x^T A x + \lambda (x^T x-1) $$ Taking the derivative with respect to $x$: \begin{align*} \frac{\partial L}{\partial x} &= x^T (A+A^T) + 2 \lambda x^T\\ \frac{\partial L}{\partial \lambda} &= x^T x - 1 \end{align*} Set this two equation as zero. \begin{align*} &x^T (A+A^T) = -2 \lambda x^T \\ \Rightarrow& A x = \tilde{\lambda}x \end{align*} where $\tilde{\lambda}=-2\lambda$. Hence the maximum and the minimum are obtained for $x$ an eigenvector of $A$ (the Lagrange multipliers provide a necessary condition. The extremum is indeed obtained because $x^T A x$ is a continuous function and the unit sphere is a compact set). For $x$ an eigenvector of $A$ with unit norm, $$ x^T A x = x^T \lambda x = \lambda x^T x = \lambda. $$ Therefore the maximum is obtained at the eigenvector corresponding to the largest eigenvalue of $A$. ::: ### PCA優點 1. 在$L_2$-norm的衡量下，盡量減少信息丟失的基礎上，對高維變量空間進行降維處理，提高模型效率 2. 將具有一定相關性的多個指標化簡為少數幾個具有代表性意義的綜合指標 ### PCA限制由推導中我可以得知PCA需要的假設 1. 屬於線性降維 (假設$f(Z)=\mu+V_q Z$） 2. 轉換後的空間變項必須互相獨立 (假設$v_{(i)} \perp v_{(j)}$) ### 怎麼選擇主成份的個數($q$)^[[https://www.zhihu.com/question/21980732](https://www.zhihu.com/question/21980732)]：在上面的推導中我們假設$q$為已知，但在實務上，選擇$q$大約可分為 1. 如果是為數據可視化，可以分別降到1维(線)，2维(平面)，或者3维(立體)。 2. 看PCA解釋了多少百分比的方差，常用的例如$99\%, 95\%, 90\%$ 3. 類似elbow method的方法，畫出主成份的個數和解釋變異數百分比的曲線，並找出手肘的那個點當作主成份的個數。 ###### tags: `Machine Learning`, `Unsupervised Learning`, `PCA`