Principal Component Analysis

# Principal Component Analysis 大概介紹一下 PCA ，從最簡單的用 best fit line 理解，再用 SVM 實作，最後帶入更高維度並且用 matplotlib 呈現最後寫出跟 sklearn 一樣的 PCA 物件 Github: https://github.com/1tangerine1day/PCA-with-numpy ``` import matplotlib.pyplot as plt import numpy as np ``` ## PCA by best fit line 從 https://www.youtube.com/watch?v=FgakZw6K1QQ 上面看到的做法，才發現原來也可以用 best fit line 方式帶入 eigenvector 和 eigen value 首先是資料: | | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | |--- |--- |---|---|---|---|---| |feature 1 |10|11|8|3|2|1| |feature 2 |6|4|5|3|2.8|1| 我們先將 sample1 畫在座標上吧: ``` feature1 = [10, 11, 8, 3, 2, 1] feature2 = [6, 4, 5, 3, 2.8, 1] plt.scatter(feature1, feature2) plt.show() ``` ![](https://i.imgur.com/qUX2LDf.png) 為了我們方便觀看與計算，我們先將整個圖形平移(不會影響點之間的關係) ``` av_1 = sum(feature1) / len(feature1) av_2 = sum(feature2) / len(feature2) feature1 = [x - av_1 for x in feature1] feature2 = [x - av_2 for x in feature2] plt.scatter(feature1, feature2) plt.axhline(color='black', lw=0.5) plt.axvline(color='black', lw=0.5) plt.show() ``` ![](https://i.imgur.com/eB1WwJB.png) PCA 的想法就是想要將資料投影到更低維度，並且希望在上面的資料的變異數越大越好(可以劃分更明確) 因此我們想要投影到一 y = a * x + b，且投影至上面的 |y| 總和最大那這跟 best fit line 有什麼關係呢，我們可以看下面這張圖 ![](https://i.imgur.com/NB6QwtU.png) 藉由畢氏定理，我們想要 c = |y| 最大，相當於要找最小 b ，就可以套用我們的 best fit line 公式找出 a 和 b ``` def best_fit(X, Y): xbar = sum(X)/len(X) ybar = sum(Y)/len(Y) n = len(X) # or len(Y) numer = sum([xi*yi for xi,yi in zip(X, Y)]) - n * xbar * ybar denum = sum([xi**2 for xi in X]) - n * xbar**2 a = numer / denum b = ybar - a * xbar print('best fit line:\ny = {:.2f}x + {:.2f}'.format(a, b)) return a, b a, b = best_fit(feature1, feature2) ``` ``` best fit line: y = 0.34x + -0.00 ``` 公式的來源也很簡單，將所有點的 xi 代入值與自己的 yi 相減平方，求總和最小 $f = \sum(y_i-(a * x_i + b))^2 = \sum(y_i - a * x_i - b)^2$ 求最小最直覺想法就是微分，只是這題我們是要求 a 和 b，因此用到偏微分: $\frac{d(a)}{d(f)}=\sum(\frac{d(a)}{d(f)}(y_i - a * x_i - b)^2)=\sum(2(y_i - a * x_i - b)-2a)=0$ $\frac{d(b)}{d(f)}=\sum(\frac{d(b)}{d(f)}(y_i - a * x_i - b)^2)=\sum(2(y_i - a * x_i - b)-1)=0$ 解聯立再化減即可將 best fit line 畫出來: ``` plt.scatter(feature1, feature2) plt.axhline(color='black', lw=0.5) plt.axvline(color='black', lw=0.5) PC1 = [a * xi + b for xi in feature1] plt.plot(feature1, PC1) plt.show() ``` ![](https://i.imgur.com/Kxbyryu.png) 斜率 0.34 = 17/50 經由畢氏定理: ``` x = (50**2 + 17**2)**0.5 print(x) print(x/x, 50/x, 17/x) ``` ``` 52.81098370604357 1.0 0.9467727448197127 0.32190273323870233 ``` 斜率我們可以想成...一種線性組合: 0.94 * feature1 + 0.32 * feature2 [0.94, 0.32] 是 PC1 的一組 eigenvector !!!(很跳痛XD) 這邊想法有點 tricky 影片也是直接帶過，我自己的想法是這樣: 先定義一下 PC1 ， PC1 是一個 function，他將 x, y 所有的點投影至 y = 0.34x 的線上，這也是我們的目的將更高維的座標化減成低維的 (2 -> 1)，透過 best fit line 我們找出一個可以使資料投影在這上面變異數最大的 function PC 我們稱之為 PC1 再來我們看看 eigenvector 定義: $AV = λV$ or $T(V) = λV$, T is linear opearator 因為 PC1 是投影 function ，是一種 linear opearator 也可以寫成投影矩陣(公式我忘了) 而 [0.94, 0.32] 是平行 y = 0.34x 線上的一個向量，必可以寫成 $PC1( [0.94, 0.32] ) = λ [0.94, 0.32]$ 所以可以說 [0.94, 0.32] 是 PC1 的一組 eigenvector (這樣理解應該沒錯吧?) 但下一步我就不會解釋了 ![](https://i.imgur.com/WU2UnN6.png) sum of square distance (on PC1): $d$ $d$ is eigen value for PC1 $d^{0.5}$ is singular value for PC1 而變異數則是 ![](https://i.imgur.com/gLcrwy5.png) code 部分則套用投影公式(這邊的不用記，記之後的就好)，計算長度總和 ``` def sum_of_square_distance(a, b, c, X, Y): sum = 0 for i, j in zip(X,Y): x_proj = i - a*(a*i + b*j + c) / ((a**2) + (b**2)) y_proj = j - b*(a*i + b*j + c) / ((a**2) + (b**2)) # sum += (x_proj**2 + y_proj**2)**0.5 sum += x_proj**2 + y_proj**2 return sum e1 = sum_of_square_distance(-a, 1, b, feature1, feature2) print("eigen value for PC1:", e1) print("singular value for PC1:", e1** 0.5) ``` ``` eigen value for PC1: 106.40117345669768 singular value for PC1: 10.315094447298952 ``` 我們找到了一個 eigen value 最大的 PC ，那還有其他的 PC 嗎? 有的，因為這是 2 dim 我們可以直接取垂直線當作另外一組 eigenvector 形成 PC2 取 -0.32 * feature1 + 0.94 * feature2 -> [-0.32, 0.94] is eigenvector of PC2 ``` plt.scatter(feature1, feature2) plt.axhline(color='black', lw=0.5) plt.axvline(color='black', lw=0.5) PC1 = [b + a * xi for xi in feature1] PC2 = [b + (-1/a) * xi for xi in feature1] plt.plot(feature1, PC1) plt.plot(feature1, PC2) max_value = max(feature1 + feature2)+1 plt.xlim(-max_value, max_value) plt.ylim(-max_value, max_value) plt.gca().set_aspect('equal', adjustable='box') plt.show() ``` ![](https://i.imgur.com/YWEA1Ox.png) 比較一下兩組 eigen vale 和 variation ``` e1 = sum_of_square_distance(-a, 1, b, feature1, feature2) e2 = sum_of_square_distance(1/a, 1, b, feature1, feature2) print("eigen value of PC1:" , e1) print("eigen value of PC2:" , e2) print("") print("singular value of PC1:" , e1** 0.5) print("singular value of PC2:" , e2** 0.5) variation_pc1 = e1 / (len(feature1)-1) variation_pc2 = e2 / (len(feature1)-1) total_variation = variation_pc1 + variation_pc2 print("") print("variation for PC1:", variation_pc1) print("variation for PC2:", variation_pc2) print("") print("PC1 accounts for {:f}% of the total variation around PCs ".format(variation_pc1/ total_variation * 100)) print("PC2 accounts for {:f}% of the total variation around PCs ".format(variation_pc2/ total_variation * 100)) ``` ``` eigen value of PC1: 106.40117345669768 eigen value of PC2: 4.0654932099689916 singular value of PC1: 10.315094447298952 singular value of PC2: 2.0163068243620543 variation for PC1: 21.280234691339537 variation for PC2: 0.8130986419937983 PC1 accounts for 96.319710% of the total variation around PCs PC2 accounts for 3.680290% of the total variation around PCs ``` 這邊可以想成要將資料分散的話 PC1 佔了這個分散動作的 96.319710% (暫時這樣解釋XD) 再繼續講之前先來看看 SVD 的做法 ## PCA by SVD ### SVD 這邊就先不多做解釋，大概複習一下 ![](https://miro.medium.com/max/440/0*8Eigag8mZn9gTFp2) ``` A = U * S * VT (m, n) = (m, m) (m, n) (n,n) ``` or ``` A = U * S * VT (m, n) = (m, r) (r, r) (r,n) ``` s is singular value v is eigenvectors ### 實作 ``` matrix = np.array([feature1,feature2]) matrix = np.transpose(matrix) u, s, vt = np.linalg.svd(matrix, full_matrices=False) v = np.transpose(vt) print(matrix.shape, u.shape, s.shape, vt.shape) print("singular value:", s[0], s[1]) print("eigen value:", s[0]**2, s[1]**2) print("eigenvector:", v[0], v[1]) ``` ``` (6, 2) (6, 2) (2,) (2, 2) singular value: 10.316009946380492 2.011617620932341 eigen value: 106.42006121382123 4.046605452845491 eigenvector: [-0.94171069 -0.33642381] [-0.33642381 0.94171069] ``` 哈哈哈其實有點尷尬，eigenvector 算出來好像和前面些微不同XD(撇除 "-" 號方向) SVD : [-0.94171069 -0.33642381] beast fit line: [0.9467727448197127 0.32190273323870233] 不過後面有檢察 sklearn 的答案，SVD 是正確的，beast fit line 因該是前面畢氏定理除以斜邊的浮點數影響 SVD : 106.42006121382123 4.046605452845491 beast fit line: 106.40117345669768 4.0654932099689916 但 eigen value 前面是套 a, b 公式計算(沒有用除以斜邊的向量)，結果就相同接下來一樣畫圖看看吧! ``` plt.scatter(feature1, feature2) plt.axhline(color='black', lw=0.5) plt.axvline(color='black', lw=0.5) PC1 = [0 + v[0][1]/v[0][0] * xi for xi in feature1] PC2 = [0 + v[1][1]/v[1][0] * xi for xi in feature1] plt.plot(feature1, PC1) plt.plot(feature1, PC2) max_value = max(feature1 + feature2)+1 plt.xlim(-max_value, max_value) plt.ylim(-max_value, max_value) plt.gca().set_aspect('equal', adjustable='box') plt.show() print("eigen value of PC1:" , s[0]**2) print("eigen value of PC2:" , s[1]**2) print("") variation_pc1 = s[0]**2/(len(feature1)-1) variation_pc2 = s[1]**2/(len(feature1)-1) print("variation for PC1:", variation_pc1) print("variation for PC2:", variation_pc2) print("") total_variation = variation_pc1 + variation_pc2 print("PC1 accounts for {:f}% of the total variation around PCs ".format(variation_pc1/ total_variation * 100)) print("PC2 accounts for {:f}% of the total variation around PCs ".format(variation_pc2/ total_variation * 100)) ``` ![](https://i.imgur.com/qAj1SH4.png) 一樣無誤，接下來我們真的要把資料投影到 PC1 PC2 上，並且當作新的 X Y 軸視覺上就把頭轉成跟上面一樣的角度就好，但程式上很難阿阿阿要計算投影到 PC1 PC2 上的長度，當作新的 xi yi，還要確定方向總之想了很久之後，會用到以下三個 function ``` def get_project(v1, v2): v2_norm = np.sqrt(sum(v2**2)) proj_of_v1_on_v2 = (np.dot(v1, v2)/v2_norm**2)*v2 return proj_of_v1_on_v2 def get_distance(v): return np.sqrt(sum(v**2)) def get_direction(v1, v2): bias = np.dot(np.array([1]*len(v2)), v2) if np.dot(v1, v2) >= 0: return 1*bias else: return -1*bias ``` * get_project(v1, v2): 是計算 v1 投影到 v2 上的向量，公式為(這個就要記了): * $u_1 = \frac{v_i . v_2}{||v2||_2} * v_2$ * get_distance(v): 將算出來的 $u_1$ 帶入求距離的 * get_direction(v1, v2): 剛開始有距離後第一個想法就是，乾方向怎麼辦，尤其是 4 維以上的空間，根本想像不到正負阿，後來複習之後才想到內積的正負可以判斷是否同向，就先判斷 v2 在空間中的方向(和[1,1,...1]比)當作基準在和要投影的 v1 比 Code就剩迴圈代入的苦差活(自認寫的很爛)，這邊不講解了下面class有，總之畫出來結果(code): ![](https://i.imgur.com/iUSgqGe.png) 可能有人會想說這樣一樣是二維阿，沒錯我們是畫了二維，但仔細想想 PC0(對應前面的 PC1 這邊我改 index 重 0 開始) 就佔了 96% 的"分散資料"功勞，其實用一維 PC0 就可以代表整體資料的差距了 ![](https://i.imgur.com/mmWOnmh.png) 找出部分重要的 PC ，在用找出的 PC 形成新的座標體系，就是我們最後要的 ## Complete code ``` import matplotlib.pyplot as plt import numpy as np class PCA(): def __init__(self, matrix): self.matrix = self.__shift(matrix) u, s, vt = np.linalg.svd(self.matrix, full_matrices=False) variations = s**2/(len(f1)-1) v = np.transpose(vt) self.singular_values = s self.eigenvectors = v[:len(s)] self.eigen_values = s**2 self.PCs = variations/sum(variations) def __shift(self, matrix): new_matrix = [] for m in matrix: av_m = sum(m)/len(m) new_row = [] for i in m: new_row.append(i - av_m) new_matrix.append(new_row) new_matrix = np.transpose(new_matrix) return new_matrix def __get_project(self, v1, v2): v2_norm = np.sqrt(sum(v2**2)) proj_of_v1_on_v2 = (np.dot(v1, v2)/v2_norm**2)*v2 return proj_of_v1_on_v2 def __get_distance(self, v): return np.sqrt(sum(v**2)) def __get_direction(self, v1, v2): bias = np.dot(np.array([1]*len(v2)), v2) if np.dot(v1, v2) >= 0: return 1*bias else: return -1*bias def show_PC_variations(self): x = np.arange(len(self.PCs)) plt.bar(x, (self.PCs * 100).tolist()) plt.xticks(x, ["PC"+str(i) for i in x]) plt.xlabel('PCA') plt.ylabel('principal components(%)') plt.show() def show_2_PCA(self, pc1, pc2): PC1_proj = [self.__get_project(np.array([i for i in w]), self.eigenvectors[pc1]) for w in self.matrix] PC1_direct = [self.__get_direction(np.array([i for i in w]), self.eigenvectors[pc1]) for w in self.matrix] PC1_distance = [d*self.__get_distance(v) for v,d in zip(PC1_proj, PC1_direct)] PC2_proj = [self.__get_project(np.array([i for i in w]), self.eigenvectors[pc2]) for w in self.matrix] PC2_direct = [self.__get_direction(np.array([i for i in w]), self.eigenvectors[pc2]) for w in self.matrix] PC2_distance = [d*self.__get_distance(v) for v,d in zip(PC2_proj, PC2_direct)] plt.axhline(color='black', lw=0.5) plt.axvline(color='black', lw=0.5) plt.scatter(PC1_distance, PC2_distance) for n in range(len(PC1_distance)): plt.annotate(str(n), (PC1_distance[n], PC2_distance[n])) plt.xlabel("PC" + str(pc1) + " (" + str(round(self.PCs[pc1]*100, 2)) +"%)") plt.ylabel("PC" + str(pc2) + " (" + str(round(self.PCs[pc2]*100, 2)) +"%)") plt.show() def show_1_PCA(self, pc1): PC1_proj = [self.__get_project(np.array([i for i in w]), self.eigenvectors[pc1]) for w in self.matrix] PC1_direct = [self.__get_direction(np.array([i for i in w]), self.eigenvectors[pc1]) for w in self.matrix] PC1_distance = [d*self.__get_distance(v) for v,d in zip(PC1_proj, PC1_direct)] plt.axhline(color='black', lw=0.5) plt.scatter(PC1_distance, [0]*len(PC1_distance)) for n in range(len(PC1_distance)): plt.annotate(str(n), (PC1_distance[n], 0)) plt.xlabel("PC" + str(pc1) + " (" + str(round(self.PCs[pc1]*100, 2)) +"%)") plt.show() ``` 這邊就舉個四維的例子 ``` # sample0 ~ sample5 f0 = [10, 11, 8, 3, 2, 1] f1 = [6, 4, 5, 3, 0.7, 1] f2 = [8, 2, 9, 1.5, 5, 5] f3 = [7, 6, 5, 4, 8, 8.5] matrix = np.array([f0,f1,f2,f3]) pca = PCA(matrix) ``` 求出的 eigenvectors 和 eigen values ``` pca.eigenvectors pca.eigen_values ``` ``` array([[-0.86889937, 0.24093226, 0.33542894, -0.27286069], [-0.40900889, -0.02568878, -0.35886838, 0.83860915], [-0.25039745, -0.92977389, -0.15847275, -0.21842166], [ 0.12258356, -0.27715023, 0.85649711, 0.41782021]]) array([120.08515057, 44.27951779, 13.30470227, 1.38896271]) ``` 看看不同 PC 所佔的比例 ``` pca.show_PC_variations() ``` ![](https://i.imgur.com/xLtWCVm.png) 可以觀察出來其實我們取 PC0 PC1 來形成一個新的座標體系，就能表達大部分的資料差距程度了 ``` pca.show_2_PCA(0,1) ``` ![](https://i.imgur.com/wxQs8zo.png) 也可以觀察個別 PC ， PC3 的 eigenvector [0.12258356, -0.27715023, 0.85649711, 0.41782021] 可以看出參考 f2 f3 比較重尤其是 f2 (線性組合權重) ``` pca.show_1_PCA(3) ``` ![](https://i.imgur.com/qDZPl6c.png) 因此: sample4 sample5 在 f2 f3 都很接近，所以投影上去幾乎黏在一起 sample1 sample3 雖然在 f2 都很接近，但因為 f0 差太多(畢竟還有 0.12258356) 所以很分散(? sample0 sample2 在 f2 f3 f1 也都蠻接近總之之後就自己觀察吧 ## Compare 和 sklearn 比較一下答案，基本上都是一樣的，我們 PCA() 作法應該無誤 ### Our PCA() ``` f1 = [10, 11, 8, 3, 2, 1] f2 = [6, 4, 5, 3, 0.7, 1] f3 = [8, 2, 9, 1.5, 5, 5] f4 = [7, 6, 5, 4, 8, 8.5] matrix = np.array([f1,f2,f3,f4]) pca = PCA(matrix) print(pca.singular_values) print(pca.PCs) pca.show_2_PCA(0,1) ``` ![](https://i.imgur.com/0hcgznj.png) ### Sklearn PCA() ``` from sklearn.decomposition import PCA as SK_PCA f1 = [10, 11, 8, 3, 2, 1] f2 = [6, 4, 5, 3, 0.7, 1] f3 = [8, 2, 9, 1.5, 5, 5] f4 = [7, 6, 5, 4, 8, 8.5] matrix = np.array([f1,f2,f3,f4]) matrix = np.transpose(matrix) pca = SK_PCA(n_components=4) pca.fit(matrix) print(pca.singular_values_) print(pca.explained_variance_ratio_) ``` ![](https://i.imgur.com/xWr304O.png)