owned this note
owned this note
Published
Linked with GitHub
# 體驗主成份分析
Understanding the principal component analysis

This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
$\newcommand{\trans}{^\top}
\newcommand{\adj}{^{\rm adj}}
\newcommand{\cof}{^{\rm cof}}
\newcommand{\inp}[2]{\left\langle#1,#2\right\rangle}
\newcommand{\dunion}{\mathbin{\dot\cup}}
\newcommand{\bzero}{\mathbf{0}}
\newcommand{\bone}{\mathbf{1}}
\newcommand{\ba}{\mathbf{a}}
\newcommand{\bb}{\mathbf{b}}
\newcommand{\bc}{\mathbf{c}}
\newcommand{\bd}{\mathbf{d}}
\newcommand{\be}{\mathbf{e}}
\newcommand{\bh}{\mathbf{h}}
\newcommand{\bp}{\mathbf{p}}
\newcommand{\bq}{\mathbf{q}}
\newcommand{\br}{\mathbf{r}}
\newcommand{\bx}{\mathbf{x}}
\newcommand{\by}{\mathbf{y}}
\newcommand{\bz}{\mathbf{z}}
\newcommand{\bu}{\mathbf{u}}
\newcommand{\bv}{\mathbf{v}}
\newcommand{\bw}{\mathbf{w}}
\newcommand{\tr}{\operatorname{tr}}
\newcommand{\nul}{\operatorname{null}}
\newcommand{\rank}{\operatorname{rank}}
%\newcommand{\ker}{\operatorname{ker}}
\newcommand{\range}{\operatorname{range}}
\newcommand{\Col}{\operatorname{Col}}
\newcommand{\Row}{\operatorname{Row}}
\newcommand{\spec}{\operatorname{spec}}
\newcommand{\vspan}{\operatorname{span}}
\newcommand{\Vol}{\operatorname{Vol}}
\newcommand{\sgn}{\operatorname{sgn}}
\newcommand{\idmap}{\operatorname{id}}
\newcommand{\am}{\operatorname{am}}
\newcommand{\gm}{\operatorname{gm}}
\newcommand{\mult}{\operatorname{mult}}
\newcommand{\iner}{\operatorname{iner}}$
```python
from lingeo import random_int_list
```
## Main idea
In statistics, data is often stored as a matrix (two-dimensional table) such that
each row represent a sample, while
each column represent a feature.
For example, the rows can be the students and the columns can be the scores in different subjects.
Let $X$ be an $N\times d$ matrix for some data and $\bx_1,\ldots,\bx_N\in\mathbb{R}^d$ its rows.
Thus,
$$
\mu = \frac{1}{N} \sum_{i = 1}^N \bx_i
$$
is the center of these data points.
One may **center** the data by creating a new matrix $X_0$ whose rows are $\bx_1 - \mu, \ldots, \bx_N - \mu$.
Let $J$ be the $N\times N$ all-ones matrix.
Then one way to describe this is $X_0 = (I - \frac{1}{N}J)X$.
The **principal component analysis** (or **PCA**) is
a method of finding the most important directions of the data.
These directions (vectors) are called the **principal components** of the data.
##### Algorithm (principal component analysis)
Input: a data represented by an $N\times d$ matrix $X$ and the desired number $k$ of principal components
Output: the principal components represented by the column vectors of a matrix $P$
1. Let $J$ be the $N\times N$ all-ones matrix.
Define $X_0 = (I - \frac{1}{n}J)X$.
2. Let $U\Sigma V\trans$ be the singular value decomposition of $X_0$.
3. Let $P$ be the $d\times k$ matrix obtained by the first $k$ columns of $V$.
Once the principal components are found.
One may project all data points onto the plane spanned by the principal components.
This can be done by the $d\times k$ matrix $Y = X_0P$,
whose rows are the new data points contains the essential information of the original data.
## Side stories
- hidden words
## Experiments
##### Exercise 1
執行以下程式碼。
<!-- eng start -->
Run the code below.
<!-- eng end -->
```python
### code
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(0)
print_ans = False
mu = np.random.randint(-3,4, (2,))
cov = np.array([[1,1.9],
[1.9,4]])
X = np.random.multivariate_normal(mu, cov, (20,))
mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:1]
plt.axis('equal')
plt.scatter(*X.T)
plt.scatter(*mu, c="red")
plt.arrow(*mu, *(P[:,0]), head_width=0.3, color="red")
xs = X0.dot(P)
ys = np.zeros_like(xs)
plt.scatter(xs, ys)
pretty_print(LatexExpr(r"V^\top ="))
print(vh)
if print_ans:
print("red point: center")
print("red arrow: first principal component")
print("orange points: projection of blue points onto the red arrow")
print("red arrow = first row of VT = first column of V")
```
##### Exercise 1(a)
若藍點為給定的資料。
改變不同的 `seed`,說明紅點、紅箭頭、橘點的意思。
<!-- eng start -->
Let the blue point set be the data. Choose different `seed` and explain the meaning of the red point, the red arrow, and the orange point.
<!-- eng end -->
---
##### Exercise 1(a) - answer here
When `seed` = 0, we get the picture by running the code above as follows:

The red dot represents the center point of the blue dots.
The red arrow indicates the distribution trend of the blue dots.
The orange dots are projection of blue points onto the red arrow.
---
##### Exercise 1(b)
令 $X_0$ 的列向量為將藍點資料置中過後的資料點。
若 $X_0 = U\Sigma V\trans$ 為其奇異值分解。
說明紅箭頭與 $V\trans$ 的關係。
<!-- eng start -->
Let $X_0$ be the matrix whose rows are the blue points, centered at their mean. If $X_0 = U\Sigma V\trans$ is the singular value decomposition, describe the relation between the red arrow and $V\trans$.
<!-- eng end -->
---
##### Exercise 1(b) - answer here
The vector of the red arrow is the first row of $V\trans$.
---
:::info
What do the experiments try to tell you? (open answer)
...
:::
## Exercises
##### Exercise 2
令 $X$ 為一 $N\times d$ 矩陣,而其列向量為一筆資料。
令 $J$ 為 $N\times N$ 的全一矩陣。
<!-- eng start -->
Let $X$ be an $N\times d$ matrix whose rows are the data points. Let $J$ be the $N\times N$ all-ones matrix.
<!-- eng end -->
##### Exercise 2(a)
說明 $\frac{1}{N}JX$ 每列的意義。
<!-- eng start -->
Describe the meaning of the rows of $\frac{1}{N}JX$.
<!-- eng end -->
---
##### Exercise 2(a) - answer here
It means that we project the center onto the approriate plane while $X$ is a data.
$\frac{1}{N}JX$ represents the result of performing matrix operations on the matrices $J$and $X$, and then scaling the result by $\frac{1}{N}$.
:::warning
Need revision: Each row of $\frac{1}{N}JX$ is the center of all data points in $X$.
:::
---
##### Exercise 2(b)
說明 $(I - \frac{1}{N}J)X$ 每列的意義。
<!-- eng start -->
Describe the meaning of the rows of $(I - \frac{1}{N}J)X$.
<!-- eng end -->
---
##### Exercise 2(b) - answer here
Let
$$
X = \begin{bmatrix}
- & \br_1 & - \\
~ & \vdots & ~ \\
- & \br_n & -
\end{bmatrix}.
$$
By observing $(I - \frac{1}{N}J)X$, we can see that its $i$th row is, take $i=1$ as an example,
$$
\frac{N-1}{N}\br_1 - \frac{1}{N}(\br_2 + \br_3 + ... + \br_n) =
\frac{N}{N}\br_1 - \frac{1}{N}(\br_1 + \br_2 + ... + \br_n).
$$
Thus, the $i$th row $(I - \frac{1}{N}J)X$ is equal to the $i$th row of $X$ minus the average of all the rows of $X$.
The above manipulations centers the data.
:::warning
Need revision: The more precise way to say "manipulate" the center of the data is to "translate" the data so that the center is at the origin.
:::
---
##### Exercise 3
執行以下程式碼。
<!-- eng start -->
Run the code below.
<!-- eng end -->
```python
### code
import numpy as np
np.random.seed(0)
X = np.random.randint(-3, 4, (5,2))
mu = X.mean(axis=0)
X0 = X - mu
print("X =\n", X)
print("mu =\n", mu)
print("X0 =\n", X0)
```
##### Exercise 3(a)
解釋 `mu = X.mean(axis=0)` 的意義。
<!-- eng start -->
Describe the meaning of `mu = X.mean(axis=0)` .
<!-- eng end -->
##### Exercise 3(b)
解釋 `X0 = X - mu` 的意義。
<!-- eng start -->
Describe the meaning of `X0 = X - mu` .
<!-- eng end -->
##### Exercise 4
執行以下程式碼。
<!-- eng start -->
Run the code below.
<!-- eng end -->
```python
### code
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
np.random.seed(0)
mu = np.random.randint(-3,4, (3,))
cov = np.array([[3, -0.9, -1.9],
[-0.9, 11, -9.9],
[-1.9, -9.9, 12]])
X = np.random.multivariate_normal(mu, cov, (100,))
mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:2]
ax = plt.axes(projection='3d')
ax.set_xlim(-5,5)
ax.set_ylim(-5,5)
ax.set_zlim(-5,5)
ax.scatter(*X.T)
ax.scatter(*mu, c="red")
ax.quiver(*(mu[:,np.newaxis] + np.zeros((2,))), *P, color="red")
```
```python
%matplotlib inline
Y = X0.dot(P)
plt.axis('equal')
plt.scatter(*Y.T)
```
##### Exercise 4(a)
說明兩段程式碼產出的圖的關係。
<!-- eng start -->
Describe the relation between the two graphs in the output.
<!-- eng end -->
##### Exercise 4(b)
使用 `X.shape` 來觀察一個陣列的形狀、
並逐句解釋下方程式碼所做的事情。
```python
mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:2]
```
<!-- eng start -->
You may use `X.shape` to check the shape of an array. Then explain the meaning of each line in the code.
```python
mu = X.mean(axis=0)
X0 = X - mu
u,s,vh = np.linalg.svd(X0)
P = vh.T[:,:2]
```
<!-- eng end -->
##### Exercise 5
**挑戰**
以下程式碼讀進一筆資料。
這筆資料實際上是貼在高維度中的一個二維平面,
且它在這個平面上排出一個一個英文字。
請找出這個英文字是什麼。
<!-- eng start -->
**Challenge**
The following code load some data from a file. The data falls on a $2$-dimensional subspace of a higher dimensional space, and the data points show an English word.
Try to find out this word.
<!-- eng end -->
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
print("shape of X =", X.shape)
```
---
##### Exercise 5 - answer here
By running the code above, we can get the picture as follows:

The word is the upside down of "Taiwan".
:::warning
Need the reasons and how you obtained the picture.
:::
---
##### Exercise 6
令 $X_0$ 及 $P$ 為主成份分析演算法中的矩陣。
說明 $X_0P$ 與 $X_0PP\trans$ 的列向量所代表的意義。
<!-- eng start -->
Let $X_0$ and $P$ be the matrices in the principal component analysis. Explain the meanings of $X_0P$ and $X_0PP\trans$.
<!-- eng end -->
:::info
collaboration: 2
3 problems: 1
moderator: 1
qc: 1
:::