用 R 做主成分分析 PCA：套件篇

手上的數據如果有許多變數，又想快速了解數據之間、變數之間的關係，那麼使用主成分分析可以達到這個效果。

主成分分析（principle component analysis, PCA），使用線性變換中的正交變換（orthogonal transformation）方法，對變數測值做線性變換，投影出線性不相關的變數，稱為主成分（principle components）。我們希望能透過主成分分析，將多個變數用比較少的主成分來解釋（降維），最大化變異數的同時又盡可能最小化流失的訊息。

首先建立數據的變異數—共變異數矩陣（variance-covariance matrix），之後進行特徵分解（eigendecomposition）或奇異值分解（singular value decomposition, SVD），最後得到特徵向量（主成分向量的方向）與特徵值（主成分向量的大小），之後就可以選擇數個（通常是 2 至 3 個）主成分將數據視覺化。
現在我們可以使用 R 的內建或者特定套件功能來進行主成分分析以及後續的視覺化。

用內建函數做 PCA

R 有兩個內件函數可以用來做主成分分析，分別是 princomp() 與 prcomp()，兩者有些不同，而目前官方是建議以 prcomp() 函數來做主成分分析。

princomp() 函數：使用特徵值分解，計算共變異數的分母為
$N$ 。
以下為官方說明：
The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp.
Note that the default calculation uses divisor
$N$ for the covariance matrix.
prcomp() 函數：使用奇異值分解，計算共變異數的分母為
$N - 1$ 。
以下為官方說明：
The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy. The print method for these objects prints the results in a nice format and the plot method produces a scree plot.
Unlike princomp, variances are computed with the usual divisor
$N - 1$ .

兩個函數的返回值也不太一樣，ATHDA 網站整理出來，這邊簡單譯成中文：

`prcomp()`	`princomp()`	中文	原文含解釋
sdev	sdev	主成分的標準差	the standard deviations of the principal components
rotation	loadings	主成分負荷量	the matrix of variable loadings (columns are eigenvectors)
center	center	變數的平均	the variable means (means that were substracted)
scale	scale	變數的標準差	the variable standard deviations (the scaling applied to each variable )
x	scores	主成分座標	The coordinates of the individuals (observations) on the principal components.

用其他套件做 PCA

除了 R 內建的函數外，也可以使用其他套件的函數來做主成分分析。常用的有多維度分析套件 amap 的 pca()、心理學領域常用套件 psych中的 principal()、生醫領域常結合 DESeq2 套件進行表現量分析流程後續分析與做圖 PCAtools 套件中的 pca()、多變量統計常用套件 FactoMiner 的 PCA()，以及生態學領域常用套件 ade4 與 vegan 中的 dudi.pca() 和 rda() 函數等等。可以看到在不同研究領域都會用到主成分分析，因此我們可以在許多不同的套件看到相同功能的函數。

因為手邊比較多環境與生態學的資料，因此常用的是 ade4 與 vegan 的 PCA 函數。值得一提的是 vegan 中 rda() 函數是用來做冗餘分析（redundancy analysis）的，那為什麼可以用來做 PCA 呢？
在這篇教學中，作者解釋把 rda() 的限制性排序（constrained ordination，亦即使用共變數或其他預測條件限制排序軸）參數關掉，就等同於 PCA。
（關於這些排序分析，又可以整理成另外一篇了。）

分析做完之後要交代或呈現的

資料剔除（culling）或轉換（transformation）方法。
PCA 對於資料正規化（normalization）或其他資料縮放法敏感，所以需要註明。（常用的正規化方法是數據減去平均值，再除以標準差，即 z 分數的算法：
$z = \frac{(x - μ ）}{σ}$ ）
PCA 是依據變異數—共變異數矩陣或者相關矩陣進行分析的。
即 prcomp() 函數中 scale=FALSE 或者 scale=TRUE。
呈現每個主成分可解釋的變異數特徵值圖（scree plot，或稱陡坡圖、碎石圖），同時要呈現用以選擇主成分數量的標準（如 eigenvalue 大於 1）。
選用的每個主成分負荷量表（table of loadings），需要標示出每個主成分中最重要的負荷量。

本篇簡單列舉幾個在 R 中進行 PCA 與後續視覺化的套件與函數，之後會用一些現實世界的資料做 PCA 與視覺化。
🐕‍🦺2022.11.25

用內建函數做 PCA

用其他套件做 PCA

分析做完之後要交代或呈現的

Read more

用 base R 簡單分析資料與做圖：散點圖與簡單線性迴歸分析

用R做網頁爬蟲+風花圖

如何取代 MS office 中的萬用字元

【新】用bash寫個由學名找TaiCOL物種資訊小功能