用 base R 簡單分析資料與做圖：散點圖與迴歸分析

在 R 當中有非常多的套件，可以提供非常美觀或者複雜的視覺化效果，但如果今天只想要簡單呈現數據間的關係，就可以使用內建的繪圖系統，而不需要額外呼叫其他的套件。
在此我們使用 base R 來進行散點圖的繪製與迴歸分析。

首先準備要分析的數據資料。
R 內建有許多的資料集，可以使用 data() 來查看 R 的內建資料集與各資料集的簡短說明。
如果要查看所有套件的內建資料集，可以使用 data(package = .packages(all.available = TRUE)) 指令。
這邊使用的是 "trees" 資料集。


data(trees)
head(trees)

可以發現，"trees" 資料集有三個欄位 "Girth"、"Height" 與 "Volume"。根據資料集的描述，這份資料集記載了 31 棵黑櫻桃樹的直徑（單位是 in）、樹高（單位是 ft）與材積（單位是 cubic ft）。
接下來我們可以用散點圖，簡單繪製這三欄數據之間的關係：


plot(trees)

散點圖的函數是 plot()，使用後馬上就可以看到產出的圖，是一個

3 \times 3

的圖陣（matrix）：

從圖中我們還可以發現，代表樹的直徑與材積的點，呈現出比較強烈的線性關係，而直徑與樹高，或者樹高與體積的關係就沒有很線性。
考慮計算體積的式子：

體 積 = ({半 徑}^{2} \times π) \times 高

，理論上，材積跟樹徑平方或者樹高會成正比。

接下來，我們就使用 R 內建的 lm() 函數，以體積對樹高配適一條簡單線性迴歸模型的迴歸線來驗證：


lm(Volume ~ Height, data = trees) #只會回傳迴歸分析之後的參數
summary(lm(Volume ~ Height, data = trees)) #回傳迴歸分析的完整結果

先比較一下上面兩行的運行結果：

> lm(Volume ~ Height, data = trees)

Call:
lm(formula = Volume ~ Height, data = trees)

Coefficients:
(Intercept) Height
-87.124 1.543

> summary(lm(Volume ~ Height, data = trees))

Call:
lm(formula = Volume ~ Height, data = trees)

Residuals:
Min 1Q Median 3Q Max
-21.274 -9.894 -2.894 12.068 29.852

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.1236 29.2731 -2.976 0.005835 **
Height 1.5433 0.3839 4.021 0.000378 ***
-–
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.4 on 29 degrees of freedom
Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358
F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784

可以發現使用了 summary() 之後，才可以獲得我們想看的統計報表。
由 Coefficients 一欄可知，使用體積對樹高進行簡單線性迴歸，配適結果為

V o l u m e = - 87.1236 + 1.5433 \times H e i g h t

，調整後的 R² 為 0.3358，迴歸模型的 p=.0003784，代表強烈拒絕材積與樹高沒有關係（即 R²=0）的虛無假說。

如果要驗證材積與樹徑的平方有無線性關係，應該要怎麼做呢？
這時候，可以利用對數，把體積取自然對數，就和樹徑有線性關係了：

\begin{aligned} 材 積 & = ({樹 徑}^{2} \times π) \times 樹 高 \\ \propto {樹 徑}^{2} \\ \Rightarrow \ln 材 積 & = 2 \times \ln 樹 徑 + \ln π + \ln 樹 高 \\ \propto \ln 樹 徑 \end{aligned}

在 R 當中，使用 log() 函數就可以取自然對數了，所以我們可以輕鬆配適取完自然對數的體積對樹徑簡單線性迴歸：


summary(fm1 <- lm(log(Volume) ~ log(Girth), data = trees))

配適結果如下：

Call:
lm(formula = log(Volume) ~ log(Girth), data = trees)

Residuals:
Min 1Q Median 3Q Max
-0.205999 -0.068702 0.001011 0.072585 0.247963

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.35332 0.23066 -10.20 4.18e-11 ***
log(Girth) 2.19997 0.08983 24.49 < 2e-16 ***
-–
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.115 on 29 degrees of freedom
Multiple R-squared: 0.9539, Adjusted R-squared: 0.9523
F-statistic: 599.7 on 1 and 29 DF, p-value: < 2.2e-16

簡單線性迴歸的驗證結果也是相當顯著的，而且 log(Girth) 的係數是 2.2 左右，代表這份資料集中的材積與樹徑^2.2，大致符合想像當中材積與樹徑的關係。
在這邊我們將簡單線性迴歸的配適結果儲存在名叫 "fm1" 的變數中，可以進一步拿來繪圖。

我們可以利用 R 的幾個內建函數，繪出散點圖與配適出的趨勢線：


plot(log(trees$Volume)~log(trees$Girth))
abline(fm1)

其中，"fm1" 變數提供了斜率與截距給 abline() 函數，因此可以在原本的散點圖中，把趨勢線繪製出來。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

除了 abline() 函數之外，還可以利用 lines() 函數來繪製線條。只要提供一系列的 x, y 座標，函數就可以把這些座標用線連接起來，在繪製非直線的時候就會用上。
如果想要更改點與線的大小、形狀、顏色，也可以加入不同參數修改。





plot(y = log(trees$Volume), x = log(trees$Girth),
     cex = 1.5, pch = 21, col = "darkgreen", bg = "pink", 
     main = "ln(材積)對ln(樹徑)做圖")
lines(log(trees$Girth), predict(fm1), 
      lty = 2, lwd = 2, col = "cyan3")

在這個例子中，函數的使用方法是 plot(y=..., x=...)，可以指定兩軸要繪製什麼項目，是比較常用的指令。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

如果不知道有什麼參數可用，可運行 ?plot()，裡面有列出可以調整的參數。
另外，在官方的協助文件中，也提供了有關點或線形狀的說明，例如以下程式碼可以繪製出可用的點點型狀（即 pch 參數）：




























pchShow <-
  function(extras = c("*",".", "o","O","0","+","-","|","%","#"),
           cex = 3, ## good for both .Device=="postscript" and "x11"
           col = "red3", bg = "gold", coltext = "brown", cextext = 1.2,
           main = paste("plot symbols :  points (...  pch = *, cex =",
                        cex,")"))
  {
    nex <- length(extras)
    np  <- 26 + nex
    ipch <- 0:(np-1)
    k <- floor(sqrt(np))
    dd <- c(-1,1)/2
    rx <- dd + range(ix <- ipch %/% k)
    ry <- dd + range(iy <- 3 + (k-1)- ipch %% k)
    pch <- as.list(ipch) # list with integers & strings
    if(nex > 0) pch[26+ 1:nex] <- as.list(extras)
    plot(rx, ry, type = "n", axes  =  FALSE, xlab = "", ylab = "", main = main)
    abline(v = ix, h = iy, col = "lightgray", lty = "dotted")
    for(i in 1:np) {
      pc <- pch[[i]]
      ## 'col' symbols with a 'bg'-colored interior (where available) :
      points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex)
      if(cextext > 0)
        text(ix[i] - 0.3, iy[i], pc, col = coltext, cex = cextext)
    }
  }

pchShow()

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

以上簡介使用 base R 進行散點圖繪製與初步的簡單線性迴歸分析，已經可以使用在許多數據上，簡單呈現兩項目間的關係，並初步視覺化。
🐕‍🦺2024.01.30

用 base R 簡單分析資料與做圖：散點圖與迴歸分析

延伸閱讀

Read more

用R做網頁爬蟲+風花圖

用 R 做主成分分析 PCA：套件篇

如何取代 MS office 中的萬用字元

【新】用bash寫個由學名找TaiCOL物種資訊小功能