Try   HackMD

用 base R 簡單分析資料與做圖:散點圖與迴歸分析

在 R 當中有非常多的套件,可以提供非常美觀或者複雜的視覺化效果,但如果今天只想要簡單呈現數據間的關係,就可以使用內建的繪圖系統,而不需要額外呼叫其他的套件。
在此我們使用 base R 來進行散點圖的繪製與迴歸分析。

首先準備要分析的數據資料。
R 內建有許多的資料集,可以使用 data() 來查看 R 的內建資料集與各資料集的簡短說明。
如果要查看所有套件的內建資料集,可以使用 data(package = .packages(all.available = TRUE)) 指令。
這邊使用的是 "trees" 資料集。

data(trees) head(trees)

可以發現,"trees" 資料集有三個欄位 "Girth"、"Height" 與 "Volume"。根據資料集的描述,這份資料集記載了 31 棵黑櫻桃樹的直徑(單位是 in)、樹高(單位是 ft)與材積(單位是 cubic ft)。
接下來我們可以用散點圖,簡單繪製這三欄數據之間的關係:

plot(trees)

散點圖的函數是 plot(),使用後馬上就可以看到產出的圖,是一個

3×3 的圖陣(matrix):
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

從圖中我們還可以發現,代表樹的直徑與材積的點,呈現出比較強烈的線性關係,而直徑與樹高,或者樹高與體積的關係就沒有很線性。
考慮計算體積的式子:

=(2×π)×,理論上,材積跟樹徑平方或者樹高會成正比。

接下來,我們就使用 R 內建的 lm() 函數,以體積對樹高配適一條簡單線性迴歸模型的迴歸線來驗證:

lm(Volume ~ Height, data = trees) #只會回傳迴歸分析之後的參數 summary(lm(Volume ~ Height, data = trees)) #回傳迴歸分析的完整結果

先比較一下上面兩行的運行結果:

> lm(Volume ~ Height, data = trees)

Call:
lm(formula = Volume ~ Height, data = trees)

Coefficients:
(Intercept) Height
-87.124 1.543

> summary(lm(Volume ~ Height, data = trees))

Call:
lm(formula = Volume ~ Height, data = trees)

Residuals:
Min 1Q Median 3Q Max
-21.274 -9.894 -2.894 12.068 29.852

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.1236 29.2731 -2.976 0.005835 **
Height 1.5433 0.3839 4.021 0.000378 ***
-
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.4 on 29 degrees of freedom
Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358
F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784

可以發現使用了 summary() 之後,才可以獲得我們想看的統計報表。
Coefficients 一欄可知,使用體積對樹高進行簡單線性迴歸,配適結果為

Volume=87.1236+1.5433×Height調整後的 R2 為 0.3358,迴歸模型的 p=.0003784,代表強烈拒絕材積與樹高沒有關係(即 R2=0)的虛無假說。

如果要驗證材積與樹徑的平方有無線性關係,應該要怎麼做呢?
這時候,可以利用對數,把體積取自然對數,就和樹徑有線性關係了:

=(2×π)×2ln=2×ln+lnπ+lnln

在 R 當中,使用 log() 函數就可以取自然對數了,所以我們可以輕鬆配適取完自然對數的體積對樹徑簡單線性迴歸:

summary(fm1 <- lm(log(Volume) ~ log(Girth), data = trees))

配適結果如下:

Call:
lm(formula = log(Volume) ~ log(Girth), data = trees)

Residuals:
Min 1Q Median 3Q Max
-0.205999 -0.068702 0.001011 0.072585 0.247963

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.35332 0.23066 -10.20 4.18e-11 ***
log(Girth) 2.19997 0.08983 24.49 < 2e-16 ***
-
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.115 on 29 degrees of freedom
Multiple R-squared: 0.9539, Adjusted R-squared: 0.9523
F-statistic: 599.7 on 1 and 29 DF, p-value: < 2.2e-16

簡單線性迴歸的驗證結果也是相當顯著的,而且 log(Girth) 的係數是 2.2 左右,代表這份資料集中的材積與樹徑2.2,大致符合想像當中材積與樹徑的關係。
在這邊我們將簡單線性迴歸的配適結果儲存在名叫 "fm1" 的變數中,可以進一步拿來繪圖。

我們可以利用 R 的幾個內建函數,繪出散點圖與配適出的趨勢線:

plot(log(trees$Volume)~log(trees$Girth)) abline(fm1)

其中,"fm1" 變數提供了斜率與截距給 abline() 函數,因此可以在原本的散點圖中,把趨勢線繪製出來。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

除了 abline() 函數之外,還可以利用 lines() 函數來繪製線條。只要提供一系列的 x, y 座標,函數就可以把這些座標用線連接起來,在繪製非直線的時候就會用上。
如果想要更改點與線的大小、形狀、顏色,也可以加入不同參數修改。

plot(y = log(trees$Volume), x = log(trees$Girth), cex = 1.5, pch = 21, col = "darkgreen", bg = "pink", main = "ln(材積)對ln(樹徑)做圖") lines(log(trees$Girth), predict(fm1), lty = 2, lwd = 2, col = "cyan3")

在這個例子中,函數的使用方法是 plot(y=..., x=...),可以指定兩軸要繪製什麼項目,是比較常用的指令。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

如果不知道有什麼參數可用,可運行 ?plot(),裡面有列出可以調整的參數。
另外,在官方的協助文件中,也提供了有關點或線形狀的說明,例如以下程式碼可以繪製出可用的點點型狀(即 pch 參數):

pchShow <- function(extras = c("*",".", "o","O","0","+","-","|","%","#"), cex = 3, ## good for both .Device=="postscript" and "x11" col = "red3", bg = "gold", coltext = "brown", cextext = 1.2, main = paste("plot symbols : points (... pch = *, cex =", cex,")")) { nex <- length(extras) np <- 26 + nex ipch <- 0:(np-1) k <- floor(sqrt(np)) dd <- c(-1,1)/2 rx <- dd + range(ix <- ipch %/% k) ry <- dd + range(iy <- 3 + (k-1)- ipch %% k) pch <- as.list(ipch) # list with integers & strings if(nex > 0) pch[26+ 1:nex] <- as.list(extras) plot(rx, ry, type = "n", axes = FALSE, xlab = "", ylab = "", main = main) abline(v = ix, h = iy, col = "lightgray", lty = "dotted") for(i in 1:np) { pc <- pch[[i]] ## 'col' symbols with a 'bg'-colored interior (where available) : points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex) if(cextext > 0) text(ix[i] - 0.3, iy[i], pc, col = coltext, cex = cextext) } } pchShow()

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

以上簡介使用 base R 進行散點圖繪製與初步的簡單線性迴歸分析,已經可以使用在許多數據上,簡單呈現兩項目間的關係,並初步視覺化。
🐕‍🦺2024.01.30

延伸閱讀

  1. Simple base R plots. In: Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau (2023). An Introduction to R