R筆記 - HackMD

# R筆記 ###### tags: `R` `programing` >紀錄一些R語言 ## :memo: Compiler-RStudio ![](https://i.imgur.com/t5rbqMj.jpg) #### :heavy_check_mark: User Interface - 左上：程式碼 - 左下：顯示變數、Import dataset (選from text(bese)，路徑要英文) - 右上：Console (顯示結果) - 右下：可顯示plot、package #### :heavy_check_mark: Execute 將要執行的部分反白，使用快捷鍵Ctrl+Enter，或是點選上方Run #### :heavy_check_mark: Setting - Tools -> Global Options -> Appearance：設定背景、字體 - Tools -> Install Packages ->Packages：可下載library，如 ggplot2 --- ## :memo: Fundamental #### :heavy_check_mark: 運算 ```R= > 2+7.5 #相加 > 2*2 #相乘 > 10/5 #相除 > 4^2 #次方 > sqrt(100) #開根號 ``` #### :heavy_check_mark: 資料型態 - 常用型態 - integer (整數) - number (實數) - logic (布林) - character (字串) - factor (類別變數) - vector (陣列) - list (列表) - matrix (矩陣) - data frame (資料框) :::info :bulb: **Hint：** R預設的資料型態是number ::: - 賦予變數 ```R= a <- 3 #賦予變數a為3 型態是number ``` - 資料型態轉換: ```R= a <- 3 as.integer(a) #轉成int ``` - 查看資料型態 ```R= a <- 3 str(a) ## num 3 ``` - 確認資料型態 ```R= a <- FALSE is.integer(a) #確認是否為整數回傳:Fasle b <- "Dr.Lee" # b的資料型態是character professor <- b # b把b存到professor裡 str(professor) # 確認professor資料型態回傳:chr "Dr.Lee" ``` #### :heavy_check_mark: vector - 宣告定義vector，使用c()函式。**R比較特別，index是從1開始** ```R= a <- c(5,10,15,20,25) a[1] ## 5 ``` - 取值 ```R= a[3] # Ans: 15 (取第3個element) a[1:3] # Ans: 5 10 15 (取第1~第3個element) a[c(2,4)] # Ans: 10 20 (取第2和第4個element) ``` - 查看數據集資訊 ?[dataset] ```R= ?cars #cars為內建資料集 ``` - 特殊規則若是把number和character同時放入vector裡，R會自動將所有元素的型態，**轉變成character** ```R== a <- c(1, "john", 3) a ``` ```R= ## [1] "1" "john" "3" ``` logic和number在vector裡的話，T和F會被自動轉換成1和0，變成數字的vector ```R= b <- c(T, 3, F) b ``` ```R= ## [1] 1 3 0 ``` #### :heavy_check_mark: factor 一種變數型態，主要用來表示有**存在哪些類別** ```R== gender <- c("boy", "girl", "boy", "boy", "girl") gender <- factor(gender) # 轉換成factor型態 gender ``` ```R= ## [1] boy girl boy boy girl ## Levels: boy girl ``` Levels的屬性代表: 在這個變數裡面，存在哪些類別 #### :heavy_check_mark: list - list可以存放「任何型態」的變數，包括vector及list ```R= Dr.Lee <- list(gender="man", age=18, hobby=c("sleep", "music")) Dr.Lee ``` ```R= $gender [1] "man" $age [1] 18 $hobby [1] "sleep" "music" ``` - 查看list ```R= str(Dr.Lee) ``` ```R= List of 3 $ gender: chr "man" $ age : num 18 $ hobby : chr [1:2] "sleep" "music" ``` - 取值使用 **$** ```R= Dr.Lee$hobby # Dr.Lee的嗜好 Dr.Lee$age # Dr.Lee的年紀 ``` ```R= ## [1] "sleep" "music" ## [1] 18 ``` - 取值使用 **索引** ```R= Dr.Lee[[3]] Dr.Lee[3] ``` ```R= [1] "sleep" "music" #使用兩個中括號，取出來的資料是vector [1] "sleep" "music" #使用一個中括號，取出來的資料是list ``` #### :heavy_check_mark: matrix ```R= a <- matrix(c(1:6), nrow=3, ncol=2) #建立一個3x2的矩陣 a ``` ```R= ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 ``` - 查看陣列 ```R= a[2,2] ## 5 b[1, ] ##3,5,7 ``` #### :heavy_check_mark: dataframe - 使用dataframe() ```R= tmp <- data.frame(Student_ID=c(1,2,3,4,5), name=c("Helen", "Lun", "Leon", "Kevin", "Tommy"), score=c(80,36, 88.9, 97.5, 60)) tmp ``` ```R= ## Student_ID name score ## 1 1 Helen 80.0 ## 2 2 Lun 36.0 ## 3 3 Leon 88.9 ## 4 4 Kevin 97.5 ## 5 5 Tommy 60.0 ``` - 取值使用 **$** ```R= tmp$name ``` ```R= ## [1] Helen Lun Leon Kevin Tommy ## Levels: Helen Kevin Leon Lun Tommy ``` - 取值使用**索引** [row,col] ```R= tmp[4,3] ## [1] 97.5 tmp[1, ] #第一個row所有值:1 Helen 80 tmp[, 3] #第三個col所有值:80.0 36.0 88.9 97.5 60.0 ``` - 取值特定資訊 ```R= tmp[tmp$name == "Leon", ] ``` ```R= ## Student_ID name score ## 3 3 Leon 88.9 ``` - 判別 ```R= tmp$name == "Leon" ## [1] FALSE FALSE TRUE FALSE FALSE ``` --- ## :memo: Statistics Function #### :heavy_check_mark: seq() ```R= seq(0,9) ##[1] 0 1 2 3 4 5 6 7 8 9 seq(9,0) ##[1] 9 8 7 6 5 4 3 2 1 0 ``` #### :heavy_check_mark: 統計函式 - mean()：平均值 - var()：變異數 - sd()：標準差 - median()：中位數 - max()：最大值 - min()：最小值 - sum()：相加 - quantile()：分位數 - range()：全距(第一分位到第四分位) #### :heavy_check_mark: 分布函式 - 常態分佈亂數：rnorm(a,b,c) a為亂數個數 b為亂數母體的平均數 c為亂數母體的標準差 ![](https://i.imgur.com/nnlufY3.png) - 二項分布機率密度函數：dbinom(x; n, p) Example:投擲一枚公正骰子 ```R= x <- 0:100 #0,1,2,...,99,100 y <- dbinom(x,size=100, prob=0.5) #二項分布的機率密度函數 class(x) ##integer class(y) ##numeric plot(x,y,type="l",ylab="Probility") #type = "l" 為折線圖 ``` **結果** ![](https://i.imgur.com/OIjdY5N.png) - 查看分布圖 ```R= plot(density(rnorm(10000,0,10))) ``` ![](https://i.imgur.com/oDtUWNI.png) #### :heavy_check_mark: 眾數 ```R= x <- factor(c("a", "b", "c", "c", "c", "d","d")) table (x) #根據因數各水準，求次數 which.max(table(x)) #出現最多次數值的索引 (如果有多個，則印回傳第一個) ``` ```R= table (x) x a b c d 1 1 3 2 which.max(table(x)) c 3 ``` #### :heavy_check_mark: 取樣(Sampling) ```R= sample(1:10,5) #取後不放回 sample (1: 10, 5, replace=TRUE) #取後放回 #取後放回依照順序加權重:1有一個, 2有二個, ..., 10有十個 sample (1: 10, 5, replace=TRUE, prob=1:10) ``` #### :heavy_check_mark: 相關分析 - 皮爾森相關係數: cor() 其值介於[-1,1]，通常0.7-1.0為高度相關 Practice ```R= cor(1:10, 1:10) # 1 cor(1:10, 1:10*2) # 1 x<-1:10 y<-x^3 cor(x, y) ## 0.928 ``` **以下使用<a href="#iris">Iris資料集</a>** 花萼長度與寬度的皮爾森相關係數： ```R= cor(iris$Sepal.Width, iris$Sepal.Length) ## -0.1175698 ``` 在鳶尾花資料集中，針對除Species外的所有列計算皮爾森相關係數： ```R= cor(iris[, 1:4]) ``` ```R= Sepal.Length Sepal.Width Petal.Length Sepal.Length 1.0000000 -0.1175698 0.8717538 Sepal.Width -0.1175698 1.0000000 -0.4284401 Petal.Length 0.8717538 -0.4284401 1.0000000 Petal.Width 0.8179411 -0.3661259 0.9628654 Petal.Width Sepal.Length 0.8179411 Sepal.Width -0.3661259 Petal.Length 0.9628654 Petal.Width 1.0000000 ``` - 皮爾森相關係數**視覺化**: corrgram() 安裝套件 ```R= install.packages("corrgram") ``` 繪圖 ```R= library(corrgram) #將鳶尾花相關係數置於圖形的右上端:(upper.panel=panel.conf) corrgram(iris, upper.panel=panel.conf) ``` ![](https://i.imgur.com/uNv5zbT.png) 左下方: 紅色代表負相關，藍色代表正相關右上方: 粗體字為皮爾森相關係數 - 相關係數檢定 ```R= #使用皮爾森相關係數 cor.test(c(1,2,3,4,5), c(1,0,3,4,5), method = "pearson") ``` ```R= Pearson's product-moment correlation data: c(1, 2, 3, 4, 5) and c(1, 0, 3, 4, 5) t = 3.9279, df = 3, p-value = 0.02937 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1697938 0.9944622 sample estimates: cor 0.9149914 ``` :::info :bulb: **Hint：** $H_0$:虛無假設都用等號；$H_1$:對立假設 ::: #### :heavy_check_mark: 推論檢定 - T檢定: t.test() 平均數為0，變異數為1，抽30個樣本 ```R= x <- rnorm(30) #default:平均數為0 變異數為1，抽30個樣本 t.test(x) ``` ```R= One Sample t-test data: x t = 1.0458, df = 29, p-value = 0.3043 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.1613034 0.4988908 sample estimates: mean of x 0.1687937 ``` 平均數為10，變異數為1，抽30個樣本 ```R= x <- rnorm(30, mean=10) t.test(x, mu=10) #H0 u=10;H1 u!=10 ``` ```R= One Sample t-test data: x t = 0.65888, df = 29, p-value = 0.5152 alternative hypothesis: true mean is not equal to 10 95 percent confidence interval: 9.759037 10.470003 sample estimates: mean of x 10.11452 ``` - T檢定: tapply() 使用內建資料集:sleep ```R= sleep ``` ```R= extra group ID 1 0.7 1 1 2 -1.6 1 2 ... 20 3.4 2 10 ``` 使用tapply()函數計算每種安眠藥使患者睡眠時間增加量的平均數 ```R= tapply(sleep$extra, sleep$group, mean) ``` ``` R= 1 2 0.75 2.33 ``` - F檢定 var.test() 使用內建資料集: sleep 假設$H_1$: $σ_1^2$ = $σ_2^2$ ; $H_2$: $σ_1^2$ != $σ_2^2$ Step1. 先用F檢定 ```R= var.test (extra~group, sleep) ``` ```R= F test to compare two variances data: extra by group F = 0.79834, num df = 9, denom df = 9, p-value = 0.7427 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.198297 3.214123 sample estimates: ratio of variances 0.7983426 ``` 此處p-value=0.7427 大於α值0.05，因此Do not reject $H_0$ -> 接受虛無假設 Step2. 用F檢定，檢查變異數是否一致後，再用T檢定 >paired=FALSE 表示兩獨立樣本檢定 (不成對、不相依) -> **獨立樣本t檢定** >var.equal 用以設定兩群體的母體變異數是否一致 ```R= t.test(extra~group, data=sleep, paired=FALSE, var.equal=TRUE) ``` ```R= Two Sample t-test data: extra by group t = -1.8608, df = 18, p-value = 0.07919 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.363874 0.203874 sample estimates: mean in group 1 mean in group 2 0.75 2.33 ``` 此處p-value = 0.7919，大於α值0.05，因此Do not reject $H_0$ -> 接受虛無假設 - 成對t檢定: 使用sleep資料集 ```R= with(sleep, t.test(extra[group == 1], extra[group == 2], paired=TRUE)) ``` ```R= Paired t-test data: extra[group == 1] and extra[group == 2] t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.4598858 -0.7001142 sample estimates: mean of the differences -1.58 ``` 使用Iris資料集 ```R= with(iris, var.test(Sepal.Width, Sepal.Length)) ``` ```R= F test to compare two variances data: Sepal.Width and Sepal.Length F = 0.27706, num df = 149, denom df = 149, p-value = 3.595e-14 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.2007129 0.3824528 sample estimates: ratio of variances 0.2770617 ``` p-value非常小 -> Sepal.Width與Sepal.Length有顯著差異 :::info :bulb: **Hint：** T檢定: 常用於平均數；F檢定: 常用於變異數 ::: --- ## :memo: Plot practice(使用ggplot2) R語言提供了多種繪圖系統，可用以繪製散布圖(scatter plot)、折線圖(graph of broken line)、長條圖(bar graph)、盒形圖(box plot)。R的繪圖功能主要由graphics、lattice、ggplot等三個套件提供。 - 需安裝ggplot2套件 ```R= library(ggplot2) #匯入library ``` #### :heavy_check_mark: 使用內建Dataset: CO2，繪製盒鬚圖 - x, y為座標名稱, color為plot使用的顏色 ```R= ggplot(data=CO2) + #此行只讀入資料，執行只有空白圖層 geom_boxplot(data=CO2,aes(x=conc,y=uptake,color=Plant)) #匯出 ``` \begin{gather*} 上下程式碼是相同的 \end{gather*} ```R= ggplot(CO2, aes(x=conc, y=uptake, colour=Plant))+ #此行讀入資料，並標記x y軸 geom_boxplot() #匯出 ``` - 結果 ![](https://i.imgur.com/N26PTkk.png) #### :heavy_check_mark: 使用內建Dataset: cars，繪製點圖 ```R= library (ggplot2) ggplot(cars, aes(x=speed, y=dist))+ geom_point() #使用 "點" 方式繪圖 ``` ![](https://i.imgur.com/idndj5z.png) #### :heavy_check_mark: 使用Iris Dataset Description <span id="iris"></span> - [Download Iris](http://archive.ics.uci.edu/ml/datasets/Iris) ```R= require(ggplot2) ggplot(data=iris) + geom_point(aes(x=Petal.Length,y=Petal.Width)) + theme_bw() #白色背景 ``` - 結果 ![](https://i.imgur.com/HxBTedp.png) - 新增:標記不同種類 color=Species ```R= require(ggplot2) ggplot(data=iris) + geom_point(aes(x=Petal.Length,y=Petal.Width,color=Species)) + theme_bw() ``` - 結果 ![](https://i.imgur.com/Vr9jTzf.png) ## :memo: Data preprocessing #### :heavy_check_mark: 處理紕漏值在R裡面，紕漏值表示NA(not available)，可以使用is.na確認資料是否有紕漏值存在 ```R= tmp <- c(1,2,3,NA,5) is.na(tmp) ## [1] FALSE FALSE FALSE TRUE FALSE ``` 計算紕漏值數量 ```R= sum(is.na(tmp)) ## 1 ``` - 1. 直接移除NA ```R= rm.data <- data[complete.case(data), ] ``` - 2. 使用**平均數**填補 ```R= mean.data <- data mean.1 <- mean(mean.data[, 1], na.rm = T) # 第一欄位的平均數 na.rows <- is.na(mean.data[, 1]) # 第一欄位中，有遺漏值存在的資 mean.data[na.rows, 1] <- mean.1 # 用第一欄位的平均數，填補第一欄位的遺漏值 ``` - 3. 用K-Nearest Neighbours填補遺漏值看有NA的資料，與哪一筆資料的值較相近，利用那一筆資料的平均值等手法來填補 ```R= require(DMwR) imputeData <- knnImputation(data) head(imputeData) ``` - 4. 用MICE填補 ## :memo: Data processing #### :heavy_check_mark: 查看資料集 ```R= require(datassets) str(iris) summary(iris) ``` ## :memo: Regression #### :heavy_check_mark: 回歸分析使用cars資料集，查看cars內容 ```R= cars ``` ```R= speed dist 1 4 2 2 4 10 3 7 4 4 7 22 ... 48 24 93 49 24 120 50 25 85 ``` - 產生線性回歸模型: lm() ```R= m<-lm(dist~speed, data=cars) summary(m) ``` ```R= lm(formula = dist ~ speed, data = cars) Residuals: Min 1Q Median 3Q Max -29.069 -9.525 -2.272 9.215 43.201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * speed 3.9324 0.4155 9.464 1.49e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12 ``` >Call指出使用哪種公式進行線性迴歸 Residuals顯示從實際資料觀測到的殘差 Coefficients顯示模型係數，以及係數的統計顯著性判定係數(R-squared)與調整後判定係數(Adjusted R-squared)代表模型對資料分散的解釋程度 F統計量(F-statistic)描述模型具有的統計意義 - 變異數分析: anova() ```R= m<-lm(dist~speed, data=cars) anova(m) ``` ```R= Analysis of Variance Table Response: dist Df Sum Sq Mean Sq F value Pr(>F) speed 1 21186 21185.5 89.567 1.49e-12 *** Residuals 48 11354 236.5 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` - 模型診斷圖形 ```R= m<-lm(dist~speed, data=cars) plot(m) ``` ![](https://i.imgur.com/4M5dwSD.png) 誤差服從平均數為0的常態分布理想狀態：一條斜率為0的直線 ![](https://i.imgur.com/1XH1OMP.png) 查看殘差是否符合常態分布理想狀態：一條斜率為1的直線 ![](https://i.imgur.com/vhdHGjl.png) x軸為Y值，Y軸為標準化殘差 ![](https://i.imgur.com/Q6MNCUU.png) 右上與右下位置查找對迴歸直線形狀影響很大的點 - 回歸直線視覺化 ```R= m<-lm(dist~speed, data=cars) plot(cars$speed,cars$dist) abline(coef(m)) ``` ![](https://i.imgur.com/4onzMZ9.png) - 多元迴歸分析使用Iris資料集，預測Sepal.Length(Y) 以Sepal.Width、Petal.Length、Petal.Width為自變數(X)，進行迴歸分析 ```R= model <- lm(formula= Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris) summary(model) ``` ```R= Call: lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris) Residuals: Min 1Q Median 3Q Max -0.82816 -0.21989 0.01875 0.19709 0.84570 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.85600 0.25078 7.401 9.85e-12 *** Sepal.Width 0.65084 0.06665 9.765 < 2e-16 *** Petal.Length 0.70913 0.05672 12.502 < 2e-16 *** Petal.Width -0.55648 0.12755 -4.363 2.41e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3145 on 146 degrees of freedom Multiple R-squared: 0.8586, Adjusted R-squared: 0.8557 F-statistic: 295.5 on 3 and 146 DF, p-value: < 2.2e-16 ``` p-value非常小，表示三個自變數(X)對Y都表示顯著 R-squared: 0.8586 ； Adj R-squared: 0.8557，表示模型預測能力不錯 #### :heavy_check_mark: 殘差分析 - 殘差模型-盒鬚圖 ```R= model <- lm(formula= Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris) boxplot(resid(model)) ``` ![](https://i.imgur.com/5pxeBFk.png) - 殘差基本假設 >常態性(Normality) >獨立性(Independence) >變異數同質性(Homogeneity of Variance) **Step1.** 使用name()，查看基本資訊 ```R= model <- lm(formula= Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris) boxplot(resid(model)) names(model) ``` ```R= [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" ``` **Step2.** 常態性假設 ```R= model$residual ``` ```R= 1 2 3 4 5 0.0845842387 0.2100028184 -0.0492514176 -0.2259940935 -0.0804994772 6 7 8 9 10 0.0228063193 -0.2946837793 -0.0212452413 -0.2249134657 0.0183576405 11 12 13 14 15 0.1835036110 -0.2921584372 0.0543545524 -0.2329058599 0.6009920509 16 17 18 19 20 ... ``` **Step3:** 檢驗殘差的常態性假設: shapiro.test() ```R= shapiro.test(model$residual) ``` ```R= Shapiro-Wilk normality test data: model$residual W = 0.99559, p-value = 0.9349 ``` >虛無假設H0: 殘差服從常態分配因為p-value > 0.05，代表無法拒絕H0 **Step4:** 檢驗殘差的獨立性假設: durbinWatsonTest() ```R= require(car) #需先安裝car套件 durbinWatsonTest(model) ``` ```R= lag Autocorrelation D-W Statistic p-value 1 -0.03992126 2.060382 0.864 Alternative hypothesis: rho != 0 ``` >由於虛無假設H0:殘差間相互獨立，因為p-value > 0.05，代表不會拒絕H0 **Step5:** 檢測變異數同質性假設: ncvTest() ```R= require(car) ncvTest(model) ``` ```R= Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 4.448612 Df = 1 p = 0.03492962 ``` >由於虛無假設H0:殘差變異數具有同質性，因為p-value < 0.05，代表拒絕H0。(這表示上面的線性模型無法使用) **Step6:** 預測放入data ```R= new.iris <- data.frame(Sepal.Width=3.456, Petal.Length=1.535, Petal.Width=0.341) new.iris ``` ```R= Sepal.Width Petal.Length Petal.Width 1 3.456 1.535 0.341 ``` 預測 ```R= predict(model, new.iris) ``` ```R= 1 5.004048 ```