--- GA: UA-159972578-2 --- ###### tags: `R` `Statistics` `統計` # 統計分析方法應用於R語言 Reference: http://ccckmit.wikidot.com/st:test1 ![The probability functions in R](https://i.imgur.com/fVa0hUG.png) + d = density function 連續型 + p = cumulative distribution function 離散 + q = quantile function + r = random number generation ``` # 累積的常態機率分布 pnorm(175, mean = 170, sd = 5) - pnorm(170, 170, 5) # 計算170到175中間的那塊常態分布面積(mean:170, x:175, sd:5) # 機率密度函數 dnorm(結果,試驗次數,機率) dnorm(5, 10, 0.5) # 頭出現的5次, 投10次, 機率0.5 ``` ``` rbinorm(n, size, prob_success) ``` + n:想要幾個結果 + size:實驗的次數 + prob_success:成功機率 + return:n個在size次數下成功的次數(e.g. 50個投擲3次銅板正面的次數) ## 常用的基準化方法 + 標準化(Standardization) + 值域:-1~+1 + Z-score:平均值為0,標準差為1 + 常態化(Normalization) + 值域:0~1(等同於比例) + Center the row (x-x_mean) + i/rowSum(i) + 取對數(log) + 正規化(Regularization): + L1 regularization + L2 regularization ## 常態分布檢定 + 在 R 中若要進行常態性檢定,最常用的方式就是 Shapiro-Wilk 檢定,它是 R 內建的統計檢定,不需要安裝套件即可使用 + 除了 Shapiro-Wilk 檢定之外,如果想嘗試其他的統計檢定方法的話,可以安裝 nortest 套件,這個套件中提供了好幾種專門用於常態性檢定的方法。 + With large sample size (n > 50, as a rule-of-thumb), Shapiro-Wilk test is very easy to get low p-value from small deviations from normality. + Lilliefors tests based on the K-S test, which is often considered a more powerful test for large sample. + 參考資料: https://officeguide.cc/r-normality-test-tutorial/ ```{r} # H0: sample is taken from a normal distribution # 若 p-value < 0.05 則拒絕假設→非常態分布 shapiro.test(mtcars$mpg) ks.test(scale(mtcars$mpg), "pnorm") nortest::lillie.test(faithful$waiting) nortest::ad.test(faithful$waiting) ``` ```{r} # normal vs uniform twoDists = data.frame("data"= c(rnorm(100,0,1), runif(100,-3,3)), "dist" = factor( c(rep("normal",100), rep("uniform",100))) ) # rnorm(100,0,1)是產生一個平均值為0,標準差為1的100個樣本之隨機常態分布 # runif(100,-3,3)是產生一個平均值為0,標準差為3的100個樣本之隨機均勻分布 ggplot(twoDists, aes(x = twoDists$data, fill=factor(twoDists$dist))) + geom_density(alpha=0.3) + xlim(-5,5) twoDists_2c = split(twoDists$data, twoDists$dist) ks.test(twoDists_2c$normal,twoDists_2c$uniform) # Kolmogorov-Smirnov test (K-S test) is a popular nonparametric test of the equality of continuous distributions. ``` + So, what should we do to verify data distributions if we deal with big datasets? + Talk to domain experts (could be yourself) first. They may have better understanding of how the data is supposed to be distributed. + Still, run the distribution check tests (e.g. package nortest for more normality tests). Note that some of them are not reliable! + Plot your data. A density plot along with normal quantile-quantile plot (q-q plot) may give you the visual senses of the distribution and normality. Check out package for visualizing big dataset (e.g. hexbin, tabplot and bigvis). Also be aware that your eyes can be fooled by figures, especially the scales of features, too! ![](https://i.imgur.com/4P6lNiP.png) ![](https://i.imgur.com/qnucCJk.png) + 精確:控球好、受過訓練;準確:有命中 + 右上:有命中代表準確高,但很不穩定(訓練次數太少),只命中一次。e.g.天賦好卻缺乏練習的球手睜開眼睛投球 + 左下:控球穩但都沒命中。e.g.經驗豐富的老球手閉著眼睛投球 + We have been using some statistical methods, such as Chi-squared test of independence and Fisher's exact test, to verify the relationships between two random variables. ## 卡方檢定 + 為無母數檢定之一(n<30) + Homogeneity test(多母體propotion是否相等/顯著不同) + Independent test(雙母體propotion是否彼此獨立) + Goodness of fits(單母體propotion是否為指定比例) ```{r} # Consider that we roll a dice for 100 times.Is it a fair dice? pointFreq1 = c( 31, 22, 17, 13, 9, 8) # Chi-squared goodness of fit test chisq.test(x = pointFreq1, p = rep(1/6, 6)) ``` + 雙母體平均數μ是否相等 ## t-test + 為有母數檢定(推論統計方法,去估計母體) + 單/雙母體變異數檢定 + A t-test tests a null hypothesis about two means; most often, it tests the hypothesis that two means are equal, or that the difference between them is zero. + Fisher's LSD ```{r} # Parametric, assuming the distributions of day1/day2 are normal t.test(dlfest$day1, dlfest$day2, paired = T) # Equivalanet to below 1-sample t test diff = dlfest$day1 - dlfest$day2 t.test(diff) ``` ```{r} OnlineMember$Birth<-as.numeric(as.character(OnlineMember$Birth)) MobileMember$Birth<-as.numeric(as.character(MobileMember$Birth)) Mage = 2018 - MobileMember$Birth Oage = 2018 - OnlineMember$Birth t.test(Mage,Oage) # p-value < 2.2e-16 ``` ## p-test ```{r} table(MobileMember$Gender) # F M Z # 14395 15533 72 table(OnlineMember$Gender) # F M Z # 17426 12388 186 prop.test(x = c(14395,17426), n = c(30000,30000)) # p-value < 2.2e-16 ``` ## ![](https://i.imgur.com/7QCroc0.png) 閱讀能力隨著年齡的變化,左圖會違反常理 ```{r} # Non-Parametric, Wilcoxon rank sum test (more # popularly known as the Mann–Whitney U test) wilcox.test(dlfest$day1, dlfest$day2, paired = T) # paired是為了解決上面的情況(但只有兩兩配對才行,三個以上無法配對成功) ``` ```{r} # Consider built-in dataset MASS::USCrime ggplot(UScrime, aes(x = UScrime$Prob, fill=factor(UScrime$So))) + geom_density(alpha=0.3) # We here compare "Southern and non-Southern" states (So) on the "probability of imprisonment" (Prob) without the assumption of equal variances of "So" t.test(formula = Prob ~ So, data = UScrime) ``` + In the case that your datasets don't seem to meet the parametric assumptions of t-test (i.e. normally-distributed outcome), nonparametric approach, Wilcoxon rank-sum test (WRS) may help. + Note that such nonparametric approaches do have assumptions. For example, we assume the dataset is randomly-sampled and independently-observed. ```{r} # Normality tests library("nortest") Map(function(f) f(UScrime$Prob), c(shapiro.test, lillie.test, ad.test)) # Wilcoxon rank-sum test wilcox.test(Prob ~ So, data = UScrime) ``` ## F檢定 + 雙母體變異數是否相等 + 多母體 ## Poisson & Exponential + Poisson是計數(一段時間內來客數) + Exponential是重時間頻率(兩次通話的間隔時長) + μ互為倒數(相反) ## ANOVA + 假設: 1. 隨機分布(randomly sampled) 2. 互相獨立(independently observed) + 隨機實驗設計(Random Experience Design) + 隨機區集實驗(Random Block Design) + 雙因子實驗(Two Factor Design) ```{r} library(multcomp) # Get "cholesterol" dataset qplot(trt,response,data = cholesterol,geom = "boxplot",xlab = "Treatment") # H0: all group means are equal cho_aov = aov(response ~ trt, data = cholesterol); summary(cho_aov) # ANOVA table # Df Sum Sq Mean Sq F value Pr(>F) # trt 4 1351.4 337.8 32.43 9.82e-13 *** # Residuals 45 468.8 10.4 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Equivalent to lm() cho_lm = lm(response ~ trt, data = cholesterol); car::Anova(cho_lm,type = 3) # A trained model can also be used to "predict" predict(cho_lm ,newdata = data.frame(trt=c("1time","drugD"))) ``` ```{r} usc_lm = lm(Prob ~ So, data = UScrime) car::Anova(usc_lm, type=3 predict(cho_lm, newdata = data.frame(trt=c("1time","drug0"))) # ANOVA也可以做預測 # 結果只有兩個平均值 ``` ## Correlation Matrix 計算 ```{r} Hmisc::rcorr(as.matrix(df[,c(4:5,7,10:11)]), type = "pearson") ``` ``` house pop male female marriage born death house 1.00 0.98 0.98 0.98 0.93 0.90 0.76 pop 0.98 1.00 1.00 1.00 0.93 0.92 0.76 male 0.98 1.00 1.00 0.99 0.93 0.92 0.77 female 0.98 1.00 0.99 1.00 0.93 0.92 0.75 marriage 0.93 0.93 0.93 0.93 1.00 0.92 0.69 born 0.90 0.92 0.92 0.92 0.92 1.00 0.65 death 0.76 0.76 0.77 0.75 0.69 0.65 1.00 n house pop male female marriage born death house 93924 93924 93924 93924 84364 84364 84364 pop 93924 93924 93924 93924 84364 84364 84364 male 93924 93924 93924 93924 84364 84364 84364 female 93924 93924 93924 93924 84364 84364 84364 marriage 84364 84364 84364 84364 84364 84364 84364 born 84364 84364 84364 84364 84364 84364 84364 death 84364 84364 84364 84364 84364 84364 84364 P house pop male female marriage born death house 0 0 0 0 0 0 pop 0 0 0 0 0 0 male 0 0 0 0 0 0 female 0 0 0 0 0 0 marriage 0 0 0 0 0 0 born 0 0 0 0 0 0 death 0 0 0 0 0 0 ``` ## 無母數檢定(看有沒有常態) ```{r} wilcox.test(Prob ~ So, data = UScrime) ``` ## 建模方式 + EDA(Exploratory Data Analysis)探索性資料分析: + 非監督式學習(資料探勘) + 觀察階段 + 沒有一定要放進去的變數(No forced in variable) + 不管資料多爛(P值多高)都塞進去 + 優點:說不定會發現某些被隱藏的現象,例如啤酒尿布之例 + 缺點:是可能會找到無意義的關係,例如生過小孩的都是女性用戶 + 電腦挑變數 + Recursive Feature Elimination(RFE): Start with fitting a model with all variables, and then remove the variable with the largest p-value or lowest variable importance. + Backward selection + CDA(Confirmatory Data Analysis)驗證性資料分析: + 監督式學習 + EDA完會有個推論,CDA為此推論之驗證階段 + 類似傳統統計範疇(p值小的留下) + 證實或否定假設,有準確要測量的答案 + 結果應跟EDA的相近,否則可能有問題 + VOI(Variable of Interest) + 電腦挑變數 + Forward selection: Begin with identifying variables of interests. These variables may always be "forced-in", and then add the variable that results in the lowest error. + Hybrid selection 相關不等於因果 e.g. 影像分析,只要有尺都有皮膚癌 <style> body { font-family:微軟正黑體;} </style>