R
Statistics
統計
Reference: http://ccckmit.wikidot.com/st:test1
# 累積的常態機率分布
pnorm(175, mean = 170, sd = 5) - pnorm(170, 170, 5)
# 計算170到175中間的那塊常態分布面積(mean:170, x:175, sd:5)
# 機率密度函數 dnorm(結果,試驗次數,機率)
dnorm(5, 10, 0.5) # 頭出現的5次, 投10次, 機率0.5
rbinorm(n, size, prob_success)
# H0: sample is taken from a normal distribution
# 若 p-value < 0.05 則拒絕假設→非常態分布
shapiro.test(mtcars$mpg)
ks.test(scale(mtcars$mpg), "pnorm")
nortest::lillie.test(faithful$waiting)
nortest::ad.test(faithful$waiting)
# normal vs uniform
twoDists = data.frame("data"= c(rnorm(100,0,1),
runif(100,-3,3)), "dist" = factor(
c(rep("normal",100), rep("uniform",100))) )
# rnorm(100,0,1)是產生一個平均值為0,標準差為1的100個樣本之隨機常態分布
# runif(100,-3,3)是產生一個平均值為0,標準差為3的100個樣本之隨機均勻分布
ggplot(twoDists, aes(x = twoDists$data, fill=factor(twoDists$dist))) +
geom_density(alpha=0.3) + xlim(-5,5)
twoDists_2c = split(twoDists$data, twoDists$dist)
ks.test(twoDists_2c$normal,twoDists_2c$uniform)
# Kolmogorov-Smirnov test (K-S test) is a popular nonparametric test of the equality of continuous distributions.
# Consider that we roll a dice for 100 times.Is it a fair dice?
pointFreq1 = c( 31, 22, 17, 13, 9, 8)
# Chi-squared goodness of fit test
chisq.test(x = pointFreq1, p = rep(1/6, 6))
+ 為有母數檢定(推論統計方法,去估計母體)
+ 單/雙母體變異數檢定
+ A t-test tests a null hypothesis about two means; most often, it tests the hypothesis that two means are equal, or that the difference between them is zero.
+ Fisher's LSD
# Parametric, assuming the distributions of day1/day2 are normal
t.test(dlfest$day1, dlfest$day2, paired = T)
# Equivalanet to below 1-sample t test
diff = dlfest$day1 - dlfest$day2
t.test(diff)
OnlineMember$Birth<-as.numeric(as.character(OnlineMember$Birth))
MobileMember$Birth<-as.numeric(as.character(MobileMember$Birth))
Mage = 2018 - MobileMember$Birth
Oage = 2018 - OnlineMember$Birth
t.test(Mage,Oage) # p-value < 2.2e-16
table(MobileMember$Gender)
# F M Z
# 14395 15533 72
table(OnlineMember$Gender)
# F M Z
# 17426 12388 186
prop.test(x = c(14395,17426), n = c(30000,30000)) # p-value < 2.2e-16
# Non-Parametric, Wilcoxon rank sum test (more
# popularly known as the Mann–Whitney U test)
wilcox.test(dlfest$day1, dlfest$day2, paired = T) # paired是為了解決上面的情況(但只有兩兩配對才行,三個以上無法配對成功)
# Consider built-in dataset MASS::USCrime
ggplot(UScrime, aes(x = UScrime$Prob, fill=factor(UScrime$So))) +
geom_density(alpha=0.3)
# We here compare "Southern and non-Southern" states (So) on the "probability of imprisonment" (Prob) without the assumption of equal variances of "So"
t.test(formula = Prob ~ So, data = UScrime)
In the case that your datasets don't seem to meet the parametric assumptions of t-test (i.e. normally-distributed outcome), nonparametric approach, Wilcoxon rank-sum test (WRS) may help.
Note that such nonparametric approaches do have assumptions. For example, we assume the dataset is randomly-sampled and independently-observed.
# Normality tests
library("nortest")
Map(function(f) f(UScrime$Prob), c(shapiro.test, lillie.test, ad.test))
# Wilcoxon rank-sum test
wilcox.test(Prob ~ So, data = UScrime)
library(multcomp) # Get "cholesterol" dataset
qplot(trt,response,data = cholesterol,geom = "boxplot",xlab = "Treatment")
# H0: all group means are equal
cho_aov = aov(response ~ trt, data = cholesterol); summary(cho_aov) # ANOVA table
# Df Sum Sq Mean Sq F value Pr(>F)
# trt 4 1351.4 337.8 32.43 9.82e-13 ***
# Residuals 45 468.8 10.4
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Equivalent to lm()
cho_lm = lm(response ~ trt, data = cholesterol);
car::Anova(cho_lm,type = 3)
# A trained model can also be used to "predict"
predict(cho_lm ,newdata = data.frame(trt=c("1time","drugD")))
usc_lm = lm(Prob ~ So, data = UScrime)
car::Anova(usc_lm, type=3
predict(cho_lm, newdata = data.frame(trt=c("1time","drug0")))
# ANOVA也可以做預測 # 結果只有兩個平均值
Hmisc::rcorr(as.matrix(df[,c(4:5,7,10:11)]), type = "pearson")
house pop male female marriage born death
house 1.00 0.98 0.98 0.98 0.93 0.90 0.76
pop 0.98 1.00 1.00 1.00 0.93 0.92 0.76
male 0.98 1.00 1.00 0.99 0.93 0.92 0.77
female 0.98 1.00 0.99 1.00 0.93 0.92 0.75
marriage 0.93 0.93 0.93 0.93 1.00 0.92 0.69
born 0.90 0.92 0.92 0.92 0.92 1.00 0.65
death 0.76 0.76 0.77 0.75 0.69 0.65 1.00
n
house pop male female marriage born death
house 93924 93924 93924 93924 84364 84364 84364
pop 93924 93924 93924 93924 84364 84364 84364
male 93924 93924 93924 93924 84364 84364 84364
female 93924 93924 93924 93924 84364 84364 84364
marriage 84364 84364 84364 84364 84364 84364 84364
born 84364 84364 84364 84364 84364 84364 84364
death 84364 84364 84364 84364 84364 84364 84364
P
house pop male female marriage born death
house 0 0 0 0 0 0
pop 0 0 0 0 0 0
male 0 0 0 0 0 0
female 0 0 0 0 0 0
marriage 0 0 0 0 0 0
born 0 0 0 0 0 0
death 0 0 0 0 0 0
wilcox.test(Prob ~ So, data = UScrime)
相關不等於因果 e.g. 影像分析,只要有尺都有皮膚癌