統計分析方法應用於R語言
Reference: http://ccckmit.wikidot.com/st:test1
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- d = density function 連續型
- p = cumulative distribution function 離散
- q = quantile function
- r = random number generation
- n:想要幾個結果
- size:實驗的次數
- prob_success:成功機率
- return:n個在size次數下成功的次數(e.g. 50個投擲3次銅板正面的次數)
常用的基準化方法
- 標準化(Standardization)
- 值域:-1~+1
- Z-score:平均值為0,標準差為1
- 常態化(Normalization)
- 值域:0~1(等同於比例)
- Center the row (x-x_mean)
- i/rowSum(i)
- 取對數(log)
- 正規化(Regularization):
- L1 regularization
- L2 regularization
常態分布檢定
- 在 R 中若要進行常態性檢定,最常用的方式就是 Shapiro-Wilk 檢定,它是 R 內建的統計檢定,不需要安裝套件即可使用
- 除了 Shapiro-Wilk 檢定之外,如果想嘗試其他的統計檢定方法的話,可以安裝 nortest 套件,這個套件中提供了好幾種專門用於常態性檢定的方法。
- With large sample size (n > 50, as a rule-of-thumb), Shapiro-Wilk test is very easy to get low p-value from small deviations from normality.
- Lilliefors tests based on the K-S test, which is often considered a more powerful test for large sample.
- 參考資料: https://officeguide.cc/r-normality-test-tutorial/
- So, what should we do to verify data distributions if we deal with big datasets?
- Talk to domain experts (could be yourself) first. They may have better understanding of how the data is supposed to be distributed.
- Still, run the distribution check tests (e.g. package nortest for more normality tests). Note that some of them are not reliable!
- Plot your data. A density plot along with normal quantile-quantile plot (q-q plot) may give you the visual senses of the distribution and normality. Check out package for visualizing big dataset (e.g. hexbin, tabplot and bigvis). Also be aware that your eyes can be fooled by figures, especially the scales of features, too!
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 精確:控球好、受過訓練;準確:有命中
- 右上:有命中代表準確高,但很不穩定(訓練次數太少),只命中一次。e.g.天賦好卻缺乏練習的球手睜開眼睛投球
- 左下:控球穩但都沒命中。e.g.經驗豐富的老球手閉著眼睛投球
- We have been using some statistical methods, such as Chi-squared test of independence and Fisher's exact test, to verify the relationships between two random variables.
卡方檢定
- 為無母數檢定之一(n<30)
- Homogeneity test(多母體propotion是否相等/顯著不同)
- Independent test(雙母體propotion是否彼此獨立)
- Goodness of fits(單母體propotion是否為指定比例)
t-test
+ 為有母數檢定(推論統計方法,去估計母體)
+ 單/雙母體變異數檢定
+ A t-test tests a null hypothesis about two means; most often, it tests the hypothesis that two means are equal, or that the difference between them is zero.
+ Fisher's LSD
p-test
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
閱讀能力隨著年齡的變化,左圖會違反常理
-
In the case that your datasets don't seem to meet the parametric assumptions of t-test (i.e. normally-distributed outcome), nonparametric approach, Wilcoxon rank-sum test (WRS) may help.
-
Note that such nonparametric approaches do have assumptions. For example, we assume the dataset is randomly-sampled and independently-observed.
F檢定
Poisson & Exponential
- Poisson是計數(一段時間內來客數)
- Exponential是重時間頻率(兩次通話的間隔時長)
- μ互為倒數(相反)
ANOVA
- 假設:
- 隨機分布(randomly sampled)
- 互相獨立(independently observed)
- 隨機實驗設計(Random Experience Design)
- 隨機區集實驗(Random Block Design)
Correlation Matrix 計算
無母數檢定(看有沒有常態)
建模方式
- EDA(Exploratory Data Analysis)探索性資料分析:
- 非監督式學習(資料探勘)
- 觀察階段
- 沒有一定要放進去的變數(No forced in variable)
- 不管資料多爛(P值多高)都塞進去
- 優點:說不定會發現某些被隱藏的現象,例如啤酒尿布之例
- 缺點:是可能會找到無意義的關係,例如生過小孩的都是女性用戶 + 電腦挑變數
- Recursive Feature Elimination(RFE):
Start with fitting a model with all variables, and then remove the variable with the largest p-value or lowest variable importance.
- Backward selection
- CDA(Confirmatory Data Analysis)驗證性資料分析:
- 監督式學習
- EDA完會有個推論,CDA為此推論之驗證階段
- 類似傳統統計範疇(p值小的留下)
- 證實或否定假設,有準確要測量的答案
- 結果應跟EDA的相近,否則可能有問題
- VOI(Variable of Interest)
- 電腦挑變數
- Forward selection:
Begin with identifying variables of interests. These variables may always be "forced-in", and then add the variable that results in the lowest error.
- Hybrid selection
相關不等於因果 e.g. 影像分析,只要有尺都有皮膚癌