# Hypothesis Test, Power Analysis and Randomization
###### tags: `R` `t-test` `Mann Whitney U Test` `statistics` `power analysis` `randomization` `reseasrch`
## Heart Dataset
**Data Set Information:** This dataset contains the medical records of **303 patients who had heart failure**, collected during their follow-up period, where each patient profile has **13 clinical features**.
* age: age in years
* sex: sex
* Value 0 = female
* Value 1 = male
* cp: chest pain type
* Value 0: typical angina
* Value 1: atypical angina
* Value 2: non-anginal pain
* Value 3: asymptomatic
* trestbps: resting blood pressure (in mm Hg on admission to the hospital)
* chol: serum cholesterol in mg/dl
* fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg: resting electrocardiographic results
* Value 0: normal
* Value 1: having ST-T wave abnormality
* Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach: maximum heart rate achieved
* exang: exercise induced angina
* Value 0 = no
* Value 1 = yes
* oldpeak = Continuous value for ST depression induced by exercise relative to rest
* slope: the slope of the peak exercise ST segment
* Value 0: upsloping
* Value 1: flat
* Value 2: downsloping
* ca: number of major vessels (0-3) colored by flourosopy
* thal: thalassemia
* Value 1 = normal
* Value 2 = fixed defect
* Value 3 = reversible defect
* target: heart disease
* Value 0 = no
* Value 1 = yes
**Original dataset version:** Tanvir Ahmad, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza: "Survival analysis of heart failure patients: a case study". PLoS ONE 12(7), 0181001 (2017).
<br/>
## Student's t-test and Mann Whitney U Test
Let compare the serum cholesterol levels of male and female
```r=
library(dplyr) # for data transformation
library(ggplot2) # for graph
# import heart.txt file and name it data1
data1 <- read.table("data/heart.txt", header = T, sep = "\t")
# first seperate male and female
male <- filter(data1, sex == 1)
female <- filter(data1, sex == 0)
# perform t-test
t.test(male$chol, female$chol)
# perform Mann Whitney U Test
wilcox.test(male$chol, female$chol)
# the better way
t.test(chol~sex, data = data1)
wilcox.test(chol~sex, data = data1)
# let's plot the cholsterol
ggplot(data = data1, aes(x = sex, y = chol)) +
geom_dotplot(binaxis='y', stackdir='center', stackratio=0.5, dotsize=0.3)
# what happened ????
class(data1$sex)
# We need to change the sex into categorical variables
data1$sex <- factor(data1$sex)
# Dot plot
ggplot(data = data1, aes(x = sex, y = chol)) +
geom_dotplot(binaxis='y', stackdir='center', stackratio=0.5, dotsize=0.3) +
scale_x_discrete(labels=c("0" = "Female", "1" = "male"))
# Box Plot
ggplot(data = data1, aes(x = sex, y = chol)) + geom_boxplot() +
scale_x_discrete(labels=c("0" = "Female", "1" = "male"))
```
<br/>
The plots
:::spoiler
Before converting sex into factor

After converting

:::
<br/>
Which statistical test should we use?
Which plot should we use?
```r=
# let find the mean and SD of chol
data1 %>% group_by(sex) %>% summarize(chol_mean = mean(chol), sd = sd(chol), n= n())
# calculate the quantile
quantile(male$chol)
quantile(female$chol)
# QQ plot
ggplot(data = data1, aes(sample = chol)) + stat_qq() + stat_qq_line()
ggplot(data = data1, aes(sample = chol, color=sex)) + stat_qq() + stat_qq_line()
# The Shapiro-Wilk Test For Normality
shapiro.test(data1$chol)
shapiro.test(male$chol)
shapiro.test(female$chol)
```
## Power Analysis
**The pwr package**
:::info
Assume the blood pressure of the population is normally distributed. ==The standard deviation== of the sampling BP is ==15 mmHg==. We want to identify if a medication can decrease the blood pressure of our target population. We want to have a alpha value = 0.05 and power = 0.8. How many samples do we need for a difference ==>10 mmHg? >15 mmHg? >20 mmHg?== What if the data is not normally distributed?
:::
```r=
# install the package (only first time)
install.packages("pwr")
# use the pwr library
library(pwr)
# pwr.t.test(n=, d=, sig.level=, power=, type= "two.sample")
# We already know sig.level= 0.05, power= 0.8
# We need to obtain d
# Write a function for effect size (es)
# effSize(x,y) x = standard deviation, y = the difference
effSize <- function(SD, dif){
return (dif/SD)
}
# assign standard deviation (sd)
sd <- 15
# calculate effect size
es.10 <- effSize(sd, 10)
es.15 <- effSize(sd, 15)
es.20 <- effSize(sd, 20)
# calculate the n
pwr.t.test(d = es.10, sig.level = 0.05, power = 0.8, type = "two.sample")
pwr.t.test(d = es.15, sig.level = 0.05, power = 0.8, type = "two.sample")
pwr.t.test(d = es.20, sig.level = 0.05, power = 0.8, type = "two.sample")
```
---
**The results of our power analysis**
:::spoiler

:::
<br/>
## Randomization
**The randomizeR-package**
:::info
Let us generate a list of complete randomization with sample size N = 15 and 2 groups.
How about three groups?
:::
```r=
# install randomizeR package
install.packages("randomizeR")
# load the package
library("randomizeR")
# crPar(N, K = 2), K group number
# creating sample object: object representing Complete Randomization with sample size n = 15
n <- 15
k <- 2
sample <- crPar(n, k)
# use genSeq() to generate a single randomization sequence
genSeq(sample)
# stored the results in the object "ramdom_sample"
ramdom_sample <- genSeq(sample)
# obtain the randomization sequence stored in ramdom_sample
getRandList(ramdom_sample)
# save the randomization sequence in "myRandList.csv"
saveRand(ramdom_sample, file="myRandList.csv")
```
<br/>
## Review
**Let us review some R codes from last time…**
```r=
# install openxlsx
install.packages("openxlsx")
# load library
library(openxlsx)
library(dplyr)
library(dslabs)
library(ggplot2)
# show only 3 digits after decimal point
options(digits = 3)
# load heights data from dslabs
data(heights)
# calculate mean, SD, quantile
mean(heights$height)
sd(heights$height)
quantile(heights$height)
summary(heights)
# The Shapiro-Wilk Test For Normality
shapiro.test(heights$height)
# seperate the data by sex
# %>% is pipe
female <- heights %>% filter(sex =="Female" )
male <- heights %>% filter(sex =="Male" )
# calculate mean, SD, quantile
mean(male$height)
mean(female$height)
sd(male$height)
sd(female$height)
summary(male$height)
summary(female$height)
# The Shapiro-Wilk Test For Normality
shapiro.test(male$height)
shapiro.test(female$height)
# Summary
sex <- c("Male", "Female", "Total")
count <- c(length(male$height),length(female$height), length(heights$height) )
mean <- c(round(mean(male$height), digits = 2), round(mean(female$height), digits = 2), round (mean(heights$height), digits = 2))
SD <- c(round(sd(male$height), digits = 2), round(sd(female$height), digits = 2), round(sd(heights$height), digits = 2))
my.table <- cbind(Sex = sex, Count = count, Mean = mean, SD = SD)
my.table
# Make a function (this is better)
mySummary <- function(x){
return(c( Count = length(x), Mean = round(mean(x), digits = 2), SD = round(sd(x), digits = 2)))
}
male.sum <- mySummary(male$height)
female.sum <- mySummary(female$height)
total.sum <- mySummary(heights$height)
my.table2 <- rbind(Male = male.sum, Female = female.sum, Total = total.sum )
my.table2
# write to xlsx file
write.xlsx(my.table2, file = "table2.xlsx", colNames = T, rowNames = T)
# dotplot
ggplot(data=heights, aes(x = sex, y = height)) + geom_dotplot(binaxis='y', stackdir='center', stackratio=0.5, dotsize=0.3)
```
**The Graph**
:::spoiler

:::
<br/>
# The End