# Hypothesis Test, Power Analysis and Randomization ###### tags: `R` `t-test` `Mann Whitney U Test` `statistics` `power analysis` `randomization` `reseasrch` ## Heart Dataset **Data Set Information:** This dataset contains the medical records of **303 patients who had heart failure**, collected during their follow-up period, where each patient profile has **13 clinical features**. * age: age in years * sex: sex * Value 0 = female * Value 1 = male * cp: chest pain type * Value 0: typical angina * Value 1: atypical angina * Value 2: non-anginal pain * Value 3: asymptomatic * trestbps: resting blood pressure (in mm Hg on admission to the hospital) * chol: serum cholesterol in mg/dl * fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) * restecg: resting electrocardiographic results * Value 0: normal * Value 1: having ST-T wave abnormality * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria * thalach: maximum heart rate achieved * exang: exercise induced angina * Value 0 = no * Value 1 = yes * oldpeak = Continuous value for ST depression induced by exercise relative to rest * slope: the slope of the peak exercise ST segment * Value 0: upsloping * Value 1: flat * Value 2: downsloping * ca: number of major vessels (0-3) colored by flourosopy * thal: thalassemia * Value 1 = normal * Value 2 = fixed defect * Value 3 = reversible defect * target: heart disease * Value 0 = no * Value 1 = yes **Original dataset version:** Tanvir Ahmad, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza: "Survival analysis of heart failure patients: a case study". PLoS ONE 12(7), 0181001 (2017). <br/> ## Student's t-test and Mann Whitney U Test Let compare the serum cholesterol levels of male and female ```r= library(dplyr) # for data transformation library(ggplot2) # for graph # import heart.txt file and name it data1 data1 <- read.table("data/heart.txt", header = T, sep = "\t") # first seperate male and female male <- filter(data1, sex == 1) female <- filter(data1, sex == 0) # perform t-test t.test(male$chol, female$chol) # perform Mann Whitney U Test wilcox.test(male$chol, female$chol) # the better way t.test(chol~sex, data = data1) wilcox.test(chol~sex, data = data1) # let's plot the cholsterol ggplot(data = data1, aes(x = sex, y = chol)) + geom_dotplot(binaxis='y', stackdir='center', stackratio=0.5, dotsize=0.3) # what happened ???? class(data1$sex) # We need to change the sex into categorical variables data1$sex <- factor(data1$sex) # Dot plot ggplot(data = data1, aes(x = sex, y = chol)) + geom_dotplot(binaxis='y', stackdir='center', stackratio=0.5, dotsize=0.3) + scale_x_discrete(labels=c("0" = "Female", "1" = "male")) # Box Plot ggplot(data = data1, aes(x = sex, y = chol)) + geom_boxplot() + scale_x_discrete(labels=c("0" = "Female", "1" = "male")) ``` <br/> The plots :::spoiler Before converting sex into factor ![](https://i.imgur.com/tQHqdgH.png) After converting ![](https://i.imgur.com/2Y7UVHM.png) ::: <br/> Which statistical test should we use? Which plot should we use? ```r= # let find the mean and SD of chol data1 %>% group_by(sex) %>% summarize(chol_mean = mean(chol), sd = sd(chol), n= n()) # calculate the quantile quantile(male$chol) quantile(female$chol) # QQ plot ggplot(data = data1, aes(sample = chol)) + stat_qq() + stat_qq_line() ggplot(data = data1, aes(sample = chol, color=sex)) + stat_qq() + stat_qq_line() # The Shapiro-Wilk Test For Normality shapiro.test(data1$chol) shapiro.test(male$chol) shapiro.test(female$chol) ``` ## Power Analysis **The pwr package** :::info Assume the blood pressure of the population is normally distributed. ==The standard deviation== of the sampling BP is ==15 mmHg==. We want to identify if a medication can decrease the blood pressure of our target population. We want to have a alpha value = 0.05 and power = 0.8. How many samples do we need for a difference ==>10 mmHg? >15 mmHg? >20 mmHg?== What if the data is not normally distributed? ::: ```r= # install the package (only first time) install.packages("pwr") # use the pwr library library(pwr) # pwr.t.test(n=, d=, sig.level=, power=, type= "two.sample") # We already know sig.level= 0.05, power= 0.8 # We need to obtain d # Write a function for effect size (es) # effSize(x,y) x = standard deviation, y = the difference effSize <- function(SD, dif){ return (dif/SD) } # assign standard deviation (sd) sd <- 15 # calculate effect size es.10 <- effSize(sd, 10) es.15 <- effSize(sd, 15) es.20 <- effSize(sd, 20) # calculate the n pwr.t.test(d = es.10, sig.level = 0.05, power = 0.8, type = "two.sample") pwr.t.test(d = es.15, sig.level = 0.05, power = 0.8, type = "two.sample") pwr.t.test(d = es.20, sig.level = 0.05, power = 0.8, type = "two.sample") ``` --- **The results of our power analysis** :::spoiler ![](https://i.imgur.com/O8rjGgd.png) ::: <br/> ## Randomization **The randomizeR-package** :::info Let us generate a list of complete randomization with sample size N = 15 and 2 groups. How about three groups? ::: ```r= # install randomizeR package install.packages("randomizeR") # load the package library("randomizeR") # crPar(N, K = 2), K group number # creating sample object: object representing Complete Randomization with sample size n = 15 n <- 15 k <- 2 sample <- crPar(n, k) # use genSeq() to generate a single randomization sequence genSeq(sample) # stored the results in the object "ramdom_sample" ramdom_sample <- genSeq(sample) # obtain the randomization sequence stored in ramdom_sample getRandList(ramdom_sample) # save the randomization sequence in "myRandList.csv" saveRand(ramdom_sample, file="myRandList.csv") ``` <br/> ## Review **Let us review some R codes from last time…** ```r= # install openxlsx install.packages("openxlsx") # load library library(openxlsx) library(dplyr) library(dslabs) library(ggplot2) # show only 3 digits after decimal point options(digits = 3) # load heights data from dslabs data(heights) # calculate mean, SD, quantile mean(heights$height) sd(heights$height) quantile(heights$height) summary(heights) # The Shapiro-Wilk Test For Normality shapiro.test(heights$height) # seperate the data by sex # %>% is pipe female <- heights %>% filter(sex =="Female" ) male <- heights %>% filter(sex =="Male" ) # calculate mean, SD, quantile mean(male$height) mean(female$height) sd(male$height) sd(female$height) summary(male$height) summary(female$height) # The Shapiro-Wilk Test For Normality shapiro.test(male$height) shapiro.test(female$height) # Summary sex <- c("Male", "Female", "Total") count <- c(length(male$height),length(female$height), length(heights$height) ) mean <- c(round(mean(male$height), digits = 2), round(mean(female$height), digits = 2), round (mean(heights$height), digits = 2)) SD <- c(round(sd(male$height), digits = 2), round(sd(female$height), digits = 2), round(sd(heights$height), digits = 2)) my.table <- cbind(Sex = sex, Count = count, Mean = mean, SD = SD) my.table # Make a function (this is better) mySummary <- function(x){ return(c( Count = length(x), Mean = round(mean(x), digits = 2), SD = round(sd(x), digits = 2))) } male.sum <- mySummary(male$height) female.sum <- mySummary(female$height) total.sum <- mySummary(heights$height) my.table2 <- rbind(Male = male.sum, Female = female.sum, Total = total.sum ) my.table2 # write to xlsx file write.xlsx(my.table2, file = "table2.xlsx", colNames = T, rowNames = T) # dotplot ggplot(data=heights, aes(x = sex, y = height)) + geom_dotplot(binaxis='y', stackdir='center', stackratio=0.5, dotsize=0.3) ``` **The Graph** :::spoiler ![](https://i.imgur.com/nOow7GK.png) ::: <br/> # The End