Chi Square and Fisher’s Exact Test

# Chi Square and Fisher’s Exact Test ###### tags: `R` `Statistics` `Chi-Square` `Fisher's Exact Test` `p-value` `Alpha Inflation` ## Heart Dataset **Data Set Information:** This dataset contains the medical records of **303 patients who had heart failure**, collected during their follow-up period, where each patient profile has **13 clinical features**. * age: age in years * sex: sex * Value 0 = female * Value 1 = male * cp: chest pain type * Value 0: typical angina * Value 1: atypical angina * Value 2: non-anginal pain * Value 3: asymptomatic * trestbps: resting blood pressure (in mm Hg on admission to the hospital) * chol: serum cholesterol in mg/dl * fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) * restecg: resting electrocardiographic results * Value 0: normal * Value 1: having ST-T wave abnormality * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria * thalach: maximum heart rate achieved * exang: exercise induced angina * Value 0 = no * Value 1 = yes * oldpeak = Continuous value for ST depression induced by exercise relative to rest * slope: the slope of the peak exercise ST segment * Value 0: upsloping * Value 1: flat * Value 2: downsloping * ca: number of major vessels (0-3) colored by flourosopy * thal: thalassemia * Value 1 = normal * Value 2 = fixed defect * Value 3 = reversible defect * target: heart disease * Value 0 = no * Value 1 = yes **Original dataset version:** Tanvir Ahmad, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza: "Survival analysis of heart failure patients: a case study". PLoS ONE 12(7), 0181001 (2017). ## Chi Square and Fisher's Exact Test * We will first import the heart files (heart.txt). * Frist put your heart file into the **"data folder"** ``` r= # load library library(ggplot2) # for graph # import heart.txt file and name it data data <- read.table("data/heart.txt", header = T, sep = "\t") # check the structures of dataset str(data) ``` The sex ratio in humans is about 1:1. In humans, the natural ratio between males and females at birth is slightly biased towards the male sex. The sex ratio for the entire world population is **101 males to 100 females (2018 est.)**. Let us use the chi-squared goodness of fit test to check if our data fit the world population ```r= # for Goodness of fit test #chisq.test( x = observation #, p = expected probability) # count male # male <- sum(data$sex) # count female # female <- (length(data$sex) - male) #calculate the expected probability male_p <- 101/(101+100) female_p <- 100/(101+100) # perform Chi-square for Goodness of fit test chisq.test(x = c(male, female), p = c(male_p, female_p )) ``` * If we want to use chisq.test() to perform **chi-square(Test of Independence)**, we need to make a contingency table. ```r= # make a table--- Sex <- matrix(c(1,2,3,4),ncol=2,byrow=TRUE) colnames(Sex) <- c("E","L") rownames(Sex) <- c("M","F") Sex <- as.table(Sex ) Sex # make a table my_table1 <- table(data$sex, data$cp) my_table1 # make a table with define lable my_table2 <- table(Sex = data$sex, Angina_type = data$cp) my_table2 # change row name rownames(my_table2) <- c("female","male") my_table2 # change column name colnames(my_table2) <- c("Typical","Atypical","Non-anginal pain", "Asymptomatic") my_table2 # perform Chi-square test of independence chisq.test(my_table1) chisq.test(my_table2) # the shortcut chisq.test(table(data$sex, data$cp)) # perform Fisher's Exact Test fisher.test(my_table2) # let us make a graph data$sex <- as.factor(data$sex) ggplot(data = data, aes(x = cp, fill = sex )) + geom_bar() # mild adjustment ggplot(data = data, aes(x = cp, fill = sex )) + geom_bar(position = "dodge") + scale_x_discrete( labels = c("Typical", "Atypical", "Non-anginal", "Asymptomatic") ) ``` ## Alpha Inflation * Multiple comparison tests are performed several times on the mean of experimental conditions. * In the situation of comparing the three groups: group A versus group B, group B versus group C, and group A versus group C. * A pair for this comparison is called **family**. * The type I error that occurs when each family is compared is called the **family-wise error’ (FWE)**. * Inflated α = 1 − (1 − α)N , N = number of hypotheses tested * If we performed 20 hypotheses tests, is p-value > 0.05 acceptable? ```r= # Let us use p.adjust() to adjust p-value # p.adjust(p, method = p.adjust.methods, n = # of comparison) # p.adjust.methods: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr") p.adjust(0.05, method = "holm", n = 20) p.adjust(0.05, method = "hochberg", n = 20) p.adjust(0.05, method = "hommel", n = 20) p.adjust(0.05, method = "bonferroni", n = 20) p.adjust(0.01, method = "bonferroni", n = 20) p.adjust(0.005, method = "bonferroni", n = 20) p.adjust(0.0025, method = "bonferroni", n = 20) ```