# Correlation ###### tags: `R` `statistics` `correlation` `Simpson’s paradox` ## Dataset **Data Set Information:** ** This dataset contains the medical records of **66 patients who had Hepatic portal venousgas(HPVG)**. Each patient profile has **13 clinical variables**. <br/> **Attribute Information:** * Patient No: id * Sex (0:Female; 1:Male): sex * Age (years): age * Symptom onset to ED presentation (hours): p_time * Body temperature (℃): bt * Pulse rate (bpm): pr * Respiratory rate (breaths/min):rr * Mean arterial pressure (mmHg): bp * Rapid Acute Physiology Score: RAPS * Rapid Emergency Medicine Score: REMS * Modified Early Warning Score: MEWS * Management (0:Conservative; 1:Surgery): management * ED presentation to operation (hours): op_time * End outcome (0:Survival; 1:Death): outcome [PLoS ONE12(9):e0184813.](https://doi.org/10.1371/journal.pone.0184813) ## Correlation We will use the PLOS one dataset to do the analysis. First, we will create correlation matrix. Correlation matrix allows you to determine which pairs of variables require further investigation. Then, we will perform correlation test individually. ```r= # load the library library(tidyverse) library(PerformanceAnalytics) # load the dataset data <- read.table("data/pone_data.txt", header = T, sep = "\t") # create correlation matrix chart.Correlation(data, histogram=TRUE, method = "spearman", pch=19) ``` <br/> **The Correlation Matrix** :::spoiler ![](https://i.imgur.com/TJ1v5Tg.png) ::: Based on the correlation matrix, we can identify the paired variables which are related. ```r= # first, plot the graphe plot(data$RAPS, data$REMS) plot(data$RAPS, data$MEWS) plot(data$REMS, data$MEWS) # Pearson correlation cor.test(data$RAPS, data$REMS, method = "pearson") cor.test(data$RAPS, data$MEWS, method = "pearson") cor.test(data$REMS, data$MEWS, method = "pearson") # Spearman correlation cor.test(data$RAPS, data$REMS, method = "spearman") cor.test(data$RAPS, data$MEWS, method = "spearman") cor.test(data$REMS, data$MEWS, method = "spearman") ``` **Pearson or Spearman?** ## Review **Let us review some R codes from last time…** Let's make the table 1... ```r= # load the library library(tidyverse) library(PerformanceAnalytics) library(openxlsx) # load the dataset data <- read.table("data/pone_data.txt", header = T, sep = "\t") # separate the dataset into different groups data_m <- filter(data, sex == 1) survivor <- filter(data, outcome == 0) survivor_m <- filter(survivor, sex == 1) non_survivor <- filter(data, outcome == 1) non_survivor_m <- filter(non_survivor, sex == 1) ``` Then, we need to make the table row-by-row ![](https://i.imgur.com/0tNesTL.png) <br/> ```r= #=============================================================================== # number of patient number <- c(length(data$id), length(non_survivor$id), length(survivor$id), NA) #=============================================================================== # sex % table_mf <- table(data$sex, data$outcome) c_test <- chisq.test(table_mf) total_male <- paste(length(data_m$id), "(",round(100 * (length(data_m$id)/length(data$id)),digits = 2), ")") non_surv_male <- paste(length(non_survivor_m$id), "(",round(100 * (length(non_survivor_m$id)/length(non_survivor$id)),digits = 2), ")") surv_male <- paste(length(survivor_m$id), "(",round(100 * (length(survivor_m$id)/length(survivor$id)),digits = 2), ")") sex_p <- round(c_test$p.value, digits = 4 ) male <- c(total_male, non_surv_male, surv_male, sex_p ) #=============================================================================== # Age # normal distribution ? shapiro.test(data$age) my_parametric <- function(x, y, z){ test <- t.test(y, z) x_r <- paste(round(mean(x),digits = 3 ), "(", round(sd(x),digits = 3), ")") y_r <- paste(round(mean(y),digits = 3 ), "(", round(sd(y),digits = 3), ")") z_r <- paste(round(mean(z),digits = 3 ), "(", round(sd(z),digits = 3), ")") p_v <- round(test$p.value,digits = 4 ) return(c(x_r, y_r, z_r, p_v)) } # use the function age <- my_parametric(data$age, non_survivor$age,survivor$age ) #=============================================================================== #p_time # normal distribution ? shapiro.test(data$p_time) # Write a function for non_parametric data my_non_parametric <- function(x, y, z){ test_2 <- wilcox.test(y, z) x_r <- paste(median(x), "(", quantile(x)[[3]],"-",quantile(x)[[2]], ")") y_r <- paste(median(y), "(", quantile(y)[[3]],"-",quantile(y)[[2]], ")") z_r <- paste(median(z), "(", quantile(z)[[3]],"-",quantile(z)[[2]], ")") p_v <- round(test_2$p.value,digits = 4 ) return(c(x_r, y_r, z_r, p_v)) } # use the function p_time <- my_non_parametric(data$p_time, non_survivor$p_time,survivor$p_time) #=============================================================================== # combine the rows into a table and name the rows my_final_table <- rbind(Number = number, Male = male, Age = age, P_time = p_time ) # rename the column colnames(my_final_table) <- c( "Total", "Non-survivors","Survivors", "p-Value" ) # save into a file write.xlsx(my_final_table, file = "table1.xlsx", colNames = T, rowNames = T) ```