# Correlation
###### tags: `R` `statistics` `correlation` `Simpson’s paradox`
## Dataset
**Data Set Information:**
** This dataset contains the medical records of **66 patients who had Hepatic portal venousgas(HPVG)**. Each patient profile has **13 clinical variables**.
<br/>
**Attribute Information:**
* Patient No: id
* Sex (0:Female; 1:Male): sex
* Age (years): age
* Symptom onset to ED presentation (hours): p_time
* Body temperature (℃): bt
* Pulse rate (bpm): pr
* Respiratory rate (breaths/min):rr
* Mean arterial pressure (mmHg): bp
* Rapid Acute Physiology Score: RAPS
* Rapid Emergency Medicine Score: REMS
* Modified Early Warning Score: MEWS
* Management (0:Conservative; 1:Surgery): management
* ED presentation to operation (hours): op_time
* End outcome (0:Survival; 1:Death): outcome
[PLoS ONE12(9):e0184813.](https://doi.org/10.1371/journal.pone.0184813)
## Correlation
We will use the PLOS one dataset to do the analysis. First, we will create correlation matrix. Correlation matrix allows you to determine which pairs of variables require further investigation. Then, we will perform correlation test individually.
```r=
# load the library
library(tidyverse)
library(PerformanceAnalytics)
# load the dataset
data <- read.table("data/pone_data.txt", header = T, sep = "\t")
# create correlation matrix
chart.Correlation(data, histogram=TRUE, method = "spearman", pch=19)
```
<br/>
**The Correlation Matrix**
:::spoiler

:::
Based on the correlation matrix, we can identify the paired variables which are related.
```r=
# first, plot the graphe
plot(data$RAPS, data$REMS)
plot(data$RAPS, data$MEWS)
plot(data$REMS, data$MEWS)
# Pearson correlation
cor.test(data$RAPS, data$REMS, method = "pearson")
cor.test(data$RAPS, data$MEWS, method = "pearson")
cor.test(data$REMS, data$MEWS, method = "pearson")
# Spearman correlation
cor.test(data$RAPS, data$REMS, method = "spearman")
cor.test(data$RAPS, data$MEWS, method = "spearman")
cor.test(data$REMS, data$MEWS, method = "spearman")
```
**Pearson or Spearman?**
## Review
**Let us review some R codes from last time…**
Let's make the table 1...
```r=
# load the library
library(tidyverse)
library(PerformanceAnalytics)
library(openxlsx)
# load the dataset
data <- read.table("data/pone_data.txt", header = T, sep = "\t")
# separate the dataset into different groups
data_m <- filter(data, sex == 1)
survivor <- filter(data, outcome == 0)
survivor_m <- filter(survivor, sex == 1)
non_survivor <- filter(data, outcome == 1)
non_survivor_m <- filter(non_survivor, sex == 1)
```
Then, we need to make the table row-by-row

<br/>
```r=
#===============================================================================
# number of patient
number <- c(length(data$id), length(non_survivor$id), length(survivor$id), NA)
#===============================================================================
# sex %
table_mf <- table(data$sex, data$outcome)
c_test <- chisq.test(table_mf)
total_male <- paste(length(data_m$id), "(",round(100 * (length(data_m$id)/length(data$id)),digits = 2), ")")
non_surv_male <- paste(length(non_survivor_m$id), "(",round(100 * (length(non_survivor_m$id)/length(non_survivor$id)),digits = 2), ")")
surv_male <- paste(length(survivor_m$id), "(",round(100 * (length(survivor_m$id)/length(survivor$id)),digits = 2), ")")
sex_p <- round(c_test$p.value, digits = 4 )
male <- c(total_male, non_surv_male, surv_male, sex_p )
#===============================================================================
# Age
# normal distribution ?
shapiro.test(data$age)
my_parametric <- function(x, y, z){
test <- t.test(y, z)
x_r <- paste(round(mean(x),digits = 3 ), "(", round(sd(x),digits = 3), ")")
y_r <- paste(round(mean(y),digits = 3 ), "(", round(sd(y),digits = 3), ")")
z_r <- paste(round(mean(z),digits = 3 ), "(", round(sd(z),digits = 3), ")")
p_v <- round(test$p.value,digits = 4 )
return(c(x_r, y_r, z_r, p_v))
}
# use the function
age <- my_parametric(data$age, non_survivor$age,survivor$age )
#===============================================================================
#p_time
# normal distribution ?
shapiro.test(data$p_time)
# Write a function for non_parametric data
my_non_parametric <- function(x, y, z){
test_2 <- wilcox.test(y, z)
x_r <- paste(median(x), "(", quantile(x)[[3]],"-",quantile(x)[[2]], ")")
y_r <- paste(median(y), "(", quantile(y)[[3]],"-",quantile(y)[[2]], ")")
z_r <- paste(median(z), "(", quantile(z)[[3]],"-",quantile(z)[[2]], ")")
p_v <- round(test_2$p.value,digits = 4 )
return(c(x_r, y_r, z_r, p_v))
}
# use the function
p_time <- my_non_parametric(data$p_time, non_survivor$p_time,survivor$p_time)
#===============================================================================
# combine the rows into a table and name the rows
my_final_table <- rbind(Number = number, Male = male, Age = age, P_time = p_time )
# rename the column
colnames(my_final_table) <- c( "Total", "Non-survivors","Survivors", "p-Value" )
# save into a file
write.xlsx(my_final_table, file = "table1.xlsx", colNames = T, rowNames = T)
```