###### tags: `R` `cancer` `oncology` `survival analysis` `progression-free survival` `overall survival` `disease-free survival` `Kaplan-Meier` `log rank test` `Cox proportional hazards regression` `hazard ratio` `relevel()` `coxph()` `lubridate::time_length()` `HNSCC` `BRAF` `CTLA4` `CD155` `CPI` `checkpoint blockade immunotherapy` `Treg`
# Survival analysis in cancer immunology with R
---
## Using the R function in next section to get cut points for continuous predicting variables (predictors), dichotomise the predictors and run univariable cox regression
* Download a survival data file from QIMR L drive
* L:/Lab_MarkS/lunC/work/Immunohistochemistry_images/data_output/AP_Exp108.1_PeterMac-lungCancer-CD8-PD1/analysis-results/PeterMac-lung-cancer-cohort-1-DFS-marker-counts.tsv
* Create a new R script file. Copy the code chunk and modify the path of input, output files or folders
* Run the `source()` function to load the R function to the working environment.
* Modify the file path within the double quotes. Make sure backword slashes \ are replaced by forward slashes / when the path is copied from Windows address bar.
* In R, place your cursor anywhere in the line with the `source()` and run this function by pressing the Ctrl and Enter keys at the same time.
* Modify the values that will be taken by the following arguments in the function `CutpointUnivariableSurvivalAnalysis()`
* `input.survival.data.file.path=` This is the full path of a survival data file. This file should be tab-separated and have at least four columns in the order of:
* column 1 : A grouping variable that uniquely identifies rows in the file (e.g., cohort_TMA_row_col)
* column 2 : A survival outcome variable (e.g., overall survival in years)
* column 3 : A censoring variable
* column 4-last : Predictors (e.g., number of CD8+ cells per image, number of CD226+ cells per image, number of CD8+CD226+ cells per image, CD226 ratio)
* `column.position.ID=1` By default the function reads column 1 as the grouping variable. This can be ignored when calling the function unless your grouping variable is at a position other than column 1.
* `column.position.survival.outcome=2` By default the function reads column 2 as the survival outcome. This can be ignored when calling the function unless your survival outcome is at a position other than column 2.
* `column.position.censoring.variable=3` By default the function reads column 3 as the censoring variable. This can be ignored when calling the function unless your censoring variable is at a position other than column 3.
* `column.positions.predictors=c(4:7)` By default the function reads column 4 to 7 as the predictoring variables. This can be ignored when calling the function unless you have more/less than 4 variables.
* `output.folder.path=` This is the folder path where the result file will be exported. Manually create this folder and replace the folder path. Do not end the path with /
```r!
source("E:/Lab_MarkS/lunC_work/Immunohistochemistry_images/scripts/RFunction_get-cutpoints_run-univariable-surivival-analysis.R")
CutpointUnivariableSurvivalAnalysis( input.survival.data.file.path="E:/Lab_MarkS/lunC_work/Immunohistochemistry_images/data_output/AP_Exp108.1_PeterMac-lungCancer-CD8-PD1/analysis-results/PeterMac-lung-cancer-cohort-1-DFS-marker-counts.tsv"
,output.folder.path="E:/Lab_MarkS/lunC_work/Immunohistochemistry_images/data_output/AP_Exp108.1_PeterMac-lungCancer-CD8-PD1/analysis-results")
```
* The R code above is run in the R script file `survival-analysis_Exp108v1_CD8-PD1.R` from QIMR L drive `L:/Lab_MarkS/lunC/work/Immunohistochemistry_images/scripts`
---
## A R function to get cut points for continuous predicting variables (predictors), dichotomise the predictors and run univariable cox regression
* Install java software from [QIMR KACE downloads](http://kace01s.adqimr.ad.lan/userui/software_library.php)
* Download the R script file `RFunction_get-cutpoints_run-univariable-surivival-analysis.R` from QIMR L drive `L:/Lab_MarkS/lunC/work/Immunohistochemistry_images/scripts`
* The input survival data file should be
* tab-separated
* having at least four columns in the order of:
* column 1 : A grouping variable that uniquely identifies rows in the file (e.g., cohort_TMA_row_col)
* column 2 : A survival outcome variable (e.g., overall survival in years)
* column 3 : A censoring variable
* column 4-last : Predictors (e.g., number of CD8+ cells per image, number of CD226+ cells per image, number of CD8+CD226+ cells per image, CD226 ratio)
```r!
# Check if required packages are already installed
if (!"dplyr" %in% rownames(installed.packages())) install.packages("dplyr")
if (!"maxstat" %in% rownames(installed.packages())) install.packages("maxstat")
if (!"survminer" %in% rownames(installed.packages())) install.packages("survminer")
# Load packages
library(dplyr)
CutpointUnivariableSurvivalAnalysis<- function( input.survival.data.file.path="E:/Lab_MarkS/lunC_work/Immunohistochemistry_images/data_output/AP_Exp108.1_PeterMac-lungCancer-CD8-PD1/analysis-results/PeterMac-lung-cancer-cohort-1-DFS-marker-counts.tsv"
,column.position.ID=1
,column.position.survival.outcome=2
,column.position.censoring.variable=3
,column.positions.predictors=c(4:7) # A number range (e.g., c(4:6))
,output.folder.path="E:/Lab_MarkS/lunC_work/Immunohistochemistry_images/data_output/AP_Exp108.1_PeterMac-lungCancer-CD8-PD1/analysis-results"){
# Check file existence
if(file.exists(input.survival.data.file.path)!=TRUE){
cat("Input survival data file could not be found")
} else {
# Import data
data <- read.delim(file=input.survival.data.file.path, header = TRUE, sep = "\t") %>%
dplyr::select(c( column.position.ID
,column.position.survival.outcome
,column.position.censoring.variable
,column.positions.predictors)) %>%
dplyr::mutate(row.numb= c(1:nrow(.))) # dim(data) 368 8
#------------------------
# Dichotimise a predictor
#------------------------
# Get column names for predictor, survival outcome and censoring variable
ID <- names(data)[column.position.ID]
predictors <- names(data)[column.positions.predictors] # length(predictors) 4
# Make survival time and censor variable names
survival.time.varname <- names(data)[column.position.survival.outcome]
survival.event.varname <- names(data)[column.position.censoring.variable]
# Make an equation
formulas.MSLRS <- sapply(predictors
,function(x) as.formula(paste( 'survival::Surv('
,survival.time.varname
,','
,survival.event.varname
, ')~'
, x))) # class(formulas.MSLRS) "list"
# Fit LogRank test models using the equation (if there are > 1 predictors)
models.MSLRS <- lapply(formulas.MSLRS
,function(x){maxstat::maxstat.test(x
,data = data
,smethod="LogRank"
,pmethod="exactGauss"
,abseps=0.01)}) # class(models.MSLRS) "list"
# Extract data from the modelling results into a list
results.MSLRS <- lapply(models.MSLRS
,function(x){
model.equation <- x$data.name # class(predictors)
# Extract values from the object maxtest
statistic <- signif(x$statistic) # class(statistic) [1] "numeric"
p.value <- signif(x$p.value, digits=2)
estimated.cutpoint <- x$estimate # class(estimated.cutpoint) [1] "numeric"
res <-c( survival.time.varname
,statistic
,p.value
,estimated.cutpoint
,model.equation)
names(res)<-c( "survival.time.varname"
,"statistic"
,"p.value"
,"cutpoint"
,"model.equation")
return(res)
}
) # class(results.MSLRS) "list"
# Convert the list to a matrix and then to a df
results.MSLRS.matrix <- t(as.data.frame(results.MSLRS, check.names = FALSE))
results.MSLRS.df <- as.data.frame(results.MSLRS.matrix) # dim(results.MSLRS.df) 1 5
# Add predictor name
results.MSLRS.df$predictor <- row.names(results.MSLRS.df)
# Remove row.names
row.names(results.MSLRS.df) <- NULL
# Reorder columns
column.order.cutpoints <- c("survival.time.varname","predictor","statistic","p.value","cutpoint","model.equation")
results.MSLRS.df <- results.MSLRS.df %>% dplyr::select_(.dots = column.order.cutpoints) # dim(results.MSLRS.df) 4 6
# Categorise predictor values into 0 (low) and 1 (high) using cutpoints
.predictors.categorised <- survminer::surv_cutpoint( data=data
,time= survival.time.varname
,event= survival.event.varname
,variables = predictors
,minprop = 0.1
,progressbar = TRUE) %>%
# Dichotimise continuous predictor values into 0 (i.e., low) and 1 (i.e., high)
survminer::surv_categorize(x=., labels = c(0,1)) # dim(.predictors.categorised) 368 6
# Change the class of data attribute to data.frame rahter than two types ("data.frame", "surv_categorize")
attributes(.predictors.categorised)$class <- "data.frame"
# Create row serial number as merging key
.predictors.categorised <- .predictors.categorised %>%
dplyr::mutate(row.numb=c(1:nrow(.))) %>%
dplyr::select(row.numb,everything())
#.predictors.categorised$row.numb <- c(1:nrow(.predictors.categorised)) # dim(.predictors.categorised) 368 7
# Make a new column name for categorised predictor
position.last.element.predictors.categorised <- length(names(.predictors.categorised))
predictor.new <- paste0(names(.predictors.categorised)[4:position.last.element.predictors.categorised],"_cat.num")
.predictors.categorised.2 <- .predictors.categorised
colnames(.predictors.categorised.2) <- c( names(.predictors.categorised)[1]
,names(.predictors.categorised)[2]
,names(.predictors.categorised)[3]
,predictor.new)
# Merge categorised predictors back with predictor value columns
predictors.categorised <- dplyr::left_join( x= data[,c("row.numb", ID, survival.time.varname, survival.event.varname, predictors)] # dim(data[,c("row.numb", ID, survival.time.varname, survival.event.varname, predictors)]) 368 8
,y= .predictors.categorised.2 # dim(.predictors.categorised.2) 368 7
,by=c( "row.numb"
,quo_name(survival.time.varname)
,quo_name(survival.event.varname)
)) # dim(predictors.categorised) 368 12
#---------------------------------------------------------------------------------------------------------
# Run univariable cox regression of single predictors on time to survival event
# Check assumption of proportional hazard
# Plot scaled Schoenfeld residuals against survival time
# Number of predictors per model: 1
#---------------------------------------------------------------------------------------------------------
uni.cox.model.base.data <- data.frame()
uni.cox.assumption.test <- data.frame()
count <- 0
data <- predictors.categorised
# get column names of categorised predictors
predictors.varnames <- grep(pattern = "_cat.num", x=names(predictors.categorised), value = TRUE)
# Loop thru each predictor
for(j in 1:length(predictors.varnames)){
count <- count+1
#-------------------
# Create an equation
#-------------------
covariate.varname <- predictors.varnames[j]
formula <- as.formula(paste("survival::Surv(", survival.time.varname, ", ", survival.event.varname, ") ~ ", covariate.varname, sep = ""))
formula.as.character <- paste("survival::Surv(", survival.time.varname, ", ", survival.event.varname, ") ~ ", covariate.varname, sep = "")
#---------------------------------------
# Fit a univariable cox regression model
#---------------------------------------
cox <- survival::coxph(formula = formula, data=data)
# Get regression coefficients (effect sizes), exponentiated beta (hazard ratio), standard error of beta, z score, and p values
cox.coefficients <- as.data.frame(summary(cox)$coefficients) %>%
dplyr::mutate( survival.time=survival.time.varname
,survival.event=survival.event.varname
,predictor=row.names(.)) %>%
dplyr::select(survival.time, survival.event, predictor, coef,`exp(coef)`, `se(coef)` , z,`Pr(>|z|)`) # class(cox.coefficients) "data.frame" # dim(predictors.categorised) 368 12
# Get confidence intervals of hazard ratios
cox.conf.int <- as.data.frame(summary(cox)$conf.int) %>%
dplyr::mutate(survival.time=survival.time.varname
,survival.event=survival.event.varname
,predictor=row.names(.)) %>%
dplyr::select(survival.time, survival.event, predictor,`lower .95`,`upper .95`) # class(cox.conf.int) "data.frame" # dim(cox.conf.int) 1 5
cox.estimates <- dplyr::left_join( x=cox.coefficients
,y=cox.conf.int
,by=c("survival.time", "survival.event", "predictor")) %>%
dplyr::mutate(model.number=count
,hazard.ratio.95CI=paste0( signif(`exp(coef)`, digits = 2)
, " ("
, signif(`lower .95`, digits = 2)
, " - "
,signif(`upper .95`, digits = 2)
, ")")
,formula= formula.as.character) %>%
dplyr::select(model.number, dplyr::everything()) # dim(cox.estimates) 1 13
# Append cox reg estimates to the base data
uni.cox.model.base.data <- dplyr::bind_rows(uni.cox.model.base.data, cox.estimates)
#-------------------------------------------------------------------------------
# Check proportional hazard assumption using Schoenfeld residuals
## p value > 0.05 provides evidence in favour of the hazards being proportional
#-------------------------------------------------------------------------------
.assum.proportionality <- survival::cox.zph(cox)
# Convert matrix to data.frame
assum.proportionality <- as.data.frame(.assum.proportionality$table, check.names=F) %>%
dplyr::mutate( model.number=count
,survival.time=survival.time.varname
,survival.event=survival.event.varname
,predictor=row.names(.)
,formula= formula.as.character) %>%
dplyr::select(model.number, survival.time, survival.event, predictor, dplyr::everything())
# Append assumption testing results to the base data
uni.cox.assumption.test <- dplyr::bind_rows(uni.cox.assumption.test, assum.proportionality)
#-------------------------------------------------------------------------------
# Plot the scaled Schoenfeld residuals against the transformed time for a predictor
## p value > 0.05 provides evidence in favour of the hazards being proportional
## x axis: time to event variable
## y axis: beta of a predictor from a linear regression of y on x
#-------------------------------------------------------------------------------
assum.proportionality.ggcoxzph <- survminer::ggcoxzph(.assum.proportionality, font.main = 13) # class(assum.proportionality.ggcoxzph) # [1] "ggcoxzph" "ggsurv" "list"
# Create file path for output plot file
assumption.output.folder.path <- file.path(output.folder.path,"proportional-hazard-assumption")
if(dir.exists(assumption.output.folder.path)==FALSE){
dir.create(assumption.output.folder.path)
}
assumption.output.file.path <- file.path(assumption.output.folder.path
,paste0("graphical-test-proportional-hazard"
,"_survival-time=",survival.time.varname
,"_predictor=",covariate.varname
,".png"))
#ggsurvplot() returns a list of ggplots containing survival curves and optionally the risk table. You can't directly save the list using ggsave(), but you can save the output of print(ggsurvplot).
## Reference [Saving ggsurvplot()](https://github.com/kassambara/survminer/issues/152)
ggplot2::ggsave( filename = assumption.output.file.path
,plot= print(assum.proportionality.ggcoxzph)
,device = "png")
} # End the for loop
#------------------------------------------------------------------------------
# Find significant predictors that don't violate assumptions of proportionality
#------------------------------------------------------------------------------
uni.cox.assumption.met <- uni.cox.assumption.test %>%
dplyr::filter(p>0.05 & predictor != "GLOBAL") %>%
dplyr::select(model.number,survival.time,survival.event, predictor, p) # dim(uni.cox.assumption.met) 3 5
uni.cox.assumption.met.model.numbers <- uni.cox.assumption.met$model.number # class(uni.cox.assumption.met.model.numbers)
uni.cox.model.assumption.met.predictors.signi <- uni.cox.model.base.data %>%
dplyr::filter(model.number %in% uni.cox.assumption.met.model.numbers & `Pr(>|z|)` < 0.05) # dim(uni.cox.model.assumption.met.predictors.signi) 1 13
#----------------------------------------------------------------------------------------------------------
# Output these data to an Excel file
## results.MSLRS.df (cutpoints for each predictor)
## predictors.categorised (2 list elements)
## uni.cox.model.base.data
## uni.cox.assumption.test
## uni.cox.model.assumption.met.predictors.signi
#----------------------------------------------------------------------------------------------------------
attributes(results.MSLRS.df)$class <- "data.frame"
attributes(predictors.categorised)$class <- "data.frame"
attributes(uni.cox.model.base.data)$class <- "data.frame"
attributes(uni.cox.assumption.test)$class <- "data.frame"
attributes(uni.cox.model.assumption.met.predictors.signi)$class <- "data.frame"
#---------------------------------------------
# Output the cutpoints to an Excel worksheet
#---------------------------------------------
survival.output.file.path.XLSX <- file.path(output.folder.path, paste0("survival-analysis_",survival.time.varname,".xlsx"))
# Export data to first spread sheet
# Output estimated cutpoints for predictors
xlsx::write.xlsx2( x= results.MSLRS.df
,file = survival.output.file.path.XLSX
,sheetName = "cutpoints"
,col.names = TRUE
,row.names = FALSE # Changing rown.names to FALSE export the df as 1 line
,append = FALSE)
# Append data to other sheets
# Output ordered categorised predictors
xlsx::write.xlsx2( x= predictors.categorised
,file = survival.output.file.path.XLSX
,sheetName = "predictors.cat"
,col.names = TRUE
,row.names = FALSE # Changing rown.names to FALSE export the df as 1 line
,append = TRUE
)
# Output univariable cox assumption test
xlsx::write.xlsx2( x= uni.cox.assumption.test
,file = survival.output.file.path.XLSX
,sheetName = "assumption.test"
,col.names = TRUE
,row.names = FALSE # Changing rown.names to FALSE export the df as 1 line
,append = TRUE )
# Output univariable cox regression estimates
xlsx::write.xlsx2( x= uni.cox.model.base.data
,file = survival.output.file.path.XLSX
,sheetName = "cox.reg.estimates"
,col.names = TRUE
,row.names = FALSE # Changing rown.names to FALSE export the df as 1 line
,append = TRUE )
# Output univariable cox regression estimates from predictors that did not violate assumptions and are significant
xlsx::write.xlsx2( x= uni.cox.model.assumption.met.predictors.signi
,file = survival.output.file.path.XLSX
,sheetName = "significant.predictors.assumption.met"
,col.names = TRUE
,row.names = FALSE # Changing rown.names to FALSE export the df as 1 line
,append = TRUE )
} # End the else()
} # End the main function
```
---
## Abbreviations
**CD** Cluster of differentiation
**HSNCC** Head and neck squamous cell carcinoma
**PD-1** Programmed cell death protein 1
**TIGIT** T‐cell immunoglobulin and ITIM domain
---
## T cells
**Fate of CD8 þ T cells after antigen stimulation** After naïve T cells are activated by the corresponding antigens, they become effector T (T E ) cells and die after attack to tumor cells or infected cells. Part of the activated T cells became memory T cells. Memory T cells further differentiated into several subsets after re-stimulation, as follows: stem cell memory (T SCM ), central memory T (T CM ), effector memory T (T EM ) cells, and resident memory T (T RM ) cells. T RM stay at local sites to respond immediately to secondary infection. T SCM and T CM cells have a high potential for self-renewal and produce T EM and T E cells after re-stimulation, while killing activity for target cells is high in T EM and T E cells compared with T SCM and T CM cells. T EM , T E , and T RM cells have a limited potential for population expansion, and tend to become terminally differentiated and subsequently die or exhausted (Figure 1). Genes, markers, and metabolism are listed in this figure, and they are characteristic in early and late memory status [Memory T cell, exhaustion, and tumor immunity](https://www.researchgate.net/publication/337898327_Memory_T_cell_exhaustion_and_tumor_immunity).

---
## Biomarkers, genes
**BRAF** BRAF is a human gene that encodes a protein called B-Raf. The gene is also referred to as proto-oncogene B-Raf and v-Raf murine sarcoma viral oncogene homolog B, while the protein is more formally known as serine/threonine-protein kinase B-Raf.
**BRAFi** BRAF inhibitors (BRAFi)
**CD8**
**CD96**
**CD155** CD155 (PVR/necl5/Tage4), a member of the nectin-like family of adhesion molecules, is highly upregulated on tumor cells across multiple cancer types and has been associated with worse patient outcomes.
**CD226**
**CTLA4** CTLA4 or CTLA-4 (cytotoxic T-lymphocyte-associated protein 4), also known as CD152 (cluster of differentiation 152), is a protein receptor that functions as an immune checkpoint and downregulates immune responses. CTLA4 is constitutively expressed in regulatory T cells but only upregulated in conventional T cells after activation – a phenomenon which is particularly notable in cancers.[4] It acts as an "off" switch when bound to CD80 or CD86 on the surface of antigen-presenting cells.
**Tregs** regulatory T cells (also called Tregs) are T cells which have a role in regulating or suppressing other cells in the immune system. Tregs control the immune response to self and foreign particles (antigens) and help prevent autoimmune disease. Tregs produced by a normal thymus are termed ‘natural’. Treg formed by differentiation of naïve T cells outside the thymus, i.e. the periphery, or in cell culture are called ‘adaptive’. [Regulatory T Cells (Tregs)](https://www.immunology.org/public-information/bitesized-immunology/cells/regulatory-t-cells-tregs)
**TIGIT**

[TIGIT and CD96: new checkpoint receptor targets for cancer immunotherapy](https://onlinelibrary.wiley.com/doi/full/10.1111/imr.12518)
---
## Survival analysis in cancer studies
Survival analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur (i.e. time to event).
Survival analysis is used in a variety of field such as:
* Cancer studies for patients survival time analyses,
* Sociology for “event-history analysis”, and in
* engineering for “failure-time analysis”.
In cancer studies, typical research questions are like:
* What is the impact of certain clinical characteristics on patient’s survival?
* What is the probability that an individual survives 3 years?
* Are there differences in survival between groups of patients?
Survival analyses use the following methods:
* Kaplan-Meier plots to visualize survival curves
* Log-rank test to compare the survival curves of two or more groups
* Cox proportional hazards regression to describe the effect of variables on survival.
[Survival Analysis Basics](http://www.sthda.com/english/wiki/survival-analysis-basics)
---
## Types of events in cancer studies
### Recurrence (Relapse)
**Relapse or recurrence** If cancer is found after treatment, and after a period of time when the cancer couldn’t be detected, it’s called a cancer recurrence. There are different types of cancer recurrence:
1. **Local recurrence** means that the cancer has come back in the same place it first started.
2. **Regional recurrence** means that the cancer has come back in the lymph nodes near the place it first started.
3. **Distant recurrence** means the cancer has come back in another part of the body, some distance from where it started (often the lungs, liver, bone, or brain).
### Progression
**Tumor progression** is the third and last phase in tumor development. This phase is characterised by increased growth speed and invasiveness of the tumor cells. As a result of the progression, phenotypical changes occur and the tumor becomes more aggressive and acquires greater malignant potential. Together with the progression, more and more aneuploidy occurs. This may be evident as nuclear polymorphism.
### Death
**Death**
---
## Time-to-event endpoints
Time-to-event endpoints are the "time" to experiencing an event, where the event can be cancer recurrence, tumor progression, death.
**Progression-free survival (PFS)** is the period after a (drug) treatment, which could not eliminate a disease, when the disease remains stable (i.e., patients with tumors, it does not progress, it does not get worse). In oncology, PFS usually refers to situations in which a tumor is present, as demonstrated by laboratory testing, radiologic testing, or clinically. The survival outcome time to PFS is calculated as date of progression minus date of primary diagnosis if progression dates are not missing, or as date of followup minus date of primary diagnosis if progression dates are missing.
**Progression** can be categorized as local progression, regional progression, locoregional progression, and metastatic progression.
**Disease-free survival** is the period after curative treatment, which eliminates a disease, when no disease can be detected (e.g., when patients have had operations and are left with no detectable disease).
**Disease-specific survival (DSS)** The patients current conditions are grouped into died of disease (DOD), alive and disease free (ADF), alive with disease (AWD), died of disease (DOD) and "died of other causes (DOC) or died independent of disease (DID). The censor is coded as 1 if DOD (i.e., patients experienced the event of death), or as 0 if ADF, AWD, or DOC (i.e., patients haven't experienced the event of death upon follow-up). Time to DSS is calculated as date of followup minus date of primary diagnosis.
**Metastasis-free survival (MFS) or distant metastasis–free survival (DMFS)** the period until metastasis is detected
**Overall survival (OS)** Patients with a certain disease (for example, colorectal cancer) can die directly from that disease or from an unrelated cause (for example, a car accident). When the precise cause of death is not specified, this is called the overall survival rate or observed survival rate. Doctors often use mean overall survival rates to estimate the patient's prognosis. This is often expressed over standard time periods, like one, five, and ten years. For example, prostate cancer has a much higher one-year overall survival rate than pancreatic cancer, and thus has a better prognosis.
```r!
d1.xlsx <- readxl::read_excel( path = filepath.source.clinical.data.HNSCC.1.xlsx
,sheet = "RBWH 2008 - 2016"
,na=c("N/A"," ")
,col_types = .d1.data.types.xlsx
,guess_max = 327
,col_names = TRUE) %>%
# Recalcuate time to event variables, as these are rounded in the input data
dplyr::mutate( Time_to_Recurrence_years= as.numeric(difftime(time1= Recurrence, time2 = `Primary Tx`, unit = "week"))/52.25
,Time_to_followUp_years=as.numeric(difftime(time1=`Follow-up`, time2 = `Primary Tx`, unit = "week"))/52.25
# Compute Recurrence free survival in years using date of primary tx, date of recurrence, date of follow-up. In HNSCC data, patients received surgery to remove tumor. The event of interest is therefore the recurrence of tumor. If patients receive drug treatment (with tumor), the event of interest is tumor growing. Then progression-free survival is calculated as time diff between progression and primary diagnosis (same as RFS, just different event)
,RFS_years=dplyr::case_when(
!is.na(Recurrence) ~ lubridate::time_length(Recurrence - `Primary Tx`, unit = "years")
,is.na(Recurrence) ~ lubridate::time_length(`Follow-up` - `Primary Tx`, unit = "years")
)
,RFS_censor= dplyr::case_when(
!is.na(Recurrence) ~ 1 # The event is recurrence, 1 means a patient has the event of recurrence recorded by date
,is.na(Recurrence) ~ 0 # A patient has no recurrence recorded, coded as 0
)
,DSS_years=lubridate::time_length(`Follow-up` - `Primary Tx`, unit = "years")
,DSS_censor= dplyr::case_when(
DSS== "DOD" ~ 1 # DOD (die of disease) indicates a patient has the event of death, censor coded as 1
,(DSS=="AWD" | DSS=="DID" | DSS=="ADF") ~ 0 # AWD (alive with disease), died of disease (DID), alive and disease-free (ADF) indicates a patient has the event of death, censor coded as 0
)
)
```
---
## Censoring
Survival analysis focuses on the expected duration of time until occurrence of an event of interest. The status variable (e.g. PFS_censor, OS_censor) is usually coded as 0 for patients who did not experience the event (e.g., tumor progression) during the followup and as 1 for patients who have experienced the event. Patients with censor=0 are right-censored because the event was not observed for these individuals within the study time period, producing the so-called censored observations.
Censoring may arise in the following ways:
* a patient has not (yet) experienced the event of interest, such as relapse or death, within the study time period;
* a patient is lost to follow-up during the study period;
* a patient experiences a different event that makes further follow-up impossible.
This type of censoring, named right censoring, is handled in survival analysis.
---
## The survival function
In survival analysis, we use information on event status and follow up time to estimate a survival function. Consider a 20 year prospective study of patient survival following a myocardial infarction. In this study, the outcome is all-cause mortality and the survival function (or survival curve) might be as depicted in the figure below.

The horizontal axis represents time in years, and the vertical axis shows the probability of surviving or the proportion of people surviving.
* At time zero, the survival probability is 1.0 (or 100% of the participants are alive).
* At 2 years, the probability of survival is approximately 0.83 or 83%.
* At 10 years, the probability of survival is approximately 0.55 or 55%.
* The median survival is approximately 11 years.
* A flat survival curve (i.e. one that stays close to 1.0) suggests very good survival, whereas a survival curve that drops sharply toward 0 suggests poor survival.
The figure above shows the survival function as a smooth curve. In most applications, the survival function is shown as a step function rather than a smooth curve (see the next page.) [Survival Analysis](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Survival/BS704_Survival_print.html)
---
## Fit a univariable cox model in R
`survival::Surv(time, time2, event)`
**time**
for right censored data, this is the follow up time. For interval data, the first argument is the starting time for the interval.
**event**
The status indicator, normally 0=alive (patients haven't experienced the event, say death), 1=dead (patients have experienced the event, say death, during the follow-up). Other choices are TRUE/FALSE (TRUE = death) or 1/2 (2=death). For interval censored data, the status indicator is 0=right censored, 1=event at time, 2=left censored, 3=interval censored. For multiple enpoint data the event variable will be a factor, whose first level is treated as censoring. Although unusual, the event indicator can be omitted, in which case all subjects are assumed to have an event.
**time2**
ending time of the interval for interval censored or counting process data only. Intervals are assumed to be open on the left and closed on the right, (start, end]. For counting process data, event indicates whether an event occurred at the end of the interval.
**type**
character string specifying the type of censoring. Possible values are "right", "left", "counting", "interval", "interval2" or "mstate".
origin
for counting process data, the hazard function origin. This option was intended to be used in conjunction with a model containing time dependent strata in order to align the subjects properly when they cross over from one strata to another, but it has rarely proven useful.
x
any R object.
---
## Kaplan-Meier curves versus cox proportional hazard (PH) regression
Kaplan-Meier curves are good for visualizing differences in survival between two categorical groups, but they don’t work well for assessing the effect of quantitative variables like age, gene expression, leukocyte count, etc. Cox PH regression can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once
[Cox Proportional-Hazards Model](http://www.sthda.com/english/wiki/cox-proportional-hazards-model)
---
## Interpreting results
### Lack of convergence
**Non-convergence** Lack of convergence is an indication that the data do not fit the model well, because there are too many poorly fitting observations. When the model does not converge, an error "coxph ran out of iterations and did not converge" appears and infinite estimates (Inf) are generated in the upper bound (`upper .95`) of 95% confidence intervals of hazard ratios (`exp(coef)` ). For example,
```r!
# Get survival data
g <- read.csv(file="C:/Lab_MarkS/lunC/Coursera_Survival-Analysis-in-R-for-Public-Health/course-data/6AiBbg-BEem6Gg6vVM6M8A_e872b4600f8111e9b2f4133a1edfbb40_simulated-HF-mort-data-for-GMPH-_1K_-final-_2_.csv") # dim(g) 1000 31
# Run a multiple Cox model that doesn't converge
cox <- coxph(Surv(fu_time, death) ~ age + gender + copd + factor(quintile) + factor(ethnicgroup), data = g)
summary(cox)
# Call:
# coxph(formula = Surv(fu_time, death) ~ age + gender + copd + factor(quintile) + factor(ethnicgroup), data = g)
#
# n= 951, number of events= 468 (49 observations deleted due to missingness)
# coef exp(coef) se(coef) z Pr(>|z|)
# age 6.213e-02 1.064e+00 5.606e-03 11.082 < 2e-16 ***
# gender -3.104e-01 7.332e-01 9.695e-02 -3.201 0.00137 **
# copd 6.439e-02 1.067e+00 1.092e-01 0.590 0.55548
# factor(quintile)1 1.422e+01 1.501e+06 9.177e+02 0.015 0.98764
# factor(quintile)2 1.391e+01 1.094e+06 9.177e+02 0.015 0.98791
# factor(quintile)3 1.431e+01 1.638e+06 9.177e+02 0.016 0.98756
# factor(quintile)4 1.427e+01 1.579e+06 9.177e+02 0.016 0.98759
# factor(quintile)5 1.430e+01 1.631e+06 9.177e+02 0.016 0.98756
# factor(ethnicgroup)2 1.402e-02 1.014e+00 3.261e-01 0.043 0.96570
# factor(ethnicgroup)3 -7.728e-01 4.617e-01 4.167e-01 -1.854 0.06369 .
# factor(ethnicgroup)9 5.211e-01 1.684e+00 3.623e-01 1.438 0.15038
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# exp(coef) exp(-coef) lower .95 upper .95
# age 1.064e+00 9.398e-01 1.0525 1.0759
# gender 7.332e-01 1.364e+00 0.6063 0.8866
# copd 1.067e+00 9.376e-01 0.8610 1.3211
# factor(quintile)1 1.501e+06 6.660e-07 0.0000 Inf
# factor(quintile)2 1.094e+06 9.141e-07 0.0000 Inf
# factor(quintile)3 1.638e+06 6.107e-07 0.0000 Inf
# factor(quintile)4 1.579e+06 6.332e-07 0.0000 Inf
# factor(quintile)5 1.631e+06 6.131e-07 0.0000 Inf
# factor(ethnicgroup)2 1.014e+00 9.861e-01 0.5352 1.9217
# factor(ethnicgroup)3 4.617e-01 2.166e+00 0.2040 1.0450
# factor(ethnicgroup)9 1.684e+00 5.939e-01 0.8277 3.4254
#
# Concordance= 0.67 (se = 0.013 )
# Likelihood ratio test= 169.6 on 11 df, p=<2e-16
# Wald test = 142.4 on 11 df, p=<2e-16
# Score (logrank) test = 143.1 on 11 df, p=<2e-16
```
These steps can be taken to fix the problem of non-convergence.
1. Change the reference category in a categorical independent variable
By default, `R` sets the first (lowest) category to be the reference. With quintile taking on values from 0 to 5 and 0 having very few patients in it, 0 is a bad choice of reference category. The first thing to try is to set the reference one to something else. As quintile measures socio-economic status or deprivation, and you're usually interested in the effect of lower status (or greater deprivation in UK terminology) compared with higher status, it makes sense to set higher status - quintile = 1 - as the reference category.
```r!
attach(g)
# Only four patients have quintile zero. This means invalid quintile, for instance when the postcode (zip code) can't be mapped to a small geographical area and therefore to a socio-economic status measure. Actually, four patients in a category can sometimes be enough to get the model to work, but there's another problem- Of those four patients with quintile zero, no one died (0 in death=1)
table(quintile, exclude = NULL)
# 0 1 2 3 4 5 <NA>
# 4 138 205 211 220 216 6
# Of those four patients with quintile zero, no one died. That itself might not be a problem, but we've let R choose the reference category by default, and it's chosen quintile zero. All the other five hazard ratios are relative to this tiny group of patients in which no one died. It's not surprising that the algorithm couldn't come up with sensible HR estimates.
t <- table(quintile, death)
# death
# quintile 0 1
# 0 4 0
# 1 60 78
# 2 111 94
# 3 105 106
# 4 116 104
# 5 109 107
round(100*prop.table(t,1),digits=1) # row %s
# death
# quintile 0 1
# 0 100.0 0.0
# 1 43.5 56.5
# 2 54.1 45.9
# 3 49.8 50.2
# 4 52.7 47.3
# 5 50.5 49.5
attach(g)
# Change the reference category for quintile to make it 1 rather than 0.
quintile <- as.factor(quintile)
quintile <- relevel(quintile, ref = 1) # quintile =1 as the reference category
cox <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile + ethnicgroup)
# Warning message:
# In fitter(X, Y, istrat, offset, init, control, weights = weights, :
# Loglik converged before variable 4,5,6,7,8 ; coefficient may be infinite.
summary(cox)
# It still didn't converge. Can you spot where the problem is? It's with quintile zero: look at its huge standard error (1.208e+03) and infinite confidence interval. You need to try something else.
```
2. Combine categories
If changing the reference category doesn't work, then consider combining categories if it makes sense to do so. Quintile zero is the an artificial category meaning that the patient's postcode (zip code) or other small geographical area identifier was missing or could not be linked to the national socio-economic status file, i.e. the patient's status is unknown. These patients are often different from the others, e.g. they are from overseas or are homeless, and that's why they have no postcode or geographical area. It doesn't make too much sense to combine them with any of the other five categories. However, in this case, this quintile zero group is so small compared with the other categories that combining them will have a negligible impact on the results.
```r!
# Combine categories (combine quintile 0 with quintile 5). To combine the quintile 0 people with the quintile 5 people, you could make a new variable, which I've called quintile_5groups, and just populate its five categories with quintile's values:
quintile_5groups <- g[,"quintile"] # best start again with the original data set, not from the existing object called "quintile"
quintile_5groups[quintile_5groups==0] <- 5 # This picks the individuals with quintile=0 (note the double equals sign) and sets them to 5
quintile_5groups <- factor(quintile_5groups) # lastly, tell R that this is a categorical variable and not a continuous one
table(quintile_5groups, exclude = NULL)
# quintile_5groups
# 1 2 3 4 5 <NA>
# 138 205 211 220 220 6
# now run the model with this new variable
cox <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile_5groups + ethnicgroup)
summary(cox)
# This time it has behaved well and has converged. However, combining the quintile zero people with the quintile=5 (most disadvantaged) people is probably not a good idea because the two sets of people are probably quite different.
```
3. Exclude the patients
This is the best option if combining categories doesn't make sense and there are only a few patients in the problematic category.
```r!
# Drop the quintile 0 patients. The next option is to drop the quintile zero patients. If you just want to run the model and not any other descriptive analyses, then you could turn the zeroes into NAs. Then coxph will simply exclude those patients because of their missing values:
quintile_5groups <- g[,'quintile']
quintile_5groups[quintile_5groups==0] <- NA # set the zeroes to missing
quintile_5groups <- factor(quintile_5groups)
table(quintile_5groups, exclude=NULL)
# quintile_5groups
# 1 2 3 4 5 <NA>
# 138 205 211 220 216 10
# Run the same model
cox <- coxph(Surv(fu_time, death) ~ age + gender + copd + quintile_5groups + ethnicgroup)
summary(cox)
# There are ten missing observations this time. The hazard ratio for quintile 5 has gone from 1.08 to 1.11. Not a big change, but it's quite surprising that the HR has changed at all given that it's only a question of four patients.
```
4. Drop the offending variable
This will be the best option if combining categories doesn't make sense and if there are too many few patients in the problematic category for us to be comfortable dropping them all.
```r!
# Drop the quintile variable. This is the simplest and sometimes best option. Just don't put quintile in the list of predictors in the model:
cox <- coxph(Surv(fu_time, death) ~ age + gender + copd + ethnicgroup)
summary(cox)
# In this case, this is rather like taking a massive hammer to crack a nut. I'd just drop the four quintile zero patients entirely as they comprise such a small percentage of the sample. You have to judge each situation separately, though. How much information is being lost or potentially distorted with each option?
```
In practice, you may have to trade off one of these concerns against another, for example having to choose between combining categories that don't fit well together and dropping an important variable from the model.
Remember that the problem of non-convergence can happen in any kind of regression and that these simple tricks can also work there.
---
### Hazard ratio
**Hazard ratio (HR)**
Hazard ratio (HR) is a measure of an effect of an intervention on an outcome of interest over time. Hazard ratio is reported most commonly in time-to-event analysis or survival analysis (i.e. when we are interested in knowing how long it takes for a particular event/outcome to occur).
Hazard Ratio (i.e. the ratio of hazards) = Hazard in the intervention group ÷ Hazard in the control group
Because Hazard Ratio is a ratio, then when:
HR = 0.5: at any particular time, half as many patients in the treatment group are experiencing an event compared to the control group.
HR = 1: at any particular time, event rates are the same in both groups. HR=1 means lack of association between the intervention (or explanatory variable/predictor) and the outcome
HR = 2: at any particular time, twice as many patients in the treatment group are experiencing an event compared to the control group.
In survival analysis, the hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions described by two levels of an explanatory variable. For example, in a drug study, the treated population may die at twice the rate per unit time as the control population. The hazard ratio would be 2, indicating higher hazard of death from the treatment. Or in another study, men receiving the same treatment may suffer a certain complication ten times more frequently per unit time than women, giving a hazard ratio of 10 [Hazard ratio](https://en.wikipedia.org/wiki/Hazard_ratio). A hazard ratio of 1 means lack of association