```{r setup, include=FALSE, message=FALSE, warning=FALSE, echo=FALSE}
#this chunk sets global options for Rmarkdown
knitr::opts_knit$set(root.dir=params$work_dir)
knitr::opts_chunk$set(echo = TRUE)
options(knitr.table.format="html")
setwd(params$work_dir)
library(ButchR)
library(ggplot2)
library(viridis)
library(ComplexHeatmap)
library(gage)
library(umap)
library(tibble)
library(cowplot)
library(dplyr)
library(DT)
library(knitr)
```
# NMF on Gene expression and Chromatin accessibility data
In this tutorial, we will be working with a dataset combining both bulk RNA-seq and ATAC-seq from sorted blood cells, which was published by Corces et al. in 2016
Corces, M.R., Buenrostro, J.D., Wu, B., Greenside, P.G., Chan, S.M., Koenig, J.L., Snyder, M.P., Pritchard, J.K., Kundaje, A., Greenleaf, W.J., Majeti, R., Chang, H.Y., 2016. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature Genetics 48, 1193–1203. [https://doi.org/10.1038/ng.3646](https://doi.org/10.1038/ng.3646).
Will we working with the RNA-seq and ATAC-seq dataset over multiple sorted blood cell populations, and verify if we can obtain specific signatures from each modality, and combine the modalities to obtain multiomics signatures.
Hence, we will perform both **single-NMF** on each modality, and then combined **integrative NMF** combining both modalities. Eventually, we will extract unsupervised signatures, and try to identify what these signatures correspond to.
First, we define a range of factorization ranks for the decomposition, as we do not know *a priori* what the right number is:
```{r factorank}
##----------------------------------------------------------------------------##
## Factorization ranks for all NMFs ##
##----------------------------------------------------------------------------##
factorization_ranks <- 6:10
```
## Gene expression (RNAseq)
### Data loading
Read normalized gene expression matrix. The raw counts have been normalized by library size (using the EstimateLibrarySize function from DESeq2), and then log-normalized.
```{r rna_dataloading, results="asis", message=FALSE}
setwd('~/ETBII2023')
##----------------------------------------------------------------------------##
## Read normalized data ##
##----------------------------------------------------------------------------##
# read normalized matrix
rna_norm_mat <- readRDS("data/rnaseq/rnaseq_normalized_counts.RDS")
rna_annotation <- readRDS("data/rnaseq/rnaseq_annotation.RDS")
##----------------------------------------------------------------------------##
## Print dataset dimension ##
##----------------------------------------------------------------------------##
data.frame(dim(rna_norm_mat), row.names = c("features", "samples"))
```
### Applying NMF
Applying Non-Negative Matrix Factorization (NMF) to normalized transcriptome data (RNAseq). We perform the factorization over a range of ranks (6 to 10), and will check later which seems more appropriate. Given that NMF is initialized randomly, we run the procedure 10 times, with up to 10^4 iterations each time, and a convergence threshold of 40, meaning that the iterations will stop if the error function does not change for this number of iterations.
**Warning: do not worry about possible error/warning message due to library loading the first time you run the decomposition....**
```{r rna_NMF_run, message=FALSE, warning=FALSE, cache=TRUE, error=FALSE}
##----------------------------------------------------------------------------##
## run NMF ##
##----------------------------------------------------------------------------##
rna_nmf_exp <- run_NMF_tensor(X = rna_norm_mat,
ranks = factorization_ranks,
method = "NMF",
n_initializations = 10,
iterations = 10^4,
convergence_threshold = 40,
extract_features = TRUE)
rna_nmf_exp
## Normalize NMF: W matrix is normalized column-wise, and H is correspondingly rescaled
rna_norm_nmf_exp <- normalizeW(rna_nmf_exp)
```
### Factorization quality metrics and optimal K
Based on the results of the factorization quality metrics, an optimal number of signatures (k) must be chosen. Several criteria can be used to determine a possible optimal number of signatures. Remember however that **THE** optimal number often does not exist, and that several factorization ranks can give you different byt useful information about the dataset!
```{r rna_NMF_optK, results='hide',fig.keep='all', message=FALSE, warning=FALSE}
## Plot K stats
gg_plotKStats(rna_norm_nmf_exp)
```
In order to find the optimal factorization ranks, you should:
minimize
* the [Frobenius error](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm)
* the [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation)
* the mean [Amari distance](https://en.wikipedia.org/wiki/Amari_distance)
maximize
* the sum and mean [silhouette width](https://en.wikipedia.org/wiki/Silhouette_(clustering))
* the [cophenic coefficient](https://en.wikipedia.org/wiki/Cophenetic_correlation).
Another good criteria is to look at the river plot, which indicates how signatures at different ranks are related to each other.
```{r rna_NMF_riverplot, results='hide',fig.keep='all', message=FALSE, warning=FALSE}
## Generate river plot
plot(generateRiverplot(rna_norm_nmf_exp), plot_area=1, yscale=0.6, nodewidth=0.5)
```
> **Beware:** there seems to be a bug with the way this function displays the leftmost signatures...
We can now plot the exposure matrices H for all the factorization ranks, adding the cell type annotation to the columns of the H matrix.
```{r rna_Hmatrix_Wnorm, fig.width=8, fig.height=5.5, out.width="90%", results='asis', warning=FALSE, message=FALSE}
##----------------------------------------------------------------------------##
## H matrix heatmap annotation ##
##----------------------------------------------------------------------------##
#Annotation for H matrix heatmap
# Define color vector
type.colVector <- list(Celltype = setNames(rna_annotation$color[match(levels(rna_annotation$Celltype), rna_annotation$Celltype)],
levels(rna_annotation$Celltype))
)
# Build Heatmap annotation
heat.anno <- HeatmapAnnotation(df = data.frame(Celltype = rna_annotation$Celltype),
col = type.colVector,
show_annotation_name = TRUE, na_col = "white")
##----------------------------------------------------------------------------##
## Generate H matrix heatmap, W normalized ##
##----------------------------------------------------------------------------##
for(ki in factorization_ranks) {
#plot H matrix
tmp.hmatrix <- HMatrix(rna_norm_nmf_exp, k = ki)
h.heatmap <- Heatmap(tmp.hmatrix,
col = viridis(100),
name = "Exposure",
clustering_distance_columns = 'pearson',
show_column_dend = TRUE,
heatmap_legend_param =
list(color_bar = "continuous", legend_height=unit(2, "cm")),
top_annotation = heat.anno,
show_column_names = FALSE,
show_row_names = FALSE,
cluster_rows = FALSE)
print(h.heatmap)
}
```
By visual inspection, we can find association of signatures to certain cell-types. We can do this analysis systematically using the recovery function:
```{r}
tmp.hmatrix <- HMatrix(rna_norm_nmf_exp, k = 7)
recovery_plot(tmp.hmatrix, rna_annotation$Celltype)
```
### Gene set enrichment analysis
Using the feature exposure extracted from the W matrix, a gene set enrichment analysis is performed against all MSigDB terms
The optimal factorization rank selected was: **K = 7**
```{r rna_gsea, results="asis", message=FALSE}
##----------------------------------------------------------------------------##
## W matrix Z scores ##
##----------------------------------------------------------------------------##
k.opt = 7
rna_wmatrix <- WMatrix(rna_norm_nmf_exp, k = k.opt)
#Zscore for each signature
rna_wmatrix.zscores <- apply(rna_wmatrix, MARGIN=2, function(wmat_score){
(wmat_score - median(wmat_score)) / mad(wmat_score)
})
colnames(rna_wmatrix.zscores) <- paste0("Signature", 1:k.opt)
```
We now apply an enrichment analysis using the GAGE package. As a reference, we are using the collection of genesets
from MSigDB corresponding to experimentally/computationally identified groups of genes.
```{r gage}
##----------------------------------------------------------------------------##
## GAGE (Generally Applicable Gene-set Enrichment) analysis ##
##----------------------------------------------------------------------------##
#Infer gene sets tha are significantly pertubed relative to all genes considered
#load precompiled GSEA MSigDB gene sets
## this is the full collection of genesets (17800 genesets)
#gs.msigDB <- readList("db/msigdb.v6.2.symbols.gmt")
## this is the set of genesets related to immunological processes
gs.msigDB <- readList("db/c7.all.v2022.1.Hs.symbols.gmt")
# head(gs.msigDB)
#run GAGE analysis
rna_msigDB_enrichment <- gage(rna_wmatrix.zscores, gsets=gs.msigDB, same.dir=TRUE)
#Drop NAs for upregulated
rna_msigDB_enrichment <- as.data.frame(rna_msigDB_enrichment$greater)
rna_msigDB_enrichment <- rna_msigDB_enrichment[!is.na(rna_msigDB_enrichment$p.geomean),]
rna_msigDB_enrichment <- rna_msigDB_enrichment[, paste0("Signature", 1:k.opt)]
# Select only more enriched terms in one signature compared to the others
idx <- apply(rna_msigDB_enrichment, 1, function(term){
term <- -log10(term)
# Change 0 to small value to avoid NAs
term[term == 0] <- 1e-40
# find if this term is more enriched in one signature compared to others
is.enrich <- sapply(term, function(x){
# p-value 5 times greater than at least 5 other signatures
sum(x/term > 5) > 5
})
any(is.enrich)
})
rna_msigDB_enrichment <- rna_msigDB_enrichment[idx,]
# Print table
datatable(rna_msigDB_enrichment, filter="top",
extensions = 'Buttons',
options = list(dom = 'Bfrtip',
buttons = list(list(extend = 'collection',
buttons = c('excel', 'csv'),
text = 'DOWNLOAD DATA')))) %>%
formatSignif(columns=colnames(rna_msigDB_enrichment), digits=3)
```
> Questions:
> * do the enriched term correspond to the recovery plot analysis?
> * redo the analysis with the `msigdb.v6.2.symbols.gmt` collection
## Integrative RNAseq & ATACseq
**Gene expression (RNAseq) & Chromatin accessibility (ATACseq)**
Here, we want to learn signatures by integrating RNA-seq and ATAC-seq, using the integrative NMF method, allowing to extract the **homogeneous** as well as **heterogeneous** part of the datasets. The parameter $\lambda$ allows to balance these 2 contributions (homogeneous/heterogeneous), and needs to be estimated (see below).
Only the those samples with RNAseq and ATACseq data were used in the integrative analysis.
### Data loading
Read normalized gene expression matrix and chromatin accessibility matrix...
```{r integrative_dataloading, results="asis", message=FALSE}
##----------------------------------------------------------------------------##
## Read normalized data ##
##----------------------------------------------------------------------------##
# read normalized matrix
int_norm_mat <- readRDS("data/multiview/multiview_norm_mat_list.RDS")
int_annotation <- readRDS("data/multiview/multiview_annotation.RDS")
##----------------------------------------------------------------------------##
## Print dataset dimension ##
##----------------------------------------------------------------------------##
## RNA-seq
data.frame(dim(int_norm_mat$rna), row.names = c("features", "samples"))
## ATAC-seq
data.frame(dim(int_norm_mat$atac), row.names = c("features", "samples"))
```
### integrative NMF Lambda tuning
In iNMF the lambda parameter controls if the factorization should shift towards extracting more homogeneous effect or heterogeneous effect.
Based on Yang and Michailidis, 2016, to tune the value of the parameter lambda for the integrative NMF (iNMF),
the objectives values of join NMF (jNMF) are compared to
single view NMFs (sNMF), the principle is that **join NMF**
represents complete **homogeneity** and **single view NMF** represents complete **heterogeneity**.
To avoid overfitting, the best lambda can be selected by plotting
the difference of the unsquared residual quantities of
jNMF and iNMF (Ri - Rj) over multiple values of lambda,
and compare it to the difference of the unsquared residual quantities of
sNMF and jNMF c*(Rj - Rs).
The optimal lambda usually is the first lambda in which
(Ri - Rj) < c*(Rj - Rs) where c is a constant >= 2.
In other words, the optimal lambda should achieve an optimal balance between the homogeneous and heterogeneous part
of the data.
For more Help please run *?iNMF_lambda_tuning.*
The following function helps to select the optimal lamba for a fixed factorization rank (k = 8).
```{r iNMF_tune, message=FALSE, warning=FALSE, cache=TRUE}
##----------------------------------------------------------------------------##
## Run integrative NMF ##
##----------------------------------------------------------------------------##
opt_lambda = iNMF_lambda_tuning(matrix_list = int_norm_mat,
lambdas = seq(0.02, 0.22, 0.05),
thr_cons = 5,
Output_type = "residuals",
rank = 8,
n_initializations = 5,
iterations = 10^4,
convergence_threshold = 40)
opt_lambda
```
> Spend some time to understand why the curves behave as they do...
The optimal lambda is the smallest value before overfitting of iNMF (around 0.1 or so in this case...).
However, in order to enhance the specific contribution
### Applying integrative NMF
Applying Integrative Non-Negative Matrix Factorization (NMF) to normalized Gene expression (RNAseq) and Chromatin accessibility data (ATACseq)
```{r iNMF_run, message=FALSE, warning=FALSE, cache=TRUE}
##----------------------------------------------------------------------------##
## Run integrative NMF ##
##----------------------------------------------------------------------------##
inmf_exp <- run_iNMF_tensor(matrix_list = int_norm_mat,
ranks = 8,
n_initializations = 5,
iterations = 10^4,
convergence_threshold = 40,
Sp = 0,
lamb = 0.07,
extract_features = FALSE)
inmf_exp
```
### Factorization quality metrics and optimal K
Based on the results of the factorization quality metrics, an optimal number of signatures (k) must be chosen:
```{r iNMF_optK, results='hide',fig.keep='all', message=FALSE, warning=FALSE}
## Plot K stats
gg_plotKStats(inmf_exp)
```
Minize the Frobenius error, the coefficient of variation and the mean Amari distance, while maximizing the sum and mean silhouette width and the cophenic coefficient.
### H Matrices: {.tabset}
Here, we can define some functions to plot the shared, specific or total part of the H matrix:
```{r iNMF_Hmatrix, fig.width=12, fig.height=4, out.width="90%", results='asis', warning=FALSE, message=FALSE}
##----------------------------------------------------------------------------##
## Plot H matrix heatmap integrative NMF ##
##----------------------------------------------------------------------------##
plotH_oneView <- function(int_nmf, viewID, k, heat.anno, main_hist, col=viridis(100),
scale_color=TRUE, displayID = viewID){
sharedH <- HMatrix(int_nmf, k = k, type = "shared")
Hview <- HMatrix(int_nmf, k = k, type = "viewspec", view_id = viewID)
# hs <- HMatrix(inmf_exp, k = ki, type = "shared")
# hv_r <- HMatrix(inmf_exp, k = ki, type = "viewspec", view_id = "rna")
# hv_a <- HMatrix(inmf_exp, k = ki, type = "viewspec", view_id = "atac")
# Define total H matrix
totalH <- sharedH + Hview
# Color Function
if (scale_color) {
colf <- circlize::colorRamp2(seq(0, max(totalH), length.out = 100), col)
} else {
colf <- col
}
#main_hist <- hclust(as.dist(1 - cor(totalH, method = "pearson")))
tH.heatmap <- Heatmap(totalH,
col = colf,
name = "Total Exposure",
column_title = "Total H matrix",
cluster_columns = main_hist,
show_column_dend = TRUE,
top_annotation = heat.anno,
show_column_names = FALSE,
show_row_names = FALSE,
cluster_rows = FALSE)
sH.heatmap <- Heatmap(sharedH,
col = colf,
name = "Shared Exposure",
column_title = "Shared H matrix",
cluster_columns = main_hist,
show_column_dend = TRUE,
top_annotation = heat.anno,
show_column_names = FALSE,
show_row_names = FALSE,
cluster_rows = FALSE)
vH.heatmap <- Heatmap(Hview,
col = colf,
name = "View Specific Exposure",
column_title = "View specific H matrix",
cluster_columns = main_hist,
show_column_dend = TRUE,
top_annotation = heat.anno,
show_column_names = FALSE,
show_row_names = FALSE,
cluster_rows = FALSE)
#print(tH.heatmap + sH.heatmap + vH.heatmap)
# stiff_cor_heat <- as_ggplot(grid.grabExpr(draw(stiff_cor_heat)))
# stiff_cor_heat
#ht_global_opt(heatmap_legend_grid_height = unit(.25, "cm"))
ht_list <- tH.heatmap + sH.heatmap + vH.heatmap
draw(ht_list, row_title = displayID)
ht_global_opt(RESET = TRUE)
}
```
Define functions to add annotations
```{r}
##----------------------------------------------------------------------------##
## H matrix heatmap annotation ##
##----------------------------------------------------------------------------##
#Annotation for H matrix heatmap
# Define color vector
type.colVector <- list(Celltype = setNames(int_annotation$color[match(levels(int_annotation$Celltype), int_annotation$Celltype)],
levels(int_annotation$Celltype))
)
# Build Heatmap annotation
heat.anno <- HeatmapAnnotation(df = data.frame(Celltype = int_annotation$Celltype),
col = type.colVector,
show_annotation_name = FALSE, na_col = "white")
my_Heatmap <- function(my_h, heat.anno, matrix_id){
hm <- Heatmap(my_h,
col = viridis(100),
name = matrix_id,
column_title = matrix_id,
cluster_columns = main_hist,
show_column_dend = TRUE,
top_annotation = heat.anno,
show_column_names = FALSE,
show_row_names = FALSE,
cluster_rows = FALSE)
}
```
We can now plot for the different factorization ranks the specific part of the H matrix both for RNA-seq and ATAC-seq:
```{r}
##----------------------------------------------------------------------------##
## Generate H matrix heatmap, W normalized ##
##----------------------------------------------------------------------------##
ki = 8
sharedH <- HMatrix(inmf_exp, k = ki, type = "shared")
main_hist <- hclust(as.dist(1 - cor(sharedH, method = "pearson")))
plotH_oneView(inmf_exp, k = ki, heat.anno = heat.anno, main_hist = main_hist,
scale_color = FALSE, viewID = "rna", displayID = "RNAseq")
plotH_oneView(inmf_exp, k = ki, heat.anno = heat.anno, main_hist = main_hist,
scale_color = FALSE, viewID = "atac", displayID = "ATACseq")
```
> Looking at the view specific H matrix, you can see some specificity brought by each modality to each signature.
> Try to spot some differences between the ATAC-seq and RNA-seq for some of the signatures!
> Beware that the scale of the shared and specific H matrix depends on the chouce of lambda. A smaller lambda will increase the specific part. You can try running the iNMF again with a smaller lambda.
### Gene set enrichment analysis
Using the feature exposure extracted from the W matrix, a gene set enrichment analysis is perform agains all MSigDB terms
The optimal factorization rank selected was: **k = 8**
```{r iNMF_gsea, results="asis", message=FALSE, warning=FALSE}
##----------------------------------------------------------------------------##
## W matrix Z scores ##
##----------------------------------------------------------------------------##
int_rna_wmatrix <- WMatrix(inmf_exp, k = 8, view_id = "rna",)
#Zscore for each signature
int_rna_wmatrix.zscores <- apply(int_rna_wmatrix, MARGIN=2, function(wmat_score){
(wmat_score - median(wmat_score)) / mad(wmat_score)
})
colnames(int_rna_wmatrix.zscores) <- paste0("Signature", 1:8)
```
We can also extract the X matrix for the atac-seq data:
```{r iNMF_atac_gsea, results="asis", message=FALSE, warning=FALSE}
##----------------------------------------------------------------------------##
## W matrix Z scores ##
##----------------------------------------------------------------------------##
int_atac_wmatrix <- WMatrix(inmf_exp, k = 8, view_id = "atac",)
# Zscore for each signature
int_atac_wmatrix.zscores <- apply(int_atac_wmatrix, MARGIN=2, function(wmat_score){
(wmat_score - median(wmat_score)) / mad(wmat_score)
})
colnames(int_atac_wmatrix.zscores) <- paste0("Signature", 1:8)
```
> Can you identify the top 10 genes and the top 10 ATAC-seq peaks for a particular signature?
Again, we perform the enrichment with gage on the RNA-seq W-matrix:
```{r gage_inmf}
##----------------------------------------------------------------------------##
## GAGE (Generally Applicable Gene-set Enrichment) analysis ##
##----------------------------------------------------------------------------##
#Infer gene sets tha are significantly pertubed relative to all genes considered
#load precompiled GSEA MSigDB gene sets
gs.msigDB <- readList("db/c7.all.v2022.1.Hs.symbols.gmt")
# head(gs.msigDB)
#run GAGE analysis
int_rna_msigDB_enrichment <- gage(int_rna_wmatrix.zscores, gsets=gs.msigDB, same.dir=TRUE)
#Drop NAs for upregulated
int_rna_msigDB_enrichment <- as.data.frame(int_rna_msigDB_enrichment$greater)
int_rna_msigDB_enrichment <- int_rna_msigDB_enrichment[!is.na(int_rna_msigDB_enrichment$p.geomean),]
int_rna_msigDB_enrichment <- int_rna_msigDB_enrichment[, paste0("Signature", 1:9)]
# Select only more enriched terms in one signature compared to the others
idx <- apply(int_rna_msigDB_enrichment, 1, function(term){
term[term == 0] <- 1e-40
term <- -log10(term)
# Change 0 to small value to avoid NAs
# find if this term is more enriched in one signature compared to others
is.enrich <- sapply(term, function(x){
# p-value 5 times greater than at least 5 other signatures
sum(x/term > 5) > 5
})
any(is.enrich)
})
# table(idx)
# head(idx)
# head(int_rna_msigDB_enrichment)
int_rna_msigDB_enrichmentf <- int_rna_msigDB_enrichment[idx,]
# Print table
datatable(int_rna_msigDB_enrichmentf, filter="top",
extensions = 'Buttons',
options = list(dom = 'Bfrtip',
buttons = list(list(extend = 'collection',
buttons = c('excel', 'csv'),
text = 'DOWNLOAD DATA')))) %>%
formatSignif(columns=colnames(int_rna_msigDB_enrichment), digits=3)
```
## If you have time...
You can perform the single NMF just on the ATAC-seq data!
## Chromatin accessibility (ATACseq)
### Data loading
Read normalized chromatin accessibility matrix. As for RNA-seq, the data has been normalized by library size estimated using DESeq2, and log-normalized.
```{r atac_dataloading, results="asis", message=FALSE}
##----------------------------------------------------------------------------##
## Read normalized data ##
##----------------------------------------------------------------------------##
# read normalized matrix
atac_norm_mat <- readRDS("data/atacseq/atacseq_normalized_counts.RDS")
atac_annotation <- readRDS("data/atacseq/atacseq_annotation.RDS")
##----------------------------------------------------------------------------##
## Print dataset dimension ##
##----------------------------------------------------------------------------##
cat("Dimension of Chromatin accessibility dataset (ATACseq): \n\n ")
kable(data.frame(dim(atac_norm_mat), row.names = c("features", "samples")),
col.names = "")
```
### Applying NMF
Applying Non-Negative Matrix Factorization (NMF) to normalized Chromatin accessibility data (ATACseq)
```{r atac_NMF_run, message=FALSE, warning=FALSE, cache=TRUE}
##----------------------------------------------------------------------------##
## run NMF ##
##----------------------------------------------------------------------------##
atac_nmf_exp <- run_NMF_tensor(X = atac_norm_mat,
ranks = factorization_ranks,
method = "NMF",
n_initializations = 8,
iterations = 10^4,
convergence_threshold = 40,
extract_features = FALSE)
atac_nmf_exp
## Normalize NMF
atac_norm_nmf_exp <- normalizeW(atac_nmf_exp)
```
### Factorization quality metrics and optimal K
Based on the results of the factorization quality metrics, an optimal number of signatures (k) must be chosen:
```{r atac_NMF_optK, results='hide',fig.keep='all', message=FALSE, warning=FALSE}
## Plot K stats
gg_plotKStats(atac_norm_nmf_exp)
## Generate river plot
#plot(generateRiverplot(atac_norm_nmf_exp), plot_area=1, yscale=0.6, nodewidth=0.5)
```
Again, we need to minize the Frobenius error, the coefficient of variation and the mean Amari distance, while maximizing the sum and mean silhouette width and the cophenic coefficient.
### H Matrix, W normalized: {.tabset}
```{r atac_Hmatrix_Wnorm, fig.width=8, fig.height=5.5, out.width="90%", results='asis', warning=FALSE, message=FALSE}
##----------------------------------------------------------------------------##
## H matrix heatmap annotation ##
##----------------------------------------------------------------------------##
#Annotation for H matrix heatmap
# Define color vector
type.colVector <- list(Celltype = setNames(atac_annotation$color[match(levels(atac_annotation$Celltype), atac_annotation$Celltype)],
levels(atac_annotation$Celltype))
)
# Build Heatmap annotation
heat.anno <- HeatmapAnnotation(df = data.frame(Celltype = atac_annotation$Celltype),
col = type.colVector,
show_annotation_name = TRUE, na_col = "white")
##----------------------------------------------------------------------------##
## Generate H matrix heatmap, W normalized ##
##----------------------------------------------------------------------------##
for(ki in factorization_ranks) {
cat("\n")
cat(" \n#### H matrix for k=", ki, " {.tabset} \n ")
#plot H matrix
tmp.hmatrix <- HMatrix(atac_norm_nmf_exp, k = ki)
h.heatmap <- Heatmap(tmp.hmatrix,
col = viridis(100),
name = "Exposure",
clustering_distance_columns = 'pearson',
show_column_dend = TRUE,
heatmap_legend_param =
list(color_bar = "continuous", legend_height=unit(2, "cm")),
top_annotation = heat.anno,
show_column_names = FALSE,
show_row_names = FALSE,
cluster_rows = FALSE)
print(h.heatmap)
cat(" \n Recovery plots for k=", ki, " \n ")
#recovery_plot(tmp.hmatrix, atac_annotation, "Celltype")
}
gc()
```
> Question: can you compare the H matrices obtained using RNA-seq and ATAC-seq? Do you see differences?
> Are some cell types separated better in one or the other modality?
### UMAP representation
First RNA-seq
```{r umap_rna}
##----------------------------------------------------------------------------##
## UMAP H matrix ##
##----------------------------------------------------------------------------##
hmatrix_norm <- HMatrix(rna_norm_nmf_exp, k = 8)
umapView <- umap(t(hmatrix_norm))
umapView_df <- as.data.frame(umapView$layout)
colnames(umapView_df) <- c("UMAP1", "UMAP2")
type_colVector <- rna_annotation %>%
dplyr::select(Celltype, color) %>%
arrange(Celltype) %>%
distinct() %>%
deframe()
umapView_df %>%
rownames_to_column("sampleID") %>%
left_join(rna_annotation, by = "sampleID") %>%
mutate(Celltype = factor(Celltype, levels = c("HSC", "MPP", "LMPP",
"CMP", "GMP", "MEP",
"CLP", "CD4", "CD8",
"NK", "Bcell", "Mono"))) %>%
ggplot(aes(x=UMAP1, y=UMAP2, color = Celltype)) +
geom_point(size = 2.5, alpha = 0.95) +
scale_color_manual(values = type_colVector) +
theme_cowplot()
```
Now with ATAC-seq.
```{r umap_atac}
##----------------------------------------------------------------------------##
## UMAP H matrix ##
##----------------------------------------------------------------------------##
hmatrix_norm <- HMatrix(atac_norm_nmf_exp, k = 8)
umapView <- umap(t(hmatrix_norm))
umapView_df <- as.data.frame(umapView$layout)
colnames(umapView_df) <- c("UMAP1", "UMAP2")
type_colVector <- atac_annotation %>%
dplyr::select(Celltype, color) %>%
arrange(Celltype) %>%
distinct() %>%
deframe()
umapView_df %>%
rownames_to_column("sampleID") %>%
left_join(atac_annotation, by = "sampleID") %>%
mutate(Celltype = factor(Celltype, levels = c("HSC", "MPP", "LMPP",
"CMP", "GMP", "MEP",
"CLP", "CD4", "CD8",
"NK", "Bcell", "Mono"))) %>%
ggplot(aes(x=UMAP1, y=UMAP2, color = Celltype)) +
geom_point(size = 2.5, alpha = 0.95) +
scale_color_manual(values = type_colVector) +
theme_cowplot()
```
> Question: can you understand the differences?
## How to go further?
To evaluate the potential of the integrative analysis, one could look at the joined signatures obtained from the iNMF analysis. These signatures contain both **genes** and **genomics regions** (or ATAC peaks), which could be investigated further
* are the peaks and the genes inside a signature in genomic proximity?
* what are the motifs enriched in the peaks of a specific signature? Can we identify specific transcription factors?
* if the peaks of a signature are enriched in specific motifs (i.e. transcription factors), are the genes in the same signature enriched for known targets?