NMF on Gene expression and Chromatin accessibility data

```{r setup, include=FALSE, message=FALSE, warning=FALSE, echo=FALSE} #this chunk sets global options for Rmarkdown knitr::opts_knit$set(root.dir=params$work_dir) knitr::opts_chunk$set(echo = TRUE) options(knitr.table.format="html") setwd(params$work_dir) library(ButchR) library(ggplot2) library(viridis) library(ComplexHeatmap) library(gage) library(umap) library(tibble) library(cowplot) library(dplyr) library(DT) library(knitr) ``` # NMF on Gene expression and Chromatin accessibility data In this tutorial, we will be working with a dataset combining both bulk RNA-seq and ATAC-seq from sorted blood cells, which was published by Corces et al. in 2016 Corces, M.R., Buenrostro, J.D., Wu, B., Greenside, P.G., Chan, S.M., Koenig, J.L., Snyder, M.P., Pritchard, J.K., Kundaje, A., Greenleaf, W.J., Majeti, R., Chang, H.Y., 2016. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nature Genetics 48, 1193–1203. [https://doi.org/10.1038/ng.3646](https://doi.org/10.1038/ng.3646). Will we working with the RNA-seq and ATAC-seq dataset over multiple sorted blood cell populations, and verify if we can obtain specific signatures from each modality, and combine the modalities to obtain multiomics signatures. Hence, we will perform both **single-NMF** on each modality, and then combined **integrative NMF** combining both modalities. Eventually, we will extract unsupervised signatures, and try to identify what these signatures correspond to. First, we define a range of factorization ranks for the decomposition, as we do not know *a priori* what the right number is: ```{r factorank} ##----------------------------------------------------------------------------## ## Factorization ranks for all NMFs ## ##----------------------------------------------------------------------------## factorization_ranks <- 6:10 ``` ## Gene expression (RNAseq) ### Data loading Read normalized gene expression matrix. The raw counts have been normalized by library size (using the EstimateLibrarySize function from DESeq2), and then log-normalized. ```{r rna_dataloading, results="asis", message=FALSE} setwd('~/ETBII2023') ##----------------------------------------------------------------------------## ## Read normalized data ## ##----------------------------------------------------------------------------## # read normalized matrix rna_norm_mat <- readRDS("data/rnaseq/rnaseq_normalized_counts.RDS") rna_annotation <- readRDS("data/rnaseq/rnaseq_annotation.RDS") ##----------------------------------------------------------------------------## ## Print dataset dimension ## ##----------------------------------------------------------------------------## data.frame(dim(rna_norm_mat), row.names = c("features", "samples")) ``` ### Applying NMF Applying Non-Negative Matrix Factorization (NMF) to normalized transcriptome data (RNAseq). We perform the factorization over a range of ranks (6 to 10), and will check later which seems more appropriate. Given that NMF is initialized randomly, we run the procedure 10 times, with up to 10^4 iterations each time, and a convergence threshold of 40, meaning that the iterations will stop if the error function does not change for this number of iterations. **Warning: do not worry about possible error/warning message due to library loading the first time you run the decomposition....** ```{r rna_NMF_run, message=FALSE, warning=FALSE, cache=TRUE, error=FALSE} ##----------------------------------------------------------------------------## ## run NMF ## ##----------------------------------------------------------------------------## rna_nmf_exp <- run_NMF_tensor(X = rna_norm_mat, ranks = factorization_ranks, method = "NMF", n_initializations = 10, iterations = 10^4, convergence_threshold = 40, extract_features = TRUE) rna_nmf_exp ## Normalize NMF: W matrix is normalized column-wise, and H is correspondingly rescaled rna_norm_nmf_exp <- normalizeW(rna_nmf_exp) ``` ### Factorization quality metrics and optimal K Based on the results of the factorization quality metrics, an optimal number of signatures (k) must be chosen. Several criteria can be used to determine a possible optimal number of signatures. Remember however that **THE** optimal number often does not exist, and that several factorization ranks can give you different byt useful information about the dataset! ```{r rna_NMF_optK, results='hide',fig.keep='all', message=FALSE, warning=FALSE} ## Plot K stats gg_plotKStats(rna_norm_nmf_exp) ``` In order to find the optimal factorization ranks, you should: minimize * the [Frobenius error](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) * the [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) * the mean [Amari distance](https://en.wikipedia.org/wiki/Amari_distance) maximize * the sum and mean [silhouette width](https://en.wikipedia.org/wiki/Silhouette_(clustering)) * the [cophenic coefficient](https://en.wikipedia.org/wiki/Cophenetic_correlation). Another good criteria is to look at the river plot, which indicates how signatures at different ranks are related to each other. ```{r rna_NMF_riverplot, results='hide',fig.keep='all', message=FALSE, warning=FALSE} ## Generate river plot plot(generateRiverplot(rna_norm_nmf_exp), plot_area=1, yscale=0.6, nodewidth=0.5) ``` > **Beware:** there seems to be a bug with the way this function displays the leftmost signatures... We can now plot the exposure matrices H for all the factorization ranks, adding the cell type annotation to the columns of the H matrix. ```{r rna_Hmatrix_Wnorm, fig.width=8, fig.height=5.5, out.width="90%", results='asis', warning=FALSE, message=FALSE} ##----------------------------------------------------------------------------## ## H matrix heatmap annotation ## ##----------------------------------------------------------------------------## #Annotation for H matrix heatmap # Define color vector type.colVector <- list(Celltype = setNames(rna_annotation$color[match(levels(rna_annotation$Celltype), rna_annotation$Celltype)], levels(rna_annotation$Celltype)) ) # Build Heatmap annotation heat.anno <- HeatmapAnnotation(df = data.frame(Celltype = rna_annotation$Celltype), col = type.colVector, show_annotation_name = TRUE, na_col = "white") ##----------------------------------------------------------------------------## ## Generate H matrix heatmap, W normalized ## ##----------------------------------------------------------------------------## for(ki in factorization_ranks) { #plot H matrix tmp.hmatrix <- HMatrix(rna_norm_nmf_exp, k = ki) h.heatmap <- Heatmap(tmp.hmatrix, col = viridis(100), name = "Exposure", clustering_distance_columns = 'pearson', show_column_dend = TRUE, heatmap_legend_param = list(color_bar = "continuous", legend_height=unit(2, "cm")), top_annotation = heat.anno, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE) print(h.heatmap) } ``` By visual inspection, we can find association of signatures to certain cell-types. We can do this analysis systematically using the recovery function: ```{r} tmp.hmatrix <- HMatrix(rna_norm_nmf_exp, k = 7) recovery_plot(tmp.hmatrix, rna_annotation$Celltype) ``` ### Gene set enrichment analysis Using the feature exposure extracted from the W matrix, a gene set enrichment analysis is performed against all MSigDB terms The optimal factorization rank selected was: **K = 7** ```{r rna_gsea, results="asis", message=FALSE} ##----------------------------------------------------------------------------## ## W matrix Z scores ## ##----------------------------------------------------------------------------## k.opt = 7 rna_wmatrix <- WMatrix(rna_norm_nmf_exp, k = k.opt) #Zscore for each signature rna_wmatrix.zscores <- apply(rna_wmatrix, MARGIN=2, function(wmat_score){ (wmat_score - median(wmat_score)) / mad(wmat_score) }) colnames(rna_wmatrix.zscores) <- paste0("Signature", 1:k.opt) ``` We now apply an enrichment analysis using the GAGE package. As a reference, we are using the collection of genesets from MSigDB corresponding to experimentally/computationally identified groups of genes. ```{r gage} ##----------------------------------------------------------------------------## ## GAGE (Generally Applicable Gene-set Enrichment) analysis ## ##----------------------------------------------------------------------------## #Infer gene sets tha are significantly pertubed relative to all genes considered #load precompiled GSEA MSigDB gene sets ## this is the full collection of genesets (17800 genesets) #gs.msigDB <- readList("db/msigdb.v6.2.symbols.gmt") ## this is the set of genesets related to immunological processes gs.msigDB <- readList("db/c7.all.v2022.1.Hs.symbols.gmt") # head(gs.msigDB) #run GAGE analysis rna_msigDB_enrichment <- gage(rna_wmatrix.zscores, gsets=gs.msigDB, same.dir=TRUE) #Drop NAs for upregulated rna_msigDB_enrichment <- as.data.frame(rna_msigDB_enrichment$greater) rna_msigDB_enrichment <- rna_msigDB_enrichment[!is.na(rna_msigDB_enrichment$p.geomean),] rna_msigDB_enrichment <- rna_msigDB_enrichment[, paste0("Signature", 1:k.opt)] # Select only more enriched terms in one signature compared to the others idx <- apply(rna_msigDB_enrichment, 1, function(term){ term <- -log10(term) # Change 0 to small value to avoid NAs term[term == 0] <- 1e-40 # find if this term is more enriched in one signature compared to others is.enrich <- sapply(term, function(x){ # p-value 5 times greater than at least 5 other signatures sum(x/term > 5) > 5 }) any(is.enrich) }) rna_msigDB_enrichment <- rna_msigDB_enrichment[idx,] # Print table datatable(rna_msigDB_enrichment, filter="top", extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = list(list(extend = 'collection', buttons = c('excel', 'csv'), text = 'DOWNLOAD DATA')))) %>% formatSignif(columns=colnames(rna_msigDB_enrichment), digits=3) ``` > Questions: > * do the enriched term correspond to the recovery plot analysis? > * redo the analysis with the `msigdb.v6.2.symbols.gmt` collection ## Integrative RNAseq & ATACseq **Gene expression (RNAseq) & Chromatin accessibility (ATACseq)** Here, we want to learn signatures by integrating RNA-seq and ATAC-seq, using the integrative NMF method, allowing to extract the **homogeneous** as well as **heterogeneous** part of the datasets. The parameter $\lambda$ allows to balance these 2 contributions (homogeneous/heterogeneous), and needs to be estimated (see below). Only the those samples with RNAseq and ATACseq data were used in the integrative analysis. ### Data loading Read normalized gene expression matrix and chromatin accessibility matrix... ```{r integrative_dataloading, results="asis", message=FALSE} ##----------------------------------------------------------------------------## ## Read normalized data ## ##----------------------------------------------------------------------------## # read normalized matrix int_norm_mat <- readRDS("data/multiview/multiview_norm_mat_list.RDS") int_annotation <- readRDS("data/multiview/multiview_annotation.RDS") ##----------------------------------------------------------------------------## ## Print dataset dimension ## ##----------------------------------------------------------------------------## ## RNA-seq data.frame(dim(int_norm_mat$rna), row.names = c("features", "samples")) ## ATAC-seq data.frame(dim(int_norm_mat$atac), row.names = c("features", "samples")) ``` ### integrative NMF Lambda tuning In iNMF the lambda parameter controls if the factorization should shift towards extracting more homogeneous effect or heterogeneous effect. Based on Yang and Michailidis, 2016, to tune the value of the parameter lambda for the integrative NMF (iNMF), the objectives values of join NMF (jNMF) are compared to single view NMFs (sNMF), the principle is that **join NMF** represents complete **homogeneity** and **single view NMF** represents complete **heterogeneity**. To avoid overfitting, the best lambda can be selected by plotting the difference of the unsquared residual quantities of jNMF and iNMF (Ri - Rj) over multiple values of lambda, and compare it to the difference of the unsquared residual quantities of sNMF and jNMF c*(Rj - Rs). The optimal lambda usually is the first lambda in which (Ri - Rj) < c*(Rj - Rs) where c is a constant >= 2. In other words, the optimal lambda should achieve an optimal balance between the homogeneous and heterogeneous part of the data. For more Help please run *?iNMF_lambda_tuning.* The following function helps to select the optimal lamba for a fixed factorization rank (k = 8). ```{r iNMF_tune, message=FALSE, warning=FALSE, cache=TRUE} ##----------------------------------------------------------------------------## ## Run integrative NMF ## ##----------------------------------------------------------------------------## opt_lambda = iNMF_lambda_tuning(matrix_list = int_norm_mat, lambdas = seq(0.02, 0.22, 0.05), thr_cons = 5, Output_type = "residuals", rank = 8, n_initializations = 5, iterations = 10^4, convergence_threshold = 40) opt_lambda ``` > Spend some time to understand why the curves behave as they do... The optimal lambda is the smallest value before overfitting of iNMF (around 0.1 or so in this case...). However, in order to enhance the specific contribution ### Applying integrative NMF Applying Integrative Non-Negative Matrix Factorization (NMF) to normalized Gene expression (RNAseq) and Chromatin accessibility data (ATACseq) ```{r iNMF_run, message=FALSE, warning=FALSE, cache=TRUE} ##----------------------------------------------------------------------------## ## Run integrative NMF ## ##----------------------------------------------------------------------------## inmf_exp <- run_iNMF_tensor(matrix_list = int_norm_mat, ranks = 8, n_initializations = 5, iterations = 10^4, convergence_threshold = 40, Sp = 0, lamb = 0.07, extract_features = FALSE) inmf_exp ``` ### Factorization quality metrics and optimal K Based on the results of the factorization quality metrics, an optimal number of signatures (k) must be chosen: ```{r iNMF_optK, results='hide',fig.keep='all', message=FALSE, warning=FALSE} ## Plot K stats gg_plotKStats(inmf_exp) ``` Minize the Frobenius error, the coefficient of variation and the mean Amari distance, while maximizing the sum and mean silhouette width and the cophenic coefficient. ### H Matrices: {.tabset} Here, we can define some functions to plot the shared, specific or total part of the H matrix: ```{r iNMF_Hmatrix, fig.width=12, fig.height=4, out.width="90%", results='asis', warning=FALSE, message=FALSE} ##----------------------------------------------------------------------------## ## Plot H matrix heatmap integrative NMF ## ##----------------------------------------------------------------------------## plotH_oneView <- function(int_nmf, viewID, k, heat.anno, main_hist, col=viridis(100), scale_color=TRUE, displayID = viewID){ sharedH <- HMatrix(int_nmf, k = k, type = "shared") Hview <- HMatrix(int_nmf, k = k, type = "viewspec", view_id = viewID) # hs <- HMatrix(inmf_exp, k = ki, type = "shared") # hv_r <- HMatrix(inmf_exp, k = ki, type = "viewspec", view_id = "rna") # hv_a <- HMatrix(inmf_exp, k = ki, type = "viewspec", view_id = "atac") # Define total H matrix totalH <- sharedH + Hview # Color Function if (scale_color) { colf <- circlize::colorRamp2(seq(0, max(totalH), length.out = 100), col) } else { colf <- col } #main_hist <- hclust(as.dist(1 - cor(totalH, method = "pearson"))) tH.heatmap <- Heatmap(totalH, col = colf, name = "Total Exposure", column_title = "Total H matrix", cluster_columns = main_hist, show_column_dend = TRUE, top_annotation = heat.anno, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE) sH.heatmap <- Heatmap(sharedH, col = colf, name = "Shared Exposure", column_title = "Shared H matrix", cluster_columns = main_hist, show_column_dend = TRUE, top_annotation = heat.anno, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE) vH.heatmap <- Heatmap(Hview, col = colf, name = "View Specific Exposure", column_title = "View specific H matrix", cluster_columns = main_hist, show_column_dend = TRUE, top_annotation = heat.anno, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE) #print(tH.heatmap + sH.heatmap + vH.heatmap) # stiff_cor_heat <- as_ggplot(grid.grabExpr(draw(stiff_cor_heat))) # stiff_cor_heat #ht_global_opt(heatmap_legend_grid_height = unit(.25, "cm")) ht_list <- tH.heatmap + sH.heatmap + vH.heatmap draw(ht_list, row_title = displayID) ht_global_opt(RESET = TRUE) } ``` Define functions to add annotations ```{r} ##----------------------------------------------------------------------------## ## H matrix heatmap annotation ## ##----------------------------------------------------------------------------## #Annotation for H matrix heatmap # Define color vector type.colVector <- list(Celltype = setNames(int_annotation$color[match(levels(int_annotation$Celltype), int_annotation$Celltype)], levels(int_annotation$Celltype)) ) # Build Heatmap annotation heat.anno <- HeatmapAnnotation(df = data.frame(Celltype = int_annotation$Celltype), col = type.colVector, show_annotation_name = FALSE, na_col = "white") my_Heatmap <- function(my_h, heat.anno, matrix_id){ hm <- Heatmap(my_h, col = viridis(100), name = matrix_id, column_title = matrix_id, cluster_columns = main_hist, show_column_dend = TRUE, top_annotation = heat.anno, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE) } ``` We can now plot for the different factorization ranks the specific part of the H matrix both for RNA-seq and ATAC-seq: ```{r} ##----------------------------------------------------------------------------## ## Generate H matrix heatmap, W normalized ## ##----------------------------------------------------------------------------## ki = 8 sharedH <- HMatrix(inmf_exp, k = ki, type = "shared") main_hist <- hclust(as.dist(1 - cor(sharedH, method = "pearson"))) plotH_oneView(inmf_exp, k = ki, heat.anno = heat.anno, main_hist = main_hist, scale_color = FALSE, viewID = "rna", displayID = "RNAseq") plotH_oneView(inmf_exp, k = ki, heat.anno = heat.anno, main_hist = main_hist, scale_color = FALSE, viewID = "atac", displayID = "ATACseq") ``` > Looking at the view specific H matrix, you can see some specificity brought by each modality to each signature. > Try to spot some differences between the ATAC-seq and RNA-seq for some of the signatures! > Beware that the scale of the shared and specific H matrix depends on the chouce of lambda. A smaller lambda will increase the specific part. You can try running the iNMF again with a smaller lambda. ### Gene set enrichment analysis Using the feature exposure extracted from the W matrix, a gene set enrichment analysis is perform agains all MSigDB terms The optimal factorization rank selected was: **k = 8** ```{r iNMF_gsea, results="asis", message=FALSE, warning=FALSE} ##----------------------------------------------------------------------------## ## W matrix Z scores ## ##----------------------------------------------------------------------------## int_rna_wmatrix <- WMatrix(inmf_exp, k = 8, view_id = "rna",) #Zscore for each signature int_rna_wmatrix.zscores <- apply(int_rna_wmatrix, MARGIN=2, function(wmat_score){ (wmat_score - median(wmat_score)) / mad(wmat_score) }) colnames(int_rna_wmatrix.zscores) <- paste0("Signature", 1:8) ``` We can also extract the X matrix for the atac-seq data: ```{r iNMF_atac_gsea, results="asis", message=FALSE, warning=FALSE} ##----------------------------------------------------------------------------## ## W matrix Z scores ## ##----------------------------------------------------------------------------## int_atac_wmatrix <- WMatrix(inmf_exp, k = 8, view_id = "atac",) # Zscore for each signature int_atac_wmatrix.zscores <- apply(int_atac_wmatrix, MARGIN=2, function(wmat_score){ (wmat_score - median(wmat_score)) / mad(wmat_score) }) colnames(int_atac_wmatrix.zscores) <- paste0("Signature", 1:8) ``` > Can you identify the top 10 genes and the top 10 ATAC-seq peaks for a particular signature? Again, we perform the enrichment with gage on the RNA-seq W-matrix: ```{r gage_inmf} ##----------------------------------------------------------------------------## ## GAGE (Generally Applicable Gene-set Enrichment) analysis ## ##----------------------------------------------------------------------------## #Infer gene sets tha are significantly pertubed relative to all genes considered #load precompiled GSEA MSigDB gene sets gs.msigDB <- readList("db/c7.all.v2022.1.Hs.symbols.gmt") # head(gs.msigDB) #run GAGE analysis int_rna_msigDB_enrichment <- gage(int_rna_wmatrix.zscores, gsets=gs.msigDB, same.dir=TRUE) #Drop NAs for upregulated int_rna_msigDB_enrichment <- as.data.frame(int_rna_msigDB_enrichment$greater) int_rna_msigDB_enrichment <- int_rna_msigDB_enrichment[!is.na(int_rna_msigDB_enrichment$p.geomean),] int_rna_msigDB_enrichment <- int_rna_msigDB_enrichment[, paste0("Signature", 1:9)] # Select only more enriched terms in one signature compared to the others idx <- apply(int_rna_msigDB_enrichment, 1, function(term){ term[term == 0] <- 1e-40 term <- -log10(term) # Change 0 to small value to avoid NAs # find if this term is more enriched in one signature compared to others is.enrich <- sapply(term, function(x){ # p-value 5 times greater than at least 5 other signatures sum(x/term > 5) > 5 }) any(is.enrich) }) # table(idx) # head(idx) # head(int_rna_msigDB_enrichment) int_rna_msigDB_enrichmentf <- int_rna_msigDB_enrichment[idx,] # Print table datatable(int_rna_msigDB_enrichmentf, filter="top", extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = list(list(extend = 'collection', buttons = c('excel', 'csv'), text = 'DOWNLOAD DATA')))) %>% formatSignif(columns=colnames(int_rna_msigDB_enrichment), digits=3) ``` ## If you have time... You can perform the single NMF just on the ATAC-seq data! ## Chromatin accessibility (ATACseq) ### Data loading Read normalized chromatin accessibility matrix. As for RNA-seq, the data has been normalized by library size estimated using DESeq2, and log-normalized. ```{r atac_dataloading, results="asis", message=FALSE} ##----------------------------------------------------------------------------## ## Read normalized data ## ##----------------------------------------------------------------------------## # read normalized matrix atac_norm_mat <- readRDS("data/atacseq/atacseq_normalized_counts.RDS") atac_annotation <- readRDS("data/atacseq/atacseq_annotation.RDS") ##----------------------------------------------------------------------------## ## Print dataset dimension ## ##----------------------------------------------------------------------------## cat("Dimension of Chromatin accessibility dataset (ATACseq): \n\n ") kable(data.frame(dim(atac_norm_mat), row.names = c("features", "samples")), col.names = "") ``` ### Applying NMF Applying Non-Negative Matrix Factorization (NMF) to normalized Chromatin accessibility data (ATACseq) ```{r atac_NMF_run, message=FALSE, warning=FALSE, cache=TRUE} ##----------------------------------------------------------------------------## ## run NMF ## ##----------------------------------------------------------------------------## atac_nmf_exp <- run_NMF_tensor(X = atac_norm_mat, ranks = factorization_ranks, method = "NMF", n_initializations = 8, iterations = 10^4, convergence_threshold = 40, extract_features = FALSE) atac_nmf_exp ## Normalize NMF atac_norm_nmf_exp <- normalizeW(atac_nmf_exp) ``` ### Factorization quality metrics and optimal K Based on the results of the factorization quality metrics, an optimal number of signatures (k) must be chosen: ```{r atac_NMF_optK, results='hide',fig.keep='all', message=FALSE, warning=FALSE} ## Plot K stats gg_plotKStats(atac_norm_nmf_exp) ## Generate river plot #plot(generateRiverplot(atac_norm_nmf_exp), plot_area=1, yscale=0.6, nodewidth=0.5) ``` Again, we need to minize the Frobenius error, the coefficient of variation and the mean Amari distance, while maximizing the sum and mean silhouette width and the cophenic coefficient. ### H Matrix, W normalized: {.tabset} ```{r atac_Hmatrix_Wnorm, fig.width=8, fig.height=5.5, out.width="90%", results='asis', warning=FALSE, message=FALSE} ##----------------------------------------------------------------------------## ## H matrix heatmap annotation ## ##----------------------------------------------------------------------------## #Annotation for H matrix heatmap # Define color vector type.colVector <- list(Celltype = setNames(atac_annotation$color[match(levels(atac_annotation$Celltype), atac_annotation$Celltype)], levels(atac_annotation$Celltype)) ) # Build Heatmap annotation heat.anno <- HeatmapAnnotation(df = data.frame(Celltype = atac_annotation$Celltype), col = type.colVector, show_annotation_name = TRUE, na_col = "white") ##----------------------------------------------------------------------------## ## Generate H matrix heatmap, W normalized ## ##----------------------------------------------------------------------------## for(ki in factorization_ranks) { cat("\n") cat(" \n#### H matrix for k=", ki, " {.tabset} \n ") #plot H matrix tmp.hmatrix <- HMatrix(atac_norm_nmf_exp, k = ki) h.heatmap <- Heatmap(tmp.hmatrix, col = viridis(100), name = "Exposure", clustering_distance_columns = 'pearson', show_column_dend = TRUE, heatmap_legend_param = list(color_bar = "continuous", legend_height=unit(2, "cm")), top_annotation = heat.anno, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE) print(h.heatmap) cat(" \n Recovery plots for k=", ki, " \n ") #recovery_plot(tmp.hmatrix, atac_annotation, "Celltype") } gc() ``` > Question: can you compare the H matrices obtained using RNA-seq and ATAC-seq? Do you see differences? > Are some cell types separated better in one or the other modality? ### UMAP representation First RNA-seq ```{r umap_rna} ##----------------------------------------------------------------------------## ## UMAP H matrix ## ##----------------------------------------------------------------------------## hmatrix_norm <- HMatrix(rna_norm_nmf_exp, k = 8) umapView <- umap(t(hmatrix_norm)) umapView_df <- as.data.frame(umapView$layout) colnames(umapView_df) <- c("UMAP1", "UMAP2") type_colVector <- rna_annotation %>% dplyr::select(Celltype, color) %>% arrange(Celltype) %>% distinct() %>% deframe() umapView_df %>% rownames_to_column("sampleID") %>% left_join(rna_annotation, by = "sampleID") %>% mutate(Celltype = factor(Celltype, levels = c("HSC", "MPP", "LMPP", "CMP", "GMP", "MEP", "CLP", "CD4", "CD8", "NK", "Bcell", "Mono"))) %>% ggplot(aes(x=UMAP1, y=UMAP2, color = Celltype)) + geom_point(size = 2.5, alpha = 0.95) + scale_color_manual(values = type_colVector) + theme_cowplot() ``` Now with ATAC-seq. ```{r umap_atac} ##----------------------------------------------------------------------------## ## UMAP H matrix ## ##----------------------------------------------------------------------------## hmatrix_norm <- HMatrix(atac_norm_nmf_exp, k = 8) umapView <- umap(t(hmatrix_norm)) umapView_df <- as.data.frame(umapView$layout) colnames(umapView_df) <- c("UMAP1", "UMAP2") type_colVector <- atac_annotation %>% dplyr::select(Celltype, color) %>% arrange(Celltype) %>% distinct() %>% deframe() umapView_df %>% rownames_to_column("sampleID") %>% left_join(atac_annotation, by = "sampleID") %>% mutate(Celltype = factor(Celltype, levels = c("HSC", "MPP", "LMPP", "CMP", "GMP", "MEP", "CLP", "CD4", "CD8", "NK", "Bcell", "Mono"))) %>% ggplot(aes(x=UMAP1, y=UMAP2, color = Celltype)) + geom_point(size = 2.5, alpha = 0.95) + scale_color_manual(values = type_colVector) + theme_cowplot() ``` > Question: can you understand the differences? ## How to go further? To evaluate the potential of the integrative analysis, one could look at the joined signatures obtained from the iNMF analysis. These signatures contain both **genes** and **genomics regions** (or ATAC peaks), which could be investigated further * are the peaks and the genes inside a signature in genomic proximity? * what are the motifs enriched in the peaks of a specific signature? Can we identify specific transcription factors? * if the peaks of a signature are enriched in specific motifs (i.e. transcription factors), are the genes in the same signature enriched for known targets?