BI345 Lab 8 - HackMD

# BI345 Lab 8 Gen Notes: We are learning single cell analyses this week using a subset of the M. pharaonis dataset from this paper. I took 3000 cells from each replicate of the castes Gyne and Male. We will be using the R package Seurat. This is a popular package for scRNA-seq analysis, that also has many other capabilities. The basic vignette is here. I’m not sure how slow your analysis will be because even after subsetting, these data take up a lot of memory in your R environment. You may want to team up with another student and walk through this tutorial together. Alternatively, you can download the data onto your personal machine and install R/ Seurat onto it and run through the analysis there. 1) Load the library and read in the dataset. To save space, the Seurat object is stored as an Rds object (R Data Set). ``` library(Seurat) setwd("/courses/bi345/Course_Materials/wk_11") small.Mpha <- readRDS("lab_08.Mpha.rds") setwd("/courses/bi345/sdivit25/Lab_8") ``` 2. Let’s explore this. ``` ncol(small.Mpha) nrow(small.Mpha) head(colnames(small.Mpha)) # appear to be barcodes head(rownames(small.Mpha)) # appear to be genes ``` ^ Note: The Seurat object contains barcodes/ libraries/ single-cells as columns and genes/ features as rows. The set up is very similar to all other types of RNA-seq datasets. 3. The metadata are tored in the object ``` small.Mpha[[]] # this is the metadata colnames(small.Mpha[[]]) ``` 4. The object also contains slots for analyses that will be done to it. ``` slotNames(small.Mpha) ``` 5. Let’s visualize typical QC metrics. These were stored as part of your metadata. ``` VlnPlot(small.Mpha, features=c("nFeature_RNA","nCount_RNA", "percent.mt"), group.by="caste", pt.size=0) FeatureScatter(small.Mpha, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by="caste", shuffle=T) FeatureScatter(small.Mpha, feature1 = "nCount_RNA", feature2 = "percent.mt", group.by="caste", shuffle=T) ``` Note: nCount are all the molecules (UMI), and nFeature are all the genes. Output: ![](https://i.imgur.com/XJGbgBO.png) ![](https://i.imgur.com/Ml4satm.png) ![](https://i.imgur.com/Z1j8Jlf.png) 6. We can also check for each gene (row), how many cells (columns) counts are present. ``` plot(sort(Matrix::rowSums(GetAssayData(small.Mpha) >= 3), decreasing = TRUE), xlab="gene rank", ylab="number of cells", main="Cells per genes (reads/gene >= 3 )") ``` Output: ![](https://i.imgur.com/jSYw7px.png) Note: You’ll see that a lot of genes are present only in a few cells. 7. Let’s summarize the metadata. This will show us that these data have already been filtered. ``` summary(small.Mpha[[]]) ``` Output: ![](https://i.imgur.com/eB6wZrw.png) Note: This QC filtering is described in the methods of the research paper. It is typical to remove cells with too high percent of mitochondrial genes, and too many of features (likely doublet, or more than one cell sequenced at a time). 8. Let’s normalize the data. The default method that Seurat uses is to normalize feature expression by total expression, multiply 10,000, then log-transform. ``` small.Mpha <- NormalizeData(small.Mpha) ``` Now that we’ve done this, both sets of data (raw counts and normalized counts) are stored in the object. 9. For visualization, we are going to identify highly variable genes. I’m going to give you two options. The first is the default, and the second is to find genes with some minimal expression present across at least 100 cells. We'll stick with default. ``` # default option; finds 2000 small.Mpha <- FindVariableFeatures(object = small.Mpha) top10 <- head(VariableFeatures(small.Mpha), 10) LabelPoints(plot=VariableFeaturePlot(small.Mpha), points=top10, repel=T) rm(top10) # alternate option, minimally expressed genes num.cells <- Matrix::rowSums(GetAssayData(small.Mpha, slot = "count") > 2) genes.use <- names(num.cells[which(num.cells >= 100)]) length(genes.use) #VariableFeatures(small.Mpha) <- genes.use rm(num.cells, genes.use) ``` Output: ![](https://i.imgur.com/myDwnMr.png) 10. Now we can run a PCA on these highly variable genes. (Note that this means we can only visualize on a UMAP or t-SNE plot one of these highly variable genes.) ``` # first scale data for input into PCA small.Mpha <- ScaleData(small.Mpha) # run PCA and generate 50 principal components small.Mpha <- RunPCA(object = small.Mpha, npcs = 50) ``` We only generated 50 PCs but typically you would generate many more. 11. Let’s visualize the PCs to see how many we should include in further dimensionality-reduced visualizations. ``` # there are too many PCs for this to be useful but you can do pairwise plotting like this DimPlot(small.Mpha, reduction="pca", group.by = "caste", shuffle=T) DimPlot(small.Mpha, reduction="pca", dims=c(2,3), group.by = "caste", shuffle=T) # instead, these heatmaps give you a better idea and show the top 200 most variable cells for each PC DimHeatmap(small.Mpha, dims=1:12, cells=200, balanced=T) DimHeatmap(small.Mpha, dims=13:24, cells=200, balanced=T) # People also use these Elbow Plots to see where additional PCs start giving you diminishing returns ElbowPlot(small.Mpha, ndims=50) ``` We are going to move forward with 30 PCs. Output: ![](https://i.imgur.com/gu9gv8s.png) ![](https://i.imgur.com/boCCHyI.png) ![](https://i.imgur.com/kZpzwXZ.png) ![](https://i.imgur.com/kBPPe2o.png) ![](https://i.imgur.com/osjfPwT.png) 12. Let’s find clusters of cells with similar expression patterns. First we have to find neighbors. The default method is to use “annoy” (approximate nearest neighbors oh yeah!). ``` small.Mpha <- FindNeighbors(object = small.Mpha, reduction = "pca", dims=1:30, nn.method="annoy") ``` 13. From the neighbors, we can find clusters using Louvain’s method. We are going to try a few different resolutions. In the paper, they used a resolution of 1.5 but we are going to try more. ``` small.Mpha <- FindClusters(object = small.Mpha, resolution=seq(1, 3, 0.5), verbose=T) head(small.Mpha[[]]) ``` The clusters have been stored in their own metadata columns. The different resolutions will have found 26-45 clusters. This is where you would be working closey with a Biologist that has some idea of how many types of genes you expect and how they should differ among sample types. 14. Your object now has a new metadata column too. in it, the most recent cluster identities have been saved. ``` table(small.Mpha@active.ident) table(small.Mpha[["seurat_clusters"]]) ``` 15. Now let’s run t-SNE and UMAP so we can visualize our clusters. These will each take a few minutes so feel free to just do one. ``` small.Mpha <- RunTSNE(object = small.Mpha, reduction = "pca", dims=1:30, do.fast=T, check_duplicates=F) small.Mpha <- RunUMAP(small.Mpha, reduction="pca", dims=1:30) ``` t-SNE and UMAP procedures are NOT linked to cluster generation. Remember that not all the genes go into these analyses anyway. They are purely for visualization and it is highly discouraged for you to use these methods to find clusters. 16. Let’s visualize the clusters we found at the different resolutions. You can substitute reduction="umap" if you chose it instead. ``` # show them at once DimPlot(object = small.Mpha, group.by=c("RNA_snn_res.1","RNA_snn_res.1.5","RNA_snn_res.2","RNA_snn_res.2.5","RNA_snn_res.3"), ncol=3 , pt.size=1.0, reduction = "tsne", label = T) # show them one at a time, but split by caste DimPlot(object = small.Mpha, group.by=c("RNA_snn_res.1.5"), pt.size=1.0, reduction = "tsne", label=T, split.by = "caste") DimPlot(object = small.Mpha, group.by=c("RNA_snn_res.2"), pt.size=1.0, reduction = "tsne", label=T, split.by = "caste") DimPlot(object = small.Mpha, group.by=c("RNA_snn_res.2.5"), pt.size=1.0, reduction = "tsne", label=T, split.by = "caste") ``` Outputs: ![](https://i.imgur.com/lX2Lsmx.png) ![](https://i.imgur.com/HerFYrK.png) ![](https://i.imgur.com/TbVx61w.png) ![](https://i.imgur.com/2s07J4S.png) 17. Choose one of the resolutions to keep, and get rid of the others. I chose to keep 1.5 and ran FindClusters again to write over the default stored cluster identities. ``` small.Mpha[["RNA_snn_res.1"]] <- NULL small.Mpha[["RNA_snn_res.2"]] <- NULL small.Mpha[["RNA_snn_res.2.5"]] <- NULL small.Mpha[["RNA_snn_res.3"]] <- NULL small.Mpha <- FindClusters(object = small.Mpha, resolution=1.5, verbose=T) head(small.Mpha[[]]) ``` 18. Now we can find markers (genes) that define (are DE etc.) each cluster. This takes a while so you might want to consider just doing the specific versions below. ``` #cluster.markers <- FindAllMarkers(small.Mpha, test.use='wilcox', only.pos=T) #head(cluster.markers) #table(cluster.markers$cluster) ``` 19. We can find genes that define a single cluster vs. all others. Let’s say I want to do this for cluster 19. We can plot the results too, at least for the genes that overlap with our highly variable features. ``` markers.19 <- FindMarkers(small.Mpha, ident.1="19", only.pos=T) head(markers.19) plot.19 <- row.names(markers.19)[row.names(markers.19) %in% VariableFeatures(small.Mpha)] FeaturePlot(small.Mpha, features=plot.19[1:4]) ``` Output: ![](https://i.imgur.com/KZdrtCC.png) 20. We can also find genes that are different between clusters. ``` markers.0to4 = FindMarkers(small.Mpha, ident.1=c("0","1","2"), ident.2=c("3","4"), only.pos = T) head(markers.0to4) plot.0to4 <- row.names(markers.0to4)[row.names(markers.0to4) %in% VariableFeatures(small.Mpha)] FeaturePlot(small.Mpha, features = plot.0to4[1:4]) ``` Output: ![](https://i.imgur.com/OzdKUH1.png) 21. We can also visualize specific genes for specific clusters. This shouldn’t be limited to highly variable genes. ``` VlnPlot(small.Mpha, features = c("Ac78C"), idents = c("0","1","2","3","4"), split.by = "caste") ``` Output: ![](https://i.imgur.com/zFWPFml.png)