Single Cell Gene Expression Analysis 實作 with R

# Single Cell Gene Expression Analysis 實作 with R ###### tags: `NGS` ,`Gene Expression`, `R`,`實作筆記` # 1. Seurat安装 `install.packages("Seurat")` # 2. 數據下載 Peripheral Blood Mononuclear Cells (PBMC) 是10X Genomics dataset page提供的一個資料，包含2700個單細胞，出自Illumina NextSeq 500平臺。 PBMCs是來自健康供體具有相對少量RNA（around 1pg RNA/cell）的原代細胞。在Illumina NextSeq 500平臺，檢測到2700個單細胞，每個細胞獲得69000 reads。 ``` tar -xvf pbmc3k_filtered_gene_bc_matrices.tar 文件夹下包含3个文件 barcodes.tsv genes.tsv matrix.mtx ``` >matrix.mtx：matrix.mtx 是 MatrixMarket格式文件 >https://math.nist.gov/MatrixMarket/formats.html >• 檔中儲存非零值; >• 注釋使用%標記； >• 第一行包含檔中總行數，總列數，總的記錄數 >• 每行中提供記錄的所處的行號和列號，已經記錄的內容 # 3. 數據匯入 ## 3.1 Read10X()函數可以自動讀入和解析資料 #### Load the PBMC dataset ```+= s = "R syntax highlighting"; library(dplyr) library(Seurat) library(patchwork) # Load the PBMC dataset pbmc.data <- Read10X(data.dir = "C:/Users/Christine Kao/Desktop/demo/pbmc3k_filtered_gene_bc_matrices") ``` #### 查看data ```+= # 維度 dim(pbmc.data) [1] 32738 2700 # 32,738 genes and 2,700 single cells pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30] 3 x 30 sparse Matrix of class "dgCMatrix" [[ suppressing 30 column names ‘AAACATACAACCAC-1’, ‘AAACATTGAGCTAC-1’, ‘AAACATTGATCAGC-1’ ... ]] CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5 TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . . MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . . #.表示0 summary(Matrix::colSums(pbmc.data)) Min. 1st Qu. Median Mean 3rd Qu. Max. 548 1758 2197 2367 2763 15844 slotNames(pbmc) [1] "assays" "meta.data" "active.assay" "active.ident" "graphs" "neighbors" [7] "reductions" "images" "project.name" "misc" "version" "commands" [13] "tools" ``` #### 查看mitochondria genes ```+= mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc.data), value = TRUE) length(mito.genes) [1] 13 ``` #### check out the meta data ```+= head(pbmc@meta.data) ``` ![](https://i.imgur.com/XMNySUh.png) #### Add some more meta data ```+= pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") head(pbmc@meta.data) ``` ![](https://i.imgur.com/jrog6EN.png) #### 查看每個細胞有多少基因被檢測到 ```+= s = "R syntax highlighting"; at_least_one <- apply(pbmc.data, 2, function(x) sum(x>0)) hist(at_least_one, breaks = 100, main = "Distribution of detected genes", xlab = "Genes with at least one tag") ``` ![](https://i.imgur.com/Xo21FNE.png) ```+= s = "R syntax highlighting"; hist(Matrix::colSums(pbmc.data), breaks = 100, main = "Expression sum per cell", xlab = "Sum expression") ``` ![](https://i.imgur.com/oz5IGn1.png) ## 3.2 產生Seurat物件 ```+= s = "R syntax highlighting"; pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc An object of class Seurat 13714 features across 2700 samples within 1 assay Active assay: RNA (13714 features, 0 variable features) pbmc.data 32738 x 2700 sparse Matrix of class "dgCMatrix" head(pbmc$RNA@data[,1:5]) 6 x 5 sparse Matrix of class "dgCMatrix" AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 AL627309.1 . . . . AP006222.2 . . . . RP11-206L10.2 . . . . RP11-206L10.9 . . . . LINC00115 . . . . NOC2L . . . . AAACCGTGTATGCG-1 AL627309.1 . AP006222.2 . RP11-206L10.2 . RP11-206L10.9 . LINC00115 . NOC2L . ``` # 4. DATA Preprocess 這部分是基於資料質控方法，標準化和檢測到的變化基因**對資料進行篩選**。 ## 4.1 細胞的篩選 * 單個細胞中檢測到單個基因的數目 * 低品質的細胞以及空泡油滴中一般檢測到很少的基因 * 包含多個細胞的油滴會檢測到異常多的基因 * 利用在一個單細胞中的基因總數檢測到線粒體的基因數目百分比在做計算 * 低品質/垂死細胞通常會有線粒體污染 * 使用PercentageFeatureSet函數函數可以計算線粒體QC，計算線粒體基因所占百分比 * 基因名以MT- 開始的基因定義為線粒體基因 #### 線粒體基因占比計算 ```+= s = "R syntax highlighting"; # QC and selecting cells for further analysis # The [[ operator can add columns to object metadata. This is a great place to stash QC stats pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Show QC metrics show(pbmc@meta.data) # Show QC metrics for the first 5 cells head(pbmc@meta.data, 5) ``` ![](https://i.imgur.com/sETWxkE.png) ![](https://i.imgur.com/zu8UDyq.png) #### 視覺化查看基因數目, UMI數目, 線粒體基因占比 ```+= s = "R syntax highlighting"; # Visualize QC metrics as a violin plot VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) ``` ![](https://i.imgur.com/Wa64FLb.png) > A couple of cells have high mitochondrial percentage which may indicate lost of cytoplasmic RNA. #### 查看基因數目, 線粒體基因占比與UMI數目的**關係** ```+= s = "R syntax highlighting"; # FeatureScatter is typically used to visualize feature-feature relationships, but can be used # for anything calculated by the object, i.e. columns in object metadata, PC scores etc. plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 + plot2 ``` ![](https://i.imgur.com/Nhbevjc.png) The `GenePlot()` function can also be used to visualise **gene-gene relationships** as well as any columns in the seurat object. Below we use the plotting function to spot cells that have a high percentage of mitochondrial RNA and to plot the relationship between the number of unique molecules and the number of genes captured. ```+= par(mfrow = c(1, 2)) GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito", pch.use = '.') GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene", pch.use = '.') ``` #### 篩選 • 篩選掉檢測到基因數目超過2500或低於200的細胞 • 單個細胞中線粒體基因數目占比超過>5% ```+= s = "R syntax highlighting"; # In the example below, we visualize QC metrics, and use these to filter cells. # We filter cells that have unique feature counts over 2,500 or less than 200 # We filter cells that have >5% mitochondrial counts pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) ``` ```+= table(pbmc$nFeature_RNA > 200 & pbmc$nFeature_RNA < 2500 & pbmc$percent.mt < 5) FALSE TRUE 62 2638 ``` ## 4.2 Normalizing the data #### 標準化前，每個細胞總的表達量 ```+= hist(Matrix::colSums(pbmc$RNA@data), breaks = 100, main = "Total expression before normalisation", xlab = "Sum of expression") ``` ![](https://i.imgur.com/GREEq62.png) #### 標準化預設使用資料標準化方法是LogNormalize, 每個細胞總的表達量都標準化到10000，然後log取對數；結果存放於pbmc[["RNA"]]@data。 ```+= pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) ``` #### 標準化後，每個細胞總的表達量 ```+= hist(Matrix::colSums(pbmc$RNA@data), breaks = 100, main = "Total expression before normalisation", xlab = "Sum of expression") ``` ![](https://i.imgur.com/3lOdNkO.png) ## 4.3 Identification of highly variable features (feature selection) 變化基因鑒定鑒定在細胞間表達高度變化的基因，後續研究需要集中于這部分基因。 Seurat內置的FindVariableFeatures()函數，首先計算每一個基因的均值和方差，並且直接模擬其關係。默認返回2000個基因。 ```+= pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) Calculating gene variances 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| Calculating feature variances of standardized and clipped values 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| ``` #### Identify the 10 most highly variable genes ```+= top10 <- head(VariableFeatures(pbmc), 10) top10 [1] "PPBP" "S100A9" "IGLL5" "LYZ" "GNLY" "FTL" "PF4" "FTH1" [9] "FCER1A" "GNG11" ``` #### 畫出表達變化的基因，從而觀察其分佈 ```+= # plot variable features with and without labels plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) plot1 plot2 plot1 + plot2 ``` plot1 ![](https://i.imgur.com/nm0vLtk.png) plot2 ![](https://i.imgur.com/f1u5K3e.png) plot1 + plot2 ![](https://i.imgur.com/6H7onMT.png) ## 4.4 Scaling the data Seurat建構一線性模型去預測基於使用者定義的變數去預測基因的表達情形，並移除變異數差異較大的數值 > The example provided in the tutorial used the number of detected molecules per cell and the percentage mitochondrial RNA to build a linear model. > > The scaled z-scored residuals, i.e. how much the actual expression differs from the linear model, are stored in the scale.data slot, which are used for dimensionality reduction and clustering. ScaleData()函數可以實現線性轉換縮放資料。最終每個基因均值為0，方差為1。結果存放於pbmc[["RNA"]]@scale.data。 ```+= # slot is empty before running ScaleData() all.genes <- rownames(pbmc) > pbmc <- ScaleData(pbmc, features = all.genes) Centering and scaling data matrix |=======================================================================| 100% class(pbmc@scale.data) [1] "matrix" "array" pbmc[["RNA"]]@scale.data[1:6, 1:6] AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AL627309.1 -0.05744997 -0.05744997 -0.05744997 AP006222.2 -0.03318769 -0.03318769 -0.03318769 RP11-206L10.2 -0.04118635 -0.04118635 -0.04118635 RP11-206L10.9 -0.03325679 -0.03325679 -0.03325679 LINC00115 -0.08128413 -0.08128413 -0.08128413 NOC2L -0.31545160 -0.31545160 -0.31545160 AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 AL627309.1 -0.05744997 -0.05744997 -0.05744997 AP006222.2 -0.03318769 -0.03318769 -0.03318769 RP11-206L10.2 -0.04118635 -0.04118635 -0.04118635 RP11-206L10.9 -0.03325679 -0.03325679 -0.03325679 LINC00115 -0.08128413 -0.08128413 -0.08128413 NOC2L -0.31545160 -0.31545160 -0.31545160 ``` 設置參數features是因為ScaleData默認處理前面鑒定的差異基因。這一步怎麼做都不會影響到後續pca和聚類，但是會影響做熱圖。 #### 移除影響方差的因素 ```+= # remove unwanted sources of variation pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt") Regressing out percent.mt |=======================================================================| 100% Centering and scaling data matrix |=======================================================================| 100% ``` # 5.線性降維分析 ## 5.1 Perform linear dimensional reduction (PCA) One goal of Principal Component Analysis (PCA) is to find the direction/s (usually the first two principal components) in which there is the most variance. The RunPCA() function performs the PCA on genes in the @var.genes slot by default and this can be changed using the pc.genes parameter. PCA was only performed on the most variable genes, which is a subset of the dataset. The ProjectPCA step scores each gene in the dataset based on their correlation with the calculated components. This is useful because there may be genes that were not detected as variable genes in the variable gene selection step, which are still strongly correlated with cellular heterogeneity. 對縮放後的資料進行PCA分析，預設使用前面鑒定表達變化大的基因。使用features參數可以重新定義資料集。 Seurat基於PCA scores進行細胞聚類。每個PC代表一個包含關聯基因的資訊的基因集（metagene）。納入多少PC進入下游分析是重要的一步 ```+= pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) PC_ 1 Positive: MALAT1, LTB, IL32, CD2, ACAP1, STK17A, CTSW, CD247, CCL5, GIMAP5 AQP3, GZMA, CST7, TRAF3IP3, GZMK, MAL, HOPX, MYC, ITM2A, ETS1 BEX2, LYAR, LDLRAP1, GIMAP7, NKG7, KLRG1, ZAP70, RORA, RIC3, SAMD3 Negative: CST3, TYROBP, LST1, AIF1, FTL, FCN1, LYZ, FTH1, S100A9, FCER1G TYMP, CFD, LGALS1, CTSS, SERPINA1, S100A8, LGALS2, SPI1, IFITM3, PSAP COTL1, IFI30, CFP, SAT1, S100A11, NPC2, LGALS3, GSTP1, NCF2, PYCARD PC_ 2 Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA, HLA-DQB1, LINC00926, CD79B, HLA-DRB1, CD74 HLA-DPB1, HLA-DMA, HLA-DQA2, HLA-DRB5, HLA-DPA1, HLA-DMB, HVCN1, FCRLA, LTB, BLNK KIAA0125, P2RX5, IRF8, IGLL5, SWAP70, ARHGAP24, SMIM14, PPP1R14A, C16orf74, FCRL2 Negative: NKG7, PRF1, GZMA, CST7, GZMB, FGFBP2, CTSW, GNLY, GZMH, SPON2 CCL4, CCL5, FCGR3A, CD247, XCL2, CLIC3, AKR1C3, HOPX, SRGN, CTSC TTC38, S100A4, IL32, ANXA1, IGFBP7, ID2, ACTB, XCL1, APOBEC3G, SAMD3 PC_ 3 Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, CD74, HLA-DPA1, MS4A1, HLA-DRB1, HLA-DRB5 HLA-DRA, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, HVCN1, FCRLA, IRF8, BLNK KIAA0125, SMIM14, PLD4, IGLL5, TMSB10, P2RX5, SWAP70, LAT2, MALAT1, IGJ Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 NGFRAP1, F13A1, RUFY1, SEPT5, MPP1, CMTM5, TSC22D1, MYL9, RP11-367G6.3, GP1BA PC_ 4 Positive: HLA-DQA1, CD79A, CD79B, HIST1H2AC, HLA-DQB1, MS4A1, PF4, CD74, SDPR, PPBP HLA-DPB1, GNG11, HLA-DQA2, SPARC, HLA-DRB1, HLA-DPA1, TCL1A, GP9, HLA-DRA, LINC00926 HLA-DRB5, RGS18, NRGN, PTCRA, CD9, AP001189.4, CA2, CLU, TUBB1, ITGA2B Negative: VIM, S100A8, S100A6, S100A4, TMSB10, S100A9, IL32, GIMAP7, S100A10, LGALS2 MAL, RBP7, FCN1, FYB, CD2, LYZ, S100A12, AQP3, MS4A6A, GIMAP4 S100A11, ANXA1, FOLR3, GIMAP5, AIF1, MALAT1, IL8, TRABD2A, IFI6, TMSB4X PC_ 5 Positive: GZMB, FGFBP2, NKG7, GNLY, CCL4, PRF1, CST7, SPON2, GZMA, XCL2 GZMH, CLIC3, CTSW, TTC38, AKR1C3, IGFBP7, XCL1, S100A8, CCL3, CCL5 TYROBP, HOPX, S100A9, LGALS2, CD160, HAVCR2, FCER1G, PTGDR, RBP7, GSTP1 Negative: LTB, VIM, AQP3, PPA1, KIAA0101, MAL, CD2, IL32, TYMS, CYTIP HN1, TUBA1B, CORO1B, PTGES3, ANXA5, TRADD, GPR183, ZWINT, FYB, ATP5C1 COTL1, RRM2, TNFAIP8, ITM2A, PRDX1, TRAF3IP3, ABRACL, ACTG1, ATP1A1, BIRC5 ``` The PrintPCA() function outputs a set of genes that most strongly define a set of principal components. #### 查看對每個主成分影響比較大的基因集 ```+= # Examine and visualize PCA results a few different ways print(pbmc[["pca"]], dims = 1:5, nfeatures = 5) PC_ 1 Positive: MALAT1, LTB, IL32, CD2, ACAP1 Negative: CST3, TYROBP, LST1, AIF1, FTL PC_ 2 Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA Negative: NKG7, PRF1, GZMA, CST7, GZMB PC_ 3 Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 Negative: PPBP, PF4, SDPR, SPARC, GNG11 PC_ 4 Positive: HLA-DQA1, CD79A, CD79B, HIST1H2AC, HLA-DQB1 Negative: VIM, S100A8, S100A6, S100A4, TMSB10 PC_ 5 Positive: GZMB, FGFBP2, NKG7, GNLY, CCL4 Negative: LTB, VIM, AQP3, PPA1, KIAA0101 ``` #### 視覺化對每個主成分影響比較大的基因集 visualise top genes associated with principal components `VizDimReduction`, `DimPlot`, 和 `DimHeatmap`可以從基因或細胞角度視覺化pca結果 ```+= VizDimLoadings(pbmc, dims = 1:2, reduction = "pca") ``` ![](https://i.imgur.com/5dEJ3V2.png) #### `DimPlot`兩個主成分的展示 ```+= DimPlot(pbmc, reduction = "pca") ``` ![](https://i.imgur.com/MLnMUEm.png) #### `DimHeatmap` DimHeatmap繪製基於單個主成分的熱圖，細胞和基因的排序都是基於他們的主成分分數。對於資料異質性的探索是很有幫助的，可以幫助用戶選擇用於下游分析的主成分維度。可以針對異質性來源進行簡單探索，這對決定用哪個PC進行下游分析很有用。根據PCA評分（PCA scores）對細胞和基因進行排序。 ```+= DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE) ``` ![](https://i.imgur.com/rPIoeOT.png) 展示多個主成分 ```+= DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE) ``` ![](https://i.imgur.com/bmyrXlT.png) The next steps are to determine how many principal components to use in downstream analyses, which is an **important step** for Seurat. The tutorial goes through two methods: 1. one uses a statistical test based on a random null model which is time-consuming for large datasets due to the resampling and may not return a clear cutoff . The statistical test is carried out by the JackStraw() function, which randomly permutes a subset of data, and calculates projected PCA scores for these “random” genes. Then compares the PCA scores for the “random” genes with the observed PCA scores to determine statistical significance. 2. The other is a commonly used heuristic.End result is a p-value for each gene’s association with each principal component. This resampling test was inspired by the jackstraw procedure. Significant principal components are those with a strong enrichment of low p-value genes. PC的選擇，確定資料集真實維度，對Seurat是重要的一步，也是最有挑戰性的一步。建議以下3種方法： 1.更加監督性，通過探索PC尋找異質性的相關來源，可用于聯合GSEA分析。 2.基於隨機空白模型進行test，對於大資料集來說比較費時，且可能不能呈現比較好的PCcutoff值。 3.常用的啟發式演算法，可以快速計算。例子中得到的結果基本一致，但在PC7-10之間選擇哪個作為cutoff需要判斷。最終選擇了JackStraw的結果，因為PC熱圖的結果很好的解釋了PC的意義。儘管cutoff小範圍的變動並不影響結果，但強烈建議對所選PC進行探索。 ## 5.2 Determine the 'dimensionality' of the dataset 數據維度為了避免單個基因影響，Seurat聚類細胞時使用pca結果。首先需要確定的是使用多少個主成分用於後續分析。常用有兩種方法，一種是基於零分佈的統計檢驗方法，這種方法耗時且可能不會返回明確結果。另一種是主成分分析常用的啟發式評估。在`JackStraw()`函數中, 使用基於零分佈的置換檢驗方法。隨機抽取一部分基因(默認1%)然後進行pca分析得到pca分數，將這部分基因的pca分數與先前計算的pca分數進行比較得到顯著性p-Value。根據主成分（pc）所包含基因的p-value進行判斷選擇主成分。最終的結果是每個基因與每個主成分的關聯的p-Value。保留下來的主成分是那些富集小的p-Value基因的主成分。處理大資料時會花費大量時間； ```+= # NOTE: This process can take a long time for big datasets, comment out for expediency. More # approximate techniques such as those implemented in ElbowPlot() can be used to reduce # computation time pbmc <- JackStraw(pbmc, num.replicate = 100) |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03m 53s |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s pbmc <- ScoreJackStraw(pbmc, dims = 1:20) ``` `JackStrawPlot()`函數提供視覺化方法，用於比較每一個主成分的p-value的分佈，虛線是均勻分佈；顯著的主成分富集有小p-Value基因，實線位於虛線左上方。下圖表明保留10個pca主成分用於後續分析是比較合理的。 The JackStrawPlot() function provides a visualisation tool for comparing the distribution of p-values for each principal component with a uniform distribution (dashed line). “Significant” principal components will show a strong enrichment of genes with low p-values (solid curve above the dashed line). 用jackStraw進行重採樣的test，隨機轉化了一個子集（默認1%），重新進行PCA，構建了一個gene scores的“空白分佈”（null distribution），重複進行。 ![](https://i.imgur.com/BOWv671.png) `ElbowPlot()`內置了一些其它的方法可以減少執行時間。 ```+= ElbowPlot(pbmc) ``` ![](https://i.imgur.com/RSJyT88.png) 啟發式評估方法生成一個Elbow plot圖。在圖中展示了每個主成分對資料方差的解釋情況（百分比表示），並進行排序。根據自己需要選擇主成分，圖中發現第9個主成分是一個拐點，後續的主成分(PC)變化都不大了。 # 6. Cluster the cells Seurat v3應用基於圖形的聚類方法，例如KNN方法。具有相似基因表達模式的細胞之間繪製邊緣，然後將他們劃分為一個內聯群體。在PhenoGraph中，首先基於pca維度中（先前計算的pca資料）計算歐式距離（the euclidean distance），然後根據兩個細胞在局部的重合情況（Jaccard 相似係數）優化兩個細胞之間的邊緣權值。此步驟內置於FindNeighbors函數，輸入時先前確定的pc資料。為了聚類細胞，接下來應用模組化優化技術反覆運算將細胞聚集在一起。`FindClusters()`函數實現這一功能，其中需要注意resolution參數，該參數設置下游聚類分析的“granularity”，更大的resolution會導致跟多的細胞類群。 3000左右的細胞量，設置resolution為0.4-1.2是比較合適的。細胞資料集越大，需要更大的resolution參數, 會獲得更多的細胞聚類。 ```+= ### step 7 :　Cluster the cells pbmc <- FindNeighbors(pbmc, dims = 1:10) Computing nearest neighbor graph Computing SNN pbmc <- FindClusters(pbmc, resolution = 0.5) Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck Number of nodes: 2700 Number of edges: 99220 Running Louvain algorithm... 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| Maximum modularity in 10 random starts: 0.8737 Number of communities: 9 Elapsed time: 0 seconds ``` #### 查看細胞屬於那個類群可以使用函數Idents ```+= # Look at cluster IDs of the first 5 cells head(Idents(pbmc), 5) ## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 ## 1 3 1 2 ## AAACCGTGTATGCG-1 ## 6 ## Levels: 0 1 2 3 4 5 6 7 8 ``` # 7. Run non-linear dimensional reduction (UMAP/tSNE) Seurat提供了一些非線性降維度分析的方法，如tSNE和UMAP，以視覺化和探索這些資料集；這一步使用的資料建議與聚類分析使用的pc維度一致。 #### UMAP ```+= # install UMAP： reticulate::py_install(packages ='umap-learn') pbmc <- RunUMAP(pbmc, dims = 1:10) ``` ```+= # note that you can set `label = TRUE` or use the LabelClusters function to help label # individual clusters DimPlot(pbmc, reduction = "umap") ``` ![](https://i.imgur.com/47qKRMX.png) #### tSNE ```+= pbmc <- RunTSNE(object = pbmc, + dims.use = 1:10, + do.fast = TRUE) DimPlot(pbmc, reduction = "tsne",label = TRUE) ``` ![](https://i.imgur.com/1SKNUvN.png) #### 添加細胞類群的標籤 ```+= DimPlot(pbmc, reduction = "umap",label = TRUE) LabelClusters(DimPlot(pbmc, reduction = "umap"),id = 'ident') ``` 此時可以保存資料，方便下次直接導入資料修改圖形。 ```+= saveRDS(pbmc, file = "../output/pbmc_tutorial.rds") ``` # 8. Finding differentially expressed features (cluster biomarkers) 尋找差異表達基因 (cluster biomarkers) Seurat可以通過差異表達分析尋找不同細胞類群的標記基因。 `FindMarkers()`函數可以進行此操作，但是默認尋找單個類群(參數ident.1)與其他所有類群陽性和陰性標記基因。 The FindMarkers() function identifies positive and negative markers by comparing genes in cells of one cluster against genes in all other cells. `FindAllMarkers()`函數會自動尋找每個類群和其他每個類群之間的標記基因。 The FindAllMarkers() function automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells. min.pct參數:設定在兩個細胞群中任何一個被檢測到的百分比，通過此設定不檢測很少表達基因來縮短程式執行時間。默認0.1 The min.pct argument requires a gene to be detected at a minimum percentage in either of the two groups of cells thresh.test參數：設定在兩個細胞群中基因差異表達量。可以設置為0 ，程式執行時間會更長。 the thresh.test argument requires a gene to be differentially expressed (on average) by some amount between the two groups max.cells.per.ident參數：每個類群細胞抽樣設置；也可以縮短程式執行時間。 ```+= # find all markers of cluster 1 cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25) For a more efficient implementation of the Wilcoxon Rank Sum Test, (default method for FindMarkers) please install the limma package -------------------------------------------- install.packages('BiocManager') BiocManager::install('limma') -------------------------------------------- After installation of limma, Seurat will automatically use the more efficient implementation (no further action necessary). This message will be shown once per session |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=08s head(cluster1.markers, n = 5) p_val avg_logFC pct.1 pct.2 p_val_adj LDHB 5.735126e-91 0.7395570 0.967 0.596 7.865152e-87 LTB 5.301131e-89 0.8051416 0.978 0.632 7.269971e-85 IL32 4.264847e-86 0.7202524 0.937 0.459 5.848812e-82 CD3D 2.819575e-77 0.6523211 0.913 0.422 3.866765e-73 AQP3 3.012777e-67 0.8711659 0.419 0.103 4.131723e-63 ``` ```+= # find all markers distinguishing cluster 5 from clusters 0 and 3 cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=15s head(cluster5.markers, n = 5) p_val avg_logFC pct.1 pct.2 p_val_adj FCGR3A 2.895591e-198 2.979114 0.968 0.036 3.971013e-194 CFD 2.334400e-191 2.367825 0.943 0.033 3.201396e-187 IFITM3 5.079026e-191 2.694139 0.981 0.046 6.965376e-187 RP11-290F20.3 9.296979e-181 1.902260 0.847 0.016 1.274988e-176 CD68 5.694805e-179 2.072228 0.911 0.035 7.809855e-175 ``` ```+= # find markers for every cluster compared to all remaining cells, report only the positive ones > pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) Calculating cluster 0 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s Calculating cluster 1 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s Calculating cluster 2 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12s Calculating cluster 3 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s Calculating cluster 4 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s Calculating cluster 5 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=16s Calculating cluster 6 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=14s Calculating cluster 7 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=15s Calculating cluster 8 |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=12s > head(pbmc.markers) p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene RPS6 1.309478e-126 0.4592044 0.997 0.995 1.795819e-122 0 RPS6 RPL32 1.468610e-126 0.4280191 0.998 0.995 2.014052e-122 0 RPL32 RPS12 1.571531e-122 0.4825202 1.000 0.990 2.155197e-118 0 RPS12 RPS27 1.385371e-118 0.4712589 0.997 0.992 1.899897e-114 0 RPS27 RPS25 1.325691e-115 0.5187368 0.995 0.974 1.818053e-111 0 RPS25 RPL31 3.526601e-111 0.5170980 0.997 0.963 4.836381e-107 0 RPL31 pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC) Registered S3 method overwritten by 'cli': method from print.boxx spatstat # A tibble: 18 x 7 # Groups: cluster [9] p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr> 1 6.45e- 77 0.895 0.439 0.119 8.84e- 73 0 CCR7 2 1.65e- 37 0.697 0.323 0.119 2.26e- 33 0 PRKCQ-AS1 3 3.01e- 67 0.871 0.419 0.103 4.13e- 63 1 AQP3 4 2.17e- 59 0.814 0.629 0.239 2.97e- 55 1 CD2 5 0. 3.84 0.994 0.213 0. 2 S100A9 6 0. 3.82 0.966 0.12 0. 2 S100A8 7 0. 2.97 0.934 0.043 0. 3 CD79A 8 8.49e-274 2.48 0.619 0.022 1.16e-269 3 TCL1A 9 1.04e-210 2.13 0.96 0.231 1.43e-206 4 CCL5 10 4.29e-201 2.20 0.609 0.051 5.88e-197 4 GZMK 11 9.88e-179 2.30 0.968 0.136 1.35e-174 5 FCGR3A 12 9.81e-123 2.08 1 0.315 1.35e-118 5 LST1 13 8.89e-269 3.34 0.961 0.07 1.22e-264 6 GZMB 14 1.28e-177 3.43 0.935 0.133 1.76e-173 6 GNLY 15 2.33e-215 2.69 0.806 0.011 3.20e-211 7 FCER1A 16 6.99e- 21 2.00 1 0.508 9.59e- 17 7 HLA-DPB1 17 6.19e-182 4.94 0.933 0.011 8.49e-178 8 PF4 18 4.24e-116 5.87 1 0.024 5.81e-112 8 PPBP ``` Seurat可以通過參數test.use設定檢驗差異表達的方法(詳情見 DE vignett)。 ```+= cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE) |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s head(cluster1.markers, n = 5) myAUC avg_diff power avg_logFC pct.1 pct.2 RPS6 0.814 0.4592044 0.628 0.4592044 0.997 0.995 RPL32 0.814 0.4280191 0.628 0.4280191 0.998 0.995 RPS12 0.809 0.4825202 0.618 0.4825202 1.000 0.990 RPS27 0.804 0.4712589 0.608 0.4712589 0.997 0.992 RPS25 0.800 0.5187368 0.600 0.5187368 0.995 0.974 ``` Seurat有多種方法視覺化標記基因的方法 • VlnPlot: 基於細胞類群的基因表達概率分佈 • FeaturePlot：在tSNE 或 PCA圖中畫出基因表達情況 • RidgePlot，CellScatter，DotPlot ```+= VlnPlot(pbmc, features = c("MS4A1", "CD79A")) ``` ![](https://i.imgur.com/Fqin5uw.png) ```+= # you can plot raw counts as well VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE) ``` ![](https://i.imgur.com/DpjzHk0.png) ```+= FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")) ``` ![](https://i.imgur.com/wdfLhtb.png) DoHeatmap為指定的細胞和基因花表達熱圖。每個類群預設展示top 20標記基因。 ```+= top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) DoHeatmap(pbmc, features = top10$gene) + NoLegend() ``` ![](https://i.imgur.com/BBOU6Lh.png) # 9.Assigning cell type identity to clusters 根據已知細胞類型與基因標記的對應關係，可以為細胞類群找到對應的細胞類型。 ```+= new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet") names(new.cluster.ids) <- levels(pbmc) pbmc <- RenameIdents(pbmc, new.cluster.ids) DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend() ``` ![](https://i.imgur.com/Dl7Pl68.png) ```+= saveRDS(pbmc, file = "../output/pbmc3k_final.rds") ``` [來源出自: _eason_生物信息学、统计学、Machine Learning](https://www.jianshu.com/p/fef17a1babc2) ![](https://i.imgur.com/x2WbBNo.png) # Cell Ranger主要流程(指令)： - cellranger mkfastq 這個指令主要功能是 demultiplex，將 Illumina 平台定序後的 raw base call (BCL) 轉檔成 FASTQ 檔案。當然，如果你拿到的檔案是 FASTQ 檔，可以跳過這一步。 - cellranger count 這個指令是最核心的步驟！這一步會比對樣本 FASTQ 跟參考序列的位置 (目前官方只提供人或老鼠的參考序列)，並決定哪些 barcode 是背景值、哪些 barcode 是 cell-associated [3]，產生下游分析的基因表現量表。早期的Cell Ranger 版本 (Cell Ranger v2.x and earlier)是假設 cell-associated barcode 應該會比背景 barcode 具有較多的 transcript counts，所以會設定一個閾值直接區分背景值及真實細胞barcode。而新版的 Cell Ranger (Cell Ranger v3.x) 則會分為兩步驟： (1) 跟舊版一樣的假設，先找出 high RNA content 的 barcode，這些先被歸類為真實細胞。 (2) 將一群「可能是空的 GEMs (Gel Bead in Emulsions)」挑出來，利用圖靈估計 (Simple Good-Turing smoothing) 對其基因建立一個 background model [4]。第一步剩下的 GEMs 的基因就跟這個 background model 比較，與這個 model 差異較大的就被歸類為真實細胞。而這個做法其實就是要挑出 low RNA content 的細胞。下圖是 10x 官網提供的比較 [3] (左：Cell Ranger 2.2，右：Cell Ranger 3.0)，綠色/藍色代表被判定為真實細胞，灰色代表背景值，樣本是混合 300 high RNA content 293T cells + 2000 low RNA content PBMC cells。圖是排序過的值，可以這樣解讀：有多少個 X(軸) barcode 帶有 Y(軸) 的UMI count。下圖可以看到，新版的 Cell Ranger 能夠將 low RNA content 細胞也挑出來，所以會比舊版細胞數還要多。 ![](https://i.imgur.com/VygS9mD.png) - cellranger aggr 當實驗設計為多樣本或多重複時才會需要這個步驟，必須將每個樣本庫 (library)先個別做完 cellranger count 再整合。這一步值得注意的參數是 “normalize”，預設為 mapped，計算方式是對每一個要合併的樣本取得「每顆細胞平均mapped到多少Transcriptome」，再以最小值當作基準，依比例重新計算每個合併樣本的 UMI count [5]。若下游分析會進行其他 normalize 處理，可以在這一步設置為 none。另外，cellranger aggr 也提供了一個 batch correction 功能，不過要提醒大家的是，這裡官方有特別強調，此功能主要是針對「樣本所採用的不同試劑版本」做校正，其他我們所認知的，例如重複實驗的校正是不適用的。如果需要其他類型的校正，需要使用第三方套件 [6]。 - cellranger reanalyze 最後這個功能比較進階，讓使用者可以彈性的重新計算、改變參數設定，例如有一群特定想看的細胞、想要排除的基因；或是聚類分析的參數，例如設定多少個 PC (principle components)、幾個 k-means clusters…等的。但是這個步驟目前不支援 Feature Barcoding analysis，所以 output 並不會看到一個新的基因表現量表，唯一能夠體現差異的是附帶的可視化 Loupe 檔案，結果可能會跟前面不同。 1. 拆分数据 mkfastq->得到fastq数据 ![](https://i.imgur.com/cwJwGXb.png) 目的：将每个flowcell 的Illumina sequencer's base call files (BCLs)转为fastq文件特色：它借鉴了Illumina出品的bcl2fastq，另外增加了：将10X 样本index名称与四种寡核苷酸对应起来，比如A1孔是样本SI-GA-A1，然后对应的寡核苷酸是GGTTTACT, CTAAACGG, TCGGCGTC, and AACCGTAA ，那么程序就会去index文件中将存在这四种寡核苷酸的fastq组合到A1这个样本提供质控结果，包括barcode 质量、总体测序质量如Q30、R1和R2的Q30碱基占比、测序reads数等可以使用10X简化版的样本信息表它的示意流程： ![](https://i.imgur.com/ga8THF2.png) 3. 细胞定量 count、 4. 定量组合 aggr、 5. 调参reanalyze， 6. 还有一些小工具比如mkref、mkgtf、upload、sitecheck、mat2csv、vdj、mkvdjref、testrun # File Generally, all single-cell RNA-seq datasets, regardless of technology or pipeline, will contain three files: * a file with the gene IDs, representing all genes quantified `barcodes`.tsv: cellular barcodes present in dataset ![](https://i.imgur.com/e1utprE.png) * a file with the cell IDs, representing all cells quantified `features.tsv`: IDs of quantified genes ![](https://i.imgur.com/OD2V0XQ.png) * a matrix of counts per gene for every cell `matrix.mtx`: a matrix of count values, where rows are associated with the gene IDs above and columns correspond to the cellular barcodes. Note that there are many zero values in this matrix. ![](https://i.imgur.com/vxJkmi0.png) ![](https://i.imgur.com/NI2wQL6.png) ![](https://i.imgur.com/plqXAzl.png)