Single Cell/Nuclei RNA-Seq Analysis Workflow

# Single Cell/Nuclei RNA-Seq Analysis Workflow ###### tags: `Single Cell` ## Level1 WDL: Produce Fastq files from BCL [Reference Level1 WDL](https://portal.firecloud.org/?return=terra#methods/scrna/makefastq/9) 1. Upload compressed BCL files from CCG to your google bucket. ``` gsutil cp [bcl.tar.gz] gs://[google bucket]/ ``` 2. Upload Makefastq index csv file to your google bucket. ``` gsutil cp [index_csv] gs://[google bucket]/meta/[index_csv] ``` ``` cat [index_csv] Lane,Sample,Index *,SampleName1,Sample1Index[Provided by CCG] *,SampleName2,Sample2Index[Provided by CCG] ..,..,.. ``` **Input:** **`input_csv_file`:** Google bucket URL of the index file. - e.g. `gs://[google bucket]/meta/[index_csv]` **`input_bcl_directory`:** Compressed/folder of BCL file - e.g. `gs://[google bucket]/[bcl.tar.gz]` **`output_directory`:** Google bucket to store fastq files - e.g. `gs://[google bucket]/Bcl_folder/ **docker_registry**: `jingxin` **cellranger_version**: `"6.0.1"` ## Level2 WDL: Generate preprocessed and normalized count matrix [Reference Level2 WDL](https://portal.firecloud.org/?return=terra#methods/scrna/level2_cellbender/13) > !!! Important: > Must put all additonal column names in the `input_file` that you want to include in your downstream analysis to the `append_attributes` parameter. > e.g. > If I want to perform batch removal on a customize batch infomration, I will > 1. Add additional column into the `input_file` to indicate batch information > 2. Assign column name of that column to`append_attributes`. > 3. If there are multiple columns you want to added in your final aggregated obj, you can use `,` to seperate column names in the `append_attributes` parameter. ### Generate raw count matrix We implement [Cellrange Count](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count) methond into WDL([scrna/count](https://portal.firecloud.org/?return=terra#methods/scrna/count/5/wdl))to measure the UMI counts per gene in each cell. To be able to run it in parallel, we expected a sample sheet in the integrated [level2 wdl](https://portal.firecloud.org/?return=terra#methods/scrna/level2_cellbender/13). | Sample | Flowcell | Reference | ..| | -------- | -------- | -------- |--------| | sample1 | gs_url | GRCh38_v3.0.0 |..| *Note:* `gs_url` is the google bucket location of a file containing a list of fastq files or a folder. e.g: `gs://de232da/fastq_list.txt` or `gs://de232da/sample1` ``` cat fastq_list.txt gs://hash123testsample1/test1_L001_I1_001.fastq.gz gs://hash123testsample1/test1_L001_R1_001.fastq.gz gs://hash123testsample1/test1_L001_R2_001.fastq.gz gs://hash123testsample1/test1_L002_I1_001.fastq.gz gs://hash123testsample1/test1_L002_R1_001.fastq.gz gs://hash123testsample1/test1_L002_R2_001.fastq.gz gsutil ls gs://de232da/sample1 gs://de232da/sample1/R1.fastq.gz gs://de232da/sample1/R2.fastq.gz ... ``` ### Preprocess and normalize raw count matrix #### Remove ambient RNA For each sample (`h5` from `CellRanger count`), we will use [CellBender](https://github.com/broadinstitute/CellBender) to remove ambient RNA molecules with the expected amount of cells equaling to the estimate from previous `CellRanger count` step. > **Note:** > Due to the function in the [WDL file](https://portal.firecloud.org/?return=terra#methods/cellbender/remove-background/11) developed by the CellBender group cannot be imported into other WDL files, we changed it a little bit to enable the import ([Modified CellBender WDL](https://portal.firecloud.org/?return=terra#methods/scrna/remove-background/4)). #### Filter out low quality cells and doublets This step will at first generates three plots to visualize the qc metrics before and after the filtering. 1. Percentage of Mitochondria reads $\frac{R_{mito}}{R_{all}};R = aligned\ reads$. Cells with the percentage higher than `pct_counts_m` will be filtered out. - Default `pct_counts_m` : 5 3. Number of expressed genes. Only cells with number of expressed genes within the range (`min_genes`,`max_genes`)will be kept. - Default `min_genes`: 500 - Default `max_genes`: 6000 5. Number of UMIs Next, we will identify robust genes in the filtered data and perform log normalization. Based on the normalized data, we will idenfity doublets per sample (`channel`). - Robust Genes: Genes that are expressed in at least 5 percent of cells - Log normalization: 1. For each cell, we normalized it to have the number of counts from robust genes equaling 100,000(TP100K). $(y_1, . . . , y_G) = \alpha \cdot (c_1, . . . , c_G), such\ that\ \sum_{i\in R} \alpha \cdot c_i = 100,000$ We have $G$ genes in total and $R$ represents the set of robust genes. 3. Transform the nomralized count matrix into the nature log space by repalcing the expression value $y$ into $log(y+1)$. ```python= pg.qc_metrics(data, percent_mito=pct_counts_m, min_genes=min_genes, max_genes=max_genes) pg.qcviolin(data, plot_type='mito', dpi=100) plt.savefig(f'{output}_qc_mito.pdf',bbox_inches='tight', dpi=150) pg.qcviolin(data, plot_type='gene', dpi=100) plt.savefig(f'{output}_qc_gene.pdf',bbox_inches='tight', dpi=150) pg.qcviolin(data, plot_type='count', dpi=100) plt.savefig(f'{output}_qc_count.pdf',bbox_inches='tight', dpi=150) pg.filter_data(data) pg.identify_robust_genes(data) pg.log_norm(data) ``` #### Normalize count matrix After having also filtered out doublets from the data, we will idenfity robust gene and perform log normalization again. The final object will be saved as `h5ad` obj. > **Note:** > 1. `pg.write_output(*.h5ad)` stores normalized data into `h5ad` obj if the `pegasus` obj has both raw and normalized `X` matrix. ```python= pg.infer_doublets(data, channel_attr=channel) pg.mark_doublets(data) pg.qc_metrics(data, percent_mito=pct_counts_m, min_genes=min_genes, max_genes=max_genes,select_singlets=True) doublets_stats = data.obs['demux_type'].value_counts().to_frame() pg.filter_data(data) pg.identify_robust_genes(data) pg.log_norm(data) doublets_stats.to_csv(f'{output}_doublets_stats.csv',sep=',') pg.write_output(data, f'{output}_filtered.h5ad') ``` ## Level3 WDL: Clustering and cell type annotation [Reference Level3 WDL](https://portal.firecloud.org/?return=terra#methods/scrna/level3/3) ### InferCNV We integrated `inferCNV` python version ([infercnvpy](https://github.com/icbi-lab/infercnvpy)) into the single cell workflow ([infercnv wdl](https://portal.firecloud.org/?return=terra#methods/scrna/infercnv/5)). Briefly, preprocessed expression matrix (log transformed `X`) will be inputted to the `infercnvpy.tl.infercnv` function to estimate single cel cnv profile (only include chromosome 1-22), while average expression level among all cells will be considered as the reference. Then, cnv profile will be used to cluster cells in the unsupervised manner. Finally, `cnv_score` will be calculated based on their `UMAP` from cnv profile. ```python= import scanpy as sc import infercnvpy as cnv adata = sc.read(adata_path) cnv.io.genomic_position_from_gtf(gtf_file=gtf_path, adata=adata) exclude_chromosomes = pd.Index(adata.var.chromosome.unique()).difference([ 'chr%d' % x for x in range(1, 23)]) cnv.tl.infercnv(adata, exclude_chromosomes=exclude_chromosomes) cnv.tl.pca(adata) cnv.pp.neighbors(adata) cnv.tl.leiden(adata) cnv.tl.umap(adata) cnv.tl.cnv_score(adata) adata.write(output, compression='gzip') ``` #### Batch removal and unsupervised clustering > [Issue] pegasus cumulus: seems that it does not consider group information? We select highly variable genes (HVGs) only from robust genes. First of all, we calculate mean($\bar\mu_g$) and variance($\bar\sigma_g^2$) for each robust gene and fit it into LOESS curve of degree 2 to get estimated $\hat \sigma_g^2$. Then, we will rank genes based on $\bar\sigma_g^2 - \hat \sigma_g^2$ and $\frac{\bar\sigma_g^2}{ \hat \sigma_g^2}$ separately. Lastly, we define the overall ranking as the sum of the two rankings and select top 2000 genes as the HVGs. > Note: > The variant ($\bar\sigma_g^2$) of each gene will be adjusted if we consider batch. Briefly, we can decompose the variance ($\bar\sigma_g^2$) into three components—within-batch variance ($\bar\sigma_{g_1}^2$), between-batch variance ($\bar\sigma_{g_2}^2$) and between-group variance ($\bar\sigma_{g_3}^2$). We remove the variance term ($\bar\sigma_{g_2}^2$) due to batch effects by redefining ($\bar\sigma_g^2$) as: $\bar\sigma_g^2 := \bar\sigma_{g_1}^2 + \bar\sigma_{g_3}^2$ and plug in the new redefined variance term to the previously described procedure to select HVGs. We then: 1. conduct PCA on HVGs 2. perform [Harmony](https://portals.broadinstitute.org/harmony/articles/quickstart.html) algorithm to remove batch on PCs 3. construct a kNN graph on the adjusted PCs 4. apply leiden algorithm with `leiden_resolution (defualt = 2.0)` resolution on the graph to find clusters. 5. visualize clusters by uniform manifold approximation and projection (UMAP) ```python= data.obs['Channel'] = data.obs[channel] pg.highly_variable_features(data, consider_batch=True) pg.pca(data) pca_key = pg.run_harmony(data) pg.neighbors(data, rep=pca_key) pg.leiden(data, rep=pca_key, resolution=leiden_resolution) pg.umap(data, rep=pca_key) ``` ### Distinguish normal cells and malignant cells (Level1 annotation) We perform K-means clustering with 2 centriod on **cnv_score** estimated from the InferCNV section. Then, we assign the cluster with a high centroid (cnv_score) to malignant cells and the one with a low centroid (cnv_score) to the normal cells. ![](https://i.imgur.com/hWBLEyw.png =300x) Next, we calculate the percentage of normal cells ($P_{in}$) and malignant ($P_{im}$) cells in each leiden cluster ($C_i$). Leiden clusters ($L$) are then annotated to the cell type ($Leve1\_Anno_{C_i}$) with the highest fraction of cells. $Leve1\_Anno_{C_i} = Label(max(P_{im},P_{in})) , C_i \in L$ ```python= kmeans = KMeans(n_clusters=2).fit(data.obs[cnv_col].values.reshape(-1, 1)) data.obs['Kmeans_Cluster'] = kmeans.predict( data.obs[cnv_col].values.reshape(-1, 1)) if kmeans.cluster_centers_[0, 0] < kmeans.cluster_centers_[1, 0]: cluster_map = {0: 'Immune Cell', 1: 'Malignant Cell'} else: cluster_map = {1: 'Immune Cell', 0: 'Malignant Cell'} data.obs['Kmeans_Cluster'] = data.obs['Kmeans_Cluster'].map(cluster_map) ``` #### Cell type scoring on normal cells (Level2 annotation) We then perform deferential gene expression analysis on normal cells. Briefly, we consider one cluster at one time, comparing gene expression level2 on cells within the cluster with all the others, and determining up and down regulated genes of the cluster. Next, we load predefined marker genes for each cell type. e.g: ```json= { "title": "Human immune cell markers", "cell_types": [ { "name": "T cell", "markers": [ { "genes": [ "CD3D+", "CD3E+", "CD3G+" ], "weight": 0.75, "comment": "CD3" }, { "genes": [ "TRAC+" ], "weight": 0.25, "comment": "T cell receptor" } ], "subtypes": { "title": "T cell subtype markers", "cell_types": [ { "name": "CD4+ T cell", "markers": [ { "genes": [ "CD4+" ], "weight": 1.0, "comment": "CD4+ T cell" } ] }, .. ] }, ... ] } ``` We use [Pegasus](https://pegasus.readthedocs.io/en/stable/api/pegasus.de_analysis.html#pegasus.de_analysis) to calculate putative score of each cell type $C$ for each cluster and annotate each cluster with the cell type with the highest score. In short, Pegasus recalibrates weights by only considering markers expressed in cells ($M$). ```python= new_w = old_w / len(markers) * len(expressed _markers) ``` For positive marker (with + sign), Pegasus assigns impact value ($V$) to be 0 if the marker is not upregulated. Otherwise, if the fold-change ($FC$) in the percentage of cells expressing this marker (within versus outside the cluster) satisfies $FC \geq 1.5$, it has an impact value of 2. If $FC < 1.5$, it has an impact value of $1+\frac{FC−1}{0.5}$. For a negative marker, if it is upregulated, its impact value is 0. If it is neither upregulated nor downregulated, its impact value is 1. Otherwise, if $\frac{1}{FC} \geq 1.5$, it has an impact value of 2. If $\frac{1}{FC} < 1.5$, it has an impact value of $1+\frac{\frac{1}{FC}−1}{0.5}$. The overall score of a cell type for a cluster is calculated as the ratio between the sum of impact values and the sum of weights multiplied by 2 from all expressed markers. $score_j=\frac{\sum_i^{i\in M_j}\sum_g^{g\in M_{ij}}V_{ijg}\cdot W_{ij}}{2\sum_i^{i\in M_j} W_{ij}}, j\in C$ Note that clusters with the highest cell type score lower than `0.5` will remain unannotated. We will also distinguish clusters with the same cell type annotation by adding the suffix `_[0-9]*`.