scRNA-seq Data analysis 2023

--- tags: Reports --- # scRNA-seq Data analysis 2023 - Francis Hopkins and Marie Larsson *********************** - [System Update](#System-Update) - [CellRanger](#CellRanger) - [bcl2fastq](#bcl2fastq) - [De-multiplexing](#De-multiplexing) - [Run CellRanger Multi](#Run-CellRanger-Multi) - [CellRanger Multi Results](#CellRanger-Multi-Results) - [Run CellRanger Aggr](#Run-CellRanger-Aggr) - [CellRanger Aggr Results](#CellRanger-Aggr-Results) - [Other resources](#Other-Resources) *********************** ## System Update ### CellRanger CellRanger was updated to version [7.2.0](https://www.10xgenomics.com/support/software/cell-ranger/downloads 'download') with module: ```bash #%Module1.0#################################################################### proc ModulesHelp { } { global _module_name puts stderr "\tThis module sets up the environment for Cell Ranger, version 6.1.1" } set _module_name [module-info name] set approot /opt/sw/bioinfo-tools/sources/cellranger-7.2.0 prepend-path PATH $approot #prepend-path LD_LIBRARY_PATH $approot/lib64:$approot/lib64/R/lib #prepend-path MANPATH $approot/share/man ``` In order to test the pipeline, run the following command: ```bash cellranger testrun --id=tiny ``` We obtained the following log: ```bash Martian Runtime - v4.0.11 2023-10-16 13:54:33 [jobmngr] WARNING: configured to use 453GB of local memory, but only 441.7GB is currently available. Serving UI at http://z6g4:47451?auth=f5M__k0Jdv-2DnTLn_HzxFyPKLOdyFSWhHTgf-H1myc Running preflight checks (please wait)... Checking sample info... Checking FASTQ folder... Checking reference... Checking reference_path (/opt/sw/bioinfo-tools/sources/cellranger-7.2.0/external/cellranger_tiny_ref) on z6g4... Checking optional arguments... mro: v4.0.11 mrp: v4.0.11 Anaconda: Python 3.10.11 numpy: 1.24.3 scipy: 1.10.1 pysam: 0.21.0 h5py: 3.6.0 pandas: 1.5.3 STAR: 2.7.2a samtools: samtools 1.16.1 Outputs: - Run summary HTML: /mnt/WD2/SC_ML/tiny/outs/web_summary.html - Run summary CSV: /mnt/WD2/SC_ML/tiny/outs/metrics_summary.csv - BAM: /mnt/WD2/SC_ML/tiny/outs/possorted_genome_bam.bam - BAM BAI index: /mnt/WD2/SC_ML/tiny/outs/possorted_genome_bam.bam.bai - BAM CSI index: null - Filtered feature-barcode matrices MEX: /mnt/WD2/SC_ML/tiny/outs/filtered_feature_bc_matrix - Filtered feature-barcode matrices HDF5: /mnt/WD2/SC_ML/tiny/outs/filtered_feature_bc_matrix.h5 - Unfiltered feature-barcode matrices MEX: /mnt/WD2/SC_ML/tiny/outs/raw_feature_bc_matrix - Unfiltered feature-barcode matrices HDF5: /mnt/WD2/SC_ML/tiny/outs/raw_feature_bc_matrix.h5 - Secondary analysis output CSV: /mnt/WD2/SC_ML/tiny/outs/analysis - Per-molecule read information: /mnt/WD2/SC_ML/tiny/outs/molecule_info.h5 - CRISPR-specific analysis: null - Antibody aggregate barcodes: null - Loupe Browser file: /mnt/WD2/SC_ML/tiny/outs/cloupe.cloupe - Feature Reference: null - Target Panel File: null - Probe Set File: null Waiting 6 seconds for UI to do final refresh. Pipestance completed successfully! 2023-10-16 13:56:21 Shutting down. Saving pipestance info to "tiny/tiny.mri.tgz" ``` The file named *websummary.html* stores a graphical overview of the analysis results. ### bcl2fastq bcl2fastq was updated to version 2.20.0.422: ```bash mamba create -c bih-cubi -n bcl2fastq2 bcl2fastq2 ``` with module: ```bash #%Module1.0#################################################################### proc ModulesHelp { } { global _module_name puts stderr "\tThis module sets up the environment for bcl2fastq, version 2.20.0.422" } set _module_name [module-info name] set approot /opt/sw/bioinfo-tools/sources/anaconda3/envs/bcl2fastq2/bin/ prepend-path PATH $approot #prepend-path LD_LIBRARY_PATH $approot/lib64:$approot/lib64/R/lib #prepend-path MANPATH $approot/share/man ``` ## De-multiplexing This step is normally performed starting from bcl files using cellranger mkfastq or bcl2fastq: 1. [mkfastq](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq 'mkfastq'). 2. [bcl2fastq](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/bcl2fastq-direct 'bcl2fastq'). The CSV file was converted to the reverse-complemented indexes (i5) obtaining the final file renamed as *SampleSheet_v2_template_FH_fixed_rc.csv*: ```bash= [Header] FileFormatVersion,2 RunName,230825_colon_24 [Reads] Read1Cycles,28 Read2Cycles,90 Index1Cycles,10 Index2Cycles,10 [Data] Sample_ID,index,index2 S01-M-24-GEX,GTAACATGCG,AGGTAACACT S02-F-24-GEX,GTGGATCAAA,CAGGGTTGGC S03-C-24-GEX,CACTACGAAA,ATCAGTCTAA S04-M-24-CMX,GCCTTGTCAA,TGATTGGATC S05-F-24-CMX,AACTTCTTTG,TACCGCTTAT S06-C-24-CMX,TGGTTATAGA,TTATGAGCTC ``` CellRanger *mkfastq* was then run as: ```bash cellranger mkfastq --id=SI-TT-A3b_ATCAGTCTAA --run=/mnt/WD2/SC_ML/230825_VH01167_15_AAC77JWHV --samplesheet=/mnt/WD2/SC_ML/230825_VH01167_15_AAC77JWHV/SampleSheet_v2_template_FH_fixed_rc.csv ``` it generated fastq files from both lanes plus I1 and I2 fastq ([index files](https://www.biostars.org/p/374581/ 'Merge fastq files')). We merge the fastq files for forward and reverse coming from different lanes for the same sample: ```bash= cat S01-M-24-GEX_S1_L001_R1_001.fastq.gz S01-M-24-GEX_S1_L002_R1_001.fastq.gz > S01-M-24-GEX_S1_R1_001.fastq.gz cat S01-M-24-GEX_S1_L001_R2_001.fastq.gz S01-M-24-GEX_S1_L002_R2_001.fastq.gz > S01-M-24-GEX_S1_R2_001.fastq.gz cat S02-F-24-GEX_S2_L001_R1_001.fastq.gz S02-F-24-GEX_S2_L002_R1_001.fastq.gz > S02-F-24-GEX_S1_R1_001.fastq.gz cat S02-F-24-GEX_S2_L001_R2_001.fastq.gz S02-F-24-GEX_S2_L002_R2_001.fastq.gz > S02-F-24-GEX_S1_R2_001.fastq.gz cat S03-C-24-GEX_S3_L001_R1_001.fastq.gz S03-C-24-GEX_S3_L002_R1_001.fastq.gz > S03-C-24-GEX_S1_R1_001.fastq.gz cat S03-C-24-GEX_S3_L001_R2_001.fastq.gz S03-C-24-GEX_S3_L002_R2_001.fastq.gz > S03-C-24-GEX_S1_R2_001.fastq.gz cat S04-M-24-CMX_S4_L001_R1_001.fastq.gz S04-M-24-CMX_S4_L002_R1_001.fastq.gz > S04-M-24-CMX_S4_R1_001.fastq.gz cat S04-M-24-CMX_S4_L001_R2_001.fastq.gz S04-M-24-CMX_S4_L002_R2_001.fastq.gz > S04-M-24-CMX_S4_R2_001.fastq.gz cat S05-F-24-CMX_S5_L001_R1_001.fastq.gz S05-F-24-CMX_S5_L002_R1_001.fastq.gz > S05-F-24-CMX_S5_R1_001.fastq.gz cat S05-F-24-CMX_S5_L001_R2_001.fastq.gz S05-F-24-CMX_S5_L002_R2_001.fastq.gz > S05-F-24-CMX_S5_R2_001.fastq.gz cat S06-C-24-CMX_S6_L001_R1_001.fastq.gz S06-C-24-CMX_S6_L002_R1_001.fastq.gz > S06-C-24-CMX_S6_R1_001.fastq.gz cat S06-C-24-CMX_S6_L001_R2_001.fastq.gz S06-C-24-CMX_S6_L002_R2_001.fastq.gz > S06-C-24-CMX_S6_R2_001.fastq.gz ``` At this stage we found that an high amount of reads went into the undetermined fastq files. The following table shows the top three unknown barcodes eating up most of the reads: |Sequence| Lane 1 Count | Lane 2 Count | |---|---|---| | GGGGGGGGGG+AGATCTCGGT | 318,228,900 | 257,874,580 | | GGGGGGGGGG+CAGGGTTGGC | 1,134,240 | 3,693,200 | | GGGGGGGGGG+AGGTAACACT | 839,900 | 1,418,180 | We tried a Blast alignment to check if it was due to rRNA enrichment but the sequence is pretty short and couldn't find any match even with relaxed parameters. The lab is investigating how this could have happened. ## Run CellRanger Multi CellRanger needs [fastq](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input 'fastq') files usually coming from mkfastq. There are different algorithms within CellRanger and the appropriate algorithm must be picked based on the chemistry as reported [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger 'CellRanger count/multi'). Click for more info about [Targeted GEX](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-targeted 'Targeted GEX') and [Cell Multiplexing (CellPlex)](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cellplex 'CellPlex'). For this specific experiment, the [Cell Multiplexing Kit](https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.10xgenomics.com%2Fsupport%2Fsingle-cell-gene-expression%2Fdocumentation%2Fsteps%2Fsample-prep%2Fchromium-next-gem-single-cell-3-v-3-1-cell-multiplexing&data=05%7C01%7Cmassimiliano.volpe%40liu.se%7C23198af79bd04320a6af08dbcfef2a67%7C913f18ec7f264c5fa816784fe9a58edd%7C0%7C0%7C638332398329630275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uqRtvCKUjWlnTHQ5IYFNA4yEXEv%2BUH%2BbP32MGaoaes4%3D&reserved=0 'Technical Note') has been used. More resources about Cell Multiplexing [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cellplex 'CellPlex') and how to run within CellRanger [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/multi 'Multi'). The tech note for the Dual Indexes Kit is available [here](https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/library-prep/chromium-single-cell-3-reagent-kits-user-guide-v-3-1-chemistry-dual-index 'Dual Index'). NOTE: [17/10/2023] The bioinformatics analysis is now in standby since I lack the proper informataion to run *cellranger multi* in oreder to make the count matrix. UPDATE: [19/10/2023] Eventually we found that what distinguishes the different sampples is the Cell Multiplexing Oligo (CMO) that will be read by *cellranger multi*. We are waiting for Francis to find the correct CMO assignment to his samples. UPDATE: [23/10/2023] Marie and Francis decided to go ahead with the analysis. To run *cellranger multi* I prepared three different csv files for separate analyses that will be lately merged with *cellranger aggr*: **M-24.csv** ```bash= [gene-expression] reference,/mnt/WD2/BIN21P006_ML/refdata-gex-GRCh38-2020-A [libraries] fastq_id,fastqs,lanes,physical_library_id,feature_types,subsample_rate S01-M-24-GEX,/mnt/WD2/SC_ML/SI-TT-A3b_ATCAGTCTAA/outs/fastq_path/fastq/S01-M-24-GEX,any,S01-M-24-GEX,gene expression, S04-M-24-CMX,/mnt/WD2/SC_ML/SI-TT-A3b_ATCAGTCTAA/outs/fastq_path/fastq/S04-M-24-CMX,any,S04-M-24-CMX,Multiplexing Capture, [samples] sample_id,cmo_ids,description M1,CMO301,M1 M2,CMO302,M2 M3,CMO303,M3 M4,CMO304,M4 M5,CMO305,M5 ``` **F-24.csv** ```bash= [gene-expression] reference,/mnt/WD2/BIN21P006_ML/refdata-gex-GRCh38-2020-A [libraries] fastq_id,fastqs,lanes,physical_library_id,feature_types,subsample_rate S02-F-24-GEX,/mnt/WD2/SC_ML/SI-TT-A3b_ATCAGTCTAA/outs/fastq_path/fastq/S02-F-24-GEX,any,S02-F-24-GEX,gene expression, S05-F-24-CMX,/mnt/WD2/SC_ML/SI-TT-A3b_ATCAGTCTAA/outs/fastq_path/fastq/S05-F-24-CMX,any,S05-F-24-CMX,Multiplexing Capture, [samples] sample_id,cmo_ids,description F1,CMO301,F1 F2,CMO302,F2 F3,CMO303,F3 F4,CMO304,F4 F5,CMO305,F5 ``` **C-24.csv** ```bash= [gene-expression] reference,/mnt/WD2/BIN21P006_ML/refdata-gex-GRCh38-2020-A [libraries] fastq_id,fastqs,lanes,physical_library_id,feature_types,subsample_rate S03-C-24-GEX,/mnt/WD2/SC_ML/SI-TT-A3b_ATCAGTCTAA/outs/fastq_path/fastq/S03-C-24-GEX,any,S03-C-24-GEX,gene expression, S06-C-24-CMX,/mnt/WD2/SC_ML/SI-TT-A3b_ATCAGTCTAA/outs/fastq_path/fastq/S06-C-24-CMX,any,S06-C-24-CMX,Multiplexing Capture, [samples] sample_id,cmo_ids,description C1,CMO301,C1 C2,CMO302,C2 C3,CMO303,C3 C4,CMO304,C4 C5,CMO305,C5 ``` For these analyses I ran three commands generating three different folders: ```bash= cellranger multi --id=M-24_results --csv=./M-24.csv cellranger multi --id=F-24_results --csv=./F-24.csv cellranger multi --id=C-24_results --csv=./C-24.csv ``` I double checked what I was doing with the 10x support and they confirmed my strategy. ### CellRanger Multi Results The cellranger multi run unfortunately resulted into a lot of unassigned cells. We got almost nothing for pretty much any of the samples. Here is a summary of the results: |Sample |Cells| Median reads per cell| Median genes per cell| Total genes detected| Median UMI counts per cell| |---|---|---|---|---|---| |M1 |22 |15,531 |520 |7,555 |1,333| |M2 |15 |30,779 |902 |6,016 |3,013| |M3 |149| 31,879 |720 |14,555 |3,325| |M4 |508| 33,742 |680 |18,038 |3,350| |M5 |0 |0 |0 |0 |0| |F1 |0 |0 |0 |0 |0| |F2 |0 |0 |0 |0 |0| |F3 |0 |0 |0 |0 |0| |F4 |0 |0 |0 |0 |0| |F5 |0 |0 |0 |0 |0| |C1 |364| 23,063 |947 |18,445 |4,605| |C2 |556| 25,060 |1,185 |19,434 |5,037| |C3 |133| 29,352 |1,111 |14,837 |6,141| |C4 |370| 27,386 |1,159 |19,434 |5,537| |C5 |0 |0 |0 |0 |0| NOTE: [30/10/2023] I have opened a ticket with the 10x support and they think there could be a library prep issue. Theu asked me for the log files from the de-multiplexing and cellranger multi steps and they are looking into them. UPDATE: [31/10/2023] The 10x support confirmed the issue with the library prep. ## Run CellRanger Aggr To aggregate the outputs from multiple runs of *cellranger count* or *cellranger multi* we need [*cellranger aggr*](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/aggregate 'CellRanger aggr'). After *cellranger multi*, I tried to aggregate the results anyway by running `cellranger aggr` with the following command: ```bash cellranger aggr --id=aggr_normOFF --csv=aggr.csv --normalize=none ``` By using parameter `--normalize=none`, we have decided to do not perform the [normalization of reads between libraries](https://kb.10xgenomics.com/hc/en-us/articles/360004559691--Alert-Low-Post-Normalization-Read-Depth 'cellranger aggr subsampling') within cellranger because it leads to a massive loss of reads due to [subsampling](https://kb.10xgenomics.com/hc/en-us/articles/115004217543-How-does-cellranger-aggr-normalize-for-sequencing-depth-among-multiple-libraries- 'cellranger aggr subsampling'). Instead, [normalizing within Seurat](https://github.com/satijalab/seurat/issues/672 'Cell Ranger “Aggr” read-depth normalization vs. Seurat NormalizeData') seems a better alternative (we will ask if this is the case with ScarfWeb too) since it does not sacrifice any data. More information about how to run `cellranger aggr` can be found [here](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/aggregate 'cellranger aggr'). More info about how to identify 10x library sources in the aggr matrix [here](https://kb.10xgenomics.com/hc/en-us/articles/360004524851-How-can-I-identify-the-10x-library-sources-when-analyzing-an-aggr-matrix- 'How can I identify the 10x library sources when analyzing an aggr matrix? '). ### CellRanger Aggr Results The aggregation can not be performed with empty samples, as we got the following error: ```bash Martian Runtime - v4.0.11 2023-10-31 11:25:15 [jobmngr] WARNING: configured to use 453GB of local memory, but only 447.3GB is currently available. Serving UI at http://z6g4:35219?auth=bAo7rRfYsKhFxWOXI8SgOP6HgEUbOnNvc9VsugOjxTQ Running preflight checks (please wait)... 2023-10-31 11:25:15 [runtime] (ready) ID.aggr_normOFF.SC_RNA_AGGREGATOR_CS.PARSE_AGGR_CSV 2023-10-31 11:25:15 [runtime] (run:local) ID.aggr_normOFF.SC_RNA_AGGREGATOR_CS.PARSE_AGGR_CSV.fork0.chnk0.main 2023-10-31 11:25:16 [runtime] (chunks_complete) ID.aggr_normOFF.SC_RNA_AGGREGATOR_CS.PARSE_AGGR_CSV [error] Cannot aggregate file because it contains no data: /mnt/WD2/SC_ML/2_cellranger_multi/M-24_results/outs/per_sample_outs/M5/count/sample_molecule_info.h5. Please remove this file from the aggregation and try again. 2023-10-31 11:25:21 Shutting down. Saving pipestance info to "aggr_normOFF/aggr_normOFF.mri.tgz" For assistance, upload this file to 10x Genomics by running: cellranger upload <your_email> "aggr_normOFF/aggr_normOFF.mri.tgz" ``` Since there is a library prep issue, as suggested by the 10x support, the aggregation cannot be performed and so the matrix cannot be obtained. For this reason I have to stop the analysis here. ## Other Resources 1. [10x Glossary](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/glossary#:~:text=Cell%20Multiplexing%3A%20The%20labeling%20of,in%20a%20single%20GEM%20well. 'Glossary') 2. [Galaxy scRNA-seq pipeline](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/scrna-preprocessing-tenx/tutorial.html 'Galaxy scRNA-seq pipeline') 3. [Old scRNA-seq pipeline](https://ppapasaikas.github.io/BC2_SingleCell/ 'Old scRNA-seq pipeline')