NGS Analysis - NF-Core RNA-SEQ

# NGS Analysis - NF-Core RNA-SEQ ## Background Using nf-core rna-seq pipeline ![image](https://hackmd.io/_uploads/Hk3g9i526.png) # STAR Pipeline ## Samplesheet from tumor data used for haplotype caller ``` sample,fastq_1,fastq_2,strandedness sample_NA18627,/scratch/work/courses/BI7653/hw2.2024/ERR240369_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/ERR240369_2.filt.fastq.gz,auto sample_HG00149,/scratch/work/courses/BI7653/hw2.2024/ERR156634_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/ERR156634_2.filt.fastq.gz,auto sample_HG00243,/scratch/work/courses/BI7653/hw2.2024/ERR162846_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/ERR162846_2.filt.fastq.gz,auto sample_HG00151,/scratch/work/courses/BI7653/hw2.2024/SRR766045_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/SRR766045_2.filt.fastq.gz,auto ``` from older rnaseq workflows (fastq hw8 2023) ``` sample,fastq_1,fastq_2,strandedness PDAC253,/scratch/kk4764/rnaseq/fastqs/PDAC253_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC253_2PE.fastq.gz,auto PDAC266,/scratch/kk4764/rnaseq/fastqs/PDAC266_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC266_2PE.fastq.gz,auto PDAC273,/scratch/kk4764/rnaseq/fastqs/PDAC273_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC273_2PE.fastq.gz,auto PDAC282,/scratch/kk4764/rnaseq/fastqs/PDAC282_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC282_2PE.fastq.gz,auto PDAC286,/scratch/kk4764/rnaseq/fastqs/PDAC286_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC286_2PE.fastq.gz,auto PDAC306,/scratch/kk4764/rnaseq/fastqs/PDAC306_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC306_2PE.fastq.gz,auto PDAC316,/scratch/kk4764/rnaseq/fastqs/PDAC316_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC316_2PE.fastq.gz,auto PDAC318,/scratch/kk4764/rnaseq/fastqs/PDAC318_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC318_2PE.fastq.gz,auto ``` ## Params YAML ``` params: config_profile_description: 'NYU NGS Analysis - SALMON RNA-Seq' max_time: 1.d skip_trimming: true skip_alignment: true custom_genomes: "/scratch/kk4764/rnaseq/GRCh38" genomes: 'GRCh38': fasta: "${params.custom_genomes}/genome.fa" star: "${params.custom_genomes}/STARIndex/" gtf: "${params.custom_genomes}/genes.gtf" bed12: "${params.custom_genomes}/genes.bed" cleanup: true ``` ``` params { config_profile_description = 'NYU NGS Analysis - STAR RNA-Seq' // limit resources max_memory = 40.GB max_cpus = 2 max_time = 1.d // main options aligner = 'star_rsem' skip_trimming = true skip_pseudo_alignment = true //skip_alignment // parameter used to run salmon in isolation // path for genomes custom_genomes = "/scratch/kk4764/rnaseq/GRCh38" genomes { 'GRCh38' { fasta = "${params.custom_genomes}/genome.fa" star = "${params.custom_genomes}/STARIndex/" gtf = "${params.custom_genomes}/genes.gtf" bed12 = "${params.custom_genomes}/genes.bed" } } } // remove intermediate files cleanup = true ``` ## Command ``` module load nextflow/23.04.1 nextflow run nf-core/rnaseq \ --input star_samplesheet.csv \ --outdir res \ --genome GRCh38 \ -profile nyu_hpc \ -params-file star_rnaseq.yaml ``` # Results ## Notes: 1. default will run star_salmon * Even using skip pseudoalignment will run salmon quantitation for QC --> salmon Quant is used for strandedness > If you set the strandedness value to auto the pipeline will sub-sample the input FastQ files to 1 million reads, use Salmon Quant to infer the strandedness automatically and then propagate this information to the remainder of the pipeline. 3. using star_rsem to run without salmon * Update1: Errors out at RSEM. 5. Using pre-downloaded files and no mem/cpu limits = 1.5 Hrs. ## From "older" test data ### DESeq2 Plots ![image](https://hackmd.io/_uploads/H1euHRn2T.png) ![image](https://hackmd.io/_uploads/SJPFS03nT.png) ![image](https://hackmd.io/_uploads/ryO9SC33T.png) ### Table | sample | PC1: 46% variance | PC2: 18% variance | |---------|---------------------|---------------------| | PDAC253 | -1.73213122108297 | 5.48930476722174 | | PDAC266 | -1.5766202211133 | -0.636814297658562 | | PDAC273 | -1.33553536881398 | -2.35334579678229 | | PDAC306 | -1.2963299556117 | -0.170925978562663 | | PDAC318 | -1.24587345304559 | -0.139170943017021 | | PDAC286 | -1.12688468668933 | -1.52028323916075 | | PDAC282 | -0.964848258704149 | -0.923356049081707 | | PDAC316 | 9.27822316506105 | 0.254591537041304 | ### Comparing with my old assessment (2022) Seems right. I assume the data has changed between 2022 and 2023. ![image](https://hackmd.io/_uploads/BJtSF0n2T.png) # Salmon-only Pipeline <img src="https://hackmd.io/_uploads/H10MzZJ6p.png" alt="image" style="width:50%;"> ## Notes * Runtime: 1h 3m // No limits * Uses the same samplesheet as STAR pipeline * Need to assign `-pseudo_aligner salmon` and `- skip_alignment` to skip STAR [[Alignment Options](https://nf-co.re/rnaseq/3.14.0/docs/usage#alignment-options)] ## Config ``` params { config_profile_description = 'NYU NGS Analysis - SALMON RNA-Seq' // limit resources //max_memory = 40.GB //max_cpus = 2 max_time = 1.d // main options skip_trimming = true skip_alignment = true pseudo_aligner = 'salmon' // path for genomes custom_genomes = "/scratch/kk4764/rnaseq/GRCh38" genomes { 'GRCh38' { fasta = "${params.custom_genomes}/genome.fa" star = "${params.custom_genomes}/STARIndex/" gtf = "${params.custom_genomes}/genes.gtf" bed12 = "${params.custom_genomes}/genes.bed" } } } // remove intermediate files cleanup = true ``` ## Command ``` nextflow run nf-core/rnaseq \ --input star_samplesheet.csv \ --outdir salmon_res \ --genome GRCh38 \ -profile nyu_hpc \ -c salmon_rnaseq.config ``` ## Results The clustering of groups are obviously different but are loosely together, and similar to the STAR alignment. #### Plots from res > deseq2 ![image](https://hackmd.io/_uploads/HkwSWKypT.png) ![image](https://hackmd.io/_uploads/ryOIWF1pa.png)