# NGS Analysis - NF-Core RNA-SEQ
## Background
Using nf-core rna-seq pipeline

# STAR Pipeline
## Samplesheet
from tumor data used for haplotype caller
```
sample,fastq_1,fastq_2,strandedness
sample_NA18627,/scratch/work/courses/BI7653/hw2.2024/ERR240369_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/ERR240369_2.filt.fastq.gz,auto
sample_HG00149,/scratch/work/courses/BI7653/hw2.2024/ERR156634_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/ERR156634_2.filt.fastq.gz,auto
sample_HG00243,/scratch/work/courses/BI7653/hw2.2024/ERR162846_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/ERR162846_2.filt.fastq.gz,auto
sample_HG00151,/scratch/work/courses/BI7653/hw2.2024/SRR766045_1.filt.fastq.gz,/scratch/work/courses/BI7653/hw2.2024/SRR766045_2.filt.fastq.gz,auto
```
from older rnaseq workflows (fastq hw8 2023)
```
sample,fastq_1,fastq_2,strandedness
PDAC253,/scratch/kk4764/rnaseq/fastqs/PDAC253_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC253_2PE.fastq.gz,auto
PDAC266,/scratch/kk4764/rnaseq/fastqs/PDAC266_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC266_2PE.fastq.gz,auto
PDAC273,/scratch/kk4764/rnaseq/fastqs/PDAC273_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC273_2PE.fastq.gz,auto
PDAC282,/scratch/kk4764/rnaseq/fastqs/PDAC282_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC282_2PE.fastq.gz,auto
PDAC286,/scratch/kk4764/rnaseq/fastqs/PDAC286_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC286_2PE.fastq.gz,auto
PDAC306,/scratch/kk4764/rnaseq/fastqs/PDAC306_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC306_2PE.fastq.gz,auto
PDAC316,/scratch/kk4764/rnaseq/fastqs/PDAC316_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC316_2PE.fastq.gz,auto
PDAC318,/scratch/kk4764/rnaseq/fastqs/PDAC318_1PE.fastq.gz,/scratch/kk4764/rnaseq/fastqs/PDAC318_2PE.fastq.gz,auto
```
## Params YAML
```
params:
config_profile_description: 'NYU NGS Analysis - SALMON RNA-Seq'
max_time: 1.d
skip_trimming: true
skip_alignment: true
custom_genomes: "/scratch/kk4764/rnaseq/GRCh38"
genomes:
'GRCh38':
fasta: "${params.custom_genomes}/genome.fa"
star: "${params.custom_genomes}/STARIndex/"
gtf: "${params.custom_genomes}/genes.gtf"
bed12: "${params.custom_genomes}/genes.bed"
cleanup: true
```
```
params {
config_profile_description = 'NYU NGS Analysis - STAR RNA-Seq'
// limit resources
max_memory = 40.GB
max_cpus = 2
max_time = 1.d
// main options
aligner = 'star_rsem'
skip_trimming = true
skip_pseudo_alignment = true
//skip_alignment // parameter used to run salmon in isolation
// path for genomes
custom_genomes = "/scratch/kk4764/rnaseq/GRCh38"
genomes {
'GRCh38' {
fasta = "${params.custom_genomes}/genome.fa"
star = "${params.custom_genomes}/STARIndex/"
gtf = "${params.custom_genomes}/genes.gtf"
bed12 = "${params.custom_genomes}/genes.bed"
}
}
}
// remove intermediate files
cleanup = true
```
## Command
```
module load nextflow/23.04.1
nextflow run nf-core/rnaseq \
--input star_samplesheet.csv \
--outdir res \
--genome GRCh38 \
-profile nyu_hpc \
-params-file star_rnaseq.yaml
```
# Results
## Notes:
1. default will run star_salmon
* Even using skip pseudoalignment will run salmon quantitation for QC --> salmon Quant is used for strandedness
> If you set the strandedness value to auto the pipeline will sub-sample the input FastQ files to 1 million reads, use Salmon Quant to infer the strandedness automatically and then propagate this information to the remainder of the pipeline.
3. using star_rsem to run without salmon
* Update1: Errors out at RSEM.
5. Using pre-downloaded files and no mem/cpu limits = 1.5 Hrs.
## From "older" test data
### DESeq2 Plots



### Table
| sample | PC1: 46% variance | PC2: 18% variance |
|---------|---------------------|---------------------|
| PDAC253 | -1.73213122108297 | 5.48930476722174 |
| PDAC266 | -1.5766202211133 | -0.636814297658562 |
| PDAC273 | -1.33553536881398 | -2.35334579678229 |
| PDAC306 | -1.2963299556117 | -0.170925978562663 |
| PDAC318 | -1.24587345304559 | -0.139170943017021 |
| PDAC286 | -1.12688468668933 | -1.52028323916075 |
| PDAC282 | -0.964848258704149 | -0.923356049081707 |
| PDAC316 | 9.27822316506105 | 0.254591537041304 |
### Comparing with my old assessment (2022)
Seems right. I assume the data has changed between 2022 and 2023.

# Salmon-only Pipeline
<img src="https://hackmd.io/_uploads/H10MzZJ6p.png" alt="image" style="width:50%;">
## Notes
* Runtime: 1h 3m // No limits
* Uses the same samplesheet as STAR pipeline
* Need to assign `-pseudo_aligner salmon` and `- skip_alignment` to skip STAR [[Alignment Options](https://nf-co.re/rnaseq/3.14.0/docs/usage#alignment-options)]
## Config
```
params {
config_profile_description = 'NYU NGS Analysis - SALMON RNA-Seq'
// limit resources
//max_memory = 40.GB
//max_cpus = 2
max_time = 1.d
// main options
skip_trimming = true
skip_alignment = true
pseudo_aligner = 'salmon'
// path for genomes
custom_genomes = "/scratch/kk4764/rnaseq/GRCh38"
genomes {
'GRCh38' {
fasta = "${params.custom_genomes}/genome.fa"
star = "${params.custom_genomes}/STARIndex/"
gtf = "${params.custom_genomes}/genes.gtf"
bed12 = "${params.custom_genomes}/genes.bed"
}
}
}
// remove intermediate files
cleanup = true
```
## Command
```
nextflow run nf-core/rnaseq \
--input star_samplesheet.csv \
--outdir salmon_res \
--genome GRCh38 \
-profile nyu_hpc \
-c salmon_rnaseq.config
```
## Results
The clustering of groups are obviously different but are loosely together, and similar to the STAR alignment.
#### Plots from res > deseq2

