# NGS Analysis: Mutect2
**Purpose:** Adapting the data from tutorial [here](https://gatk.broadinstitute.org/hc/en-us/articles/360047232772--Notebook-Intro-to-using-Mutect2-for-somatic-data) for nf-core/Sarek.
## Data
The data available for use is located in their google cloud [here](https://console.cloud.google.com/storage/browser/gatk-tutorials/workshop_2002/3-somatic;tab=objects?project=broad-dsde-outreach&organizationId=548622027621&pli=1&prefix=&forceOnObjectsSortingFiltering=false). We need to download the appropriate software in your own conda environment.
* Reference: `ref/Homo_sapiens_assembly38.fasta`
* Tumor: `bams/tumor.bam`
* Normal: `bams/normal.bam`
* Panel of normals PoN: `resources/chr17_m2pon.vcf.gz`
### Setting up Conda Environment (OPTIONAL)
1. Create an empty environment
`conda create -n <your_env_name>`
2. Activate environment.
`conda activate /location/<your_env_name>`
### Downloading the Data
1. Use conda environment (Not necessary). [Setup Instructions](#Setting-up-Conda-Environment)
2. Set node with srun.
3. Install Google Cloud software
```
pip install google-cloud-storage
```
3. Download data
```
gsutil -m cp -r gs://gatk-tutorials/workshop_2002/3-somatic/bams . &&
gsutil -m cp -r gs://gatk-tutorials/workshop_2002/3-somatic/ref . &&
gsutil -m cp -r gs://gatk-tutorials/workshop_2002/3-somatic/resources . &&
gsutil -m cp -r gs://gatk-tutorials/workshop_2002/3-somatic/mutect2_precomputed .
```
## NF-Core/Sarek
I want to run Sarek Somatic. We previously ran Germline with haplotype caller.
We can try using Strelka and all the other stuff.
### Command
```
nextflow run nf-core/sarek -r 3.4.0 --input ./samplesheet.csv --outdir res -profile nyu_hpc -c ./somatic.config
```
### Sample Sheet (csv)
```
patient,status,sample,bam,bai
patient1,1,tumor_sample,/scratch/kk4764/mutect2/bams/tumor.bam,/scratch/kk4764/mutect2/bams/tumor.bai
patient1,0,normal_sample,/scratch/kk4764/mutect2/bams/normal.bam,/scratch/kk4764/mutect2/bams/normal.bai
```
### Config
```
params {
config_profile_description = 'NYU NGS Analysis'
// limit resources
max_memory = 16.GB
max_cpus = 2
max_time = 1.d
// main options
tools = 'strelka,mutect2,vep'
skip_tools = null
split_fastq = 25000000
step = 'markduplicates' // needs to start at markduplicates
joint_mutect2 = true
dbsnp = "/scratch/kk4764/ngs/nextflow/GATKBundle/Homo_sapiens_assembly38.known_indels.vcf.gz"
dbsnp_tbi = "/scratch/kk4764/ngs/nextflow/GATKBundle/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
fasta = "/scratch/kk4764/mutect2/ref/Homo_sapiens_assembly38.fasta"
fasta_fai = "/scratch/kk4764/mutect2/ref/Homo_sapiens_assembly38.fasta.fai"
dict = "/scratch/kk4764/mutect2/ref/Homo_sapiens_assembly38.dict"
intervals = "/scratch/kk4764/mutect2/resources/chr17plus.interval_list"
}
}
```
## Troubleshooting
### Join Error
```
Join mismatch for the following entries:
- key=[patient:patient1, sample:normal_sample, sex:NA, status:0, id:normal_sample, data_type:cram] values=
- key=[patient:patient1, sample:tumor_sample, sex:NA, status:1, id:tumor_sample, data_type:cram] values=
```
#### Error from .nextflow.log
```
error [nextflow.exception.ProcessUnrecoverableException]: Process `NFCORE_SAREK:SAREK:BAM_BASERECALIBRATOR:GATK4_BASERECALIBRATOR` input file name collision -- There are multiple input files for each of the following file names: Homo_sapiens_assembly38.known_indels.vcf.gz
error [java.lang.InterruptedException]: java.lang.InterruptedException
error [java.lang.InterruptedException]: java.lang.InterruptedException
error [java.lang.InterruptedException]: java.lang.InterruptedException
error [java.lang.InterruptedException]: java.lang.InterruptedException
error [java.lang.IllegalStateException]: Cannot obtain the semaphore to fork operator's body.
```
#### Problem and correction:
`dbsnp` and `dbsnp_tbi` have the same input. Oops. Changed tbi.
### Waiting for file transfers to complete Error
Run would time out due to waiting for file transfers to complete. When looking at the .nextflow.log, we see that there's something being downloaded from amazon or something.
We also see that the error points to a filename collision... [Possible Fix](https://nextflow.io/docs/latest/process.html#multiple-input-files)
#### STDOUT
`slurmstepd: error: *** JOB 42921725 ON cs076 CANCELLED AT 2024-02-14T00:54:03 DUE TO TIME LIMIT ***`
#### Full Error
```
Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:GATHERPILEUPSUMMARIES_NORMAL (1)'
Caused by:
Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:GATHERPILEUPSUMMARIES_NORMAL` input file name collision -- There are multiple input files for each of the following file names: normal_sample.pileups.table
```
#### Possible Correction
[2/14/2024] It's possible this is an intervals issue when it creates the pileup summaries. Will try removing intervals file, set `--no_intervals`, and start from `variant_calling` to get to root of issue faster. Also added `-with-dag` option to help troubleshoot the commands being run.
### java.lang.OutOfMemoryError
[2/14/2024] It looks like a GATK GetPileupSummaries error, but when looking at the processes from STDOUT, we get:
```
[- ] process > NFCORE_SAREK:SAREK:MULTIQC [ 0%] 0 of 1
```
We also still get the lines of "Waiting for files to transfer" error again.
#### .command.sh output
```
ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:GETPILEUPSUMMARIES_NORMAL (normal_sample)'
Caused by:
Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:GETPILEUPSUMMARIES_NORMAL (normal_sample)` terminated with an error exit status (1)
Command executed:
gatk --java-options "-Xmx9830M -XX:-UsePerfData" \
GetPileupSummaries \
--input normal_sample.converted.cram \
--variant af-only-gnomad.hg38.vcf.gz \
--output normal_sample.pileups.table \
--reference Homo_sapiens_assembly38.fasta \
--intervals af-only-gnomad.hg38.vcf.gz \
--tmp-dir . \
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_MUTECT2:GETPILEUPSUMMARIES_NORMAL":
gatk4: $(echo $(gatk --version 2>&1) | sed 's/^.*(GATK) v//; s/ .*$//')
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
Using GATK jar /usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx9830M -XX:-UsePerfData -jar /usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar GetPileupSummaries --input normal_sample.converted.cram --variant af-only-gnomad.hg38.vcf.gz --output normal_sample.pileups.table --reference Homo_sapiens_assembly38.fasta --intervals af-only-gnomad.hg38.vcf.gz --tmp-dir .
07:36:33.336 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
07:36:33.386 INFO GetPileupSummaries - ------------------------------------------------------------
07:36:33.390 INFO GetPileupSummaries - The Genome Analysis Toolkit (GATK) v4.4.0.0
07:36:33.390 INFO GetPileupSummaries - For support and documentation go to https://software.broadinstitute.org/gatk/
07:36:33.390 INFO GetPileupSummaries - Executing as kk4764@cm001.hpc.nyu.edu on Linux v4.18.0-372.26.1.el8_6.x86_64 amd64
07:36:33.390 INFO GetPileupSummaries - Java runtime: OpenJDK 64-Bit Server VM v17.0.3-internal+0-adhoc..src
07:36:33.390 INFO GetPileupSummaries - Start Date/Time: February 14, 2024 at 7:36:33 AM GMT
07:36:33.390 INFO GetPileupSummaries - ------------------------------------------------------------
07:36:33.390 INFO GetPileupSummaries - ------------------------------------------------------------
07:36:33.391 INFO GetPileupSummaries - HTSJDK Version: 3.0.5
07:36:33.391 INFO GetPileupSummaries - Picard Version: 3.0.0
07:36:33.391 INFO GetPileupSummaries - Built for Spark Version: 3.3.1
07:36:33.391 INFO GetPileupSummaries - HTSJDK Defaults.COMPRESSION_LEVEL : 2
07:36:33.391 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
07:36:33.392 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
07:36:33.392 INFO GetPileupSummaries - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
07:36:33.392 INFO GetPileupSummaries - Deflater: IntelDeflater
07:36:33.392 INFO GetPileupSummaries - Inflater: IntelInflater
07:36:33.392 INFO GetPileupSummaries - GCS max retries/reopens: 20
07:36:33.392 INFO GetPileupSummaries - Requester pays: disabled
07:36:33.392 INFO GetPileupSummaries - Initializing engine
07:36:34.136 INFO FeatureManager - Using codec VCFCodec to read file file://af-only-gnomad.hg38.vcf.gz
07:36:34.306 INFO FeatureManager - Using codec VCFCodec to read file file://af-only-gnomad.hg38.vcf.gz
07:45:23.348 WARN IntelInflater - Zero Bytes Written : 0
07:45:39.250 INFO GetPileupSummaries - Shutting down engine
[February 14, 2024 at 7:45:39 AM GMT] org.broadinstitute.hellbender.tools.walkers.contamination.GetPileupSummaries done. Elapsed time: 9.10 minutes.
Runtime.totalMemory()=10309599232
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3512)
at java.base/java.util.Arrays.copyOf(Arrays.java:3481)
at java.base/java.util.ArrayList.toArray(ArrayList.java:369)
at java.base/java.util.ArrayList.addAll(ArrayList.java:670)
at org.broadinstitute.hellbender.utils.IntervalUtils.parseIntervalArguments(IntervalUtils.java:319)
at org.broadinstitute.hellbender.utils.IntervalUtils.loadIntervals(IntervalUtils.java:239)
at org.broadinstitute.hellbender.cmdline.argumentcollections.IntervalArgumentCollection.parseIntervals(IntervalArgumentCollection.java:200)
at org.broadinstitute.hellbender.cmdline.argumentcollections.IntervalArgumentCollection.getTraversalParameters(IntervalArgumentCollection.java:180)
at org.broadinstitute.hellbender.cmdline.argumentcollections.IntervalArgumentCollection.getIntervals(IntervalArgumentCollection.java:111)
at org.broadinstitute.hellbender.engine.GATKTool.initializeIntervals(GATKTool.java:525)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:728)
at org.broadinstitute.hellbender.engine.LocusWalker.onStartup(LocusWalker.java:128)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:147)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Work dir:
/scratch/kk4764/mutect2/work/07/9373193f393eb5e20c2673395e84e5
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
```
#### Possible Correction
1. Using a WSL2 environment with memory changes? Or Docker: https://gatk.broadinstitute.org/hc/en-us/community/posts/20185757419163-Out-of-memory-error-in-GetPileupSummaries
2. Find a way to skip GETPILEUPSUMMARIES
3. Address input name clash
```
withName: 'GETPILEUPSUMMARIES.*' {
ext.prefix = { meta.num_intervals <= 1 ? "${meta.id}.mutect2" : "${meta.id}.mutect2.${intervals.simpleName}" }
publishDir = [
mode: 'copy',
path: { "${params.outdir}/variant_calling/" },
pattern: "*.table",
saveAs: { meta.num_intervals > 1 ? null : "mutect2/${meta.id}/${it}" }
]
}
withName: 'GETPILEUPSUMMARIES_.*' {
ext.prefix = { meta.num_intervals <= 1 ? "${meta.id}.mutect2" : "${meta.id}.mutect2.${intervals.simpleName}" }
publishDir = [
mode: 'copy',
path: { "${params.outdir}/variant_calling/" },
pattern: "*.table",
saveAs: { meta.num_intervals > 1 ? null : "mutect2/${meta.tumor_id}_vs_${meta.normal_id}/${it}" }
]
}
```
{"title":"NGS Analysis Mutect Setup","description":"The data available for use is located in their google cloud. We need to download the appropriate","contributors":"[{\"id\":\"fb3254a6-23bd-47b1-abe8-a8bfbd4908d2\",\"add\":17538,\"del\":3813}]"}