8. Variant Calling and Filtration

# 8. Variant Calling and Filtration ###### tags: `2. Main Steps` `tutorials` `variant calling` The main purpose of these next steps is to find variants in the sequence data and filter them to only count the variants that are 'real' and not a product of sequencing errors. This will end with a filtered VCF output. And with that you can do LOTS of things:) ### Step 1. Haplotype Caller This step may take a VERY long time and it's not recommended that you run it on the entire genome at once (ours took 5+ days to run on the entire *D.melanogaster* genome!). Instead, it's recommended you specify a genomic interval. If you want the whole genome you can run a job array for all of the chromosomes as distinct runs. In our example below, we specified the left and right arms of Chromosome 2 seperately (using a loop and the -L option below). If you have multiple intervals you'd like to specify, you can make an interval list shown [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists#:~:text=GATK%20supports%20several%20types%20of,bed%20%2C%20and%20VCF%20files.). #### NOTE: we moved our working directory to be in the scratch space (`/work/noor_wgs`) We realized that writing and reading big files inside the archival folders can be slow, and haplotype calling is a long running step. So we moved the aligned, marged, duplicates-marked BAMs files to the scratch space, and edit the code to point to the files in scratch space instead of the archival space. Also, the files in this scratch space will get erased if untouched for 75 days, so don't keep important files here for too long. More information [here.](https://rc.duke.edu/dcc/cluster-storage/) ``` #!/bin/bash #SBATCH -e errs/%a_hapCall_chr2-%A.err #SBATCH -o outs/%a_hapCall_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do java -jar $GATK HaplotypeCaller \ -L 2${ARM} \ -R /work/noor_wgs/ref/dmel-all-chromosome-r6.41.fasta \ -I /work/noor_wgs/aligned_merged_mdup_BAMs/aligned_merged_mdup_L${LL}.bam \ -O /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}.g.vcf.gz \ -ERC GVCF done ``` ### Step 2. Genotype GVCFs ``` #!/bin/bash #SBATCH -e errs/%a_gVCF_chr2-%A.err #SBATCH -o outs/%a_gVCF_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do java -jar $GATK GenotypeGVCFs \ -R /work/noor_wgs/ref/dmel-all-chromosome-r6.41.fasta \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}.g.vcf.gz \ -O /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}.vcf.gz done ``` ### Step 3. Select Variants (SNPs and Indels seperate) ``` #!/bin/bash #SBATCH -e errs/%a_selectvar_chr2-%A.err #SBATCH -o outs/%a_selectvar_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do gunzip -k /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}.vcf.gz java -jar $GATK SelectVariants \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}.vcf \ --select-type-to-include SNP \ -O /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_SNP.vcf java -jar $GATK SelectVariants \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}.vcf \ --select-type-to-include INDEL \ -O /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_indel.vcf done ``` ### Step 4. Count Variants (before filtration) ``` #!/bin/bash #SBATCH -e errs/%a_countvar_chr2-%A.err #SBATCH -o outs/%a_countvar_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do java -jar $GATK CountVariants \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_SNP.vcf java -jar $GATK CountVariants \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_indel.vcf done ``` ### Step 5. Filter and Select Variants Note that for this step the cutoffs you want to use for the different filtering variables may be different from what we used. In this example we used cutoffs suggested by GATK best practices (a good resource [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants)), but another way to determine the filtering criteria is to use statistical cutoffs (e.g. you can plot the distribution the different statistics for your genome data and exclude anything above the 95/99th percentile). ``` #!/bin/bash #SBATCH -e errs/%a_filter_chr2-%A.err #SBATCH -o outs/%a_filter_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do java -jar $GATK VariantFiltration \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_SNP.vcf \ -R /work/noor_wgs/ref/dmel-all-chromosome-r6.41.fasta \ -O /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_SNP_filtered.vcf \ --filter-name "QD_filter" \ --filter-expression "QD < 2.0" \ --filter-name "FS_filter" \ --filter-expression "FS > 60.0" \ --filter-name "MQ_filter" \ --filter-expression "MQ < 40.0" \ --filter-name "MQRankSum_filter" \ --filter-expression "MQRankSum < -12.5" \ --filter-name "ReadPosRankSum_filter" \ --filter-expression "ReadPosRankSum < -8.0" java -jar $GATK VariantFiltration \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_indel.vcf \ -R /work/noor_wgs/ref/dmel-all-chromosome-r6.41.fasta \ -O /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_indel_filtered.vcf \ --filter-name "QD_filter" \ --filter-expression "QD < 2.0" \ --filter-name "FS_filter" \ --filter-expression "FS > 200.0" \ --filter-name "ReadPosRankSum_filter" \ --filter-expression "ReadPosRankSum < -20.0" \ --filter-name "SOR_filter" \ --filter-expression "SOR > 10.0" done ``` ### Step 6. Count Variants (that pass filtration) ``` #!/bin/bash #SBATCH -e errs/%a_countvar_filtered_chr2-%A.err #SBATCH -o outs/%a_countvar_filtered_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=rp280@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do java -jar $GATK CountVariants \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_SNP_filtered.vcf java -jar $GATK CountVariants \ -V /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_indel_filtered.vcf done ``` ### Step 7. Merge SNP and INDEL VCF MergeVCFs will be used to join indels and SNPs from the same regions of the chromosome into one VCF file. This does NOT join 2L and 2R together yet, you are merging the two VCFs for 2L into one 2L VCF and same for 2R. The next step will join VCFs for 2R and 2L into one Chromosome 2 VCF (using GatherVcfs). ``` #!/bin/bash #SBATCH -e errs/%a_merge_chr2-%A.err #SBATCH -o outs/%a_merge_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=sbm34@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 for ARM in {L,R}; do java -jar $GATK MergeVcfs \ --INPUT /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_SNP_filtered.vcf \ --INPUT /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_indel_filtered.vcf \ --OUTPUT /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2${ARM}_merged_filtered.vcf done ``` ### Step 8. Combine VCFs for 2L and 2R to make final filtered Chromosome 2 VCF! ``` #!/bin/bash #SBATCH -e errs/%a_gather_chr2-%A.err #SBATCH -o outs/%a_gather_chr2-%A.out #SBATCH --mem-per-cpu=20G #SBATCH -p scavenger #SBATCH --mail-type=ALL --mail-user=sbm34@duke.edu #SBATCH -a 30,34,45,59,73 LL=${SLURM_ARRAY_TASK_ID} module load GATK/4.1.9.0 module load Java/1.8.0_60 java -jar $GATK GatherVcfs \ --INPUT /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2L_merged_filtered.vcf \ --INPUT /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_2R_merged_filtered.vcf \ --OUTPUT /work/noor_wgs/vcfs/L${LL}/dmel_L${LL}_chr2.vcf ``` /datacommons/noor/gws_proj/data/2021_09_21_fastq/vcfs/L73