WGS pipeline using GATK best practices

# WGS pipeline using GATK best practices Work within conda environment created previously This tutorial is strictly for learning purposes on what steps to take when using GATK best practices guidelines for WGS data. These guidelines can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery) for data preprocessing and [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) for using GATK ![](https://i.imgur.com/ZQibnob.png) ``` conda activate <env-name> ``` Pick some data from SRA database using sratools Create a text file and paste in these accession numbers. Call it `accession.txt` ``` ERR10219898 ERR10219899 ERR10219900 ERR10219901 ``` Download our data ``` for sample in `cat accession.txt`; do fastq-dump --gzip --split-3 $sample done ``` Now we have our FASTQ Let's check the quality. Create a folder fastqc_htmls ``` fastqc fastqs/* --outdir fastqc_htmls cd fastqc_htmls multiqc . ``` For interpretation of these results, go [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/) How many reads are contained in each file? ## Trimming Create a directory called `trimmed_reads` ``` for sample in `cat accession.txt`; do R1=fastqs/"${sample}"_1.fastq.gz R2=fastqs/"${sample}"_2.fastq.gz trim_galore --phred33 -j 4 -q 10 --length 5 --paired $R1 $R2 -o trimmed_reads done ``` Try running the trimming step using `trimmomatic` What option do I have to specify to output gzipped trimmed reads #### Run a second quality check to confirm that the reads meet the trimming specifications specified Compare reports before and after trimming Compare file sizes before and after trimming ## Alignment Download our reference genome for chromosome 13 from Ensembl ``` wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz ``` Index using samtools and bwa Reading assignment - What do the files from bwa indexing contain. These are the files that end with `.ann, .pac, .sa, .bwt, .amb, .fai` The reference genome should be in the same directory as its index ``` gunzip Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz bwa index Homo_sapiens.GRCh38.dna.chromosome.13.fa samtools faidx Homo_sapiens.GRCh38.dna.chromosome.13.fa ``` Alignment using bwa-mem ``` for sample in `cat accession.txt`; do R1=trimmed_reads/"${sample}"_1_val_1.fq.gz R2=trimmed_reads/"${sample}"_2_val_2.fq.gz bwa mem -aM -t 4 \ Homo_sapiens.GRCh38.dna.chromosome.13.fa \ -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" \ $R1 $R2 > "${sample}".sam # convert to BAM format samtools view -Shb "${sample}".sam > "${sample}".bam # all this can be run using one command # bwa mem -aM -t 4 Homo_sapiens.GRCh38.dna.chromosome.13.fa -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" \ # $R1 $R2 | samtools view -Shb - > "${sample}.bam" done ``` How would you? - Check the percentage of aligned reads in the BAM files - Extract unaligned reads from the BAM files Sorting and indexing BAM files ``` for sample in `cat accession.txt`; do samtools sort "${sample}".bam -o "${sample}"_sorted.bam samtools index "${sample}"_sorted.bam done ``` **Note** By now you will notice that there are several intermediate steps that are required to generate a coordinate sorted BAM files. With the newer version of SAMtools, a coordinated sorted BAM file can be generated in one command from FASTQ files to avoid wasting time with intermediate steps `bwa mem -aM -t 4 Homo_sapiens.GRCh38.dna.chromosome.13.fa -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" $R1 $R2 | samtools sort "${sample}_sorted.bam"` View the alignments using IGV Files needed - Reference genome file - coordinate sorted BAM file and its index - GFF file for annotations Before using GATK, we shall create an index of the reference genome. This should also be located in the same directory as the reference genome ``` gatk CreateSequenceDictionary --REFERENCE Homo_sapiens.GRCh38.dna.chromosome.13.fa --OUTPUT Homo_sapiens.GRCh38.dna.chromosome.13.dict ``` ## Using GATK While following GATK best practices guidelines, the following steps are necessary. These steps are computationally intensivee 1. Marking duplicates 2. Base recalibration 3. HaplotypeCaller The guidelines can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) - Why is it neccessary to remove duplicates from the BAM files - Why is base recalibration important Make directories `metrics` and `gatk` Download a reference set of known variants ``` wget https://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/homo_sapiens-chr13.vcf.gz gunzip homo_sapiens/homo_sapiens-chr13.vcf.gz ``` Index the VCF ``` gatk IndexFeatureFile \ -F homo_sapiens-chr13.vcf ``` Copy this and save is as `gatk.sh`. It may take a while to run this so it can be run in the background. To do this `nohup bash gatk.sh >& gatk.log &` ``` reference_genome=Homo_sapiens.GRCh38.dna.chromosome.13.fa known_vcf=homo_sapiens-chr13.vcf for sample in `cat accession.txt`; do input_bam="${sample}"_sorted.bam # marking duplicates gatk MarkDuplicates \ --INPUT $input_bam \ --OUTPUT gatk/"${sample}"_sorted_dedup.bam \ --METRICS_FILE metrics/"${sample}"_dup_metrics.txt \ --REMOVE_DUPLICATES true \ --CREATE_INDEX true # BQSR building the model gatk BaseRecalibrator \ --input gatk/"${sample}"_sorted_dedup.bam \ --output gatk/"${sample}"_recal_data.table \ --reference "${reference_genome}" \ --known-sites "${reference_vcf}" # Applying BQSR Recalibration gatk ApplyBQSR \ --bqsr-recal-file gatk/"${sample}"_recal_data.table \ --input gatk/"${sample}"_sorted_dedup.bam \ --output gatk/"${sample}"_sorted_dedup_BQSR_recal.bam \ --reference "${reference_genome}" # Checking the quality of recalibration # Post BQSR recal table gatk BaseRecalibrator \ --input gatk/"${sample}"_sorted_dedup_BQSR_recal.bam \ --output gatk/"${sample}"_post_recal_data.table \ --reference "${reference_genome}" \ --known-sites "${reference_vcf}" # Analyse covariates (compare before and After BQSR) # Evaluate and compare base quality score recalibration tables gatk AnalyzeCovariates \ -before gatk/"${sample}"_recal_data.table \ -after gatk/"${sample}"_post_recal_data.table \ -plots gatk/"${sample}"_AnalyzeCovariates.pdf gatk HaplotypeCaller \ --reference "${reference_genome}" \ --input gatk/"${sample}"_sorted_dedup_BQSR_recal.bam \ --output gatk/"${sample}".g.vcf.gz \ --ERC GVCF done ``` - Inspect the files saved in the metrics directory - Compare the sizes of files before and after marking dupicates - Compare the sizes of files before and after base recalibration - What is the difference between a GVCF and a VCF Calling variants as GVCFs allows for joint genotyping. Thse are done using `gatk GenomicsDBImport` and `gatk GenotypeGVCFs`. ## Filter variants either with VQSR or by hard-filtering ![](https://i.imgur.com/qReZRE5.png) The reference to these steps can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering) For human data, variant scores are recalibrated using `gatk VQSR`. The logic behind this step can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR-). These steps are similar to base recalibrator and are done using `gatk VariantRecalibrator` and `gatk ApplyVQSR` Truth datasets include data from the 1000GP, HapMap, dbSNP, OMNI projects or databases For other organisms such as cattle that do not have reliable truth sets, hard filtering is recommended. Guuidelines on further steps using hard filtering can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants). Variants are seperates into SNPs and INDELs using `gatk SelectVariants` followed by filtrations based on threshoolds using `gatk VariantFiltration` to obtain analysis ready variants for annotation and downstream analysis. Annotation can be done using `ANNOVAR` or `SnpEff` or `gatk Funcotator` ## Note Pipelines are typically written in a workflow language like `Nextflow`, `Snakemake`, `WDL` or `bpipe`. The steps that we just covered upto HaplotypeCaller are what constitute a pipeline or workflow. These workflow languages are ideal for writing complex workflows and allow for reproducibility. Containerisation platforms such as `singularity` and `Docker` are also used for reproducibility by packaging software in a portable manner. It is also important to note job submissions to compute clusters require slight modifications to the scripts. Job schedulers such as `The Portable Batch System (PBS)` and the `Simple Linux Utility for Resource Management (Slurm)` are used for high performance computing clusters (HPCs) while `HTCondor` is used for a High-Throughput Computing (HTC) environment # De novo assembly Deinterleave the FASTQ file ``` #!/bin/bash # Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress] # # Deinterleaves a FASTQ file of paired reads into two FASTQ # files specified on the command line. Optionally GZip compresses the output # FASTQ files using pigz if the 3rd command line argument is the word "compress" # # Can deinterleave 100 million paired reads (200 million total # reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s) # # Latest code: https://gist.github.com/3521724 # Also see my interleaving script: https://gist.github.com/4544979 # # Inspired by Torsten Seemann's blog post: # http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html # Set up some defaults GZIP_OUTPUT=0 PIGZ_COMPRESSION_THREADS=4 #$(nproc) # If the third argument is the word "compress" then we'll compress the output using pigz if [[ $3 == "compress" ]]; then GZIP_OUTPUT=1 fi if [[ ${GZIP_OUTPUT} == 0 ]]; then paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2 else paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2 fi ``` ``` gzip -dc ERR10370895.fastq.gz | ./deinterleave.sh ERR10370895_1.fastq.gz ERR10370895_2.fastq.gz compress ``` 1. Run an assembly via the command-line using any of the best assemblers such as Velvet, Spades, unicycler, mira, abyss, etc, and give a reason for your tool of choice. Briefly explain the parameters used (if any) during assembly. 2. Comment on you assembly output highlighting; the number of contigs generated, the L50, N50, maximum and minimum contigs, what proportion of the contigs assembled, etc. This can be generated using the QUAST tool. 3. Repeat the assembly with a different assembler or a different kmer size and comment on any observed difference in the 2 outputs. 4. Use an appropriate contig-merger tool and gap filler tools to combine your contigs into a draft scaffold. ( Describe your steps, tools pipeline used and your final output. ) Use `sspace` ## Nextflow This script does fastqc and multiqc ``` #!/usr/bin/env nextflow nextflow.enable.dsl=2 params.fastq = "data/*{R1,R2}.fastq" params.outdir = "fastqc_results" // file_ch = Channel.fromFilePairs(params.fastq) //fastqc process FASTQC { publishDir params.outdir, mode:'copy' input: tuple val(sample_id), file(reads) output: path "fastqc_report" """ mkdir fastqc_report fastqc ${reads} --outdir fastqc_report """ } process MULTIQC { publishDir params.outdir, mode:'copy' input: path "*" output: path "multiqc_report.html" """ multiqc fastqc_report/* """ } workflow { Channel .fromFilePairs(params.fastq, checkIfExists: true) .set { read_pairs_ch } MULTIQC(FASTQC(read_pairs_ch).collect()) } ```