# WGS pipeline using GATK best practices
Work within conda environment created previously
This tutorial is strictly for learning purposes on what steps to take when using GATK best practices guidelines for WGS data. These guidelines can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery) for data preprocessing and [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) for using GATK

```
conda activate <env-name>
```
Pick some data from SRA database using sratools
Create a text file and paste in these accession numbers. Call it `accession.txt`
```
ERR10219898
ERR10219899
ERR10219900
ERR10219901
```
Download our data
```
for sample in `cat accession.txt`;
do
fastq-dump --gzip --split-3 $sample
done
```
Now we have our FASTQ
Let's check the quality. Create a folder fastqc_htmls
```
fastqc fastqs/* --outdir fastqc_htmls
cd fastqc_htmls
multiqc .
```
For interpretation of these results, go [here](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/)
How many reads are contained in each file?
## Trimming
Create a directory called `trimmed_reads`
```
for sample in `cat accession.txt`;
do
R1=fastqs/"${sample}"_1.fastq.gz
R2=fastqs/"${sample}"_2.fastq.gz
trim_galore --phred33 -j 4 -q 10 --length 5 --paired $R1 $R2 -o trimmed_reads
done
```
Try running the trimming step using `trimmomatic`
What option do I have to specify to output gzipped trimmed reads
#### Run a second quality check to confirm that the reads meet the trimming specifications specified
Compare reports before and after trimming
Compare file sizes before and after trimming
## Alignment
Download our reference genome for chromosome 13 from Ensembl
```
wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
```
Index using samtools and bwa
Reading assignment
- What do the files from bwa indexing contain. These are the files that end with `.ann, .pac, .sa, .bwt, .amb, .fai`
The reference genome should be in the same directory as its index
```
gunzip Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
bwa index Homo_sapiens.GRCh38.dna.chromosome.13.fa
samtools faidx Homo_sapiens.GRCh38.dna.chromosome.13.fa
```
Alignment using bwa-mem
```
for sample in `cat accession.txt`;
do
R1=trimmed_reads/"${sample}"_1_val_1.fq.gz
R2=trimmed_reads/"${sample}"_2_val_2.fq.gz
bwa mem -aM -t 4 \
Homo_sapiens.GRCh38.dna.chromosome.13.fa \
-R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" \
$R1 $R2 > "${sample}".sam
# convert to BAM format
samtools view -Shb "${sample}".sam > "${sample}".bam
# all this can be run using one command
# bwa mem -aM -t 4 Homo_sapiens.GRCh38.dna.chromosome.13.fa -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" \
# $R1 $R2 | samtools view -Shb - > "${sample}.bam"
done
```
How would you?
- Check the percentage of aligned reads in the BAM files
- Extract unaligned reads from the BAM files
Sorting and indexing BAM files
```
for sample in `cat accession.txt`;
do
samtools sort "${sample}".bam -o "${sample}"_sorted.bam
samtools index "${sample}"_sorted.bam
done
```
**Note**
By now you will notice that there are several intermediate steps that are required to generate a coordinate sorted BAM files. With the newer version of SAMtools, a coordinated sorted BAM file can be generated in one command from FASTQ files to avoid wasting time with intermediate steps
`bwa mem -aM -t 4 Homo_sapiens.GRCh38.dna.chromosome.13.fa -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" $R1 $R2 | samtools sort "${sample}_sorted.bam"`
View the alignments using IGV
Files needed
- Reference genome file
- coordinate sorted BAM file and its index
- GFF file for annotations
Before using GATK, we shall create an index of the reference genome. This should also be located in the same directory as the reference genome
```
gatk CreateSequenceDictionary --REFERENCE Homo_sapiens.GRCh38.dna.chromosome.13.fa --OUTPUT Homo_sapiens.GRCh38.dna.chromosome.13.dict
```
## Using GATK
While following GATK best practices guidelines, the following steps are necessary. These steps are computationally intensivee
1. Marking duplicates
2. Base recalibration
3. HaplotypeCaller
The guidelines can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-)
- Why is it neccessary to remove duplicates from the BAM files
- Why is base recalibration important
Make directories `metrics` and `gatk`
Download a reference set of known variants
```
wget https://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/homo_sapiens-chr13.vcf.gz
gunzip homo_sapiens/homo_sapiens-chr13.vcf.gz
```
Index the VCF
```
gatk IndexFeatureFile \
-F homo_sapiens-chr13.vcf
```
Copy this and save is as `gatk.sh`. It may take a while to run this so it can be run in the background.
To do this
`nohup bash gatk.sh >& gatk.log &`
```
reference_genome=Homo_sapiens.GRCh38.dna.chromosome.13.fa
known_vcf=homo_sapiens-chr13.vcf
for sample in `cat accession.txt`;
do
input_bam="${sample}"_sorted.bam
# marking duplicates
gatk MarkDuplicates \
--INPUT $input_bam \
--OUTPUT gatk/"${sample}"_sorted_dedup.bam \
--METRICS_FILE metrics/"${sample}"_dup_metrics.txt \
--REMOVE_DUPLICATES true \
--CREATE_INDEX true
# BQSR building the model
gatk BaseRecalibrator \
--input gatk/"${sample}"_sorted_dedup.bam \
--output gatk/"${sample}"_recal_data.table \
--reference "${reference_genome}" \
--known-sites "${reference_vcf}"
# Applying BQSR Recalibration
gatk ApplyBQSR \
--bqsr-recal-file gatk/"${sample}"_recal_data.table \
--input gatk/"${sample}"_sorted_dedup.bam \
--output gatk/"${sample}"_sorted_dedup_BQSR_recal.bam \
--reference "${reference_genome}"
# Checking the quality of recalibration
# Post BQSR recal table
gatk BaseRecalibrator \
--input gatk/"${sample}"_sorted_dedup_BQSR_recal.bam \
--output gatk/"${sample}"_post_recal_data.table \
--reference "${reference_genome}" \
--known-sites "${reference_vcf}"
# Analyse covariates (compare before and After BQSR)
# Evaluate and compare base quality score recalibration tables
gatk AnalyzeCovariates \
-before gatk/"${sample}"_recal_data.table \
-after gatk/"${sample}"_post_recal_data.table \
-plots gatk/"${sample}"_AnalyzeCovariates.pdf
gatk HaplotypeCaller \
--reference "${reference_genome}" \
--input gatk/"${sample}"_sorted_dedup_BQSR_recal.bam \
--output gatk/"${sample}".g.vcf.gz \
--ERC GVCF
done
```
- Inspect the files saved in the metrics directory
- Compare the sizes of files before and after marking dupicates
- Compare the sizes of files before and after base recalibration
- What is the difference between a GVCF and a VCF
Calling variants as GVCFs allows for joint genotyping. Thse are done using `gatk GenomicsDBImport` and `gatk GenotypeGVCFs`.
## Filter variants either with VQSR or by hard-filtering

The reference to these steps can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering)
For human data, variant scores are recalibrated using `gatk VQSR`. The logic behind this step can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR-). These steps are similar to base recalibrator and are done using `gatk VariantRecalibrator` and `gatk ApplyVQSR`
Truth datasets include data from the 1000GP, HapMap, dbSNP, OMNI projects or databases
For other organisms such as cattle that do not have reliable truth sets, hard filtering is recommended. Guuidelines on further steps using hard filtering can be found [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants). Variants are seperates into SNPs and INDELs using `gatk SelectVariants` followed by filtrations based on threshoolds using `gatk VariantFiltration` to obtain analysis ready variants for annotation and downstream analysis. Annotation can be done using `ANNOVAR` or `SnpEff` or `gatk Funcotator`
## Note
Pipelines are typically written in a workflow language like `Nextflow`, `Snakemake`, `WDL` or `bpipe`. The steps that we just covered upto HaplotypeCaller are what constitute a pipeline or workflow. These workflow languages are ideal for writing complex workflows and allow for reproducibility.
Containerisation platforms such as `singularity` and `Docker` are also used for reproducibility by packaging software in a portable manner.
It is also important to note job submissions to compute clusters require slight modifications to the scripts. Job schedulers such as `The Portable Batch System (PBS)` and the `Simple Linux Utility for Resource Management (Slurm)` are used for high performance computing clusters (HPCs) while `HTCondor` is used for a High-Throughput Computing (HTC) environment
# De novo assembly
Deinterleave the FASTQ file
```
#!/bin/bash
# Usage: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq [compress]
#
# Deinterleaves a FASTQ file of paired reads into two FASTQ
# files specified on the command line. Optionally GZip compresses the output
# FASTQ files using pigz if the 3rd command line argument is the word "compress"
#
# Can deinterleave 100 million paired reads (200 million total
# reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
#
# Latest code: https://gist.github.com/3521724
# Also see my interleaving script: https://gist.github.com/4544979
#
# Inspired by Torsten Seemann's blog post:
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
# Set up some defaults
GZIP_OUTPUT=0
PIGZ_COMPRESSION_THREADS=4 #$(nproc)
# If the third argument is the word "compress" then we'll compress the output using pigz
if [[ $3 == "compress" ]]; then
GZIP_OUTPUT=1
fi
if [[ ${GZIP_OUTPUT} == 0 ]]; then
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2
else
paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $1) | cut -f 5-8 | tr "\t" "\n" | pigz --best --processes ${PIGZ_COMPRESSION_THREADS} > $2
fi
```
```
gzip -dc ERR10370895.fastq.gz | ./deinterleave.sh ERR10370895_1.fastq.gz ERR10370895_2.fastq.gz compress
```
1. Run an assembly via the command-line using any of the best assemblers such as Velvet, Spades, unicycler, mira, abyss, etc, and give a reason for your tool of choice. Briefly explain the parameters used (if any) during assembly.
2. Comment on you assembly output highlighting; the number of contigs generated, the L50, N50, maximum and minimum contigs, what proportion of the contigs assembled, etc. This can be generated using the QUAST tool.
3. Repeat the assembly with a different assembler or a different kmer size and comment on any observed difference in the 2 outputs.
4. Use an appropriate contig-merger tool and gap filler tools to combine your contigs into a draft scaffold. ( Describe your steps, tools pipeline used and your final output. ) Use `sspace`
## Nextflow
This script does fastqc and multiqc
```
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
params.fastq = "data/*{R1,R2}.fastq"
params.outdir = "fastqc_results"
// file_ch = Channel.fromFilePairs(params.fastq)
//fastqc
process FASTQC {
publishDir params.outdir, mode:'copy'
input:
tuple val(sample_id), file(reads)
output:
path "fastqc_report"
"""
mkdir fastqc_report
fastqc ${reads} --outdir fastqc_report
"""
}
process MULTIQC {
publishDir params.outdir, mode:'copy'
input:
path "*"
output:
path "multiqc_report.html"
"""
multiqc fastqc_report/*
"""
}
workflow {
Channel
.fromFilePairs(params.fastq, checkIfExists: true)
.set { read_pairs_ch }
MULTIQC(FASTQC(read_pairs_ch).collect())
}
```