# COURSE INSTALLATION NOTES.md Introduction to bioinformatics for DNA and RNA sequence analysis # Installation This workshop requires a large number of different bioinformatics tools. The instructions for installing these tools exist here. Note that depending on the operating system and environment, some additional dependencies would likely be needed. If you are using the AWS instance built for this course these dependencies have already been installed. However if you are interested in the underlying dependencies and how they were installed see the AWS AMI Setup page. The remainder of this section will assume that you are on the AWS instance, however these instructions should work on any xenial ubuntu distribution with the required dependencies. # Prepare for installation For this workshop we will be using the workspace folder to store results, executables, and input files. To start we must choose a single directory for installing tools, typically in linux, user compiled tools are installed in /usr/local/bin however backups of the tools we will be using have already been installed there. In this tutorial we will install tools in ~/workspace/bin. Lets go ahead and make a bin directory in ~/workspace to get started. ```bash # make a bin directory mkdir -p ~/workspace/bin ``` # Install Samtools Samtools is a software package based in C which provies utilities for manipulating alignment files (SAM/BAM/CRAM). It is open source, available on github, and is under an MIT license. Let’s go ahead and download the source code from github to our bin directory and extract it with tar. Next we need to cd into our extracted samtools source code and configure the software. Running ./configure will make sure all dependencies are available and will also let the software know where it should install to. After that we will need to run make to actually build the software. Finally we can run make install which will copy the built software and the underlying libraries, documentation, etc. to their final locations. We can check the installation and print out the help message by providing the full path to the executable. ```bash # change to bin directory cd ~/workspace/bin # download and extract the source code wget https://github.com/samtools/samtools/releases/download/1.14/samtools-1.14.tar.bz2 tar --bzip2 -xvf samtools-1.14.tar.bz2 # configure and compile cd samtools-1.14/ ./configure --prefix=/home/ubuntu/workspace/bin/samtools-1.14/ make make install ln -s /home/ubuntu/workspace/bin/samtools-1.14/bin/samtools /home/ubuntu/workspace/bin/samtools # check instalation ~/workspace/bin/samtools --help ``` # Install PICARD PICARD is a set of java based tools developed by the Broad institute. It is very useful for manipulating next generation sequencing data and is available under an open source MIT license. The version of Picard we will be using requires java 8 which has already been installed. All we need to do is download the jar file which is a package file used to distribute java code. We can do this with wget. To run the software, we simply need to call java with the -jar option and provide the jar file. ```bash # change to the bin directory and download the jar file cd ~/workspace/bin wget https://github.com/broadinstitute/picard/releases/download/2.26.6/picard.jar # check the installation java -jar ~/workspace/bin/picard.jar -h ``` # Install BWA BWA is a popular DNA alignment tool used for mapping sequences to a reference genome. It is available under an open source GPLv3 license. To install BWA, we first need to download and extract the source code. Unlike with samtools theres no ./configure file so we can just run make to build the software. We can then make a symlink with ln -s which is just a reference to another file. In this case we will make a symlink so the executable in ~/workspace/bin/bwa-0.7.17/bwa and be found in ~/workspace/bin/bwa. ```bash # change to the bin folder, download, and extract the source code cd ~/workspace/bin wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2 tar --bzip2 -xvf bwa-0.7.17.tar.bz2 # build the software cd bwa-0.7.17 make # make symlink ln -s ~/workspace/bin/bwa-0.7.17/bwa ~/workspace/bin/bwa # check the installation ~/workspace/bin/bwa ``` # Install GATK 4 GATK is a toolkit developed by the broad institute focused primarily on variant discovery and genotyping. It is open source, hosted on github, and available under a BSD 3-clause license. First let’s download and unzip GATK from github. The creators of GATK recommend running GATK through conda which is a package, environment, and dependency management software, in essence conda basically creates a virtual environment from which to run software. The next step then is to tell conda to create a virtual environment for GATK by using the yaml file included within GATK as the instructions for creating the virtual environment. We do this with the command conda env create, we also use the -p option to specify where this environment should be stored. We will also make a symlink so the executable downloaded is available directly from our bin folder. To run GATK we must first start up the virtual environment with the command source activate, we can then run the program by providing the path to the executable. To exit the virtual environment run the command source deactivate. ```bash # download and unzip cd ~/workspace/bin wget https://github.com/broadinstitute/gatk/releases/download/4.2.3.0/gatk-4.2.3.0.zip unzip gatk-4.2.3.0.zip # make sure ubuntu user can create their own conda environments sudo chown -R ubuntu:ubuntu /home/ubuntu/.conda # create conda environment for gatk cd gatk-4.2.3.0/ conda env create -f gatkcondaenv.yml -p ~/workspace/bin/conda/gatk # make symlink ln -s ~/workspace/bin/gatk-4.2.3.0/gatk ~/workspace/bin/gatk # test installation conda activate ~/workspace/bin/conda/gatk ~/workspace/bin/gatk # to exit the virtual environment conda deactivate ``` # Install VEP 93.4 VEP is a variant annotation tool developed by ensembl and written in perl. By default VEP will perform annotations by making web-based API queries however it is much faster to have a local copy of cache and fasta files. The AWS AMI image we’re using already has these files for hg38 in the directory /opt/vep_cache as they can take a bit of time to download. To get an idea of what it’s like to install these we will install a vep_cache for petromyzon_marinus, a much smaller genome. To start we need to download vep from github using wget and unzip VEP. From there we can use the INSTALL.pl script vep provides to install the software which will ask a series of questions listed below. We also make a symlink when the installer completes. Note that the following assumes the existence of a particular version of Perl. We had to install Perl 5.22.0 since this is the last version supported by VEP and the version that comes with Ubuntu 18.04 is newer than this. When prompted by the install step below use these answers: Do you wish to exit so you can get updates (y) or continue (n): n [ENTER] Do you want to continue installing the API (y/n)? y [ENTER] Do you want to install any cache files (y/n)? y [ENTER] 147 [ENTER] Do you want to install any FASTA files (y/n)? y [ENTER] 42 [ENTER] Do you want to install any plugins (y/n)? n [ENTER] ```bash # download and unzip vep cd ~/workspace/bin wget https://github.com/Ensembl/ensembl-vep/archive/refs/tags/release/104.3.zip unzip 104.3.zip # run the INSTALL.pl script provided by VEP cd ensembl-vep-release-104.3/ /usr/local/bin/perl-5.22.0/perl -MCPAN -e 'install DBI' /usr/local/bin/perl-5.22.0/perl INSTALL.pl --CACHEDIR /opt/vep_cache #1. Do you wish to exit so you can get updates (y) or continue (n): n [ENTER] #2. Do you want to continue installing the API (y/n)? y [ENTER] (if asked) #3. Do you want to install any cache files (y/n)? y [ENTER] 147 [ENTER] #4. Do you want to install any FASTA files (y/n)? y [ENTER] 42 [ENTER] #5. Do you want to install any plugins (y/n)? n [ENTER] # make a symlink ln -s ~/workspace/bin/ensembl-vep-release-104.3/vep ~/workspace/bin/vep ``` #### Installing, Configuring VEP annotation support files as cache ```bash= mkdir ~/workspace/vep_cache/ cd ~/workspace/vep_cache/ curl -O tar xzf homo_sapiens_vep_104_GRCh38.tar.gz # test the Installation ~/workspace/bin/vep --help ``` # Install Varscan Varscan is a java program designed to call variants in sequencing data. It was developed at the Genome Institute at Washington University and is hosted on github. To use Varscan we simply need to download the distributed jar file into our~/workspace/bin. As with the other java programs which have already been installed in this section we can invoke Varscan via java -jar ```bash # Install Varscan cd ~/workspace/bin curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar java -jar ~/workspace/bin/VarScan.v2.4.2.jar ``` # Install BCFtools BCFtools is an open source program for variant calling and manipulating files in Variant Call Format (VCF) or Binary Variant Call Format (BCF). To install we first need to download and extract the source code with curl and tar respectively. We can then call make to build the program and make install to copy the program to the desired directory. ```bash cd ~/workspace/bin curl -L -k -o bcftools-1.14.tar.bz2 https://github.com/samtools/bcftools/releases/download/1.14/bcftools-1.14.tar.bz2 tar --bzip2 -xvf bcftools-1.14.tar.bz2 #install the software cd bcftools-1.14 make -j make prefix=~/workspace/bin/bcftools-1.14 install ln -s ~/workspace/bin/bcftools-1.14/bin/bcftools ~/workspace/bin/bcftools # test installation ~/workspace/bin/bcftools -h ``` # Install Strelka Strekla is a germline and somatic variant caller developed by illumina and available under an open source GPLv3 license. The binary distribution for strelka is already built and hosted on github so to install all we have to do is download and extract the software. It is important to note that strelka is built on python 2 and won’t work for python 3. The AMI we’re using contains both python versions so we just have to make sure we invoke strelka with python2, you can view the python versions on the AMI with python2 --version and python3 --version. ```bash # download and extract cd ~/workspace/bin conda create --name strelka-env python=2.7 curl -L -k -o strelka-2.9.10.centos6_x86_64.tar.bz2 https://github.com/Illumina/strelka/releases/download/v2.9.10/strelka-2.9.10.centos6_x86_64.tar.bz2 tar --bz2 -xvf strelka-2.9.10.centos6_x86_64.tar.bz2 # test installation python2 ~/workspace/bin/strelka-2.9.10.centos6_x86_64/bin/configureStrelkaSomaticWorkflow.py -h ``` # Install Sambamba Sambamba is a high performance alternative to samtools and provides a subset of samtools functionality. It is up to 6x faster for duplicate read marking and 4x faster for viewing alignment files. To install sambamba we can just download the binary distribution and extract it. From there we just make a symlink to make using it a bit more intuitive. ```bash # download and extract cd ~/workspace/bin curl -L -k -o sambamba_v0.6.4_linux.tar.bz2 https://github.com/lomereiter/sambamba/releases/download/v0.6.4/sambamba_v0.6.4_linux.tar.bz2 tar --bzip2 -xvf sambamba_v0.6.4_linux.tar.bz2 # create symlink ln -s ~/workspace/bin/sambamba_v0.6.4 ~/workspace/bin/sambamba # test installation ~/workspace/bin/sambamba ``` # Install HISAT2 HISAT2 is a graph based alignment algorithm devoloped at Johns Hopkins University. It is heavily used in the bioinformatics community for RNAseq based alignments. To Install we will need to download and extract the binary executable. We then make a symlink to put it with the other executables we’ve installed. ```bash # download and extract cd ~/workspace/bin wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip unzip hisat2-2.1.0-Linux_x86_64.zip # create symlink ln -s ~/workspace/bin/hisat2-2.1.0/hisat2 ~/workspace/bin/hisat2 # test installation ~/workspace/bin/hisat2 --help ``` # Install StringTie StringTie is a software program to perform transcript assembly and quantification of RNAseq data. The binary distributions are available so to install we can just download this distribution and extract it. Like with our other programs we also make a symlink to make it easier to find. ```bash # download and extract cd ~/workspace/bin wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.0.Linux_x86_64.tar.gz tar -xzvf stringtie-2.2.0.Linux_x86_64.tar.gz # make symlink ln -s ~/workspace/bin/stringtie-2.2.0.Linux_x86_64/stringtie ~/workspace/bin/stringtie # test installation ~/workspace/bin/stringtie -h ``` # Install Gffcompare Gffcompare is a program that is used to perform operations on general feature format (GFF) and general transfer format (GTF) files. It has a binary distribution compatible with the linux we’re using so we will just download, extract, and make a symlink. ```bash # download and extract cd ~/workspace/bin wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.9.8.Linux_x86_64.tar.gz tar -xzvf gffcompare-0.9.8.Linux_x86_64.tar.gz # make symlink ln -s ~/workspace/bin/gffcompare-0.9.8.Linux_x86_64/gffcompare ~/workspace/bin/gffcompare # check Installation ~/workspace/bin/gffcompare ``` # Install R R is a feature rich interpretive programming language originally released in 1995. It is heavily used in the bioinformatics community largely due to numerous R libraries available on bioconductor. It takes a several minutes to compile so we’ll use one which has already been setup. If we were to install R, we first would need to download and extract the source code. Next we’d configure the installation with --with-x=no which tells R to install without X11, a windowing system for displays. We’d also specify --prefix which is where the R framework will go, this includes the additional R libraries we’ll download later. From there we’d do make and make install to build the software and copy the files to their proper location and create symlinks for the executables. Finally we’d install the devtools and Biocmanager packages from the command line to make installing additional packages easier. We’ve commented out the code below, however it is exactly what was run to set up the R we will be using, except the installation location. ```bash ## download and extract cd ~/workspace/bin wget https://cran.r-project.org/src/base/R-3/R-3.5.1.tar.gz tar -zxvf R-3.5.1.tar.gz ## configure the installation, build the code cd R-3.5.1 ./configure --prefix=/home/ubuntu/workspace/bin --with-x=no make make install ## make symlinks ln -s ~/workspace/bin/R-3.5.1/bin/Rscript ~/workspace/bin/Rscript ln -s ~/workspace/bin/lib64/R/bin/R ~/workspace/bin/R ## test installation cd ~/workspace/bin ~/workspace/bin/Rscript --version ## install additional packages ~/workspace/bin/R --vanilla -e 'install.packages(c("devtools", "BiocManager", "dplyr", "tidyr", "ggplot2"), repos="http://cran.us.r-project.org")' ``` # Install copyCat copyCat is an R library for detecting copy number aberrations in sequencing data. The library is only available on github so we will have to use the BiocManager library to install a few of the underlying package dependencies. If installing a package from cran or bioconductor these dependencies would be automatically installed. After these dependencies are installed we can use the devtools package to install copycat directory from its github repository. ```bash # Install R Library dependencies ~/workspace/bin/R --vanilla -e 'BiocManager::install(c("IRanges", "DNAcopy"))' # install copyCat ~/workspace/bin/R --vanilla -e 'devtools::install_github("chrisamiller/copycat")' ``` # Install CNVnator CNVnator is a depth based copy number caller. It is open source and available on github under a creative common public license (CCPL). To install we first download and extract the source code. CNVnator relies on a specific version of samtools which is distributed with CNVnator, so our first step is to run make on that samtools. To finish the installation process we can then run make in CNVnator’s main source directory. ```bash # download and decompress cd ~/workspace/bin #download and install dependency package "root" from Cern (https://root.cern/install/). curl -OL https://root.cern/download/root_v6.20.08.Linux-ubuntu20-x86_64-gcc9.3.tar.gz tar -xvzf root_v6.20.08.Linux-ubuntu20-x86_64-gcc9.3.tar.gz source root/bin/thisroot.sh wget https://github.com/abyzovlab/CNVnator/releases/download/v0.3.3/CNVnator_v0.3.3.zip unzip CNVnator_v0.3.3.zip # make the samtools dependency distributed with CNVnator cd CNVnator_v0.3.3/src/samtools make # make CNVnator cd ../ make # make a symlink ln -s ~/workspace/bin/CNVnator_v0.3.3/src/cnvnator ~/workspace/bin/cnvnator # test installation ~/workspace/bin/cnvnator ``` # Install CNVkit CNVkit is a python based copy number caller designed for use with hybrid capture. To install we can download and extract the package. We then must use conda to set up the environment to run cnvkit. This process, while straight forward, takes some time so we’ve commented out the installation instructions for this tool and will use the conda environment that has already been set up. ```bash ## download and unzip cd ~/workspace/bin wget https://github.com/etal/cnvkit/archive/refs/tags/v0.9.9.zip unzip v0.9.9.zip ## add conda channels conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda ## create conda environment conda create -n cnvkit python=3 ln -s ~/workspace/bin/cnvkit-0.9.9/cnvkit.py ~/workspace/bin/cnvkit.py # test installation source activate cnvkit #install all dependencies ~/workspace/bin/cnvkit.py --help # to exit the virtual environment source deactivate ``` # Install Kallisto Kallisto is a kmer-based alignment algorithm used for quantifying transcripts in RNAseq data. Kallisto has a binary distribution available so to use the program we only have to download and extract the software from github. ```bash # download and extract cd ~/workspace/bin wget https://github.com/pachterlab/kallisto/releases/download/v0.46.2/kallisto_linux-v0.46.2.tar.gz tar -zxvf kallisto_linux-v0.46.2.tar.gz mv kallisto kallisto_linux-v0.46.2 # make symlink ln -s ~/workspace/bin/kallisto_linux-v0.46.2/kallisto ~/workspace/bin/kallisto # test installation ~/workspace/bin/kallisto ``` # Install Pizzly Pizzly is a fusion detection algorithm which uses output from Kallisto. Pizzly has a binary distribution so we can download and extract that from github to get started. ```bash # download and extract cd ~/workspace/bin mkdir pizzly-v0.37.3 cd pizzly-v0.37.3 wget https://github.com/pmelsted/pizzly/releases/download/v0.37.3/pizzly_linux.tar.gz tar -zxvf pizzly_linux.tar.gz # make symlink ln -s ~/workspace/bin/pizzly-v0.37.3/pizzly ~/workspace/bin/pizzly # test executable ~/workspace/bin/pizzly --help ``` # Manta Manta is a structural variant caller developed by Illumina and available on gitub under the GPL_v3 license. It uses paired-end sequencing reads to build a breakend association graph to identify structural varaints. ```bash # download and extract cd ~/workspace/bin wget https://github.com/Illumina/manta/releases/download/v1.6.0/manta-1.6.0.centos6_x86_64.tar.bz2 tar --bzip2 -xvf manta-1.6.0.centos6_x86_64.tar.bz2 #we can use strelka-env for this also conda activate strelka-env # test installation python2 ~/workspace/bin/manta-1.6.0.centos6_x86_64/bin/configManta.py --help conda deactivate ``` # mosdepth mosdepth is a program for determining depth in sequencing data. The easiest way to install mosdepth is through bioconda a channel for the conda package manager. The AMI already has conda setup to install to /usr/local/bin/miniconda and so we’ve already installed mosdepth for you. However below are the commands used during the installation. ```bash # add the bioconda channel conda config --add channels bioconda # install mosdepth with the conda package manager conda install mosdepth ``` # bam-readcount bam-readcount is a program for determing read support for individual variants (SNVs and Indels only). We are going to point this local install of bam-readcount to use the samtools installation we completed above. Samtools is a dependency of bam-readcount. This tool uses Cmake to create its makefile, so compiling from source has an extra step here. Instead of using an official release from github we are cloning the latest code from the master branch. In general this practice should be avoided and you should use an official release instead. ```bash # install bam-readcount cd ~/workspace/bin git clone https://github.com/genome/bam-readcount.git mv bam-readcount bam-readcount-latest cd bam-readcount-latest export SAMTOOLS_ROOT=/home/ubuntu//workspace/bin/samtools-1.14 cmake -Wno-dev /home/ubuntu/workspace/bin/bam-readcount-latest make # create symlink ln -s ~/workspace/bin/bam-readcount-latest/bin/bam-readcount ~/workspace/bin/bam-readcount # test installation ~/workspace/bin/bam-readcount ``` # vt vt is a variant tool set that discovers short variants from Next Generation Sequencing data. We will use this for the purpose of splitting multi-allelic variants. ```bash #install vt cd ~/workspace/bin conda install -c bioconda vt # create symlink ln -s /home/ubuntu/miniconda3/bin/vt ~/workspace/bin/vt # test installation ~/workspace/bin/vt ``` # vcf-annotation-tools VCF Annotation Tools is a python package that includes several tools to annotate VCF files with data from other tools. We will be using this for the purpose of adding bam readcounts to the vcf files. ```bash #install vcf-annotation-tools pip install vcf-annotation-tools #testing Installation vcf-readcount-annotator -h ``` # Install seqtk Seqtk is a lighweight tool for processing FASTQ and FASTA files. We will use seqtk to subset RNA-seq fastq files to more quickly run the fusion alignment module. ```bash # Download cd ~/workspace/bin git clone https://github.com/lh3/seqtk.git seqtk.v1 # Install cd seqtk.v1 make # (Ignore warning message) make install # Check install ln -s /usr/local/bin/seqtk ~/workspace/bin/seqtk ~/workspace/bin/seqtk ``` ```bash cd ~/workspace/inputs/references/ mkdir -p gatk cd gatk conda install gsutil # SNP calibration call sets - dbsnp, hapmap, omni, and 1000G # Runtime: < 2min gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf . # Runtime: ~ 2min bgzip --threads 8 Homo_sapiens_assembly38.dbsnp138.vcf gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz . gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz . gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz . # Indel calibration call sets - dbsnp, Mills gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz . gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz . # Interval lists that can be used to parallelize certain GATK tasks gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/wgs_calling_regions.hg38.interval_list . gsutil cp -r gs://genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/ . # list the files we just downloaded ls -lh wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver ``` # Download coordinates describing the Exome Reagent used to generate our exome data (SeqCapEZ_Exome_v3.0) ```bash # change directories mkdir -p ~/workspace/inputs/references/exome cd ~/workspace/inputs/references/exome # download the files wget -c wget -c https://sequencing.roche.com/content/dam/rochesequence/worldwide/shared-designs/SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip unzip SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip # remove the zip rm -f SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip ``` # Convert SeqCapEZ_Exome_v3.0 to hg38 assembly coordinates using liftover You might have noticed that these annotation files from Roche are all from the hg19 genome assembly. Obviously this presents a problem as the analysis we’re performing is using the newer hg38 assembly. This issue is actually not an uncommon situation and a variety of tools exist that are designed to make converting from one assembly to another easier. In the next section we will be using UCSC liftover to perform this task. To start we first need to download a chain file specific to the assembly conversion we want to perform (in our case hg19 -> hg38). These files provide a mapping between the two assemblies and can be downloaded from the UCSC site. Once we have our chain file we can run liftOver which will take following positional arguments. ```bash # download cd ~/workspace/bin wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver chmod +x liftOver # change to the appropriate directory cd ~/workspace/inputs/references/exome # download the chain file wget -c http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz # run liftover liftOver SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_primary_targets.bed hg19ToHg38.over.chain.gz SeqCap_EZ_Exome_v3_hg38_primary_targets.bed unMapped.bed liftOver SeqCapEZ_Exome_v3.0_Design_Annotation_files/SeqCap_EZ_Exome_v3_hg19_capture_targets.bed hg19ToHg38.over.chain.gz SeqCap_EZ_Exome_v3_hg38_capture_targets.bed unMapped1.bed # create a version in standard bed format (chr, start, stop) cut -f 1-3 SeqCap_EZ_Exome_v3_hg38_primary_targets.bed > SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed cut -f 1-3 SeqCap_EZ_Exome_v3_hg38_capture_targets.bed > SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.bed # take a quick look at the format of these files head SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed head SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.bed # a very common question in exome (or any capture based approach) is: how big is my capture space? # use bedtools to determine the size of the capture space represented by this bed file # first sort the bed files and store the sorted versions bedtools sort -i SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed > SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.bed bedtools sort -i SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.bed > SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.bed # now merge the bed files to collapse any overlapping regions so they are not double counted. bedtools merge -i SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.bed > SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed bedtools merge -i SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.bed > SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed # finally use a Perl one liner to determine the size of each non-overlapping region and determine the cumulative sum perl -ne 'chomp; @l=split("\t",$_); $size += $l[2]-$l[1]; print "$size\n"' SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed perl -ne 'chomp; @l=split("\t",$_); $size += $l[2]-$l[1]; print "$size\n"' SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed # note that the size of the space targeted by the exome reagent is ~63 Mbp. Does that sound reasonable? # now create a subset of these bed files grep -w -P "^chr6|^chr17" SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed > exome_regions.bed grep -w -P "^chr6|^chr17" SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed > probe_regions.bed # clean up intermediate files #rm -fr SeqCapEZ_Exome_v3.0_Design_Annotation_files/ SeqCap_EZ_Exome_v3_hg38_primary_targets.bed SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.bed SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.bed unMapped.bed ``` # Obtaining an interval list for the exome bed files ```bash # first for the complete exome and probe bed file cd ~/workspace/inputs/references/ mkdir temp cd temp wget http://genomedata.org/pmbio-workshop/references/genome/all/ref_genome.dict cd ~/workspace/inputs/references/exome java -jar $PICARD BedToIntervalList I=SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.bed O=SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.sort.merge.interval_list SD=~/workspace/inputs/references/temp/ref_genome.dict java -jar $PICARD BedToIntervalList I=SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.bed O=SeqCap_EZ_Exome_v3_hg38_capture_targets.v2.sort.merge.interval_list SD=~/workspace/inputs/references/temp/ref_genome.dict rm -fr ~/workspace/inputs/references/temp/ # next for our subset exome and probe regions file cd ~/workspace/inputs/references/exome java -jar /usr/local/bin/picard.jar BedToIntervalList I=exome_regions.bed O=exome_regions.bed.interval_list SD=~/workspace/inputs/references/genome/ref_genome.dict java -jar /usr/local/bin/picard.jar BedToIntervalList I=probe_regions.bed O=probe_regions.bed.interval_list SD=~/workspace/inputs/references/genome/ref_genome.dict ``` # Obtaining transcriptome reference files ```bash # make sure CHRS environment variable is set. If this command doesn't give a value, please return to the Environment section of the course echo $CHRS # create a directory for transcriptome input files mkdir -p ~/workspace/inputs/references/transcriptome cd ~/workspace/inputs/references/transcriptome # download the files wget http://genomedata.org/pmbio-workshop/references/transcriptome/$CHRS/ref_transcriptome.gtf wget http://genomedata.org/pmbio-workshop/references/transcriptome/$CHRS/ref_transcriptome.fa # take a look at the contents of the gtf file. Press 'q' to exit the 'less' display. less -p start_codon -S ref_transcriptome.gtf # How many unique gene IDs are in the .gtf file? # We can use a perl command-line command to find out: perl -ne 'if ($_ =~ /(gene_id\s\"ENSG\w+\")/){print "$1\n"}' ref_transcriptome.gtf | sort | uniq | wc -l # what are all the feature types listed in the third column of the GTF? # how does the following command (3 commands piped together) answer that question? cut -f 3 ref_transcriptome.gtf | sort | uniq -c ``` # Create reference index for genome ```bash cd ~/workspace/inputs/references/genome bwa index ref_genome.fa ``` # Create a reference index for transcriptome with HISAT for splice RNA alignments to the genome ```bash cd ~/workspace/inputs/references/transcriptome # create a database of observed splice sites represented in our reference transcriptome GTF ~/workspace/bin/hisat2-2.1.0/hisat2_extract_splice_sites.py ref_transcriptome.gtf > splicesites.tsv head splicesites.tsv # create a database of exon regions in our reference transcriptome GTF ~/workspace/bin/hisat2-2.1.0/hisat2_extract_exons.py ref_transcriptome.gtf > exons.tsv head exons.tsv # build the reference genome index for HISAT and supply the exon and splice site information extracted in the previous steps # specify to use 8 threads with the `-p 8` option # run time for this index is ~5 minutes ~/workspace/bin/hisat2-2.1.0/hisat2-build -p 8 --ss splicesites.tsv --exon exons.tsv ~/workspace/inputs/references/genome/ref_genome.fa ref_genome ``` # Create a reference transcriptome index for use with Kallisto ```bash cd ~/workspace/inputs/references/transcriptome mkdir kallisto cd kallisto # tidy up the headers to just include the ensembl transcript ids cat ../ref_transcriptome.fa | perl -ne 'if ($_ =~ /\d+\s+(ENST\d+)/){print ">$1\n"}else{print $_}' > ref_transcriptome_clean.fa # run time for this index is ~30 seconds kallisto index --index=ref_transcriptome_kallisto_index ref_transcriptome_clean.fa ``` # Index VCF annotation files for use with GATK ```bash cd ~/workspace/inputs/references/gatk/ #SNP calibration call sets - dbsnp, hapmap, omni, and 1000G # Runtime: ~ 4min gatk --java-options '-Xmx12g' IndexFeatureFile -I ~/workspace/inputs/references/gatk/Homo_sapiens_assembly38.dbsnp138.vcf.gz gatk --java-options '-Xmx12g' IndexFeatureFile -I ~/workspace/inputs/references/gatk/hapmap_3.3.hg38.vcf.gz gatk --java-options '-Xmx12g' IndexFeatureFile -I ~/workspace/inputs/references/gatk/1000G_omni2.5.hg38.vcf.gz # Runtime: ~ 3min gatk --java-options '-Xmx12g' IndexFeatureFile -I ~/workspace/inputs/references/gatk/1000G_phase1.snps.high_confidence.hg38.vcf.gz #Indel calibration call sets - dbsnp, Mills gatk --java-options '-Xmx12g' IndexFeatureFile -I ~/workspace/inputs/references/gatk/Homo_sapiens_assembly38.known_indels.vcf.gz gatk --java-options '-Xmx12g' IndexFeatureFile -I ~/workspace/inputs/references/gatk/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz ``` # get input data ```bash mkdir -p ~/workspace/inputs/data/fastq cd ~/workspace/inputs/data/fastq wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/Exome_Norm.tar wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/Exome_Tumor.tar wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/RNAseq_Norm.tar wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/RNAseq_Tumor.tar #wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/WGS_Norm.tar #wget -c http://genomedata.org/pmbio-workshop/fastqs/$CHRS/WGS_Tumor.tar ``` # run fastqc ```bash cd ~/workspace/inputs/data/fastq fastqc Exome_Norm/Exome_Norm*.fastq.gz fastqc Exome_Tumor/Exome_Tumor*.fastq.gz tree #fastqc WGS_Norm/WGS_Norm*.fastq.gz #fastqc WGS_Tumor/WGS_Tumor*.fastq.gz #tree fastqc RNAseq_Norm/RNAseq_Norm*.fastq.gz fastqc RNAseq_Tumor/RNAseq_Tumor*.fastq.gz tree ``` # run multiqc ```bash cd ~/workspace/inputs mkdir qc cd qc multiqc ~/workspace/inputs/data/fastq/ tree ``` # run bwa ```bash # Run bwa mem using the above information cd ~/workspace/align # Runtime: ~ 4min bwa mem -t 2 -Y -R "@RG\tID:Exome_Norm\tPL:ILLUMINA\tPU:C1TD1ACXX-CGATGT.7\tLB:exome_norm_lib1\tSM:HCC1395BL_DNA" -o ~/workspace/align/Exome_Norm.sam ~/workspace/inputs/references/genome/ref_genome.fa ~/workspace/inputs/data/fastq/Exome_Norm/Exome_Norm_R1.fastq.gz ~/workspace/inputs/data/fastq/Exome_Norm/Exome_Norm_R2.fastq.gz # Runtime: ~ 4min bwa mem -t 2 -Y -R "@RG\tID:Exome_Tumor\tPL:ILLUMINA\tPU:C1TD1ACXX-ATCACG.7\tLB:exome_tumor_lib1\tSM:HCC1395_DNA" -o ~/workspace/align/Exome_Tumor.sam ~/workspace/inputs/references/genome/ref_genome.fa ~/workspace/inputs/data/fastq/Exome_Tumor/Exome_Tumor_R1.fastq.gz ~/workspace/inputs/data/fastq/Exome_Tumor/Exome_Tumor_R2.fastq.gz ``` # Convert sam to bam format ```bash cd ~/workspace/align # Runtime: ~30s samtools view -@ 2 -h -b -o Exome_Norm.bam Exome_Norm.sam # Runtime: ~45s samtools view -@ 2 -h -b -o Exome_Tumor.bam Exome_Tumor.sam ``` # Query name sort bam files ```bash cd ~/workspace/align # Runtime: ~ 4min java -Xmx12g -jar $PICARD SortSam I=Exome_Norm.bam O=Exome_Norm_namesorted.bam SO=queryname java -Xmx12g -jar $PICARD SortSam I=Exome_Tumor.bam O=Exome_Tumor_namesorted.bam SO=queryname ``` ### Mark duplicates Use picard MarkDuplicates to mark duplicate reads. These are typically defined as read pairs with identical start and stop alignment positions. Note, MarkDuplicates works on read name sorted alignments. ```bash cd ~/workspace/align # MarkDuplicates -I Exome_Tumor_namesorted.bam -O Exome_Tumor_namesorted_mrkdup.bam -ASSUME_SORT_ORDER queryname -METRICS_FILE Exome_Tumor_mrkdup_metrics.txt -QUIET true -COMPRESSION_LEVEL 0 -VALIDATION_STRINGENCY LENIENT java -Xmx12g -jar $PICARD MarkDuplicates -I Exome_Norm_namesorted.bam -O Exome_Norm_namesorted_mrkdup.bam -ASSUME_SORT_ORDER queryname -METRICS_FILE Exome_Norm_mrkdup_metrics.txt -QUIET true -COMPRESSION_LEVEL 0 -VALIDATION_STRINGENCY LENIENT java -Xmx12g -jar $PICARD MarkDuplicates -I Exome_Tumor_namesorted.bam -O Exome_Tumor_namesorted_mrkdup.bam -ASSUME_SORT_ORDER queryname -METRICS_FILE Exome_Tumor_mrkdup_metrics.txt -QUIET true -COMPRESSION_LEVEL 0 -VALIDATION_STRINGENCY LENIENT ``` ### Position sort bam file For indexing and subsequent steps a position-sorted bam is required. Therefore, we will sort bam file by coordinate. ```bash cd ~/workspace/align java -Xmx12g -jar $PICARD SortSam -I Exome_Norm_namesorted_mrkdup.bam -O Exome_Norm_sorted_mrkdup.bam -SO coordinate java -Xmx12g -jar $PICARD SortSam -I Exome_Tumor_namesorted_mrkdup.bam -O Exome_Tumor_sorted_mrkdup.bam -SO coordinate ``` ### Create bam index for use with GATK, IGV, etc In order to efficiently load and search a bam file, downstream applications typically require an index ```bash cd ~/workspace/align java -Xmx12g -jar $PICARD BuildBamIndex -I Exome_Norm_sorted_mrkdup.bam java -Xmx12g -jar $PICARD BuildBamIndex -I Exome_Tumor_sorted_mrkdup.bam ``` ### Perform Indel Realignment If desired, add this step. See docs [here](https://software.broadinstitute.org/gatk/documentation/article?id=7156). But, note that as announced in the GATK v3.6 highlights, variant calling workflows that use HaplotypeCaller or MuTect2 now omit indel realignment. HaplotypeCaller includes a local read assembly that mostly deprecates/replaces the need for a separate indel realignment step. See the following [blog](https://software.broadinstitute.org/gatk/blog?id=7847) for a detailed discussion of this issue. See [here](https://drive.google.com/drive/folders/1U6Zm_tYn_3yeEgrD1bdxye4SXf5OseIt) for latest versions of all GATK tutorials: ### Perform Base Quality Score Recalibration Questions about GATK step. Why that the BQSR commands below limit the modeling step to chr1-22. This is where the majority of known variants are located and the autosomes are expected to have more even coverage than sex chromosomes. However, once the model is built, we apply to all bases on all contigs. #### Calculate BQSR Table First, calculate the BQSR table. NOTE: For exome data, we should modify the below to use the `--intervals` (`-L`) option. "This excludes off-target sequences and sequences that may be poorly mapped, which have a higher error rate. Including them could lead to a skewed model and bad recalibration." https://software.broadinstitute.org/gatk/documentation/article.php?id=4133 ```bash cd ~/workspace/align gatk --java-options '-Xmx12g' BaseRecalibrator -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Norm_sorted_mrkdup.bam -O ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.table --known-sites ~/workspace/inputs/references/gatk/Homo_sapiens_assembly38.dbsnp138.vcf.gz --known-sites ~/workspace/inputs/references/gatk/Homo_sapiens_assembly38.known_indels.vcf.gz --known-sites ~/workspace/inputs/references/gatk/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --preserve-qscores-less-than 6 --disable-bam-index-caching $GATK_REGIONS gatk --java-options '-Xmx12g' BaseRecalibrator -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Tumor_sorted_mrkdup.bam -O ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.table --known-sites ~/workspace/inputs/references/gatk/Homo_sapiens_assembly38.dbsnp138.vcf.gz --known-sites ~/workspace/inputs/references/gatk/Homo_sapiens_assembly38.known_indels.vcf.gz --known-sites ~/workspace/inputs/references/gatk/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --preserve-qscores-less-than 6 --disable-bam-index-caching $GATK_REGIONS ``` #### Apply BQSR Now, apply the BQSR table to bam files. ```bash cd ~/workspace/align gatk --java-options '-Xmx12g' ApplyBQSR -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Norm_sorted_mrkdup.bam -O ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam --bqsr-recal-file ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.table --preserve-qscores-less-than 6 --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 gatk --java-options '-Xmx12g' ApplyBQSR -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Tumor_sorted_mrkdup.bam -O ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam --bqsr-recal-file ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.table --preserve-qscores-less-than 6 --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 ``` create an index for these new bams ```bash cd ~/workspace/align java -Xmx12g -jar $PICARD BuildBamIndex I=Exome_Norm_sorted_mrkdup_bqsr.bam java -Xmx12g -jar $PICARD BuildBamIndex I=Exome_Tumor_sorted_mrkdup_bqsr.bam ``` ### Clean up un-needed sam/bam files Keep final sorted, duplicated marked, bqsr bam/bai/table files and mrkdup.txt files. Delete everything else. ```bash cd ~/workspace/align mkdir final mv *_sorted_mrkdup_bqsr.* final/ mv *.txt final/ rm *.sam rm *.bam rm *.bai mv final/* . rmdir final/ ``` Calculate QC metric for bam files using samtools and picard See docs here: https://github.com/genome/cancer-genomics-workflow/wiki/Alignment ### Run Samtools flagstat ```bash cd ~/workspace/align/ samtools flagstat ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam > ~/workspace/align/Exome_Norm_flagstat.txt samtools flagstat ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam > ~/workspace/align/Exome_Tumor_flagstat.txt # Runtime: < 2min #samtools flagstat ~/workspace/align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam > ~/workspace/align/WGS_Norm_merged_flagstat.txt #samtools flagstat ~/workspace/align/WGS_Tumor_merged_sorted_mrkdup_bqsr.bam > ~/workspace/align/WGS_Tumor_merged_flagstat.txt ``` ### Run various picard CollectMetrics tools ```bash cd ~/workspace/align/ java -Xmx12g -jar $PICARD CollectInsertSizeMetrics -I ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam -O ~/workspace/align/Exome_Norm_insert_size_metrics.txt -H ~/workspace/align/Exome_Norm_insert_size_metrics.pdf java -Xmx12g -jar $PICARD CollectInsertSizeMetrics -I ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam -O ~/workspace/align/Exome_Tumor_insert_size_metrics.txt -H ~/workspace/align/Exome_Tumor_insert_size_metrics.pdf #java -Xmx12g -jar $PICARD CollectInsertSizeMetrics -I ~/workspace/align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Norm_merged_insert_size_metrics.txt -H ~/workspace/align/WGS_Norm_merged_insert_size_metrics.pdf #java -Xmx12g -jar $PICARD CollectInsertSizeMetrics -I ~/workspace/align/WGS_Tumor_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Tumor_merged_insert_size_metrics.txt -H ~/workspace/align/WGS_Tumor_merged_insert_size_metrics.pdf # Picard CollectAlignmentSummaryMetrics java -Xmx12g -jar $PICARD CollectAlignmentSummaryMetrics -I ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam -O ~/workspace/align/Exome_Norm_alignment_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa java -Xmx12g -jar $PICARD CollectAlignmentSummaryMetrics -I ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam -O ~/workspace/align/Exome_Tumor_exome_alignment_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa #java -Xmx12g -jar $PICARD CollectAlignmentSummaryMetrics -I ~/workspace/align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Norm_merged_alignment_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa #java -Xmx12g -jar $PICARD CollectAlignmentSummaryMetrics -I ~/workspace/align/WGS_Tumor_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Tumor_merged_alignment_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa #Picard CollectHsMetrics #Exome Only java -Xmx12g -jar $PICARD CollectHsMetrics -I ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam -O ~/workspace/align/Exome_Norm_hs_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa -BI ~/workspace/inputs/references/exome/probe_regions.bed.interval_list -TI ~/workspace/inputs/references/exome/exome_regions.bed.interval_list java -Xmx12g -jar $PICARD CollectHsMetrics -I ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam -O ~/workspace/align/Exome_Tumor_hs_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa -BI ~/workspace/inputs/references/exome/probe_regions.bed.interval_list -TI ~/workspace/inputs/references/exome/exome_regions.bed.interval_list #Picard CollectGcBiasMetrics #WGS only #java -Xmx12g -jar $PICARD CollectGcBiasMetrics -I ~/workspace/align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Norm_merged_gc_bias_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa -CHART ~/workspace/align/WGS_Norm_merged_gc_bias_metrics.pdf -S ~/workspace/align/WGS_Norm_merged_gc_bias_summary.txt #java -Xmx12g -jar $PICARD CollectGcBiasMetrics -I ~/workspace/align/WGS_Tumor_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Tumor_merged_gc_bias_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa -CHART ~/workspace/align/WGS_Tumor_merged_gc_bias_metrics.pdf -S ~/workspace/align/WGS_Tumor_merged_gc_bias_summary.txt #Picard CollectWgsMetrics #First we need to create the Autosomal Chromosome Interval List #egrep 'chr[0-9]{1,2}\s' ~/workspace/inputs/references/genome/ref_genome.fa.fai | awk '{print $1"\t1\t"$2"\t+\t"$1}' | cat ~/workspace/inputs/references/genome/ref_genome.dict - > ~/workspace/inputs/references/genome/ref_genome_autosomal.interval_list #java -Xmx12g -jar $PICARD CollectWgsMetrics -I ~/workspace/align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Norm_merged_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa -INTERVALS ~/workspace/inputs/references/genome/ref_genome_autosomal.interval_list #java -Xmx12g -jar $PICARD CollectWgsMetrics -I ~/workspace/align/WGS_Tumor_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/align/WGS_Tumor_merged_metrics.txt -R ~/workspace/inputs/references/genome/ref_genome.fa -INTERVALS ~/workspace/inputs/references/genome/ref_genome_autosomal.interval_list ``` ### Run FastQC ```bash cd ~/workspace/align fastqc -t 8 Exome_Norm_sorted_mrkdup_bqsr.bam fastqc -t 8 Exome_Tumor_sorted_mrkdup_bqsr.bam tree #fastqc -t 8 WGS_Norm_merged_sorted_mrkdup_bqsr.bam #fastqc -t 8 WGS_Tumor_merged_sorted_mrkdup_bqsr.bam #tree #fastqc RNAseq_Norm #fastqc RNAseq_Tumor #tree ``` ### Run MultiQC to produce a final report ```bash cd ~/workspace/align mkdir post_align_qc cd post_align_qc multiqc ~/workspace/align/ tree ``` # ############################## # END OF PREPROCESSING # ################################ # ### Module objectives - Perform single-sample germline variant calling with GATK HaplotypeCaller on WGS and exome data - Perform single-sample germline variant calling with GATK GVCF workflow on WGS and exome data - Perform single-sample germline variant calling with GATK GVCF workflow on additional exomes from 1000 Genomes Project - Perform joint genotype calling on exome data, including additional exomes from 1000 Genomes Project In this module we will use the GATK HaplotypeCaller to call variants from our aligned bams. Since we are only interested in germline variants in this module, we will only call variants in the normal samples (i.e., WGS_Norm and Exome_Norm). The following example commands were inspired by an excellent [GATK tutorial](https://gatkforums.broadinstitute.org/gatk/discussion/7869/howto-discover-variants-with-gatk-a-gatk-workshop-tutorial), provided by the Broad Institute. Note: We are using [GATK4 v4.0.10.0](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.10.0/) for this tutorial. ### Run GATK HaplotypeCaller First, we will run [GATK HaplotypeCaller](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.10.0/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php) to call germline SNPs and indels. Whenever HaplotypeCaller finds signs of variation it performs a local de novo re-assembly of reads. This improves the accuracy of variant calling, especially in challenging regions, and represents a substantial improvement over the previous GATK UnifiedGenotyper caller. We will also include an option to generate bam output from HaplotypeCaller so that local reassembly/realignment around called variants can be visualized. GATK HaplotypeCaller is run with the following options: * --java-options '-Xmx12g' tells GATK to use 60GB of memory * HaplotypeCaller specifies the GATK command to run * -R specifies the path to the reference genome * -I specifies the path to the input bam file for which to call variants * -O specifies the path to the output vcf file to write variants to * --bam-output specifies the path to an optional bam file which will store local realignments around variants that HaplotypeCaller used to make calls * $GATK_REGIONS is an environment variable that we defined [earlier](/module-01-setup/0001/05/01/Environment_Setup/) to limit calling to specific regions (e.g., just chr6 and chr17) ```bash #Make sure that $GATK_REGIONS is set correctly echo $GATK_REGIONS #Create working dir for germline results if not already created mkdir -p ~/workspace/germline cd ~/workspace/germline #Call variants for exome data gatk --java-options '-Xmx12g' HaplotypeCaller -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam -O ~/workspace/germline/Exome_Norm_HC_calls.vcf --bam-output ~/workspace/germline/Exome_Norm_HC_out.bam $GATK_REGIONS #Call variants for WGS data #gatk --java-options '-Xmx12g' HaplotypeCaller -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/WGS_Norm_merged_sorted_mrkdup_bqsr.bam -O ~/workspace/germline/WGS_Norm_HC_calls.vcf --bam-output ~/workspace/germline/WGS_Norm_HC_out.bam $GATK_REGIONS ``` ### Module objectives - Perform GATK hard-filtering of germline SNVs and indels - Perform GATK VQSR-filtering of germline SNVs and indels - Perform VEP annotation of filtered variants. In this module we will learn about variant filtering and annotation. The raw output of GATK HaplotypeCaller will include many variants with varying degrees of quality. For various reasons we might wish to further filter these to a higher confidence set of variants. The recommended approach is to use GATK VQSR (see below for more details). This requires a large number of variants. Sufficient numbers may be available for even a single sample of whole genome sequencing data. However for targeted sequencing (e.g., exome data) it is recommended to include a larger number (i.e., at least 30-50), preferably platform-matched (i.e., similar sequencing strategy), samples with variant calls. For our purposes, we will first demonstrate a less optimal hard-filtering strategy. Then we will demonstrate VQSR filtering. It is strongly recommended to read the following documentation from GATK: - [How to apply hard filters](https://software.broadinstitute.org/gatk/documentation/article?id=2806) - [How to run VQSR](https://software.broadinstitute.org/gatk/documentation/article?id=2805) ### Perform hard-filtering on Exome data #### Extract SNPs and Indels from the variant call set First, we will separate out the SNPs and Indels from the VCF into new separate VCFs. Note that the variant type (SNP, INDEL, MIXED, etc) is not stored explicitly in the vcf but instead inferred from the genotypes. We will use a versatile GATK tool called `SelectVariants`. This command can be used for all kinds of simple filtering or subsetting purposes. We will run it twice to select by variant type, once for SNPs, and then again for Indels, to produce two new VCFs. GATK SelectVariants is run with the following options: * –java-options ‘-Xmx60g’ tells GATK to use 60GB of memory * SelectVariants specifies the GATK command to run * -R specifies the path to the reference genome * -V specifies the path to the input vcf file to be filtered * -select-type [SNP, INDEL, MIXED, etc] specifies which type of variant to limit to * -O specifies the path to the output vcf file to be produced ```bash cd ~/workspace/germline/ gatk --java-options '-Xmx12g' SelectVariants -R ~/workspace/inputs/references/genome/ref_genome.fa -V ~/workspace/germline/Exome_Norm_HC_calls.vcf -select-type SNP -O ~/workspace/germline/Exome_Norm_HC_calls.snps.vcf gatk --java-options '-Xmx12g' SelectVariants -R ~/workspace/inputs/references/genome/ref_genome.fa -V ~/workspace/germline/Exome_Norm_HC_calls.vcf -select-type INDEL -O ~/workspace/germline/Exome_Norm_HC_calls.indels.vcf ``` #### Apply basic filters to the SNP and Indel call sets Next, we will perform so-called hard-filtering by applying a number of *hard* (somewhat arbitrary) cutoffs. For example, we might require each variant to have a minimum QualByDepth (QD) of 2.0. The QD value is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples. With such a filter any variant with a QD value less than 2.0 would be marked as filtered in the FILTER field with a filter name of our choosing (e.g., QD_lt_2). Multiple filters can be combined arbitrarily. Each can be given its own name so that you can later determine which one or more filters a variant fails. Visit the [GATK documentation on hard-filtering](https://software.broadinstitute.org/gatk/documentation/article?id=2806) to learn more about the following hard filtering options. Notice that different filters and cutoffs are recommended for SNVs and Indels. This is why we first split them into separate files. ```bash cd ~/workspace/germline/ gatk --java-options '-Xmx12g' VariantFiltration -R ~/workspace/inputs/references/genome/ref_genome.fa -V ~/workspace/germline/Exome_Norm_HC_calls.snps.vcf --filter-expression "QD < 2.0" --filter-name "QD_lt_2" --filter-expression "FS > 60.0" --filter-name "FS_gt_60" --filter-expression "MQ < 40.0" --filter-name "MQ_lt_40" --filter-expression "MQRankSum < -12.5" --filter-name "MQRS_lt_n12.5" --filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS_lt_n8" --filter-expression "SOR > 3.0" --filter-name "SOR_gt_3" -O ~/workspace/germline/Exome_Norm_HC_calls.snps.filtered.vcf gatk --java-options '-Xmx12g' VariantFiltration -R ~/workspace/inputs/references/genome/ref_genome.fa -V ~/workspace/germline/Exome_Norm_HC_calls.indels.vcf --filter-expression "QD < 2.0" --filter-name "QD_lt_2" --filter-expression "FS > 200.0" --filter-name "FS_gt_200" --filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS_lt_n20" --filter-expression "SOR > 10.0" --filter-name "SOR_gt_10" -O ~/workspace/germline/Exome_Norm_HC_calls.indels.filtered.vcf ``` Let's take a look at a few of the variants in one of these filtered files. Remember, we haven't actually removed any variants. We have just marked them as PASS or *filter-name*. Use `grep -v` and `head` to skip past all the VCF header lines and view the first few records. How many variants passed or failed our filters? ```bash grep -v "##" Exome_Norm_HC_calls.snps.filtered.vcf | head -10 ``` #### Merge filtered SNP and INDEL vcfs back together It is convenient to have all our variants in a single result file. Therfore, we will merge them back together. GATK MergeVcfs is run with the following options: * –java-options ‘-Xmx60g’ tells GATK to use 60GB of memory * MergeVcfs specifies the GATK command to run * -I specifies the path to each of the vcf files to be merged * -O specifies the path to the output vcf file to be produced ```bash gatk --java-options '-Xmx12g' MergeVcfs -I ~/workspace/germline/Exome_Norm_HC_calls.snps.filtered.vcf -I ~/workspace/germline/Exome_Norm_HC_calls.indels.filtered.vcf -O ~/workspace/germline/Exome_Norm_HC_calls.filtered.vcf ``` #### Extract PASS variants only It would also be convenient to have a vcf with only passing variants. For this, we can go back the `GATK SelectVariants` tool. This will be run much as above except with the `--exlude-filtered` option instead of `-select-type`. GATK SelectVariants is run with the following options: * –java-options ‘-Xmx60g’ tells GATK to use 60GB of memory * SelectVariants specifies the GATK command to run * -R specifies the path to the reference genome * -V specifies the path to the input vcf file to be filtered * --exclude-filtered specifies to remove all variants except those marked as PASS * -O specifies the path to the output vcf file to be produced ```bash gatk --java-options '-Xmx12g' SelectVariants -R ~/workspace/inputs/references/genome/ref_genome.fa -V ~/workspace/germline/Exome_Norm_HC_calls.filtered.vcf -O ~/workspace/germline/Exome_Norm_HC_calls.filtered.PASS.vcf --exclude-filtered ``` #### Running VARSCAN __________________________ The first variant caller that we will use here is [VARSCAN](http://varscan.sourceforge.net/), VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome resequencing data and employs a robust heuristic/statistic approach to call variants that meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance: ### Exome data commands: ```bash mkdir -p ~/workspace/somatic/varscan cd ~/workspace/somatic/varscan # Runtime: ~5min java -Xmx12g -jar ~/workspace/bin/VarScan.v2.4.2.jar somatic <(samtools mpileup -l ~/workspace/inputs/references/exome/exome_regions.bed --no-BAQ -f ~/workspace/inputs/references/genome/ref_genome.fa ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam) ~/workspace/somatic/varscan/exome --mpileup 1 --output-vcf java -Xmx12g -jar /usr/local/bin/VarScan.v2.4.2.jar processSomatic exome.snp.vcf exome.snp java -Xmx12g -jar /usr/local/bin/VarScan.v2.4.2.jar processSomatic exome.indel.vcf exome.indel find ~/workspace/somatic/varscan -name '*.vcf' -exec bgzip -f {} \; find ~/workspace/somatic/varscan -name '*.vcf.gz' -exec tabix -f {} \; gatk VariantFiltration -R ~/workspace/inputs/references/genome/ref_genome.fa -V exome.snp.Somatic.vcf.gz --mask exome.snp.Somatic.hc.vcf.gz --mask-name "processSomatic" --filter-not-in-mask -O exome.snp.Somatic.hc.filter.vcf.gz gatk VariantFiltration -R ~/workspace/inputs/references/genome/ref_genome.fa -V exome.indel.Somatic.vcf.gz --mask exome.indel.Somatic.hc.vcf.gz --mask-name "processSomatic" --filter-not-in-mask -O exome.indel.Somatic.hc.filter.vcf.gz bcftools concat -a -o exome.vcf.gz -O z exome.snp.Somatic.hc.filter.vcf.gz exome.indel.Somatic.hc.filter.vcf.gz tabix -f ~/workspace/somatic/varscan/exome.vcf.gz ``` #### **Running STRELKA** __________________________ The second variant caller that we will use is [STRELKA](https://github.com/Illumina/strelka/blob/master/docs/userGuide/README.md). Strelka calls germline and somatic small variants from mapped sequencing reads and is optimized for rapid clinical analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. Both germline and somatic callers include a final empirical variant rescoring step using a random forest model to reflect numerous features indicative of call reliability which may not be represented in the core variant calling probability model. ### Exome data commands: ```bash mkdir -p ~/workspace/somatic/strelka/exome cd source activate strelka-env ~/workspace/bin/strelka-2.9.10.centos6_x86_64/bin/configureStrelkaSomaticWorkflow.py --normalBam=workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam --tumorBam=workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam --referenceFasta=workspace/inputs/references/genome/ref_genome.fa --exome --runDir=workspace/somatic/strelka/exome #Please specify according to the number of cpus available or how many you would like to allocate to this job. In this case, four were given. # Runtime: ~ 3min python2 ~/workspace/somatic/strelka/exome/runWorkflow.py -m local -j 8 conda deactivate cd ~/workspace/somatic/strelka/exome/results/variants zcat somatic.snvs.vcf.gz | awk '{if(/^##/) print; else if(/^#/) print "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n"$0; else print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\tGT:"$9"\t./.:"$10"\t./.:"$11;}' - > somatic.snvs.gt.vcf zcat somatic.indels.vcf.gz | awk '{if(/^##/) print; else if(/^#/) print "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">\n"$0; else print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\tGT:"$9"\t./.:"$10"\t./.:"$11;}' - > somatic.indels.gt.vcf find ~/workspace/somatic/strelka/exome/results/variants/ -name "*.vcf" -exec bgzip -f {} \; find ~/workspace/somatic/strelka/exome/results/variants/ -name "*.vcf.gz" -exec tabix -f {} \; bcftools concat -a -o exome.vcf.gz -O z somatic.snvs.gt.vcf.gz somatic.indels.gt.vcf.gz tabix exome.vcf.gz ``` #### **Running MuTect2** __________________________ The final variant caller that we will also use results from is [MuTect2](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_cancer_m2_MuTect2.php). MuTect2 is a somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect (Cibulskis et al., 2013) with the assembly-based machinery of HaplotypeCaller. #### Exome data commands ```bash #Obtaining germline resource from GATK cd ~/workspace/inputs/references gsutil cp gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz . gsutil cp gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi . mkdir -p ~/workspace/somatic/mutect cd ~/workspace/somatic/mutect #Creating a panel of normals # Runtime: ~ 17min gatk --java-options "-Xmx12G" Mutect2 -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam -tumor-sample HCC1395BL_DNA -O Exome_Norm_PON.vcf.gz #Running Mutect2 Using latest version of GATK # Runtime: ~20m gatk --java-options "-Xmx12G" Mutect2 -R ~/workspace/inputs/references/genome/ref_genome.fa -I ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam -tumor HCC1395_DNA -I ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam -normal HCC1395BL_DNA --germline-resource ~/workspace/inputs/references/af-only-gnomad.hg38.vcf.gz --af-of-alleles-not-in-resource 0.00003125 --panel-of-normals ~/workspace/somatic/mutect/Exome_Norm_PON.vcf.gz -O ~/workspace/somatic/mutect/exome.vcf.gz -L chr6 -L chr17 #Running mutect2 using gatk version 3.6 #java -Xmx12g -jar /usr/local/bin/GenomeAnalysisTK.jar -T MuTect2 --disable_auto_index_creation_and_locking_when_reading_rods -R ~/workspace/data/raw_data/references/ref_genome.fa -I:tumor ~/workspace/data/DNA_alignments/chr6+chr17/final/Exome_Tumor_sorted_mrkdup_bqsr.bam -I:Normal ~/workspace/data/DNA_alignments/chr6+chr17/final/Exome_Norm_sorted_mrkdup_bqsr.bam --dbsnp ~/workspace/data/raw_data/references/Homo_sapiens_assembly38.dbsnp138.vcf.gz --cosmic ~/workspace/data/raw_data/references/Cosmic_v79.dictsorted.vcf.gz -o ~/workspace/data/results/somatic/mutect/exome.vcf.gz -L ~/workspace/data/results/inputs/SeqCap_EZ_Exome_v3_hg38_primary_targets.v2.interval_list echo ~/workspace/somatic/mutect/exome.vcf.gz > ~/workspace/somatic/mutect/exome_vcf.fof bcftools concat --allow-overlaps --remove-duplicates --file-list ~/workspace/somatic/mutect/exome_vcf.fof --output-type z --output ~/workspace/somatic/mutect/mutect_exome.vcf.gz mv mutect_exome.vcf.gz exome.vcf.gz tabix ~/workspace/somatic/mutect/exome.vcf.gz ``` #### **Merge Variants** __________________________ With outputs from all three algorithms, we can now merge the variants to generate a comprehensive list of detected variants: ```bash # Unzip the vcf.gz files before combining Variants cd ~/workspace/somatic gunzip ~/workspace/somatic/varscan/exome.vcf.gz gunzip ~/workspace/somatic/strelka/exome/results/variants/exome.vcf.gz gunzip ~/workspace/somatic/mutect/exome.vcf.gz #Need to change header sample names in vcf file produced by mutect2 in order to combine variants with those from other algorithms sed -i 's/HCC1395BL_DNA/NORMAL/' ~/workspace/somatic/mutect/exome.vcf sed -i 's/HCC1395_DNA/TUMOR/' ~/workspace/somatic/mutect/exome.vcf # (UNIQUIFY command) java -Xmx4g -jar /usr/local/bin/GenomeAnalysisTK.jar -T CombineVariants -R ~/workspace/data/raw_data/references/ref_genome.fa -genotypeMergeOptions UNIQUIFY --variant:varscan ~/workspace/data/results/somatic/varscan/exome.vcf --variant:strelka ~/workspace/data/results/somatic/strelka/exome/results/variants/exome.vcf --variant:mutect ~/workspace/data/results/somatic/mutect/new_gatk_files/exome.vcf -o ~/workspace/data/results/somatic/exome.unique.vcf.gz #java -Xmx24g -jar ~/workspace/bin/GenomeAnalysisTK.jar -T CombineVariants -R ~/workspace/inputs/references/genome/ref_genome.fa -genotypeMergeOptions PRIORITIZE --rod_priority_list mutect,varscan,strelka --variant:varscan ~/workspace/somatic/varscan/exome.vcf --variant:strelka ~/workspace/somatic/strelka/exome/results/variants/exome.vcf --variant:mutect ~/workspace/somatic/mutect/exome.vcf -o ~/workspace/somatic/exome.merged.vcf.gz java -Xmx12g -jar ~/workspace/bin/picard.jar MergeVcfs -I ~/workspace/somatic/varscan/exome.vcf -I ~/workspace/somatic/strelka/exome/results/variants/exome.vcf -I ~/workspace/somatic/mutect/exome.vcf -O ~/workspace/somatic/exome.merged.vcf bgzip -c ~/workspace/somatic/exome.merged.vcf > ~/workspace/somatic/exome.merged.vcf.gz tabix -p vcf ~/workspace/somatic/exome.merged.vcf.gz ``` ### **Left Align and Trim** __________________________ Reference for explaining left align and trim: https://genome.sph.umich.edu/wiki/Variant_Normalization#Left_alignment ```bash cd ~/workspace/somatic/ gatk --java-options "-Xmx12G" LeftAlignAndTrimVariants -V ~/workspace/somatic/exome.merged.vcf.gz -O exome.merged.leftalignandtrim.vcf -R ~/workspace/inputs/references/genome/ref_genome.fa ``` Note that when running on chromosome 6 and 17 merged variants file, this gave 0 variants aligned. ### **Splitting Multi-allelic Variant using vt** ```bash cd ~/workspace/somatic/ vt decompose -s ~/workspace/somatic/exome.merged.leftalignandtrim.vcf -o ~/workspace/somatic/exome.merged.leftalignandtrim.decomposed.vcf ``` ### Basic Filtering on Somatic Variants First, let's do a basic filtering for `PASS` only variants on our merged and normalized vcf file: ```bash cd ~/workspace/somatic gatk --java-options "-Xmx12G" SelectVariants -R ~/workspace/inputs/references/genome/ref_genome.fa --exclude-filtered -V ~/workspace/somatic/exome.merged.leftalignandtrim.decomposed.vcf -O ~/workspace/somatic/exome.merged.norm.pass_only.vcf ``` #### VEP Annotation [VEP](https://useast.ensembl.org/info/docs/tools/vep/index.html) stands for Variant Effect Predictor. We will use it to annotate our variants to determine the effect of the variants (e.g. SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions. (McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biology Jun 6;17(1):122. (2016) doi:10.1186/s13059-016-0974-4) ```bash cd ~/workspace/somatic # Runtime: ~4min vep -i ~/workspace/somatic/exome.merged.norm.pass_only.vcf --cache --dir ~/workspace/vep_cache/ --format vcf --vcf --plugin Downstream --plugin Wildtype --symbol --terms SO --flag_pick --transcript_version -o ~/workspace/somatic/exome.merged.norm.annotated.vcf #vep --cache --dir_cache /opt/vep_cache --dir_plugins /opt/vep_cache/Plugins --fasta /opt/vep_cache/homo_sapiens/93_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --fork 8 --assembly=GRCh38 --offline --vcf --plugin Downstream --everything --terms SO --pick --coding_only --transcript_version -i /workspace/germline/Exome_Norm_HC_calls.filtered.PASS.vcf -o /workspace/germline/Exome_Norm_HC_calls.filtered.PASS.vep.vcf --force_overwrite ``` ### Adding Bam-readcounts to VCF file We have added a python helper script that will take your vcf and DNA bam files and generates two bam-readcount output files, one for snv and one for indel. ```bash cd ~/workspace/bin # First download the bam_readcount_helper script wget https://raw.githubusercontent.com/griffithlab/pmbio.org/master/assets/scripts/bam_readcount_helper.py chmod +x bam_readcount_helper.py ln -s ~/workspace/bin/bam_readcount /usr/local/bin/bam_readcount cd ~/workspace/somatic mkdir -p ~/workspace/somatic/bam_readcounts # activate the bam-readcount conda enironment # conda activate bam-readcount # Running bam-readcount on annotated vcf file, once using tumor bam and another time for normal bam file python -u ~/workspace/bin/bam_readcount_helper.py ~/workspace/somatic/exome.merged.norm.annotated.vcf 'TUMOR' ~/workspace/inputs/references/genome/ref_genome.fa ~/workspace/align/Exome_Tumor_sorted_mrkdup_bqsr.bam ~/workspace/somatic/bam_readcounts/ python -u ~/workspace/bin/bam_readcount_helper.py ~/workspace/somatic/exome.merged.norm.annotated.vcf 'NORMAL' ~/workspace/inputs/references/genome/ref_genome.fa ~/workspace/align/Exome_Norm_sorted_mrkdup_bqsr.bam ~/workspace/somatic/bam_readcounts/ ``` Once we concatenate the snv readcount file as well as the indel readcount file, we can then run the vcf-annotation-tool to add these bam-readcounts to our vcf output. ```bash cd ~/workspace/somatic/bam_readcounts/ cat TUMOR_bam_readcount_snv.tsv TUMOR_bam_readcount_indel.tsv > TUMOR_bam_readcount_combined.tsv cat NORMAL_bam_readcount_snv.tsv NORMAL_bam_readcount_indel.tsv > NORMAL_bam_readcount_combined.tsv mkdir -p ~/workspace/somatic/final/ cd ~/workspace/somatic/final/ vcf-readcount-annotator ~/workspace/somatic/exome.merged.norm.annotated.vcf ~/workspace/somatic/bam_readcounts/TUMOR_bam_readcount_combined.tsv DNA -s TUMOR -o ~/workspace/somatic/final/exome.tumordna_annotated.vcf.gz vcf-readcount-annotator ~/workspace/somatic/final/exome.tumordna_annotated.vcf.gz ~/workspace/somatic/bam_readcounts/NORMAL_bam_readcount_combined.tsv DNA -s NORMAL -o ~/workspace/somatic/final/exome.annotated.vcf.gz # Remove the intermediate file rm exome.tumordna_annotated.vcf.gz tabix -p vcf exome.annotated.vcf.gz #conda deactivate ``` ### Generating Table from VCF file ```bash # Adjust the output fields accordingly cd ~/workspace/somatic/final/ gunzip exome.annotated.vcf.gz #java -Xmx4g -jar /usr/local/bin/GenomeAnalysisTK.jar -T VariantsToTable -R ~/workspace/inputs/references/genome/ref_genome.fa --variant ~/workspace/somatic/final/exome.annotated.vcf -F CHROM -F POS -F ID -F REF -F ALT -F set -F AC -F AF -F DP -F AF -F CSQ -o variants.tsv gatk VariantsToTable -V ~/workspace/somatic/final/exome.annotated.vcf -F CHROM -F POS -F ID -F REF -F ALT -F set -F AC -F AF -GF GT -GF AD -O variants.tsv wget -c https://raw.githubusercontent.com/genome/docker-cle/master/add_annotations_to_table_helper.py #conda activate bam-readcount python -u add_annotations_to_table_helper.py variants.tsv exome.annotated.vcf Consequence,Gene ./ #conda deactivate ``` ```bash ```