4. Setup Your Reference Genome using BWA and GATK

# 4. Setup Your Reference Genome using BWA and GATK ###### tags: `2. Main Steps` `tutorials` `reference genome` > Program required: BWA and GATK You will only need the BWA index for the alignment step (bwa mem), but the GATK index/dictionary/image will be important for the next step (MergeBamAlignment), so we have you set up both in this tutorial. ## Step 1: Get your reference genome from FlyBase onto the Cluster 1. Go to this link: [http://ftp.flybase.net/genomes/](http://ftp.flybase.net/genomes/). 2. Click on your species, then reference version number you want. 3. Go into "fasta" directory. 4. Copy the path to the file that you want to download by right clicking the link of your reference sequence (we used all-chromosome option) and choose Copy Link Address 5. Paste the address somewhere (any random word file, .txt, etc) and replace the prefix `http` with `ftp`. This will be the link that you use with the`wget` command **on the cluster.** ## Step 2: Setup softwares and directories on your cluster ### Load Softwares: Any software you are loading from DCC, you will need to load at the beginning of each session you're using it(i.e. if you load any module from DCC then log off of the cluster entirely, you'll have to load it again when you get back on.) Luckily, it's super easy to load modules from DCC :) **1. BWA**: All you need to do is type ```module avail``` and hit enter. This should give you a list of all modules available on the DCC and will tell you exactly which version of BWA to load. Then enter the following text into the terminal (once logged onto the cluster). You may need to change the specific version of BWA depending on what is offered at the time you're using this. ``` module load BWA/0.7.17 ``` **2. GATK**: Same thing here except for GATK `module load GATK/4.1.9.0` **NOTE** that you can also just put these module load commands into your shell script so that it automatically loads the programs you need before running. :) ## Step 3. Make an index for reference genome using BWA ### Notes on why we're using bwa index (rather than GATK) The GATK "Best Practices" recommend converting .fastq files to uBAM (unaligned BAM) before aligning, but this assumes you will be keeping ONLY the uBAMs as backup, and not the original FASTQs. But because we do not want to delete the original raw data, we are going to go bypass this step and align the .fastq files to the reference genome directly. The GATK 'BwaAndMarkDuplicatesPipelineSpark' does not work with .fastq input, so you will need to use bwa mem to align the .fastq with the reference genome. You will also need to create the reference genome indexes using BWA (NOT using GATK commands - they are not the same and will not work with gwa mem!). So that's a long story as to why you need this code. 1. Make a shell script that contains the following code and save it in your 'code' directory (or whatever you'd like) on the cluster. ``` #!/bin/bash #SBATCH -e slurm.err #SBATCH -p common #SBATCH --mem=1G #this script takes a fasta genome reference file and makes a copy of it in index form #Uses a tool from bwa called index bwa index ../ref/dmel-all-chromosome-r6.41.fasta dmel-all-chromosome-r6.41.fasta ``` 2. In this same directory, type: `sbatch name_of_your_file.sh` and hit enter. 3. You will see many files output in the same directory that contains your reference genome (they will all contain the prefix of the name of your fasta file and .fasta (e.g. the previx is dmel-all-chromosome-r6.41.fasta). The prefix is important later on in the bwa mem step! ## Step 4: Make an index for reference genome using GATK ### A: Construct Dictionary Run this code in `code` directory. ``` #!/bin/bash #SBATCH -e slurm-%j.err #SBATCH -o slurm-%j.out #SBATCH -p common #SBATCH --mem=1G module load GATK/4.1.9.0 #this script takes a fasta genome reference file and produces a dictionary of contig names #Uses a tool from Picard within GATK4 called CreateSequenceDictionary #-R=reference_file.fasta #-O=output_file_name.dict java -jar $GATK CreateSequenceDictionary -R ../ref/dmel-all-chromosome-r6.41.fasta -O ../ref/dmel-all-chromosome-r6.41.dict ``` **Don't panic** if you get an info error (see below) that failed to detect whether we are running on Google Compute Engine, according to this [link](https://github.com/broadinstitute/gatk/issues/6875). ``` Sep 23, 2021 2:45:39 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. ``` ### B: Construct an index file ``` #!/bin/bash #SBATCH -e slurm-%j.err #SBATCH -o slurm-%j.out #SBATCH -p common #SBATCH --mem=1G module load GATK/4.1.9.0 module load samtools/1.9 #this script takes a fasta genome reference file and makes a copy of it in index form #Uses a tool from SAMtools called faidx samtools faidx ../ref/dmel-all-chromosome-r6.41.fasta -o ../ref/test.fai ``` ### C. Make an image ``` #!/bin/bash #SBATCH -e slurm-%j.err #SBATCH -o slurm-%j.out #SBATCH -p common #SBATCH --mem=1G module load GATK/4.1.9.0 #This script creates an index image file (?) of the reference sequence #It's required for any BWA tools within GATK, like BwaSpark #where -I-input file (reference), -O-output file (assumed same name, but with .img extension) java -jar $GATK BwaMemIndexImageCreator -I ../ref/dmel-all-chromosome-r6.41.fasta -O ../ref/dmel-all-chromosome-r6.41.fasta.img ``` Took somewhat longer to run (2-3 minutes(?))