This tutorial follows after the `Avian Influenza, Hybrid reference mapping' tutorial. For instructions on how to acquire and process the data, please refer to that tutorial.
We are going to work with:
We already did this! Yay.
We are going to assemble the reads using IRMA. Then, we are going to use BLAST to do the strain typing.
You can find and query all influenza viruses in GenBank, including the complete genome sequences from the NIAID Influenza Genome Sequencing Project here: https://www.ncbi.nlm.nih.gov/genomes/FLU
You can easily download the sequences here: https://ftp.ncbi.nih.gov/genomes/INFLUENZA/ You want to download the FASTA nucleotide file.
Read the README
file to figure out which of the files op the website that is.
Download the compressed version, do you remember what the gzip extension is?
Create the directory structure ~/workshop_data/references/ncbi
.
Tip: remember the -p
flag for the mkdir
command? What does it do?
Move the downloaded file to the created directory. Tip: use mv
.
Uncompress it using gunzip
. Tip: Run the history
command to see how you did it before.
This will be the base for your reference databse.
Now, we have to make it readable for BLAST:
IRMA outputs an consensus FASTA file, with an assembly consensus sequence per segment. You can use blastn
to find the closest genome in the NCBI database.
You can parse the results of the blast search to do the typing.
Metadata from the NCBI Influenza DB is merged in with the BLAST results to improve subtyping results and provide context for the results obtained.You can download the metadata here: https://ftp.ncbi.nih.gov/genomes/INFLUENZA/genomeset.dat.gz
You can find a script to this here: https://github.com/peterk87/nf-flu/assets/CFIA-NCFAD/nf-flu/bin/parse_influenza_blast_results.py
You can run this script like this: