Try   HackMD

This tutorial follows after the `Avian Influenza, Hybrid reference mapping' tutorial. For instructions on how to acquire and process the data, please refer to that tutorial.

Data

We are going to work with:

  • the paired-end sanger reads in ~/workshop_data/avian_influenza

Quality Control

We already did this! Yay.

Assembly

We are going to assemble the reads using IRMA. Then, we are going to use BLAST to do the strain typing.

Reference database

You can find and query all influenza viruses in GenBank, including the complete genome sequences from the NIAID Influenza Genome Sequencing Project here: https://www.ncbi.nlm.nih.gov/genomes/FLU

You can easily download the sequences here: https://ftp.ncbi.nih.gov/genomes/INFLUENZA/ You want to download the FASTA nucleotide file.

  • Read the README file to figure out which of the files op the website that is.

  • Download the compressed version, do you remember what the gzip extension is?

  • Create the directory structure ~/workshop_data/references/ncbi. Tip: remember the -p flag for the mkdir command? What does it do?

  • Move the downloaded file to the created directory. Tip: use mv.

Uncompress it using gunzip. Tip: Run the history command to see how you did it before.

This will be the base for your reference databse.

Now, we have to make it readable for BLAST:

conda activate mapping_and_assembly
mamba install blast
cd ~/workshop_data/references/ncbi
ls
# This should show `influenza.fna`. Make sure it does, before moving on!
makeblastdb -in influenza.fna -dbtype nucl

mkdir blast_db
mv influenza.fna.* blast_db

Install IRMA

conda create -n irma irma
conda activate irma

Run the assembly and typing

cd ~/workshop_data/avian_influenza
IRMA FLU-avian reads_1.processed.fastq.gz reads_2.processed.fastq.gz reads_IRMA

IRMA outputs an consensus FASTA file, with an assembly consensus sequence per segment. You can use blastn to find the closest genome in the NCBI database.

blastn \
        -num_threads [N_CPUs] \
        -db ~/workshop_data/references/ncbi/blast_db/influenza.fna \
        -query [IRMA_consensus_output] \
        -out results.blastn.txt

You can parse the results of the blast search to do the typing.

Metadata from the NCBI Influenza DB is merged in with the BLAST results to improve subtyping results and provide context for the results obtained.You can download the metadata here: https://ftp.ncbi.nih.gov/genomes/INFLUENZA/genomeset.dat.gz

You can find a script to this here: https://github.com/peterk87/nf-flu/assets/CFIA-NCFAD/nf-flu/bin/parse_influenza_blast_results.py

You can run this script like this:

  python parse_influenza_blast_results.py \
   --threads [N_CPUs] \
   --flu-metadata [path/to/metadata/fastq.gz] \
   --top 3 \
   --excel-report iav-subtyping-report.xlsx \
   --pident-threshold 0.85 \
   results.blastn.txt