This tutorial follows after the `Avian Influenza, Hybrid reference mapping' tutorial.
For instructions on how to acquire and process the data, please refer to that tutorial.
## Data
We are going to work with:
- the paired-end sanger reads in ~/workshop_data/avian_influenza
## Quality Control
We already did this! Yay.
## Assembly
We are going to assemble the reads using IRMA.
Then, we are going to use BLAST to do the strain typing.
## Reference database
You can find and query all influenza viruses in GenBank, including the complete genome sequences
from the NIAID Influenza Genome Sequencing Project here: <https://www.ncbi.nlm.nih.gov/genomes/FLU>
You can easily download the sequences here: <https://ftp.ncbi.nih.gov/genomes/INFLUENZA/>
You want to download the FASTA nucleotide file.
- Read the `README` file to figure out which of the files op the website that is.
- Download the compressed version, do you remember what the gzip extension is?
- Create the directory structure `~/workshop_data/references/ncbi`.
Tip: remember the `-p` flag for the `mkdir` command? What does it do?
- Move the downloaded file to the created directory. Tip: use `mv`.
Uncompress it using `gunzip`. Tip: Run the `history` command to see how you did it before.
This will be the base for your reference databse.
Now, we have to make it readable for BLAST:
```
conda activate mapping_and_assembly
mamba install blast
```
```
cd ~/workshop_data/references/ncbi
ls
# This should show `influenza.fna`. Make sure it does, before moving on!
makeblastdb -in influenza.fna -dbtype nucl
mkdir blast_db
mv influenza.fna.* blast_db
```
## Install IRMA
<!-- conda create -n IRMA 'r-base>=3.6' perl=5.16
wget https://wonder.cdc.gov/amd/flu/irma/flu-amd-202209.zip -O flu-amd-202209.zip
cd ~/Downloads
unzip flu-amd-202209.zip
mv flu-amd-202209/* ~/miniconda3/envs/IRMA/bin -->
```
conda create -n irma irma
conda activate irma
```
## Run the assembly and typing
```
cd ~/workshop_data/avian_influenza
IRMA FLU-avian reads_1.processed.fastq.gz reads_2.processed.fastq.gz reads_IRMA
```
IRMA outputs an consensus FASTA file, with an assembly consensus sequence per segment. You can use `blastn` to find the closest genome in the NCBI database.
```
blastn \
-num_threads [N_CPUs] \
-db ~/workshop_data/references/ncbi/blast_db/influenza.fna \
-query [IRMA_consensus_output] \
-out results.blastn.txt
```
You can parse the results of the blast search to do the typing.
Metadata from the NCBI Influenza DB is merged in with the BLAST results to improve subtyping results and provide context for the results obtained.You can download the metadata here:
<https://ftp.ncbi.nih.gov/genomes/INFLUENZA/genomeset.dat.gz>
You can find a script to this here:
<https://github.com/peterk87/nf-flu/assets/CFIA-NCFAD/nf-flu/bin/parse_influenza_blast_results.py>
You can run this script like this:
```
python parse_influenza_blast_results.py \
--threads [N_CPUs] \
--flu-metadata [path/to/metadata/fastq.gz] \
--top 3 \
--excel-report iav-subtyping-report.xlsx \
--pident-threshold 0.85 \
results.blastn.txt
```