This tutorial follows after the `Avian Influenza, Hybrid reference mapping' tutorial. For instructions on how to acquire and process the data, please refer to that tutorial. ## Data We are going to work with: - the paired-end sanger reads in ~/workshop_data/avian_influenza ## Quality Control We already did this! Yay. ## Assembly We are going to assemble the reads using IRMA. Then, we are going to use BLAST to do the strain typing. ## Reference database You can find and query all influenza viruses in GenBank, including the complete genome sequences from the NIAID Influenza Genome Sequencing Project here: <https://www.ncbi.nlm.nih.gov/genomes/FLU> You can easily download the sequences here: <https://ftp.ncbi.nih.gov/genomes/INFLUENZA/> You want to download the FASTA nucleotide file. - Read the `README` file to figure out which of the files op the website that is. - Download the compressed version, do you remember what the gzip extension is? - Create the directory structure `~/workshop_data/references/ncbi`. Tip: remember the `-p` flag for the `mkdir` command? What does it do? - Move the downloaded file to the created directory. Tip: use `mv`. Uncompress it using `gunzip`. Tip: Run the `history` command to see how you did it before. This will be the base for your reference databse. Now, we have to make it readable for BLAST: ``` conda activate mapping_and_assembly mamba install blast ``` ``` cd ~/workshop_data/references/ncbi ls # This should show `influenza.fna`. Make sure it does, before moving on! makeblastdb -in influenza.fna -dbtype nucl mkdir blast_db mv influenza.fna.* blast_db ``` ## Install IRMA <!-- conda create -n IRMA 'r-base>=3.6' perl=5.16 wget https://wonder.cdc.gov/amd/flu/irma/flu-amd-202209.zip -O flu-amd-202209.zip cd ~/Downloads unzip flu-amd-202209.zip mv flu-amd-202209/* ~/miniconda3/envs/IRMA/bin --> ``` conda create -n irma irma conda activate irma ``` ## Run the assembly and typing ``` cd ~/workshop_data/avian_influenza IRMA FLU-avian reads_1.processed.fastq.gz reads_2.processed.fastq.gz reads_IRMA ``` IRMA outputs an consensus FASTA file, with an assembly consensus sequence per segment. You can use `blastn` to find the closest genome in the NCBI database. ``` blastn \ -num_threads [N_CPUs] \ -db ~/workshop_data/references/ncbi/blast_db/influenza.fna \ -query [IRMA_consensus_output] \ -out results.blastn.txt ``` You can parse the results of the blast search to do the typing. Metadata from the NCBI Influenza DB is merged in with the BLAST results to improve subtyping results and provide context for the results obtained.You can download the metadata here: <https://ftp.ncbi.nih.gov/genomes/INFLUENZA/genomeset.dat.gz> You can find a script to this here: <https://github.com/peterk87/nf-flu/assets/CFIA-NCFAD/nf-flu/bin/parse_influenza_blast_results.py> You can run this script like this: ``` python parse_influenza_blast_results.py \ --threads [N_CPUs] \ --flu-metadata [path/to/metadata/fastq.gz] \ --top 3 \ --excel-report iav-subtyping-report.xlsx \ --pident-threshold 0.85 \ results.blastn.txt ```