Avian Influenza, Hybrid reference mapping

questions:

With reassortment of gene segments being a common event in avian influenza virus (AIV) evolution, does it make sense to use a reference-based mapping approach for constructing consensus genome sequences for AIV samples?
Is it possible to reuse existing tools and workflows developed for the analysis of sequencing data from other viruses?

objectives:

Determine how reassortment impacts reference-based mapping approaches
Use a collection of per-segment reference sequences to construct a hybrid reference genome that is sufficiently close to a sequenced sample to be useful as a reference for mapping
Construct a sample consensus genome from mapped reads

key_points:

Reassortment of gene segments makes reference-based mapping of influenza sequencing data challenging
An alternative to de-novo assembly can be mapping to a dynamically chosen reference genome
Variant calling and consensus genome construction can follow workflows used also for other viral sequence data

Introduction

The viral surface proteins HA and (to a lesser extent) NA are the main targets of the host antibody response and are, thus, under constant selection pressure to mutate into forms capable of evading an existing host immune response. As a consequence, these segments have evolved into a much richer panel of sequence variants than the other segments, to the point that sequences of the HA segment can, at present, be classified into 18 distinct subfamilies, H1-H18, while there are 11 recognized subfamilies of NA, and Influenza A strains are subtyped (as, for example, H5N1, H3N2, etc.) according to the combination of HA- and NA-encoding segments they are carrying.

Importantly, the sequence diversity of HA and (again to a lesser extent) of NA segments is big enough to prevent a naive approach of mapping sequenced reads to one specific agreed-upon Influenza A reference sequence. While this would work for the other six segments, mapping software would regularly fail to find enough plausible mappings for sequenced reads of HA and NA origin to continue analysis with. This is why, in this tutorial we are going to explore an alternative approach, which is also mapping-based but chooses a suitable reference for each segment dynamically based on the input sequencing data.

Data

We are going to work with:

INSAFlu segment reference databases
Paired end sanger reads

Please download the data from here: https://drive.google.com/drive/folders/1jQz8_rcr-uRGMrHRfy7a_nezrbjPz1t0?usp=sharing. Afterwards run the following:

mv ~/Downloads/avian_influenza*.zip ~/workshop_data/
unzip avian_influenza*

Inspect the references.

Are all subtypes of HA and NA represented in the sequences?

cd ~/workshop_data/avian_influenza
cat segment_references/ref_4_ha.fasta | grep H1
cat segment_references/ref_4_ha.fasta | grep H2
cat segment_references/ref_4_ha.fasta | grep H3
...

Quality control

As a very first step, we would like to make sure that we base our analysis only on the high-quality parts of the sequenced reads.

Do you remember the commands for quality control? Tip: use the UP arrow in your terminal to look through your history, alternatively run history.

Inspect the data using FastQC and MultiQC, what do you think?

conda activate QC
fastqc reads*
multiqc .

Trim low-quality bases from the 5' and 3' ends of the reads, use a quality cutoff of 20. Also filter out short reads < 30 bases.

Which set of reads (forward or reverse reads) did profit most quality-wise from our low-quality base trimming? What percentage of reads got discarded completely with our settings, and what percentage of bases?

cutadapt -q 20,20 -m 30 -o reads_1.processed.fastq.gz -p reads_2.processed.fastq.gz reads_1.fastq.gz reads_2.fastq.gz

Per-segment subtyping and hybrid reference construction

Our goal at this step is to find best matches between our sequencing data and the reference sequences of each genome segment. This will:

give us preliminary subtyping results with regard to the HA and NA segments for the sequenced sample
suggest the best reference to map the sequencing data to for each segment.

We are then going to combine the best reference of each segment into a hybrid reference genome to use for mapping our sequenced reads against.

To identify the best matching reference segments, we are going to run the tool VAPOR asking it to report the stats for any hits it identifies.

Take a look at the documentation at https://github.com/connor-lab/vapor. This tool unfortunately is not available in the conda channels, so we have to do this differently.

conda create -n mapping_and_assembly python=3 'numpy>=1.5.1'
conda activate mapping_and_assembly
pip install git+https://github.com/connor-lab/vapor
vapor.py --help

Run VAPOR, figure out the command using vapor --help.

touch vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_1*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_2*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_3*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_4*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_5*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_6*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_7*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt
vapor.py -t 0.1 -m 0.0 -fa segments_references/ref_8*.fasta -fq reads_1.processed.fastq.gz reads_2.processed.fastq.gz >> vapor_results.txt

Take a look at the results (vapor_results.txt). Is there a difference between the results for segment 4 (encoding HA), segment 6 (encoding NA) and those for the remaining segments?

According to the tool, what is the likely subtype with regard to HA and NA of the sample?

Now that we have established that things may make sense, we can use the output of VAPOR to extract the actual sequence of the top hit for each reference segment. We then concatenate these best matches into a hybrid reference genome for mapping.

Replace parts of text to extract just the name of the sequence from each line of VAPOR's output.

TAB='\t'
sed -E "s/^.+${TAB}>(.+)$/\1/g" vapor_results.txt > vapor_results.names.txt

seqtk_subseq to extract the reference sequence based on their names reported by VAPOR and combine the best-matching sequence of each segment into one dataset Take a look at https://github.com/lh3/seqtk to see what this tool can do!

conda install seqtk

cat segment_references/ref_*.fasta > segment_references/all.fasta
seqtk subseq segment_references/all.fasta vapor_results.names.txt > hybrid_reference.fasta

Mapping to a hybrid reference

If things went well, the hybrid reference we just obtained should be close enough across all segments to our sample to allow successful mapping of reads. Before we start the mapping we may want to truncate the segment names in our hybrid reference genome though because currently these names still reflect the full origin of the segment sequences, but from now on we are fine with just the segment ID.

Replace the sequence names with their segment name

sed -E "s/^>([^|]+).+$/>\1/g" hybrid_reference.fasta > hybrid_reference.clean.fasta

Having polished the titles of the segments in our hybrid reference genome we are finally ready for mapping, which we will carry out with BWA-MEM, clean up a bit with Samtools view and produce a quality report for with QualiMap BamQC.

Map with BWA-MEM

conda install bwa-mem2

bwa-mem2 index hybrid_reference.clean.fasta
bwa-mem2 mem -t 4 hybrid_reference.clean.fasta reads_* > out.sam

Let's reduce the size of this file, by converting it to BAM (binary SAM) format. You can use samtools for this.

To convert SAM to BAM, we use the samtools view command. We must specify that our input is in SAM format (by default it expects BAM) using the -S option. We must also say that we want the output to be BAM (by default it produces BAM) with the -b option.

conda install samtools
samtools view -S -b out.sam > out.bam

We can also use samtools view to obtain a selection of reads

"Filter by quality": 20

"Require that these flags are set": Read is paired and Read is mapped in a proper pair

To understand how this works we first need to inspect the SAM format. The SAM format includes a bitwise FLAG field described here. The -f/-F options to the samtools command allow us to query based on the presense/absence of bits in the FLAG field. So -f 4 only output alignments that are unmapped (flag 0×0004 is set) and -F 4 only output alignments that are not unmapped (i.e. flag 0×0004 is not set), hence these would only include mapped alignments.

As you can read in the linked documentation, the -f 3 switch filters for reads that are paired in sequencing (flag 0×0001) and for which each segment was properly aligned (flag 0×0002). Here we add 0x0001 + 0x0002 = 3.

samtools view -b -f 3 -q 20 out.bam > out_filtered.bam

Doing anything meaningful such as calling variants or visualizing alignments in IGV requires that the BAM is further manipulated. It must be sorted such that the alignments occur in “genome order”. That is, ordered positionally based upon their alignment coordinates on each chromosome. Samtools can do this too!

samtools sort out_filtered.bam -o out_filtered.sorted.bam

Study the report with QualiMap

What is the coverage of each segment by the sequenced reads, and is it uniform? Look for a plot showing read mapping quality across the reference. What can you conclude?

conda install qualimap
qualimap bamqc -bam out_filtered.sorted.bam

Consensus sequence

samtools mpileup -aa -A -d 0 -Q 0 out.sorted.REF_NA.bam | ivar consensus -q 20 -t .7 -c 0.8 -m 10 -n N -p out.sorted.REF_NA

Avian Influenza, Hybrid reference mapping

Introduction

Data

Quality control

Per-segment subtyping and hybrid reference construction

Mapping to a hybrid reference

Consensus sequence

Read more

Data groupA

Assembly of Avian Influenza using IRMA

Introduction to Docker

Introduction to Nextflow