owned this note
owned this note
Published
Linked with GitHub
# H3N2 phylogenetic trees (CEIRS)
### Data processing
Raw reads from the Illumina MiSeq sequencer were first filtered using Trim Galore! [http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/] using an adapter sequence of CTGTCTCTTATACACATCT and a quality score cutoff of 30. QC-trimmed reads were then individually classified using kraken [https://github.com/DerrickWood/kraken] against a custom influenza-specific kraken database containing all complete sequences in EpiFlu with a modified taxonomy, which for flu types A, B, and C, identifies the likely segment, subtype, host, and year for each read. The most likely subtype for each sample was identified from the kraken results, and 85 H3N2 samples were pulled out for further analysis. QC-trimmed reads for all 85 H3N2-identified samples were aligned to the A/Victoria/361/2011 reference sequence, and consensus FASTA sequences were generated for each sample. From these consensus FASTA sequences, 63 samples had at least 85% coverage of the HA segment and were used for subsequent phylogenetic analysis.
### Reference sequences
Reference sequences for the 2014-2015 influenza season were obtained from the GISAID EpiFlu database on October 22, 2016, downloading all human H3N2 isolate sequences with an collection date between 2014-10-01 and 2015-05-31 - a total of 5,642 unique EpiFlu IDs across 5,338 unique strain names. Sequences for the A/Victoria/361/2011 and A/Texas/50/2012 vaccine strains were obtained from NCBI BioProject PRJNA205638. EpiFlu sequences were subset to strains that contained exactly one of each of the 8 genome segments, and for strain names that contained more than one EpiFlu ID, only the lowest ID was chosen. This ultimate subset contained full genomes for 1,478 H3N2 strains.
### Phylogenetic analysis
Data from each segment was combined into a segment-specific FASTA file for the 1,478 EpiFlu reference strains, 63 sequenced samples, and the A/Texas/50/2012 vaccine strain, for a total of 1,542 sequences. Each segment was then aligned using MAFFT [http://mafft.cbrc.jp/alignment/software/], all deletions ('-') were replaced with 'N's, and all sequences were padded with 'N's to the length of the vaccine sequence. Any insertions with respect to the vaccine sequence were removed (any positions in the vaccine sequence that now contain 'N's), and all sequences were trimmed to the length of the vaccine sequence. Phylogenetic trees were created for each segment using the nucleotide option of FastTree [http://www.microbesonline.org/fasttree/], and trees were then re-rooted to the vaccine strain using the `nw_reroot` tool from the Newick utilities package [https://github.com/xflouris/newick-tools]. Trees were viewed and colored using FigTree [http://tree.bio.ed.ac.uk/software/figtree/] and HA clades were identified by adding known clade references to these trees and identifying branches for the 3C.2a, 3C.3, 3C.3a, and 3C.3b clades. From these branch identifications, a set of rules was developed so that a clade label could be added to each strain label.
### Clade identification
#### HA clade
HA clades were identified using the following rules (all positions using H3 numbering):
if(159Y){
clade = "3C.2a"
} else if(138S OR 159S) {
clade = "3C.3a"
} else {
if(157S) {
clade = "3C.3b"
} else if(142R){
clade = "3C.2"
} else if(3L AND 128A){
clade = "3C.3"
} else if(3I AND 128T){
clade = "3C.2"
} else {
clade = "uncertain"
}
Using these rules, all 1,542 strains are labeled according to the branch they fall on with the following exceptions:
- A/Brazil/9517/2015 is labeled 3C.3a but falls on the 3C.3b branch
- A/Manitoba/RV3623/2014 and A/Nevada/30/2014 are labeled 3C.3b but fall on the 3C.3 branch
- 01-13-P-0057 is labeled uncertain but falls on the 3C.2a branch
- 01-90-P-0102 is labeled 3C.3 but falls on the 3C.3b branch
It is unclear whether these exceptions are due to mistakes in the clade identification, or if they are artifacts from the FastTree tree generation.
#### NA clade
NA clades were identified by looking at residues 245-247 and labeling as NAglyPos if those residues were NAT or NAS, and NAglyNeg for all other sequences.
#### PB1-F2 clade
PB1-F2 clades were identified by looking at the identity of residue 75.
#### Clade coloring
All clade information was added to to the name each strain name before tree generation, and then nodes were colored by searching for a specific clade text and coloring all nodes in that selection.