draft
Observable page: https://observablehq.com/@aglucaci/visualizing-selection-analysis-results-for-evolution-of-t/11
WHO Variant tracking page: https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/
Outbreak.info page for C.37: https://outbreak.info/situation-reports?pango=C.37
CoVSpectrum page for C.37:
https://cov-spectrum.ethz.ch/explore/Switzerland/AllSamples/AllTimes/variants/json={"variant"%3A{"name"%3A"C.37"%2C"mutations"%3A[]}%2C"matchPercentage"%3A1}
Data retrival on GISAID on July 7th 2021 with 2,127 sequences. (Params: low coverage excl.)
Located at:
/home/aglucaci/SARS-CoV-2_Clades/data/C.37
Full results
/home/aglucaci/SARS-CoV-2_Clades/results/C.37
Web-accessible results at (on silverback)
/data/shares/web/web/SARS2_TimeSeries/clades
Browser accessible:
https://data.hyphy.org/web/SARS2_TimeSeries/clades
Clade defining mutations in Spike
S G75V
S T76I
S del247/253
S L452Q
S F490S
S D614G
S T859N
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
In order to monitor the continued evolution of the SARS-CoV-2 genome, we developed a rapid assessment tool (RASCL) designed to investigate the nature and extent of selective forces acting on viral genes in clade sequences. We describe our method and its application to the Delta and Kappa lineages in a recent Virologial post. Here, we present an early application of our method to the emerging Lambda variant of interest, which was designated on June 14, 2021 (VOI, WHO Int). We find continued evolution in Spike outside of clade-defining mutations. As well, we report on extensive evolution of the Nucleocapsid gene within this variant of interest.
RASCL is freely available via a dedicated Github repository:
https://github.com/veg/SARS-CoV-2_Clades
Also available via a Galaxy (usegalaxy.org) workflow:
https://usegalaxy.eu/u/daveb/w/selection
We thank the global community of health-care workers and scientists who have worked tirelessly to face the pandemic head-on. We are also thankful for the GISAID (Elbe et al., 2017) submitters from across the world. Additionally, we thank members of the Datamonkey / HyPhy and Galaxy teams for their continued assistance in the development of this application.
Data retrieval was performed through GISAID on July 7th 2021 with 2,127 sequences downloaded for the C.37 variant of interest. An additional parameter for the search included: 'low coverage excl.'.
We investigated the nature and extent of selective forces acting on viral genes in C.37 clade sequences. As sequence collection, sampling, and deposition into public data repositories is an ongoing process, we present an early analysis of the emerging lambda variant. By using complete linkage distance clustering (method described in TN93) we selected a median of 80 C.37 (Lambda) sequences from available sequences to represent genomic diversity. A global reference dataset from publicly available sequences via Virus Pathogen Database and Analysis Resource (ViPR) (Pickett et al., 2012) is included as background.
For C.37 clades sequences, we find strong statistical support for multiple modes of selection: diversifying positive (FEL/MEME), negative (FEL), and directional (FADE). There is also evidence of coevolution among sites within individual gene segments (BGM), and at a number of sites clade evolution occurs with selective forces that are different from those acting on background sequences (Contrast-FEL). These results are summarized in Figure 1 and can be fully explored in the following interactive notebook: Link to full RASCL results for C.37 clade sequences.
Figure 0. Here, we show the geographic distribution of C.37 sequence from the start of 2021 on a country-by-country basis.
As we described in our previous Virological post on the Delta and Kappa variants, we again notice that a key observation is that selection appears to be operating on many sites that are not a part of the clade-defining signature mutation set, implying that there may be ongoing diversification and adaptation in this clade. Additionally, as temporal changes in the frequency of amino-acid residues at sites outside of clade defining mutations occur, these may become seeds for future variant lineages.
Figure 1. Schematic for genome-level results for the C.37/Lambda clade. The color key is defined as follows: Green = MEME (episodic diversifying selection) on All branches (Track 1, or the outermost track) and MEME results on Internal branches (Track 2). Purple = Clade defining sites from Pangolin. Orange = S1 Functional sites shown to influence antibody binding. Yellow = Contrast-FEL results on Internal branches. Blue = FADE results, a test for directional selection. Grey box = FEL results for negatively selected sites. Grey lines = BGM results for coevolving sites.
We focus our attention on positively selected sites located in the Spike gene. We identify 8 sites subject to episodic positive selection in the C.37 clade (Table 1), with only 2 of those appearing on the canonical clade signature list (452, 490). The trends in most common haplotypes at the remaining (non-clade-defining) six positions are shown in Figure 2.
# | Coordinate (SARS-CoV-2) | Gene/ORF | Codon (in gene/ORF) | # of selected branches | p-value | q-value |
---|---|---|---|---|---|---|
1 | 21595 | S | 12 | 1 | 0.0137907 | 0.130649 |
2 | 21619 | S | 20 | 2 | 0.048189 | 0.170079 |
3 | 21775 | S | 72 | 2 | 0.00503418 | 0.0647251 |
4 | 22297 | S | 246 | 1 | 0.00242549 | 0.0363824 |
5 | 22318 | S | 253 | 2 | 0.0124571 | 0.124571 |
6 | 22915 | S | 452 | 0 | 0.00300422 | 0.0415968 |
7 | 23029 | S | 490 | 2 | 0.0202347 | 0.15176 |
8 | 24037 | S | 826 | 1 | 0.0493256 | 0.170742 |
Table 1. Sites in the Spike gene from C.37 sequences which show evidence for episodic selection. Sites which are bolded indicate clade-defining mutations from Pangolin.
Figure 2. Temporal trends of the substitution combinations at all sites represented in Table 1 in the Spike gene for C.37 sequences in 2021. The symbol . denotes the reference residue at that site . Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from / Sergei Pond / Observable
We notice the high level of selection activity in the Nucleocapsid gene within C.37 sequences (Figure 1 and Table 2). Therefore, we focus our attention on positively selected sites located in the Nucleocapsid gene. We identify 19 sites subject to episodic positive selection in the C.37 clade (Table 2), with none of those appearing on the canonical clade signature list. Of note, the Nucleocapsid 366I mutation seems to occur at high frequency (~49% of C.37 sequences in 2021) at a number of branches on the phylogenetic tree (Table 2). This mutation have played a role in the early evolution of C.37 (see Figure 3.) and allowed other mutations in Nucleocapsid to be tolerated. The trends in most common haplotypes at the nineteen positions are shown in Figure 3.
# | Coordinate (SARS-CoV-2) | Gene/ORF | Codon (in gene/ORF) | # of selected branches | p-value | q-value | Physiochemical Properties |
---|---|---|---|---|---|---|---|
1 | 28303 | N | 11 | 1 | 0.0457584 | 0.175245 | |
2 | 28324 | N | 18 | 1 | 0.0378039 | 0.174479 | |
3 | 28342 | N | 24 | 1 | 0.0309127 | 0.179493 | |
4 | 28351 | N | 27 | 1 | 0.0258399 | 0.166114 | |
5 | 28417 | N | 49 | 1 | 0.0304154 | 0.182492 | |
6 | 28540 | N | 90 | 1 | 0.0391604 | 0.171924 | |
7 | 28627 | N | 119 | 1 | 0.0188724 | 0.15441 | secondary, charge |
8 | 28675 | N | 135 | 1 | 0.0309321 | 0.173993 | |
9 | 28723 | N | 151 | 1 | 0.0432127 | 0.176779 | |
10 | 28789 | N | 173 | 2 | 0.0322641 | 0.175986 | |
11 | 28876 | N | 202 | 1 | 0.000257256 | 0.0154354 | |
12 | 28906 | N | 212 | 1 | 0.035702 | 0.169115 | |
13 | 29368 | N | 366 | 7 | 0.000531167 | 0.009561 | volume |
14 | 29371 | N | 367 | 2 | 0.000329097 | 0.00846248 | bipolar, volume, composition, charge |
15 | 29413 | N | 381 | 1 | 0.0384615 | 0.173077 | |
16 | 29464 | N | 398 | 2 | 0.0332768 | 0.166384 | |
17 | 29473 | N | 401 | 0 | 0.00848191 | 0.101783 | volume |
18 | 29479 | N | 403 | 0 | 0.000282372 | 0.0127067 | |
19 | 29518 | N | 416 | 1 | 8.8975e-05 | 0.0160155 | Overall |
Table 2. Sites in the Nucleocapsid gene in C.37 sequences which show evidence for episodic selection. We also note (where appropriate) on physiochemical properties influencing selection on specific sites (per the PRIME method).
Figure 3. Temporal trends of the substitution combinations at all sites reported in Table 2 in the Nucleocapsid gene in C.37 sequences. . denotes the reference residue. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from / Sergei Pond / Observable
Ongoing monitoring of emergent VOI and VOC can detect adaptive mutations before they rise to high frequency, and help establish their relationship to key clinical parameters including pathogenicity and transmissibility. Additionally, continued evolution within a particular clade may form the foundation for a subclade with further functional sites of interest. Based on current information for the Spike gene from https://covdb.stanford.edu/mut-annot-viewer/SARS2S/ we included several annotations with clinical relevance.
SARS-CoV-2 C.37 Spike gene, sites of interest from Table 1.
The RASCL application is designed to accept two inputs: a “query” whole genome sequence dataset, and a “background” whole genome sequence dataset. Users are requested to input a whole genome dataset (fasta, unaligned), which will be defined as the “query” sequence dataset, and can be typically retrieved from data repositories such as GISAID. Typically, a background sequence dataset is assembled from publicly available sequences via the ViPR database, and is provided by default unless otherwise specified. However, our application also tolerates the usage of another clade of interest to be used as “background”, which allows one to directly interrogate between-clade evolutionary dynamics. Whole genome datasets are broken down into their respective coding sequences from the NCBI SARS-CoV-2 reference annotation. The gene set includes structural (Spike, Matrix, Nucleocapsid, Envelope) and non-structural (leader, nsp2, nsp3, nsp4, 3C, nsp6, nsp7, nsp8, nsp9, nsp10, helicase, exonuclease, endornase, ORF3a, ORF6, ORF7a, ORF8, RdRp, methyltransferase) genes. Individual genes from query and background datasets are processed from here onwards. Our procedure includes several quality control steps including sanitizing fasta input headers and striking ambiguous codons from an alignment. In order for our analyses to complete rapidly, alignments are subsampled using genetic distances provided by the TN93 (Tamura and Nei 1993) algorithm. A combined (query + background) alignment is created and with sequences kept from the background dataset divergent enough to be useful for subsequent selection analysis. Inference of a maximum likelihood phylogenetic tree (RAxML-NG, Kozlov et al., 2019) is performed on the combined dataset. Trees are labelled for “query” and “background” branches. Analyses SLAC (Pond and Frost, 2005), BGM (Poon et al., 2008), FEL (Pond and Frost, 2005), MEME (Murrell et al., 2012), aBSREL (Smith et al., 2015), BUSTEDS (Wisotsky et al., 2020), RELAX (Wertheim et al., 2015), CFEL (Pond et al., 2021) are performed with state of the art molecular evolution models from the HyPhy (Pond et al., 2020) suite of bioinformatics tools. We report on sites of interest which together may represent regions of ongoing evolution within a specific clade. Results are combined into easily readable and web-accessible JSON files used for web processing. Visualization of results is done with full-feature interactive notebooks through ObservableHQ.
References
SARS-CoV-2 C.37 Nucleocapsid gene, sites of interest:
… rename this, but provide literature review and references here
Evolutionary relationship to clinical parameters
Do any of the sites have some kind of link to transmissiblity or immune evasion?
A little more background on C.37. Where it is etc.
How is it spreading..
To understand the …
Annotate the tables. based on NCBI annotation where in the gene does the site fall ?
noteable features