- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
We identify genomic sites in μ (B.1.621, Latif et al. 2021) clade sequences that may be subject to selective forces and could be prioritized for further studies, but have not yet reached high frequencies.
We present an analysis of the B.1.621 (μ) variant, performed on individual genes/protein products, using a median of 101 μ sequences downsampled from all available sequences in GISAID to represent the genomic diversity in this clade (see the Methods section below for details). A similarly downsampled global SARS-CoV-2 reference dataset from publicly available sequences via the Virus Pathogen Database and Analysis Resource (ViPR) (Pickett et al., 2012) database is used as background.
Full analysis details can be found in this ObservableHQ notebook
Our analysis identified 67 (Table 1 in the ObservableHQ notebook) individual codon sites (among 1643 sites that are polymorphic in the amino-acid space among) that showed evidence of episodic diversifying selection along internal branches of this clade using the MEME method at q ≤ 0.20 false discovery rate (FDR). A total of 5 sites (Table 3 in the ObservableHQ notebook) were found to be subject to directional selection using the FADE method.
We created a table of high-priority sites in SARS-CoV-2 μ B.1.621 sequences. We assign a “Rank” to each site based on a point system described in the “Approach” section above. Briefly, each site is given a point for each of the follow requirements that are met: found to be positively selected, the site occurs outside of the clade defining site set, the site has any unexpected mutations as described in our Sarbecoviruses evolutionary analysis notebook, the site has a mutation present above the minimum threshold in the SARS-CoV-2 Global Haplotype analysis (for this we remove any mutation present in Reference clade, and remove gaps).
# | Coordinate (SARS-CoV-2) | Gene/ORF | Codon (in gene/ORF) | p-value | q-value | Rank | Physiochemical Property |
---|---|---|---|---|---|---|---|
1 | 16075 | RDRP/ORF1b | 879/870Y | 0.0452115 | 0.140312 | 4 | |
2 | 28873 | N | 201G | 0.0460285 | 0.138086 | 4 | |
3 | 28253 | ORF8 | 121VHF | 0.0300385 | 0.180231 | 4 | |
4 | 25336 | S | 1259HV | 0.0157386 | 0.134903 | 4 | |
5 | 25333 | S | 1258D | 0.00852771 | 0.0959367 | 4 | |
6 | 13516 | RDRP/ORF1b | 26/17I | 0.0336153 | 0.168077 | 4 | |
7 | 14122 | RDRP/ORF1b | 228/219D | 0.036867 | 0.170155 | 4 | |
8 | 14530 | RDRP/ORF1b | 364/355F | 0.0416466 | 0.144161 | 4 | |
9 | 14767 | RDRP/ORF1b | 443/434V | 0.0384624 | 0.161005 | 4 | |
10 | 14785 | RDRP/ORF1b | 449/440T | 0.0383252 | 0.168257 | 4 | charge |
11 | 17976 | Helicease/ORF1b | 580/1503F | 0.0113156 | 0.107201 | 3 | |
12 | 20550 | Endornase/ORF1b | 310/2361I | 0.00645087 | 0.0893198 | 3 | |
13 | 21234 | Methyltransferase/ORF1b | 192/2589Y | 0.049474 | 0.134929 | 3 | |
14 | 21640 | S | 27LT | 0.000824192 | 0.0247258 | 3 | |
15 | 21997 | S | 146YNTPS | 0.000598969 | 0.0215629 | 3 | |
16 | 22003 | S | 148S | 0.0221244 | 0.165933 | 3 | |
17 | 22000 | S | 147HQNP | 0.0110123 | 0.1166 | 3 | Overall, secondary |
18 | 19482 | Exonuclease/ORF1b | 481/2005V | 0.0474395 | 0.133424 | 3 | |
19 | 25707 | ORF3a | 106FPIL | 0.0410847 | 0.145005 | 3 | |
20 | 27210 | ORF6 | 4PHI | 7.98568e-06 | 0.000479141 | 3 | |
21 | 29023 | N | 251S | 0.0425228 | 0.141743 | 3 | |
22 | 19548 | Exonuclease/ORF1b | 503/2027VS | 0.0470548 | 0.134442 | 3 | |
23 | 29443 | N | 391AN | 0.0325738 | 0.167522 | 3 | |
24 | 14470 | RDRP/ORF1b | 344/335I | 0.0339005 | 0.164921 | 3 | |
25 | 18327 | Exonuclease/ORF1b | 96/1620I | 0.0395957 | 0.15494 | 3 | |
26 | 2944 | NSP3/ORF1a | 633/894S | 0.0246692 | 0.158588 | 3 | |
27 | 17820 | Helicase/ORF1b | 528/1451S | 0.0392635 | 0.160624 | 3 | |
28 | 17025 | Helicase/ORF1b | 263/1186I | 0.0241484 | 0.167182 | 3 | |
29 | 15001 | RDRP/ORF1b | 521/512C | 0.0457987 | 0.139725 | 3 | |
30 | 9472 | NSP4/ORF1a | 307/3070T | 0.040304 | 0.145094 | 3 | |
31 | 9424 | NSP4/ORF1a | 291/3054T | 0.0402514 | 0.147862 | 3 | |
32 | 9139 | NSP4/ORF1a | 196/2959S | 0.0111345 | 0.111345 | 3 | |
33 | 8614 | NSP4/ORF1a | 21/2784VT | 0.0400482 | 0.153376 | 2 | |
34 | 6535 | NSP3/ORF1a | 1273/2091D | 0.0246588 | 0.164392 | 2 | |
35 | 8578 | NSP4/ORF1a | 9/2772H | 0.0198022 | 0.162018 | 2 | volume |
36 | 19479 | Exonuclease/ORF1b | 480/2004CV | 0.00358349 | 0.0586389 | 2 | volume, charge |
37 | 8659 | NSP4/ORF1a | 36/2799N | 0.0441831 | 0.142017 | 2 | charge |
38 | 18960 | Exonuclease/ORF1b | 307/1831V | 0.0462054 | 0.134145 | 2 | |
39 | 16260 | Helicase/ORF1b | 8/931Y | 0.0309811 | 0.17989 | 2 | |
40 | 2230 | NSP2/ORF1a | 476/656SA | 0.0316127 | 0.172433 | 2 |
Table 1. A table of high-priority sites in SARS-CoV-2 μ B.1.621 sequences. We assign a "Rank" to each site based on a point system described in the "Approach" section above.
For the set of SARS-CoV-2 genomic sites (taken from Table 1), sites inferred from μ B.1.621 sequences to be under the influence of natural selection, we observe when and how selection (positively or negatively) operated on them, through a series of 3-month overlapping intervals going back to the beginning of the pandemic. The earliest intervals ends in February 2020 and the latest - in September 2021.
Figure 1. Evolutionary trajectories of 40 high-priority selected sites (from Table 1). If a site was found to be positively (red) or negatively (blue) selected during a specific time period, a bubble will be drawn at a corresponding point on the plot. The area of the bubble is scaled as -log10 p, where p is the p-value of the FEL likelihood ratio test. Larger bubbles correspond to smaller p-values; p-values are not directly comparable between different time windows and different genes due to differences in sample sizes and other factors. The x-axis shows the endpoint of the time-window; e.g., Mar 30th 2021 will correspond to the analysis performed with the data from Jan 01 2021 to Mar 30th 2021. Figures like this can be generated with the Evidence of natural selection history operating on SARS-CoV-2 genomes ObservableHQ notebook.
Figure 2. Temporal trends of the substitution combinations at all sites represented in Table 1 in the Spike gene for B.1.621 (μ) sequences in 2021 (from left to right: S/27, S/146,S/147, S/1258, S/1259). The symbol "." denotes the reference residue at that site. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from / Sergei Pond / Observable. Additional search parameters include "B.1.621[pangolin] AND 20210101[after]"
Figure 3. Temporal trends of the substitution combinations at all sites represented in Table 1 in the RDRP (RNA-dependent RNA polymerase) gene for B.1.621 (μ) sequences in 2021 (from left to right: RDRP/26, RDRP/228, RDRP/344, RDRP/364, RDRP/443, RDRP/449, RDRP/521, RDRP/879). The symbol "." denotes the reference residue at that site. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from / Sergei Pond / Observable. Additional search parameters include "B.1.621[pangolin] AND 20210101[after]"
The ongoing monitoring of emergent VOIs and VOCs can detect adaptive mutations before they rise to high frequency, and help establish their relationship to key clinical parameters including pathogenicity and transmissibility. Additionally, continued evolution within a particular clade may form the foundation for a subclade with further functional sites of interest. Based on current information for the Spike gene from https://covdb.stanford.edu/mut-annot-viewer/SARS2S/ we include several annotations with clinical relevance.
SARS-CoV-2 B.1.621 (μ) Spike gene, sites of interest from Table 1 of our ObservableHQ notebook.
epitope binding to mAbs
epitope binding to mAbs
epitope binding to mAbs
epitope binding to mAbs
epitope binding to mAbs
epitope binding to mAbs
epitope binding to mAbs
Figure 4. Spike protein crystal structure annotation 6CRZ with MEME sites, a measure of episodic selection (These sites are listed in Table 1 of the ObservableHQ notebook). The color legend for these figures is as follows: the NTD region is highlighted in Blue, the RBD region is highlighted in Green, The Heptad Repeat (HR) region is highlighted in Ruby, MEME (Positively selected) sites are highlighted in Orange. Figures like this can be generated using the Categorical site highlighting with NGL ObservableHQ notebook. To interact with the figure above visit: Categorical highlighting RASCL/Mu
The RASCL application is designed to accept two inputs: a “query” whole genome sequence dataset, and a “background” whole genome sequence dataset. Users are requested to input a whole genome dataset (fasta, unaligned), which will be defined as the “query” sequence dataset, and can be typically retrieved from data repositories such as GISAID. Typically, a background sequence dataset is assembled from publicly available sequences via the ViPR database, and is provided by default unless otherwise specified. However, our application also tolerates the usage of another clade of interest to be used as “background”, which allows one to directly interrogate between-clade evolutionary dynamics. Whole genome datasets are broken down into their respective coding sequences from the NCBI SARS-CoV-2 reference annotation. The gene set includes structural (Spike, Matrix, Nucleocapsid, Envelope) and non-structural (leader, nsp2, nsp3, nsp4, 3C, nsp6, nsp7, nsp8, nsp9, nsp10, helicase, exonuclease, endornase, ORF3a, ORF6, ORF7a, ORF8, RdRp, methyltransferase) genes. Individual genes from query and background datasets are processed from here onwards. Our procedure includes several quality control steps including sanitizing fasta input headers and striking ambiguous codons from an alignment. In order for our analyses to complete rapidly, alignments are subsampled using genetic distances provided by the TN93 (Tamura and Nei 1993) algorithm. A combined (query + background) alignment is created and with sequences kept from the background dataset divergent enough to be useful for subsequent selection analysis. Inference of a maximum likelihood phylogenetic tree (RAxML-NG, Kozlov et al., 2019) is performed on the combined dataset. Trees are labelled for “query” and “background” branches. Analyses SLAC (Pond and Frost, 2005), BGM (Poon et al., 2008), FEL (Pond and Frost, 2005), MEME (Murrell et al., 2012), aBSREL (Smith et al., 2015), BUSTEDS (Wisotsky et al., 2020), RELAX (Wertheim et al., 2015), CFEL (Pond et al., 2021) are performed with state of the art molecular evolution models from the HyPhy (Pond et al., 2020) suite of bioinformatics tools. We report on sites of interest which together may represent regions of ongoing evolution within a specific clade. Results are combined into easily readable and web-accessible JSON files used for web processing. Visualization of results is done with full-feature interactive notebooks through ObservableHQ.
Data retrieval was performed through GISAID on September 1st 2021 with 3,288 sequences downloaded for the B.1.621 (μ) variant of interest. Additional parameters for the search included: 'low coverage excl.' and 'complete'.
RASCL is freely available via a dedicated Github repository: https://github.com/veg/SARS-CoV-2_Clades
We thank the global community of health-care workers and scientists who work tirelessly to face the pandemic head-on. We are also thankful for the GISAID submitters throughout the world. Additionally, we thank members of the Datamonkey / HyPhy and Galaxy teams for their continued assistance in the development and application of our software.
References