# P1 and P2 lineage selection report
We investigated the extent of selective forces acting on P1 and P2 SARS-CoV-2 lineages (see Methods). We analzed a median of 5 unique P1 haplotypes per gene/peptide in the context of a median of 79 reference sequences, and a median of 9 unique P2 haplotypes per gene/peptide in the context of a median of 70 reference sequences. The summary of sites where selection was detected (MEME p≤0.05) is shown in the following figure. In addition to individual sites under selection, we also recorded instances of putative convergence, i.e. substitutions to the same amino-acid at the same site in both lineages; there were only 2 such events (S/484K and ORF1A/318L). The evolutionary "credibility" of target residues was estimated using the [PRIME method](http://hyphy.org/w/index.php/PRIME) based on a [bat/pangolin Sarbecovirus alignment](https://www.biorxiv.org/content/10.1101/2020.05.28.122366v2)
Because of low sequence divegence, and relatively small sample sizes, BUSTED reported gene-level selection on one gene/peptide, `N` in P1 (p<0.001).
We next inquired whether or not there was evidence of selection on any of those sites in the global GISAID dataset (~300,000 seqeunces sampled up to early January 2021)
There was evidence of positive diversifying selection operating on a number of sites identified as selected in P1 and P2 lineages prior to the emergence of N501Y lineages: `N/80, N/119, N/202, N/238, S/18, S/20, S/26, S/655, S/688, S/1027`.
Unless specified otherwise, all analyses were performed on a single gene (e.g. S) or peptide product (e.g. nsp3); since genes/peptides are the targets of selection.
We aligned all sequences from an individual lineage (i.e. P1, P2) and reference (GISAID unique haplotypes in the corresponding gene/peptide) sequences to the GenBank reference genome protein sequence for the corresponding segment using the codon-aware alignment tool, bealign which is part of the BioExt Python package (https://github.com/veg/BioExt), also used by HIV-TRACE with the HIV-BETWEEN-F scoring matrix (similar viral sequences). This approach did not consider insertions relative to the reference genome in subsequent analyses. Deletions were treated as missing data and were not specifically tested for with codon models. We reduced alignments by removing identical or nearly-identical sequuences using pairwise genetic distances complete linkage clustering with the `tn93-cluster` tool (https://github.com/veg/tn93). All groups of sequences that are within D genetic distance (Tamura-Nei 93) of every other sequence in the group and represented by a single (randomly chosen) sequence in the group. We used D=0.0001 for lineage-specific sequence sets, and at D=0.0015 for GISAID reference (or “background”) sequence sets. We restricted the reference set of sequences to those sampled before Oct 15th, 2020.
We inferred a maximum likelihood tree from the combined sequence dataset with `raxml-ng` using default settings (GTR+G model, 20 starting trees). We partitioned internal branches in the resulting tree into two non-overlapping sets used for testing and annotated the Newick tree. Because of lack of phylogenetic resolution in some of the segments/genes, not all analyses were possible for all segments/genes. In particular this is true when lineage P1 and P2 sequences were not monophyletic in a specific region, and no internal branches could be labeled as belonging to the focal lineage.
For P1 the following genes/peptides were analyzed `N, helicase, leader, nsp2, nsp3, nsp6, nsp8, nsp9, ORF3a, RdRp, S`. For P2 they were `3C, endornase, exonuclease, leader, M, methyltransferase, N, nsp2, nsp3, nsp6, nsp7, nsp8, nsp9, ORF3a, ORF8, RdRp, S, 3C, endornase, exonuclease, leader, M`.
We used HyPhy v2.5.27 (http://www.hyphy.org/) to perform a series of selection analyses. Because our tree is reduced to only include unique haplotypes, even leaf nodes could represent “transmission” events, if the same haplotype was sampled more than once (and the vast majority were). We performed:
1. Gene-level tests for selection on the internal branches of the P1 and P2 clades using BUSTED with synonymous rate variation enabled.
2. Codon site-level tests for episodic diversifying (MEME) and pervasive positive or negative selection (FEL)
3. Differences in selective pressures at individual sites were identified using [Contrast-FEL](https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msaa263/5926108).