# Notes on selection analyses of the nAChR epitope > From [Drabeck et el](https://www.biorxiv.org/content/10.1101/2022.09.14.508034v1) ### "Tall" dataset. This dataset has unusual dimensions: short (30) codons and relatively many (199) sequences. This creates some statistical issues that could potentially be impactful. 1. Many branch lengths are `=0` (for example, HyPhy collapses 130 internal tree branches because they are 0). This is not biologically realistic. If possible, it would be better to estimate branch lengths from a longer gene alignment, even if not all species are present. 2. The precision with which non-0 branch lengths are estimated is degraded (likely biased). This could create downstream issues with all method. 3. Methods which draw power from sequence length (e.g. aBSREL, BUSTED, and also PAML), are going to suffer power loss. ### Basic data exploration. Using the tree with the following branches labeled as foreground, ![](https://i.imgur.com/Qs0Cirm.png) I ran SLAC to get a sense of what substitution patterns are present in the data. ``` hyphy slac --alignment /Users/sergei/Downloads/doi_10-6/Mammal_Full.phy --tree /Users/sergei/Downloads/doi_10-6/FINAL_MamTree.tre --branches TEST ``` [Visualizing](http://vision.hyphy.org/SLAC) the resulting SLAC file, there are only 7 sites where **any** non-synonymous substitutions are inferred to have occurred on the foregound branches (the `N` column in the table) ![](https://i.imgur.com/C8bnZuw.png) In other words, PAML reports selection at 4/7 sites with **ANY** non-synonymous variation (to convert to paper coordiates, one needs to add 174 to the site indices, PAML-identified sites are, per Table 1, would be 8,13,15,23 in direct alignment coordinates). This raises some questions, at least for me, about **specificity**, especially at sites where only a few non-synonymous change occur. > **Minor point**. In table 1, MEME results are reported as numbers close to 1, presumably to match PAMLs empirical Bayes posterior probabilities. This is confusing and should not be done, because MEME reports p-values (small = significant), and these are not convertible to posterior probabilities. ### MEME data analysis. Because many other analyses are not well-suited for such "short" and tall" alignments, I'll focus on running MEME here on foreground branches. ``` hyphy meme --alignment /Users/sergei/Downloads/doi_10-6/Mammal_Full.phy --tree /Users/sergei/Downloads/doi_10-6/FINAL_MamTree.tre --branches TEST ``` Visualize the resulting `JSON` file in [this notebook](https://observablehq.com/@spond/meme). ![](https://i.imgur.com/sB2D5lY.png) #### Site 15 Positive selection detected on FG branches (p = 0.005). Multiple non-synonymous substitutions, including several **multinucleotide** substitutions (e.g. `TTC→AAC`) drive this signal; makes intutive sense. ![](https://i.imgur.com/SrAvdCX.png) ![](https://i.imgur.com/JSNYvme.png) #### Site 8 No evidence of positive selection -- MEME calls this to be ~**neutral**. Note, that the signal at this site comes from ~2 non-synonymous subsitutions (CGG to CAG, CGG to TGG) and ~1 synonymous substitution (CGG to CGA). ![](https://i.imgur.com/VCGKBfB.png) #### Site 13 No evidence of positive selection -- MEME calls this to be ~**neutral** (dN/dS ~ 1). This site does not **look** neutral based on the substitution pattern, with lots of non-synonymous changes (~7). ![](https://i.imgur.com/K4mrDip.png) #### Site 23 No evidence of positive selection -- MEME calls this to be ~**neutral** (dN/dS ~ 0.5). Two non-synonymous changes (toggle between CAC and CCC). ![](https://i.imgur.com/tiIYJGJ.png) #### Overall. Looking at the distribution of rates across sites inferred by MEME one can see apparent strong site-to-site syonymous rate variation (can also check with BUSTED). ![](https://i.imgur.com/5GCt3kU.png) ### Contrast-FEL For situations like this, where you have two groups of branches with different *phenotypes*, we developed a specialized method that looks for **differences** in selection between them at individual sites: [contrast-FEL](https://academic.oup.com/mbe/article/38/3/1184/5926108). Intuitively, you get more power by testing for differences as opposed to testing for presence of positive selection (PS), because sites can be different even if both sets of branches (or neither set of branches) are under selection. ``` hyphy contrast-fel --alignment /Users/sergei/Downloads/doi_10-6/Mammal_Full.phy --tree /Users/sergei/Downloads/doi_10-6/FINAL_MamTree.tre --branch-set TEST ``` <div style = "font-size:10px"> | Codon | alpha | beta | substitutions | test |LRT p-value|Permutation p-value| |:------:|:--------------:|:----------------------------:|:----------------------------:|:--------------------------------------:|:---------:|:-----------------:| | 8 | 1.360 | 0.015 - 0.709 | 3 | overall | 0.0030 | 0.1429 | | 13 | 3.060 | 0.363 - 3.194 | 5 | overall | 0.0002 | 0.5000 | | 15 | 0.711 | 0.170 - 4.519 | 6 | overall | 0.0000 | 0.3333 | | 23 | 1.128 | 0.000 - 0.739 | 2 | overall | 0.0004 | 1.0000 | </div> ### ** Found _4_ sites with different _overall_ dN/dS at p <= 0.05** ### ### False discovery rate correction There are 4 sites where the overall p-value passes the False Discovery Rate threshold of 0.2 | Codon | q-value | |:-------------:|:--------------------:| | 15 | 0.0000003848 | | 13 | 0.0027120329 | | 23 | 0.0044756641 | | 8 | 0.0223980957 | For all four sites, there is STRONGER selection on the foreground branches (larger second beta value) **This is probably the best method here, and the one I would have recommended for the specific hypothesis you are testing** ### Why PAML and HyPhy difference in site level selection. Possible explanations (would need simulations to pin the source of differences); could be combination of those. 1. Lack of specificity by PAML: pretty much any site with sufficient non-syn variation is called selected. Here you can't really tell if you have the potentual for false positives, because there's almost no non-syn variation outside the key set of sites that would be TRUE positives. 2. Lack of power by HyPhy: because of many zero-branch lengths and the "tall" nature of the alignment, there is insufficient power. 3. Differences in methodology -- Synonymous rate variation (present here) -- For "small" datasets the heavy parametric form of PAML test distribution could actually give you more power (but also false positives)