owned this note
owned this note
Published
Linked with GitHub
# A model to detect positive diversifying selection associated with a phenotype/trait
Consider a phylogenetic tree (**P**) with branches partitioned into two sets: those with a phenotype/trait of interest (**T** or test) and those without (**B** or background); see <span style='background:yellow'>Fig XX.A</span> for an example. To study the evolution of a coding sequence along **P**, we extend the Bayesian Unrestricted Episodic Diversification (BUSTED) codon evolutionary framework <span style='background:yellow'>[ref]</span> to allow independent selective regimes on **F** and **B**. A selective regime, S<sub>T</sub> or S<sub>B</sub>, is described by a distribution of <tt>ω</tt> substitution rate ratios with <tt>K</tt> discrete bins (ω<sub>1</sub>≤ω<sub>2</sub>...≤1≤ω<sub>K</sub>) <span style='background:yellow'>Fig XX.B</span>). For a given site and a branch in **T** (**B**), ω is modeled as a random independent (of other branches) draw from S<sub>F</sub> (S<sub>B</sub>), and the site phylogenetic likelihood is computed as the expectation over all possible draws. The <tt>K</tt> ω values and <tt>K-1</tt> weight parameters </tt> parameters of each distribution are inferred from the data using maximum likelihood.
>TODO: add a sentence summarizing why BUSTED is a good model.
To determine whether or not diversifying positive selection is associated with phenotype/trait for a given gene, we
1. Estimate the ω distributions for **T** and **B** branches from the data, using maximum likelihood.
2. Test for episodic diversifying selection (ω<sub>K</sub> > 1) on branches with the phenotype (**T**), using a likelihood ratio test (LRT), as described in the BUSTED paper.
3. Test for episodic diversifying selection (ω<sub>K</sub> > 1) on branches without the phenotype (**B**).
4. Test whether or not the ω distributions are different between **T** and **B** branches. This is done by fitting the null model where the entire tree has a shared ω distribution, and performing a likelihood ratio test (2K-1 degrees of freedom) versus the model from step (1).
#### Is diversifying selection on a gene associated with phenotype/trait?
Genes where **only** the branches with the phenotype/trait show evidence of positive diversifying selection, **and** the selective regimes are significantly different between **B** and **T** branches are labeled as being associated with the evolution of a trait/phenotype. Genes where selection is found on both sets of branches, but where the selective regimes are significantly different between **B** and **T** branches are further examined to see if selection on **T** branches is intensified compared to **B** branches (higher ω<sub>K</sub> and its associated weight).
---
**Figure XX** BUSTED phenotype association model (BUSTED-PH). **A.** A tree with branches partitioned into Test and Background. **B.** Inferred distribution of ω with 3 bins (5 total parameters) **C.** Examples of different gene classifications based on the results of the three BUSTED-PH tests. **D.** Gene classification workflow.

#### Model validation and extension.
Statisitcal performance of our tests will be assessed using simulated data, designed to capture a broad range of parameter values with a fixed number of replicates (e.g. 10,000-50,000), following the Latin Hypercube Sampling scheme described previously <span style='background:yellow'>[cite]</span>. Specifically, we will investigate the power to detect selection as a function of the size of sets **T** and **B**, sequence divergence levels, effect size (differences between ω distributions), and ensure proper control of Type I error with data simulated under the null model.
We will consider two extensions to BUSTED-PH: (i) site-to-site synonymous rate variation (SRV) <span style='background:yellow'>[cite]</span> and (ii) support for multiple-nucleotide substitutions (MNS) <span style='background:yellow'>[cite]</span>. Previous work <span style='background:yellow'>[cite]</span> has shown that **not** including SRV or MNS can lead to elevated Type I errors due to model misspecification in certain situations. On the other hand, adding these model features also reduces statisitcal power to detect selection, hence they should be added when justified by the nature of the data (i.e., via model testing).
#### Incorporating the effect of continuous phenotypes on evolutionary rates.
If measured or estimated phenotype/trait measurements,<tt>p<sub>b</sub></tt>are available for all branches of the tree, we will parameterize the evolutionary rates to include a dependance on the phenotype, and directly test the effect of phenotype. The specific form of parametrization depends on the phenotype and its hypothesized effect on the rates.
> TODO: RELAX-type scaling, direct contribution to omega, as a moment-controlling hyper-parameter, general models (e.g. GAMs)
#### Model implementation and computational performance.
The models will be implemented in the HyPhy software package <span style='background:yellow'>[cite]</span>; a prototype of BUSTED-PH is available at span style='background:yellow'>[cite]</span>. Computational performance of BUSTED-PH on comparative data is comparable to standard BUSTED-type methods; an average run time (4x AMD Epyc 2 2GHz cores) for 19143 mammalian gene alignments with 30-40 sequences is ~7 miniutes per gene. Maximum likelihood fitting of mixture models by direct optimization can become trapped in local maxima; we are mititaging this by performing rapid initial searches using a draw from the Latin Hypercube Sampler for distributional parameters as starting points, followed by detailed model optimization.