# Evaluating subsampled phylogenetic data for comparative sequence analysis.
## The Problem
When you set out to ask a question such as: "How has the selection/selective pressure in gene X change between N species" one might think it necessary to gather as many sequences as possible. However (speculation) there may be a point of diminishing returns, where less is actually more. Capturing sufficient diversity within the dataset by reducing the total number of sequences in your dataset might be a useful place to start. As datasets grow larger and larger, methods will have to adapt to handle the work loads... etc
## Abstract
Here, we present an evaluation of two subsampling procedures based on (1) trimming a phylogeny or (2) reducing an alignment based on genetic distances. We apply both the "full" dataset and our subsampled datasets through a series of standard techniques in the molecular evolution toolkit. These include the use of FEL and MEME, available through the widely utilized HyPhy software suite. Our results on empirical and synthetic datasets indicate that an optimal subsampled alignment exists for alignments or phylogenies with short branches (clarify this). The subsampling procedure is applicable for datasets with short branches (or distances) or in situtations with large datasets where direct computation is impractical or computationaly burdensome (infeasible?).
## Organization of the work
### Evalutation with Treemmer
* Empirical alignment with p53
* FEL, MEME, FEL-Internal, MEME-Internal
* Empirical alignment with p53, however, Keep-Aligned
* After subsampling, do not perform MSA. But infers a new tree.
* Next iteration would not infer a new tree but rather prune branches. (not implemented yet).
* Simulated data (Uniform)
* FEL, MEME
* Simulated data (Gamma Distributed variation)
* FEL, MEME
### Evaluation with TN93-Algorithm (Available [Here](https://colab.research.google.com/drive/13sFTPkBMccdD_Qpp8DuWhU4UN2ckVT5d?usp=sharing))
* Empirical alignment from BDNF
* FEL, MEME, FEL-Internal, MEME-Internal
## Introduction
Molecular sequence data in many areas of science (-omics) is now routinely processsed in form of big data, requiring techniques (software) and resources (hardware) capable to operate at scale. A fundamental inquiry is often conducted on coding sequences in order to evaluate the level of selective pressure. In the context of a protein coding sequence alignment or phylogeny, one might be interested in extracting the most information about sites (pertaining to functional domains) of interest and the complex relationship of natural selection and mutation. In the biological sciences, alignments and phylogenies often represent direct functional units or (homologous) gene families of interest and an active area of research is to estimate biologically relevant parameters (and patterns) from this input across species.
Many techniques for estimating dN/dS, a commonly used measure in genetic evolution for evaluating the extent of selection acting on coding sequences exist. {explain how they work} {and that they suffer from LRT, pvalue based measures, where sites can fall right outside of thresholds on "full" alignments. Our recommdataion for subsampling can circumvent this and results in optimal (a more optimal, increased?) sequence (alignment) evalution. }
While commonly used methods offer substantial improvement in sensitivity and computational time over older techniques, the process of evaluating subsampled datasets represents a gap in the potential knowledge gained from the application of such methods when dealing with a) very large datasets or b) datasets with short branches (clarify this). In this paper, we address this issue by introducing (alignment or phylogeny) subsampling as an upstream modification in the codon substition model framework. Subsampling for (alignment or phylogeny) has been given little attention in the literature as some underlying assumptions are used in creation of input datasets, typically that more sequences will result in an increased information outcome (more results). {examples in the literature, if any}. Here we motivate and exploit a link with subsampling methods to justify our use of subsamples against "full" datasets to estimate selective pressures.
Our empirical results indicate that for several experienmental based datasets, subsampling only a small fraction (less than half) of sequences in an alignment (or branches in a phylogeny) gives comparable or increased performance to the use of full alignments even with sensitive analyses. This observation stands true when dealing with alignments with short branches (or small genetic distances) and thus is recommended as an application to a much wider range of datasets. Furthermore, the subsampling procedure does not require prior relationships to be known between taxa such as (taxonomic orders) or protein functional domains. Nor does it require for information to be evaluatd (molecular sequence analysis conduceted, don't need to run the full alignment through.) on full datasets. For instance, in the case of very large datasets, (such as...SARS-CoV-2?), obtaining subsamples is much more feasible than capturing parameter estimates (conducting molecular sequence analysis) on the "full" alignment. In such cases, capturing information on the full dataset may be too costly computationally for results to be returned in a reasonable time (for results to be releveant today and not 6 months from now.). Together, these results make subsampling datasets suitable for challanging datasets that are not amenable to molecular sequence analysis by other methods.
## Results

### Figure 1. (A) p53 empirical dataset, RTL Decay plot
### Figure 1. (B) p53 empirical dataset, Initial Tree and 80% Tree (Treemmer)

### Figure 2. (A) p53 empirical dataset, impact of subsampling (with Treemmer) on the inference of <u>positively</u> selected sites (p<0.1)
| Method | All Branches | Internal Branches |
| -------- | -------- | -------- |
| FEL |  |  |
| MEME | |  |
### Figure 2. (B) p53 empirical dataset,, impact of subsampling on the inference of <u>negatively</u> selected sites (p<0.1)
| #\Method | FEL | FEL-Internal|
| -------- | -------- | -------- |
| 1 |  |  |
### Figure 3. Simulated studies (Uniform), evaluated with Treemmer
Constant parameters for this:
* Use the initial tree (100%) from above (TP53).
* Sites (number of codons): 300
* Replicates: 10, but I will only use the first replicate for the analysis below.
* Omega will vary: {0.1, 1.0, 10.0}
* Analysis will similarly stop at the 80% tree.
SimulateMG94, outputs a fasta with a tree
| Method\Omega Value | 0.1 | 1 | 10 |
| -------- | -------- | -------- | -------- |
| FEL: Positive Sites |  |  | |
| FEL: Negative Sites |  |  |  |
| MEME |  |  | |
### Figure 4. Simulated studies (Gamma distributed Site Variation), evaluated with Treemmer
Constant parameters for this:
* Use the initial tree (100%) from above (TP53).
* Sites (number of codons): 300
* Replicates: 1
* Omega will vary: {0.1, 1.0, 10.0}
* Site variation will follow the Gamma distribution
* Analysis will similarly stop at the 80% tree.
| Method\Omega Value | 0.1 | 1 | 10 |
| -------- | -------- | -------- | -------- |
| FEL: Positive Sites |  |  | |
| FEL: Negative Sites |  |  | |
| MEME | | | |
### Figure 5. (A) p53 empirical alignment, Treemmer subsampling but do not realign.
Keep aligned after subsampling, this allows us to follow sites across alignments.
| Method | All Branches | Internal Branches |
| -------- | -------- | -------- |
| FEL, Positive Sites |  |  |
| FEL, Negative Sites |  |  |
| MEME, Episodic Sites |  |  |
### Figure 5. (B) p53 empirical alignment, Treemmer subsampling but do not realign. We will now 'Follow' sites across alignment to observe their behavior.
For FEL the union of positively inferred sites is: {110, 119, 125, 203, 297}
For FEL the union of negatively inferred sites is: {402 sites}
For MEME the union of positively inferred sites is: {117 sites}
<iframe src="https://data.hyphy.org/web/Subsampling/p53_KeepAligned_FEL_PositiveSites.html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/p53_KeepAligned_FEL_NegativeSites.html" width="1000" height="600" frameBorder="0"> </iframe>
#### Examining MEME and several of its estimated parameters, p53 empirical
<iframe src="https://data.hyphy.org/web/Subsampling/p53_KeepAligned_MEME_FollowSites.html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/p53_empirical(Treemmer,KeepAligned,MEME){'Parameter':p+_values}[Sitesn=117].html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/p53_empirical(Treemmer,KeepAligned,MEME){'Parameter':beta+}[Sitesn=117].html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/p53_empirical(Treemmer,KeepAligned,MEME){'Parameter':alpha}[Sitesn=117].html" width="1000" height="600" frameBorder="0"> </iframe>
### Figure 6. (A) BDNF empirical alignment. TN93-Algo subsampling, keep-aligned after subsampling. inferring new trees.
| Method | All Branches | Internal Branches |
| -------- | -------- | -------- |
| FEL Positive Sites |  |  |
| FEL Negative Sites |  |  |
| MEME |  |  |
### Figure 6. (B) BDNF empirical alignment. Following sites (as p-values, across alignments).
<iframe src="https://data.hyphy.org/web/Subsampling/BDNF_TN93Algo_FEL_AllBranches_PositiveSites.html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/BDNF_TN93Algo_FEL_InternalBranches_PositiveSites.html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/BDNF_TN93Algo_FEL_AllBranches_NegativeSites.html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/BDNF_TN93Algo_FEL_InternalBranches_NegativeSites.html" width="1000" height="600" frameBorder="0"> </iframe>
MEME
<iframe src="https://data.hyphy.org/web/Subsampling/BDNF_TN93Algo_MEME_AllBranches_PositiveSites.html" width="1000" height="600" frameBorder="0"> </iframe>
<iframe src="https://data.hyphy.org/web/Subsampling/BDNF_TN93Algo_MEME_InternalBranches_PositiveSites.html" width="1000" height="600" frameBorder="0"> </iframe>
## Methods
### Data Retrival
### Data Preparation
#### Empirical data
#### Simulated data
### Treemmer evaluation
### TN93-Algo
### Hyphy Analysis
#### FEL
#### MEME
### Post-hoc analysis (Output of FEL/MEME)
#### Visualizations
#### Statistics
### Data availability
### Software availability
## Discussion
Maximizing the utility of datasets.
Signal may be masked by artefacts and bias. Our methods aim to limit confounding signal which may be driven by synonymous sequence changes at sites.
We combine two similar approaches (treemmer and our TN93-Subsampling algo). One operates on an inferred tree as input. The other on an alignment. We use these for phylogenetic subsampling and inference based proceedures in molecular evolution.
We utilize these methods to increase optimal outcomes from given alignment or phylogenetic trees (datasets) by discovering new sites of interest not found in the initial (100%) dataset.
How sites may vary with their inference of evolutionary rate.
High evolutionary rates can lead to mutation saturation some sites. Modeling rate heterogeneity.
information density of subsampled alignments.
We could realize an optimal subsample that maximizes the total number of sites inferred (or total positive sites, or total negative sites depending on biological question of interest.).
Subsampled alignments improve data quality and improve the efficiency of molecular analysis.
Site information content
## References
*Methods references*
[1] Menardo, F., Loiseau, C., Brites, D. et al. Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity. BMC Bioinformatics 19, 164 (2018). https://doi.org/10.1186/s12859-018-2164-8
[2] Sergei L. Kosakovsky Pond, Simon D. W. Frost, Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection, Molecular Biology and Evolution, Volume 22, Issue 5, May 2005, Pages 1208–1222, https://doi.org/10.1093/molbev/msi105
[3] Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Kosakovsky Pond SL (2012) Detecting Individual Sites Subject to Episodic Diversifying Selection. PLoS Genet 8(7): e1002764. https://doi.org/10.1371/journal.pgen.1002764
*Subsampling related references*
1. Phylogenetic Synecdoche Demonstrates Optimality of Subsampling and Improves Recovery of the Blaberoidea Phylogeny. https://www.biorxiv.org/content/10.1101/601237v2.abstract
2. https://www.nature.com/articles/srep28955
## Supplement
For this note there are a few obvious things.
1. Why do some sites drop in and out? Is there are specific pattern? Is that problematic?
2. What is our practical recommendation?
3. MEME / FEL etc easily scale up to 100s of sequences. In order to demonstrate utility, we need to show at least one “very large” example, where subsampling makes what is an impractical analysis, practical
4. How can you tell if a data set is a good candidate for data reduction without running the selection analysis?
(4 you basically need a rarefaction type curve.)
For your p-value plots it would be nice to discuss some archetypal sites (always selected, never selected, only selected if enough sequences are present etc)