# Selection Pressure analysis
*Selection pressure analysis is able to characterize mutation sites or mutant branches of species genes or functional genes based on phylogenetic trees, which are essentially based on the calculation of dN to dS ratios and fitting models to obtain statistically significant results.*
## Step 1. Prepare files to analysis
Prepare the protein file, with **faa** as the suffix, and the **cds** file corresponding to the protein, with **fna** as the suffix. To conduct selection pressure analysis, you need to build a phylogenetic tree and its corresponding codon sequence. Therefore, we first use `muscle5` to align the **faa** files to generate **afa** files, `trimal` uses the *Automated1* parameter heuristic to prune the gappy sites to obtain the **msa** files, and use `iqtree` to build the tree.
> We should definitely try to improve alignment quality. There's no gold standard here.
## Step 2. Align code sequences file
For **cds** files, use `macse` to align **cds** files with the guidance of aligned **afa** files.
> **afa** means alignment file, **msa** means trim aligment file
`macse -prog reportGapsAA2NT -align_AA test.afa -seq test.fna -out_NT test_cds.afa`
## Step 3. Trim code sequence file
Use `macse` for the aligned **cds** file, and use the aligned **msa** file to guide the trimming of the aligned **cds** file.
`macse -prog reportMaskAA2NT -align_AA test.afa -align test_cds.afa -mask_AA test.msa -min_NT_to_keep_seq 30 -min_seq_to_keep_site 4 -min_percent_NT_at_ends 0.3 -dist_isolate_AA 3 -min_homology_to_keep_seq 0.3 -min_internal_homology_to_keep_seq 0.5 -out_NT test_cds.msa`
The recommended parameter as follows:
`macse -prog reportMaskAA2NT -align_AA test.afa -align test_cds.afa -mask_AA test.msa -min_seq_to_keep_site 4 -min_p
ercent_NT_at_ends 0.3`
>These parameters will only remove low-quality gaps but not low-quality sequences, which makes it easier to harmonize the number of sequences in the tree file and the alignment file.
The sequence of the **cds** generated at this stage is aligned and trimmed. The detailed parameters are used to control the trimming gap and filter low quality sequences.
## Step 4. Select Pressure analysis by`Hyphy`
Use `hyphy` to perform selection pressure analysis, that is, input the aligned pruned **cds** file and the tree file obtained by `iqtree`. However, it should be noted that after trim the **cds**, some sequences that do not suitable for the conditions will be deleted, so the tree file needs to be extracted and processed by subtrees. One-to-one correspondence with the number of sequences in the **cds** file.
> Custom `python` script **prune_subtree.py** available:
```
import argparse
from ete3 import Tree
parser = argparse.ArgumentParser(description='Process tree file, subtree_taxa file and output file paths.')
parser.add_argument('--tree', type=str, help='Path to the tree file')
parser.add_argument('--subtree_taxa', type=str, help='Path to the subtree_taxa file')
parser.add_argument('--output', type=str, help='Path to the output file')
args = parser.parse_args()
t = Tree(args.tree)
with open(args.subtree_taxa, "r") as f:
subtree_taxa = [line.strip() for line in f if line.strip()]
t.prune(subtree_taxa, preserve_branch_length=True)
t.write(outfile=args.output)
```
After processing, `hyphy absrel` can be used to calculate the selection pressure of the **site-branch** model:
```hyphy absrel --alignment test_cds.msa --tree test.contree --output out.json```
```
aBSREL (adaptive Branch-Site Random Effects Likelihood) is an improved version of the commonly-used "branch-site" models, which are used to test if positive selection has occurred on a proportion of branches. As such, aBSREL models both site-level and branch-level ω
heterogeneity. aBSREL, however, does not test for selection at specific sites. Instead, aBSREL will test, for each branch (or branch of interest) in the phylogeny, whether a proportion of sites have evolved under positive selection.
aBSREL differs from other branch-site model implementations by inferring the optimal number of ω
classes for each branch. For example, the earlier HyPhy branch-site approach (BS-REL) assumed three ω
rate classes for each branch and assigned each site, with some probability, to one of these classes. aBSREL, by contrast, acknowledges that different branches may feature more or less complex evolutionary patterns and hence may be better modeled by more or fewer ω
classes. Specifically, aBSREL uses AICc (small sample AIC) to infer the optimal number of ω
rate classes for each branch.
After aBSREL fits the full adaptive model, the Likelihood Ratio Test is performed at each branch and compares the full model to a null model where branches are not allowed to have rate classes of ω>1
.
aBSREL can be run in two modes:
1. Test a specific hypothesis by a priori selecting a set of "foreground" branches to test for positive selection.
2. Perform an exploratory analysis where all branches are tested for positive selection. In this scenario, p-values at each branch must be corrected for multiple testing (using the Holm-Bonferroni correction). Due to multiple testing, the exploratory approach has much lower power compared to the other approach.
```
> [https://www.hyphy.org/methods/selection-methods/#:~:text=aBSREL%20]
## Step 5. Visualize results
The obtained json file can be visualized on the http://vision.hyphy.org/ website.
**P.S.** To trace the origin of proteins, use the `macse translateNT2AA` module of macse. Using `prodigal` will delete **ORFs** with too low scores, resulting in sequence deletion.
## Refenrence
1. Smith, M. D. et al. Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection. Molecular Biology and Evolution 32, 1342–1353 (2015).
2. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
3. Ranwez, V., Douzery, E. J. P., Cambon, C., Chantret, N. & Delsuc, F. MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons. Molecular Biology and Evolution 35, 2582–2584 (2018).
4. Minh, B. Q. et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol. 37, 1530–1534 (2020).
1. Edgar, R. C. Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun 13, 6968 (2022).
1. Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution 33, 1635–1638 (2016).