owned this note
owned this note
Published
Linked with GitHub
# Screening an alignment for possible sources of error
We will use a new feature in BUSTED (`--error-sink`) to see if an alignment might have regions that exhibit "odd" variation patterns. The intiution is very simple: all this does is add a evolutionary class with ω>100 and weight <1% to the estimation procedure. Anything that falls into this class is considered "suspect".
For example
```
hyphy busted
--alignment aBSREL_example_files/new_alignments/ENST00000319416.TEX35.fa
--tree aBSREL_example_files/new_alignments/ENST00000319416.TEX35.fa_newdataset_tree.nh
--error-sink Yes
...
### Performing the full (dN/dS > 1 allowed) branch-site model fit
* Log(L) = -3994.44, AIC-c = 8116.68 (63 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.042 | 0.046 | |
| Negative selection | 0.042 | 67.026 | Collapsed rate class |
| Diversifying selection | 1.875 | 32.732 | |
| Error absorption | 8402.894 | 0.196 | |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.241 | 68.714 | |
| 1.228 | 25.748 | |
| 9.358 | 5.538 |
...
----
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p = 0.0089**.
```
Note how there's a rate class called `Error absorption` with a very high ω estimate (>8000) and a very small weight (~.2%). There's still enough evidence to detect positive selection, which is reflected by the rate class with ω~2 and >30% weight.
Compare it to the standard `BUSTED` run
```
hyphy busted
--alignment aBSREL_example_files/new_alignments/ENST00000319416.TEX35.fa
--tree aBSREL_example_files/new_alignments/ENST00000319416.TEX35.fa_newdataset_tree.nh
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.091 | 6.084 | |
| Negative selection | 0.092 | 71.757 | Collapsed rate class |
| Diversifying selection | 2.955 | 22.159 | |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.226 | 65.312 | |
| 1.141 | 28.512 | |
| 8.536 | 6.176 | |
...
Likelihood ratio test for episodic diversifying positive selection, **p = 0.0000**.
```
The `Diversifying selection` class has a higher rate estimate, but this hides the presence of a very small fraction of "weird" site/branch combinations with ω>100.
For datasets **without** the error component the `Error absorption` class will have no weight assigned to it; for example...
```
### Obtaining the global omega estimate based on relative GTR branch lengths and nucleotide substitution biases
* Log(L) = -6160.75, AIC-c = 12413.98 (46 estimated parameters)
* 1 partition. Total tree length by partition (subs/site) 0.510
* non-synonymous/synonymous rate ratio for *test* = 0.4345
### Improving branch lengths, nucleotide substitution biases, and global dN/dS ratios under a full codon model
* Log(L) = -6155.61, AIC-c = 12403.70 (46 estimated parameters)
* non-synonymous/synonymous rate ratio for *test* = 0.4065
### Performing the full (dN/dS > 1 allowed) branch-site model fit
* Log(L) = -6027.27, AIC-c = 12169.27 (57 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.365 | 0.771 | |
| Negative selection | 0.369 | 98.836 | Collapsed rate class |
| Diversifying selection | 27.275 | 0.393 | |
| Error absorption | 113.302 | 0.000 | Not supported by data |
* The following rate distribution for site-to-site **synonymous** rate variation was inferred
| Rate | Proportion, % | Notes |
|-----------------------------------|---------------|-----------------------------------|
| 0.389 | 66.496 | |
| 1.799 | 30.945 | |
| 7.227 | 2.559 | |
```
### How do we find `where` the errors may be like?
To do, so load the BUSTED .json result into https://observablehq.com/@spond/busted
You will be notice a warning tag suggesting that possible errors have been found.
![image.png](https://hackmd.io/_uploads/Sy08-OWXa.png)
Select the `Error sink support` plot type and look for "hot spots"
![image.png](https://hackmd.io/_uploads/r1bnb_ZmT.png)
You can drag a selection box around an area of the plot, e.g. as shown below and a mini-alignment box will appear at the bottom of the plot.
![image.png](https://hackmd.io/_uploads/H1LbfubQp.png)
Here you may notice that `HLmacCal1` and `HLdiaYou3A` each have a stretch of codons that's potentially weird: mutiple nucleotide changes, codons that don't appear in other species, etc.
You can also view a specific codon on the tree, for example, `199` here
![image.png](https://hackmd.io/_uploads/ryYo5uWmp.png)