# GeneInvestigator (GI)
An application for user-friendly state of the art molecular sequence analysis.
## Software availability
Github repository [https://github.com/aglucaci/GeneInvestigator](https://github.com/aglucaci/GeneInvestigator)
## Retreive input data from NCBI Orthologs
Transcript and Protein data from orthologous sequences.
For example, if we are interested in the TP53 gene: https://www.ncbi.nlm.nih.gov/gene/7157/ortholog/?scope=117570&term=TP53
Download all information:
* Tabular data (Contains accession numbers, species scientific name, and other metadata)
* Reference transcript sequence.
* Reference protein sequence.
Typically performed on one gene per species, but all transcripts per species can also be analyzed as an option.
## Pipeline

## Introduction for TP53
Excerpt from: https://www.ncbi.nlm.nih.gov/gene/7157
>This gene encodes a tumor suppressor protein containing transcriptional activation, DNA binding, and oligomerization domains. The encoded protein responds to diverse cellular stresses to regulate expression of target genes, thereby inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. Mutations in this gene are associated with a variety of human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing of this gene and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants (PMIDs: 12032546, 20937277). [provided by RefSeq, Dec 2016]
## Example Results
### Quick summary text
{Todo}
Number of species in alignment
Number of sites in alignment
GC content?
Summarize positive and negative sites
### Results of BUSTEDS analysis for a gene-level overview

**Figure 1.** BUSTEDS found evidence found evidence (LRT, p-value ≤ 0.05) of gene-wide episodic diversifying selection in the selected test branches of your phylogeny.
Describe differences between omega {1,2,3} value and proportion between unconstrained and constrained model. Anything to say with ER plots? Threshold the ER plot?
### Results of MEME analysis for episodic positive selection
Plotting everything but highlighting (via color) the significant positive sites.

**Figure 2.** MEME analysis of your gene of interest found 126 of 1167 sites to be statisically significant (p-value <= 0.1)
### Results of FEL analysis for negative selection
Only plotting the negative sites here

**Figure 3.** FEL analysis of your gene of interest found 411 of 1167 sites to be statisically significant (p-value <= 0.1) for pervasive negative/purifying selection
### Results of FUBAR analysis for episodic selection
Plot everything, but highlight the negative sites
"ProbNegative" here corresponds to "Prob[alpha>beta]" from the FUBAR output.

**Figure 4.** FUBAR analysis of your gene of interest found 418 of 1167 sites to be statisically significant (posterior probability threshold 0.9) for episodic negative/purifying selection. FUBAR analysis of your gene of interest found 2 of 1167 sites to be statisically significant (posterior probability threshold 0.9) for episodic positive/diversifying selection.
### Results of BGM analysis for coevolving sites
Does not show "triangle" linked coevolving sites e.g. site pairs (1,2) and (2,3)
"ProbS1andS2" is the posterior probablity that sites 1 and 2 are not conditionally independent.
Shared subs is the number of substitions shared between a pair of sites.

**Figure 5.** BGM analysis of your gene of interest found 395 pairs of coevolving sites out of 1167 total sites to be statisically significant (posterior probability threshold 0.5).
### Results of aBSREL analysis for episodic diversifying selection on branches
Displaying significant branches and nodes in red.

**Figure 6.** aBSREL analysis of your gene of interest found 38 of 513 branches to be statisically significant (p-value <= 0.05) for episodic diversifying selection
### Which lineages are represented in my alignment?
269 Species in the alignment.
Aim for between 5-10 lineage descriptions.

What am I left with? 6 clade descriptions

Mammalia (perhaps subdivide?)
Osteoglossocephalai = bony fishes
Sauropsida = reptiles
Gymnophiona = amphibians
Erpetoichthys = ropefish (~1% = 2-3 species.)
Batrachia = tailless amphibians (like frogs) (1 specie?)
Semionotiformes = (1 specie?)
## Discussion
Is this gene under selection?
Do significant sites between MEME and FEL overlap? They should for positively selected sites, and should not for negatively selected (statisically significant) FEL sites.
Are coevolving sites under positive or negative selection, linked-selection
> BUSTEDS
>
> RELAX, how many lineages do I have? Do analysis between branches.
> aBSREL
> MEME, FEL, SLAC, FUBAR
> GARD, subsample to ~30, Discuss recombination
# References
HyPhy
Datamonkey
All HyPhy methods
Accessory methods, TN93, MAFFT, IQTREE
Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces
# Supplementary Tables
## Table 1. BUSTEDS Fits, test results
## Table 2. MEME -- Significant sites
## Table 3. FEL -- Negative Significant sites
## Tavke 4. aBSREL
| # | Baseline MG94xREV | Baseline MG94xREV omega ratio | Corrected P-value | Full adaptive model | Full adaptive model (non-synonymous subs/site) | Full adaptive model (synonymous subs/site) | LRT | Nucleotide GTR | Rate Distributions | Rate classes | Uncorrected P-value | original name |
|---:|--------------------:|--------------------------------:|--------------------:|----------------------:|-------------------------------------------------:|---------------------------------------------:|---------:|-----------------:|:--------------------------------------------------------------------------------------|---------------:|----------------------:|:------------------------------------------------------------------------------------------------------------|
| 1 | 0.271015 | 0.248128 | 0.00101124 | 1.24717 | 1.16192 | 0.0852501 | 23.9811 | 0.219682 | [[0.08351739391025588, 0.814034501673088], [25.78312631442658, 0.185965498326912]] | 2 | 2.05536e-06 | nan |
| 2 | 0.247485 | 0.402128 | 0.0224881 | 379.635 | 379.548 | 0.0870347 | 17.756 | 0.193016 | [[0.1794147285931847, 0.8288530517605159], [9090.000000003736, 0.1711469482394841]] | 2 | 4.68503e-05 | nan |
| 3 | 0.0148508 | 0.518222 | 0.000322385 | 0.175039 | 0.172598 | 0.00244112 | 26.2757 | 0.0177287 | [[0, 0.9744626224056592], [987.8020916848121, 0.02553737759434083]] | 2 | 6.49969e-07 | nan |
| 4 | 0.0688569 | 1e+10 | 0.0483007 | 1.04917 | 1.04569 | 0.00347803 | 16.2193 | 0.0859847 | [[0, 0.8800201943708488], [894.04943259379, 0.1199798056291512]] | 2 | 0.000101472 | nan |
| 5 | 0.147321 | 0.783096 | 0.0114947 | 0.658764 | 0.639233 | 0.0195312 | 19.1077 | 0.145414 | [[0.3838030590330714, 0.8444078195665582], [72.96547749683984, 0.1555921804334418]] | 2 | 2.37494e-05 | nan |
| 6 | 0.0494127 | 0.248901 | 0.0391371 | 2.74932 | 2.73446 | 0.0148621 | 16.6416 | 0.0494891 | [[0.08171724964798617, 0.957908917963494], [1557.699535943612, 0.04209108203650602]] | 2 | 8.20484e-05 | nan |
| 7 | 0.214756 | 0.410058 | 0.000978347 | 0.871563 | 0.806064 | 0.0654993 | 24.051 | 0.180228 | [[0.1095580176944264, 0.8408750960441533], [27.01378932551229, 0.1591249039558467]] | 2 | 1.98448e-06 | nan |
| 8 | 0.216423 | 0.404559 | 0.00334964 | 0.909491 | 0.84621 | 0.0632809 | 21.5789 | 0.192698 | [[0.08907333920270652, 0.8257657450393591], [26.96026079730138, 0.1742342549606409]] | 2 | 6.86401e-06 | nan |
| 9 | 0.0193913 | 0.467868 | 0.00343106 | 2.81498 | 2.81013 | 0.00485433 | 21.5269 | 0.0181038 | [[0.1905853889518029, 0.978009820385468], [9383.731856578368, 0.02199017961453198]] | 2 | 7.04531e-06 | nan |
| 10 | 0.0239495 | 0.406299 | 0 | 0.582777 | 0.580037 | 0.00273964 | 121.617 | 0.0215566 | [[0, 0.9502300085151673], [1517.731423200614, 0.04976999148483274]] | 2 | 0 | nan |
| 11 | 0.0562303 | 0.445824 | 0.000189855 | 0.323557 | 0.30473 | 0.0188271 | 27.3394 | 0.0539543 | [[0.2697852411103061, 0.968801038023459], [176.7157679676714, 0.03119896197654104]] | 2 | 3.81235e-07 | nan |
| 12 | 0.040963 | 0.863876 | 0 | 0.40471 | 0.40117 | 0.00353967 | 126.751 | 0.040804 | [[0.05420569097218374, 0.9502046359232122], [811.0053477596006, 0.04979536407678775]] | 2 | 0 | nan |
| 13 | 0.173598 | 0.346965 | 0.0120047 | 2.73766 | 2.72766 | 0.0100021 | 19.0172 | 0.134369 | [[1, 0.8545620545506069], [663.1141152380948, 0.1454379454493931]] | 2 | 2.48545e-05 | nan |
| 14 | 0.114996 | 0.809018 | 0.0028013 | 1.7344 | 1.73436 | 4.28557e-05 | 21.9431 | 0.100148 | [[1, 0.8556210297601691], [100000, 0.1443789702398309]] | 2 | 5.71693e-06 | nan |
| 15 | 0.122433 | 0.478445 | 0.000576403 | 1.01034 | 0.995354 | 0.0149887 | 25.1094 | 0.102575 | [[0.2539175969811507, 0.8727106483495446], [184.3916331081659, 0.1272893516504554]] | 2 | 1.16681e-06 | nan |
| 16 | 0.00817209 | 0.340903 | 0.0347516 | 0.119071 | 0.117071 | 0.00199951 | 16.8821 | 0.0080027 | [[0, 0.993626957337727], [3277.778847969584, 0.00637304266227301]] | 2 | 7.2702e-05 | XM_005749774_1_PREDICTED_Pundamilia_nyererei_cellular_tumor_antigen_p53_like_LOC102203120_mRNA_1 |
| 17 | 0.103901 | 0.293289 | 0.00233172 | 10.3832 | 10.3375 | 0.0456859 | 22.3126 | 0.0981326 | [[0.1357129025907057, 0.9482367361682671], [1557.113665935761, 0.05176326383173291]] | 2 | 4.74892e-06 | XM_007952255_1_PREDICTED_Orycteropus_afer_afer_tumor_protein_p53_TP53_mRNA_1 |
| 18 | 0.0457502 | 0.32037 | 1.27578e-11 | 0.241786 | 0.22455 | 0.0172363 | 60.3389 | 0.0455824 | [[0.02364363647801504, 0.9625574726250689], [123.5301901932855, 0.03744252737493114]] | 2 | 2.53131e-14 | XM_008010192_2_PREDICTED_Chlorocebus_sabaeus_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 19 | 0.356247 | 0.320588 | 7.02686e-07 | 3.65051 | 3.53951 | 0.110996 | 38.5186 | 0.312879 | [[0.1159697821727439, 0.7769833043336732], [50.61114657575612, 0.2230166956663268]] | 2 | 1.40537e-09 | XM_008321788_3_PREDICTED_Cynoglossus_semilaevis_tumor_protein_p53_tp53_mRNA_1 |
| 20 | 0.110145 | 0.398121 | 0.00331385 | 0.673168 | 0.630454 | 0.0427146 | 21.6043 | 0.107742 | [[0.05381489102857581, 0.9132353699648752], [60.12599823633681, 0.0867646300351248]] | 2 | 6.7768e-06 | XM_010571169_1_PREDICTED_Haliaeetus_leucocephalus_tumor_protein_p53_TP53_mRNA_1 |
| 21 | 0.0315783 | 0.366843 | 2.68195e-08 | 6.57414 | 6.56455 | 0.0095854 | 45.047 | 0.0318192 | [[0, 0.9731198669707395], [9090.000000003736, 0.02688013302926051]] | 2 | 5.34253e-11 | XM_011727613_2_PREDICTED_Macaca_nemestrina_tumor_protein_p53_TP53_mRNA_1 |
| 22 | 0.248904 | 0.221391 | 0.0160571 | 102.609 | 102.513 | 0.0962467 | 18.4343 | 0.204709 | [[0.1563804928982661, 0.9012698341888755], [3847.530319680883, 0.09873016581112448]] | 2 | 3.33135e-05 | XM_014019513_1_PREDICTED_Austrofundulus_limnaeus_tumor_protein_p53_tp53_mRNA_1 |
| 23 | 0.121 | 0.319189 | 1.87822e-12 | 4.44712 | 4.42298 | 0.024138 | 64.1631 | 0.11054 | [[0.0168368360727766, 0.885082564787816], [568.7597482656681, 0.114917435212184]] | 2 | 3.71925e-15 | XM_014261886_1_PREDICTED_Pseudopodoces_humilis_tumor_protein_p53_TP53_mRNA_1 |
| 24 | 0.176395 | 0.243335 | 9.1023e-05 | 1.9998 | 1.93635 | 0.0634534 | 28.8093 | 0.15838 | [[0.02835874316489893, 0.8674912613583818], [81.9787911678715, 0.1325087386416182]] | 2 | 1.82411e-07 | XM_014893382_1_PREDICTED_Sturnus_vulgaris_tumor_protein_p53_TP53_partial_mRNA_1 |
| 25 | 0.135205 | 0.7309 | 0 | 81.0506 | 81.0302 | 0.0203742 | 88.5054 | 0.134317 | [[0.1397099080155256, 0.8439130611686997], [9090.000000003736, 0.1560869388313003]] | 2 | 0 | XM_015541817_1_PREDICTED_Panthera_tigris_altaica_tumor_protein_p53_TP53_partial_mRNA_1 |
| 26 | 0.0422313 | 0.749838 | 6.26628e-10 | 7.36183 | 7.35356 | 0.00827325 | 52.5537 | 0.0416236 | [[0.3056022114043056, 0.9651459173578737], [9090.000000003736, 0.03485408264212631]] | 2 | 1.24578e-12 | XM_015562573_1_PREDICTED_Myotis_davidii_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 27 | 0.125564 | 0.678655 | 0 | 35.3403 | 35.3302 | 0.0101213 | 192.752 | 0.124534 | [[0.0676211734326852, 0.8629977433488712], [9090.000000003736, 0.1370022566511288]] | 2 | 0 | XM_015992323_1_PREDICTED_Peromyscus_maniculatus_bairdii_tumor_protein_p53_Tp53_transcript_variant_X1_mRNA_1 |
| 28 | 0.0714231 | 0.796249 | 0 | 13.9542 | 13.9481 | 0.00612411 | 157.307 | 0.0720728 | [[0, 0.9106059867613868], [9090.000000003736, 0.08939401323861318]] | 2 | 0 | XM_016931470_2_PREDICTED_Pan_troglodytes_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 29 | 0.0594683 | 0.427825 | 5.62883e-14 | 0.561761 | 0.551066 | 0.0106943 | 70.5412 | 0.0498929 | [[0.1259844837537699, 0.9514527920641109], [376.2254588321816, 0.04854720793588907]] | 2 | 1.11022e-16 | XM_025195179_1_PREDICTED_Alligator_sinensis_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 30 | 0.041931 | 0.258989 | 0.0219854 | 0.101613 | 0.0799033 | 0.0217093 | 17.8051 | 0.0408591 | [[0.03863496214427223, 0.9690592305995114], [41.23128517951871, 0.03094076940048862]] | 2 | 4.57078e-05 | XM_028780049_1_PREDICTED_Grammomys_surdaster_tumor_protein_p53_Tp53_transcript_variant_X1_mRNA_1 |
| 31 | 0.0610096 | 0.310105 | 0.00420115 | 5.10637 | 5.07529 | 0.0310796 | 21.1196 | 0.0621175 | [[0.1018540946120266, 0.9626603774309264], [1557.699906332703, 0.03733962256907364]] | 2 | 8.64434e-06 | XM_030474375_1_PREDICTED_Strigops_habroptila_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 32 | 0.0550798 | 0.651481 | 0.00619516 | 0.335227 | 0.323262 | 0.0119644 | 20.3421 | 0.0548318 | [[0.2411625338720781, 0.9514469608672388], [193.8137073349213, 0.04855303913276121]] | 2 | 1.27735e-05 | XM_030548110_1_PREDICTED_Gopherus_evgoodei_tumor_protein_p53_TP53_mRNA_1 |
| 33 | 0.0355974 | 1.00236 | 0 | 25.751 | 25.7489 | 0.00206024 | 87.036 | 0.0354803 | [[0.7066674874553129, 0.9554163256482412], [100000, 0.04458367435175881]] | 2 | 0 | XM_030861565_1_PREDICTED_Globicephala_melas_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 34 | 0.120016 | 0.254143 | 0.022593 | 0.893713 | 0.83993 | 0.0537829 | 17.7426 | 0.108033 | [[0.09255911594996524, 0.9144690076952635], [64.15460824206325, 0.08553099230473649]] | 2 | 4.7167e-05 | XM_030970056_1_PREDICTED_Camarhynchus_parvulus_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 35 | 0.0242619 | 0.599177 | 0.000241366 | 0.0477902 | 0.0401435 | 0.00764665 | 26.8568 | 0.0250599 | [[0.2009160516548929, 0.9798753216487776], [83.28854940137336, 0.02012467835122245]] | 2 | 4.85646e-07 | XM_032280430_1_PREDICTED_Sapajus_apella_tumor_protein_p53_TP53_transcript_variant_X1_mRNA_1 |
| 36 | 0.0515846 | 0.500838 | 0.000548347 | 15.0164 | 15.0013 | 0.0150812 | 25.2129 | 0.0499321 | [[0.1917560263656985, 0.9609786199719969], [9090.000000003736, 0.03902138002800315]] | 2 | 1.10777e-06 | XM_032794536_1_PREDICTED_Chelonoidis_abingdonii_tumor_protein_p53_TP53_mRNA_1 |
| 37 | 0.213434 | 0.37986 | 3.69073e-08 | 95.6181 | 95.564 | 0.0540627 | 44.4054 | 0.193231 | [[0.1554275170793705, 0.8360599805867974], [3846.116864262084, 0.1639400194132026]] | 2 | 7.36673e-11 | XM_035135727_1_PREDICTED_Zootoca_vivipara_tumor_protein_p53_TP53_transcript_variant_X2_mRNA_1 |
| 38 | 0.0430373 | 0.588996 | 0 | 0.65763 | 0.649333 | 0.00829716 | 87.2536 | 0.0433632 | [[0.2025151800612972, 0.9628558985353769], [746.4563181259871, 0.0371441014646231]] | 2 | 0 | XM_041664596_1_PREDICTED_Microtus_oregoni_tumor_protein_p53_Tp53_transcript_variant_X1_mRNA_1 |
## Phylogenetic tree analysis, histogram of branch lengths and what is the total tree length?
## AlignmentProfiler (Skip for now)
v1 https://colab.research.google.com/drive/1JWvp9zTEqCslH5P2_9UhwQllmOJOGAF9?usp=sharing
v2 https://colab.research.google.com/drive/1fNjBYkIOH9DFD8hdmXJTekj8Hzg-g3Jl?usp=sharing
## Left over (can delete)
| BUSTEDS Unconstrained model | BUSTEDS Constrained model |
| -------- | -------- |
|  |  |
**Figure 1.** BUSTEDS found evidence found evidence (LRT, p-value ≤ 0.05) of gene-wide episodic diversifying selection in the selected test branches of your phylogeny.