--- tags: Craig cyanos --- # Getting lost in the phylogenetics of Craig's long branch [toc] # Overview ## 1. What are we even trying to think about here? * Separate from looking at differences between the marine and freshwater picocyanos from a gene/presence absence point of view, we can also look at the genes still shared and how they vary. Indeed, the long branch should have little-to-nothing to do with the difference in gene presence/absence, but be due to differences in genes that are still present in both groups. * The way I'm thinking of this tree business below is that maybe it can help us find which conserved genes (as in present across all) vary the most (here in the context of incongruence of a single gene's tree with the phylogenomic tree). * Maybe this will help us spot targets that are contributing to the long branch. Conserved, so essential, but have undergone distinct evolutionary change in the marine environment (rather than just neutral divergence). That's the idea, as I'm currently thinking about it anyway ¯\\\_(ツ)\_/¯ ## 2. How are things being done for this? * Amy Willis (biostatistician at UWash) and Sarah Teichman (PhD student with Amy) are trying to develop ways of looking at how multiple trees relate to each other. * Below we are looking at some ordinations generated with the [R package](https://github.com/svteichman/TreeVizPackage) Sarah is actively developing for this. They involve the terms in the table below, which we'll mention as we look at the figures and the relevant code to help me/us connect things. ### Terms summary/reminder table |Thing|Info| |------|----| |**RF**| distance metric between trees that considers topology only ([Robinson-Foulds](https://en.wikipedia.org/wiki/Robinson%E2%80%93Foulds_metric))| |**Geodesic (or BHV)**| distance metric between trees that considers topology and branch length (i think introduced in [Billera, Holmes, and Vogtmann 2001](https://www.sciencedirect.com/science/article/pii/S0196885801907596), way over my head)| |**Log map** | a coordinate system we are using below; it does take into account topology and branch length. It is based on the phylogenomic tree, as described by Sarah "*the coordinates of the phylogenomic tree are the lengths of all of its branches*", and the positions of the rest of the trees are in relation to that phylogenomic tree: "*The log map coordinates from the base tree to all of the other trees are the coordinates of the base tree (all of its internal edges) added to the geodesic distance between the new tree and the base tree multiplied by some vector that gives the direction of the new tree from the base tree. The direction stuff is where it gets tricky because tree space is kinda tricky.*"| # RF distances and MDS > RF considers topology only. ## Code stuff The input is an object of class "multiPhylo" which is a list of all phylogenetic trees after being read-in with the `ape` package: ![](https://i.imgur.com/XySyjBM.png) The RF distances between all trees are calculated with the `phangorn` package [treedist function](https://rdrr.io/cran/phangorn/man/treedist.html): ```r tree_dists_rf <- as.matrix(phangorn::RF.dist(trees_complete)) ``` The output of that is our distance matrix (subset shown): ![](https://i.imgur.com/9Jwgwp3.png) We're then doing an NMDS of that with the `vegan` package using the [metaMDS function](https://rdrr.io/cran/vegan/man/metaMDS.html). > **Mike question for Sarah:** There is an argument for "consensus" in the `TreeVizPackage::plot_MDS` (i'm using version 0.0.0.9000), what does that do in this case? ## Ordination **So this is a non-metric multdimensional scaling ordination of the RF (topology only) distances between all trees.** ![](https://i.imgur.com/xURTrp2.png) # Log map and PCA > Log uses geodesic distances, so considers topology and branch length, but is also creating the ordination space with respect to the phylogenomic tree (the coordinates of its location are all of its branch lengths; and the coordinates of the rest of the trees are generating by adding the coordinates of the phylogenomic tree to the geodesic distance between it and the given tree, multiplied by some vector that gives the direction of the new tree from the phylogenomic tree. ## Code stuff The inputs for this are the phylogenomic tree and all of the other trees. The function call looks like this: ```R lm_with_pend <- compute_logmap(cat_path, tree_paths, add_pendant_branches = TRUE, cons_tree = cat_tree) ``` This calls a java program to do all the work that terminology-wise is described above. The output is a table like this: ![](https://i.imgur.com/jbM9HNn.png) Which I take as 192 rows being 1 for each tree, and 119 columns being there are 119 coordinates for each tree. I don't know where that specific number comes from, maybe something to do with the total number of branches or leaves? Or maybe > **Mike question for Sarah:** Why 119 columns? Is that a consequences of the number of branches and/or leaves, or is it a consequence of the data and it was found that 119 is what allows the best multidimensional ordination? Then this is ordinated with a PCA this time. I'm guessing this is because we are operating in some kind of euclidean space now as opposed to before we were using a distance. > **Mike question for Sarah:** Why PCA for when using log map? ## Ordination **So this is a PCA ordination of the of the log map (branch length and topology) made with all trees relative to phylogenomic tree.** ![](https://i.imgur.com/enNmDLt.png) ## Semi-randomly looking at CbiJ, EVE, and phylogenomic trees ### Phylogenomic tree [Interactive version here](https://itol.embl.de/tree/7184244121410631600459049) <a href="https://i.imgur.com/jIaM5vR.png"><img src="https://i.imgur.com/jIaM5vR.png"></a> ### CbiJ gene **CbiJ in tree ordination-space relative to phylogenomic tree** CbiJ is a involved in B12 biosynthesis ![](https://i.imgur.com/Uk9t8Yn.png) **CbiJ tree** [(interactive version)](https://itol.embl.de/tree/13822922232332501612563890#) <a href="https://i.imgur.com/uSDUJqs.png"><img src="https://i.imgur.com/uSDUJqs.png"></a> ### EVE gene **EVE in tree ordination-space relative to phylogenomic tree** Not as clear what EVE does functionally, but likely involved in RNA binding in some facet ![](https://i.imgur.com/pO81KoG.png) **EVE tree** [(interactive version)](https://itol.embl.de/tree/13822922232332841612563909#) ![]() <a href="https://i.imgur.com/wqCf9X5.png"><img src="https://i.imgur.com/wqCf9X5.png"></a> ### Some alignment and tree metrics |Metric|CbiJ|EVE|Phylogenomic| |:-----|----|---|------------| |Alignment length|210|147|91,999| |Alignment length (no gaps)|142|138|47,922| |Total tree length|13.9411|6.9057|6.2097| |Tree length / align. len. ¯\\\_(ツ)\_/¯ | 0.0664 | 0.0470 |0.0000674| |Tree length / ungapped-align. len. ¯\\\_(ツ)\_/¯ | 0.0982 | 0.0500 | 0.000130 | |Median Shannon uncertainty* (of align. columns) |0.3522|0.1520|0.1305| \* calculated with this function: http://scikit-bio.org/docs/0.5.3/generated/skbio.alignment.TabularMSA.conservation.html) **Density plots of Shannon uncertainties (of align. columns)** <a href="https://i.imgur.com/HQcvhZc.png"><img src="https://i.imgur.com/HQcvhZc.png"></a> ## Some thoughts > * The CbiJ tree topology-wise seems to more closely track with the phylogenomic tree than the EVE one – keeping a monophyletic clade of Prochlorococcus (red), and having Prochlorococcus diverge in a sub-clade with respect to Syn RCC307 (which is believed to be the case). > > * Since right now the tree ordination-space includes distance and topology, it would be expected that there are greater distances involved in the CbiJ tree, causing it to be relatively further away in the ordination plot. And this is indeed the case. > > * Sarah asked at some point if I thought branch length should be incorporated, and I did think that/maybe still do. But maybe this is suggesting that branch length shouldn't be included, or that it should maybe be normalized for alignment length somehow? Though normalizing for alignment length the simplistic way I did above maybe isn't sufficient. --- ## Programs/code used ```bash # for shannon uncertainty of multiple-sequence alignments mamba install -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.23 conda activate bit bit-calc-variation-in-msa -h # usage: bit-calc-variation-in-msa [-h] -i INPUT_ALIGNMENT_FASTA # [-g {nan,ignore,error,include}] # [-t {DNA,Protein}] [-o OUTPUT_TSV] # This script takes an alignment in fasta format as input and returns the # Shannon uncertainty values for each column using: http://scikit- # bio.org/docs/0.5.3/generated/skbio.alignment.TabularMSA.conservation.html. In # output "variation" column: 0 is same character in all sequences for that # position (highest conservation); 1 is equal probability of any character # (greatest variability). "Conservation" column is inverse. As written, any # ambiguous bases or residues are converted to gap characters. For version info, # run `bit-version`. bit-calc-variation-in-msa -i run_files/individual_alignments/EVE_aln.faa -o EVE-aln-variation.tsv bit-calc-variation-in-msa -i run_files/individual_alignments/CbiJ_aln.faa -o CbiJ-aln-variation.tsv bit-calc-variation-in-msa -i Aligned_SCGs_mod_names.faa -o phylogenomic-aln-variation.tsv conda deactivate mamba create -n phykit -c jlsteenwyk phykit=0.1.1 conda activate phykit phykit alng run_files/individual_alignments/EVE_aln.faa phykit alng run_files/individual_alignments/CbiJ_aln.faa phykit alng Aligned_SCGs_mod_names.faa phykit tree_len individual_trees/EVE.tre phykit tree_len individual_trees/CbiJ.tre phykit tree_len target-picos-gtotree-picocyano-HMM-1.5.45.tre ```