--- tags: Thèse --- # Remarques Vinciane Correlations plasmide vs chromosome (heatmaps) : quantification (un moins l'autre) ? pour montrer que c'est plus 'blocky' sur les plasmides Distribution plasmide/K par phylum : pour montrer lien avec le nb de plasmides faire un graphe x/y plutot que le barplot nb de famille par genome : chromosomique vs plasmide aussi Extremophiles : plus d'IS ? Prendre 50 especes extremophiles et comparer avec 50 especes picked aleatoirement ? # Interesting results overview ## IS repartition across the tree of life /!\ Species averaged data ![image.png](https://hackmd.io/_uploads/BydkKHdma.png) Whole tree: shows clade-specific IS contents. ![image.png](https://hackmd.io/_uploads/HyrMv-_mp.png) Enterobacteriaceae tree: example showing the importance of lifestyle. Empty genomes correspond to endosymbionts or parasitic species relying on host functions and that have therefore undergone genome reduction and loss of IS elements (presumably after a period of genome expansion). The same can be observed in Burkholderiaceae (tree not shown here). Another way of highlighting this is by looking at the average number of copies in each genus (species averaged data): ![image](https://hackmd.io/_uploads/rJsuuXrEp.png) Genera with no copies are potential endosymbionts ('Candidatus' are bacteria that were impossible to culture in the lab, ie that are dependent on things that are not provided even in a rich environment), and Shigella are obligatory pathogens (genomes in transition: see Jane Hawkey's thesis). Looking at all genera (averaged data): ![image](https://hackmd.io/_uploads/SyFcdP6ua.png) The Candidatus genomes are likely to be endosymbionts since their size is reduced: ![image](https://hackmd.io/_uploads/Hkh49Xhda.png) NB: it's endosymbiont -> fewer IS but it's not an equivalence (still a lot of genera with little/no IS that are not 'Candidatus' or known endosymbionts - which is why (I think) when looking at all genomes, we do not see a correlation between genome size and number of copies) Small candidatus genomes are mostly empty: ![image](https://hackmd.io/_uploads/Sy_DVY8K6.png) Using a list of endosymbiotic genera I got from looking at NA Moran's articles: ![image](https://hackmd.io/_uploads/B1QpcDad6.png) Reducing list to characterized endosymbionts (obligate vs facultative): ![image](https://hackmd.io/_uploads/Bk9ZA_Itp.png) ![image](https://hackmd.io/_uploads/Bkx-5pLF6.png) ![image](https://hackmd.io/_uploads/Skdgqa8tp.png) ![image](https://hackmd.io/_uploads/SJL7wumnp.png) Extremophiles: ![image](https://hackmd.io/_uploads/r1KMFpUt6.png) ![image](https://hackmd.io/_uploads/ryVXF6IKp.png) ![image](https://hackmd.io/_uploads/H1uhUuQha.png) ![image](https://hackmd.io/_uploads/SkUz5OQ36.png) ## IS replicon localization /!\ Non-species averaged data Depending on the IS family, IS elements can be found predominantly on chromosomes or on plasmids. The vast majority of IS1182, IS1595, IS1634, and IS481 elements are found on chromosomes, whereas IS6, IS91, Tn3, and ISKra4 are mostly found on plasmids. This does not seem to be correlated to conservative vs replicative transposition. ![image.png](https://hackmd.io/_uploads/BJr9-buQa.png) - Why is there such a stark difference between IS families? Can some integrate more easily into chromosomes? Are the families predominantly found on chromosomes older, mostly inactive families? - Does this have an impact on their evolutionary dynamic? ![image.png](https://hackmd.io/_uploads/ry7zmuOQT.png) Across phyla, most IS elements are found on chromosomes (most likely because chromosomes represent most of the genome). But in the case of Euryarcheota and Deinococcus-Thermus, there is a much bigger fraction of plasmidic elements. - Why? Is it because genomes in these phyla tend to carry more plasmids? Or because their chromosomes carry less IS elements? Does lifestyle play a role here? ![image.png](https://hackmd.io/_uploads/SJtazO_7a.png) There seems to be a pretty good correlation between carrying more plasmids on average and carrying more plasmidic IS elements. - Ideally it would be better to have this on species averaged data. ## IS prevalence across clades /!\ Species averaged data ![image.png](https://hackmd.io/_uploads/S1n_X-_Q6.png) IS3 is the most prevalent IS family, found in almost 70% of species. The least prevalent family is ISH6, which is only found in a fraction of Euryarchaeota and Planctomycetes. - Check the Planctomycetes: probably only one genome that contains ISH6. ## IS correlations /!\ Species averaged data IS correlations on the whole dataset are very weak. However, stronger correlations can be observed in individual clades (example here with 4 different families). ![image.png](https://hackmd.io/_uploads/B1uBrZdXT.png) Correlations in Burkholderiaceae, Enterobacteriaceae, Flavobacteriaceae, Rhizobiaceae (data averaged by species) show that correlations differ from clade to clade. - Where do these correlations come from? Are they just a product of evolutionary history? (Inherited from a common ancestor?) /!\ Non species averaged data ![image.png](https://hackmd.io/_uploads/HJPxSZdQa.png) Looking at correlations on chromosomes versus plasmids in Enterobacteriaceae show that correlations also differ based on replicon type. A group of IS elements (IS5, IS6, IS607, Tn3, IS110, ISNCY, IS1, IS3) tend to be found on the same plasmids. This group is responsible for the 'block' aspect of plasmidic IS distribution across Enterobacteriaceae (bottom figure: chromosomic IS vs plasmidic IS, with empty genomes and identifiable endosymbionts removed). /!\ Species averaged data ![image.png](https://hackmd.io/_uploads/r1fCv-OXp.png) ## IS copy number distribution /!\ Species averaged data ![image.png](https://hackmd.io/_uploads/Hkw05ZdmT.png) L-shaped distributions for both chromosomic and plasmidic IS. Distributions are also similar for replicative and conservative transposition-associated elements. ### Threshold at 1 Mb: no IS elements below Marginal side effect: Pearson statistic is 0.19. /!\ non species averaged data ![image](https://hackmd.io/_uploads/Sk_QxZ6Kp.png) ![image](https://hackmd.io/_uploads/rk4l-WTYp.png) ![image](https://hackmd.io/_uploads/BywvmbaYp.png) ## IS family diversity /!\ Species averaged data ![image.png](https://hackmd.io/_uploads/SJx8oWOm6.png) The more IS diversity (more IS families), the more copies per family. Unlike what could be expected, IS families do not seem to compete for an ecological niche in the genomes (otherwise, you would see less copies of each family when genomes contain more IS families). Rather, it seems like some genomes are more tolerant than others to high IS loads. - Why would some genomes be more tolerant than others? Different lifestyle? Smaller effective population sizes (-> reduced selection)? - Ecological niches: different because of different target sites? Automate a script to look for motives in sequences around insertions from each family/element ## High-copy number genomes /!\ Species averaged data ![image.png](https://hackmd.io/_uploads/rklPyFb_7p.png) Mosaic genomes, that contain high numbers of copies from various families (they are not just the product of one IS family expanding). They are found across all clades. In terms on lifestyle, most of them seem to be symbionts, extremophiles, or host-adapted pathogens. Those are scenarii were selection pressure changes (another argument against the neutral model) + potential increase of transposition rates in response to stress. - Confirm this! ## Intraspecies chromosomes vs plasmid similarity in IS content ![image.png](https://hackmd.io/_uploads/H1h7BB_X6.png) Using a Mann-Whitney test, average intra-species Pearson correlations are slightly higher for chromosomic IS content than on plasmid IS content. This is an expected given that plasmids move around faster than chromosome parts. ## Genome distance versus IS content similarity Planning to use FastAAI (https://github.com/cruizperez/FastAAI) or use distance in tree computed from mafft alignement? https://www.biostars.org/p/6661/ Problem with FastAAI: not published, does not seemed to have been used much. How reliable is it? I had previously used pairsnp but I don't think it's a very good method. It is not made to compare highly divergent genomes.