# Analysis of IS-mediated mutations in the LTEE metagenomics data
## Number of IS-mediated insertions

The lower part of the stacked plot represents fixed IS mutations, and the upper part non-fixed mutations (extinct or remained polymorphic). The first bar represents the ancestral lineage, and the two last bars represent the major and minor lineages when applicable.
Populations with the most IS mutations are Ara+1, Ara-3, Ara-5.
Clearly hypermutability plays a role: few IS mutations are fixed/detected in hypermutator populations. But it is not the only factor at play: in non-hypermutator populations, there are both populations that fix/sample a lot of IS mutations (Ara+1, Ara-5) and populations that fix/sample very few (Ara+5).
Comparison with point mutations (separated between mutators and non-mutators):



Differences in IS mutation numbers are due to IS150, which has been more active in Ara+1, Ara-3, Ara-5 and Ara-6. The activity of other elements is rather consistent across populations.

IS mutations represent a minority of mutations overall.

Significant difference (chi2 test, pvalue=0.0058).
## Cumulative number of IS mutations over time



## Number of fixed IS mutations over time
Appearance of new beneficial mutations seems to slow down in populations with high IS mutation numbers (Ara+1, Ara-5, Ara+3).

Mutations tend to fix as cohorts (several mutations fix at a time, presumably because the same successful genotype is carrying them together).
- How many of these fixed mutations are hitchhiking vs driver mutations?
Ara+1, Ara-3, Ara-6: IS mutation fixation seems to slow down in the last 10k generations, unlike in Ara-5 (at least in the population with the most IS fixed, which continues to fix IS mutations). In populations with low numbers of fixed IS, fixation seems to continue.
Comparison with point mutation accumulation rates (mutator vs non mutator):


## IS mutation sampling over time
Another way of looking at the fixation rate of IS mutations, but over time. I would distinguish 3 groups of populations:
- Ara+1 and Ara-3: sampling a lot of beneficial IS mutations, fixing less than half
- Ara+5, Ara-2, Ara-5, Ara-6: fixing most of beneficial IS mutations
- Ara+2, Ara+3, Ara+4, Ara+6, Ara-1, Ara-4: sample few beneficial mutations, and most are not fixed



## IS mutations vs point mutations over time
Dynamics are pretty similar (expansion/reduction of diversity at the same times). I can't detect anything that would look like a transposition burst (more polymorphic IS mutations compared to point mutations).




- fitness data: hard to correlate with specific events because it is very noisy
## Fates of mutations


- fixation
- disappearance (sometimes after reaching quite high frequencies)
- remain at a lower frequency
- oscillations -> oscillations of subpopulations
## Coincidence between sublineage apparition and IS mutations
Could these mutations play a role in the emergence of the subpopulations? (Not really convinced, there are probably a bunch of other point mutations that emerge at the same time and that could explain the stable survival of two subpopulations. I don't really see why IS mutations would play a specific role here. See [this article on the emergence of the sublineages](https://www.nature.com/articles/s41467-023-39471-9), which is linked to acetate excretion and consumption, which has nothing to do with the genes affected by the IS mutations I see here.)



- yqeB: Partial phylogenetic profiling suggested that yqeB belongs to a system connected to selenium-dependent molybdenum hydroxylases.
- pykF: This gene encodes for pyruvate kinase, an enzyme involved in glycolysis, specifically catalyzing the transfer of a phosphate group from phosphoenolpyruvate (PEP) to ADP, resulting in the formation of pyruvate and ATP. Pyruvate kinase is a key enzyme in the glycolytic pathway, contributing to the production of ATP, which is essential for cellular energy.
- hokB: This gene encodes a toxin component of a toxin-antitoxin system. Toxin-antitoxin systems are involved in regulating bacterial growth and survival under stress conditions. The hokB gene typically produces a toxin that can inhibit cell growth or induce cell death, while an antitoxin, often encoded by a neighboring gene, counteracts the toxin's effects to maintain cellular stability.
- hyfA: This gene is part of the hyf operon, which is involved in the assembly and function of the hydrogenase-4 complex in E. coli. The hydrogenase-4 complex is responsible for hydrogen production and is involved in hydrogen metabolism under certain conditions.
## Insertion sites
### Distribution of insertion sites across the genome
Expecting to find the same results as in the clone data.


Insertions are found almost across all genome.
### Fixed insertions


### Insertion hotspots

5 or more fixed IS mutations:
| Gene | Approx. position | Populations | Populations with point mutations | IS/total mutations |
| -------- | -------- | -------- | ----------- |----------- |
| mokC | 16,900 | A+1, A+3, A+5, A-1, A-2, A-3, A-5, A-6 | A+3, A-2 | 9/12 |
| menC (IS186 insertion site) | 2,322,300 | A+2, A+3, A-2, A-3, A-5, A-6 | A-3 | 7/8 |
| hokB | 1,462,200 | A+1, A+2, A+6, A-2, A-3, A-5, A-6 | - | 7/7 |
| yqeB | 2,899,600 | A+1, A+2, A-2, A-3 | A+3, A+6, A-1, A-2 | 5/10 |
| kduD | 2,877,300 | A+1, A-1, A-3, A-5 | - | 5/5 |
| ECB_02816 | 3,015,600 | A+4, A-1, A-2, A-3, A-6 | A+3, A+6, A-1, A-4 | 5/12 |
- look at insertion timing + potential associated fitness gain
- look at hotspots not listed here: intergenic? Close genes?
(Descriptions generated using ChatGPT, check them.)
- mokC: The mokC gene in E. coli is involved in the modulation of the Mok (Maintenance of Killer Plasmid) system. This system is associated with the stability and maintenance of certain plasmids in bacteria. The Mok proteins, including MokC, help in preventing the loss of these plasmids during cell division by selectively killing cells that have lost the plasmid.
- menC: The menC gene is part of the menaquinone biosynthesis pathway. Menaquinone, also known as vitamin K2, is a compound essential for electron transport in the bacterial respiratory chain. MenC is involved in the synthesis of a precursor that leads to the production of menaquinone. Menaquinone plays a crucial role in bacterial respiration by transferring electrons, which is important for the generation of energy.
- hokB: The hokB gene in E. coli encodes a toxin component of a toxin-antitoxin system. Toxin-antitoxin systems are involved in various cellular functions such as stress response, programmed cell death, and plasmid stabilization. The HokB toxin, when expressed at high levels, can lead to cell death by disrupting the integrity of the cell membrane.
- kduD: The kduD gene is involved in the metabolism of 2-keto-3-deoxygluconate (KDG). KDG is a derivative of glucose and is part of the catabolic pathway utilized by bacteria to break down complex carbohydrates. KduD is an enzyme that plays a role in converting KDG into other metabolites, contributing to the utilization of specific carbon sources by E. coli.
### Insertion dynamics over time


- mutations at t0: misassignment of read barcodes to those samples during the library prep and sequencing.
### Insertion sites vs essential genes location

Only zone with no IS insertions: peak of essential genes.
### Insertions into essential genes
Some essential genes are targeted by IS insertions:
- motA: Part of the motor complex in flagella; involved in motility.
- yegP: Associated with the cell membrane; precise function may vary.
- hcaC: Involved in the degradation of aromatic compounds like hydroxycinnamic acids.
- yqjA: Its specific function is not very well characterized in E. coli.
- sgcA: Associated with sugar transport or metabolism.
- tdk: Encodes a thymidine kinase, involved in DNA precursor synthesis.
- galF: Catalyzes the interconversion of UDP-galactose and UDP-glucose.
- ydiR: Acts as a transcriptional regulator, involved in various cellular processes.
- yibG: Role not extensively characterized; may have a function related to transport or metabolism.
- acpD: Involved in the acylation of various compounds.
- araG: Involved in the metabolism of arabinose.
- yeeI: Exact function not fully elucidated in E. coli.
- mglB: Involved in methyl-galactoside transport.
- tdcG: Part of the lysine decarboxylase system, involved in pH homeostasis.
- yhhL: Role not extensively characterized; may have a function related to transport or metabolism.
- rsd: Acts as a global regulator of the RNA polymerase sigma factor.
- astB: Involved in arginine succinyltransferase activity.
- hybD: Part of a hydrogenase complex, involved in hydrogen metabolism.
- slp: Associated with surface-layer protein; involved in cell surface structure.
- yieL: Role not extensively characterized; may have a function related to transport or metabolism.
Overall, insertion positions are not located at the end of the genes (except for yibG: insertion 22 bp from the end, 432 bp from beginning).
Most of these genes are targeted by other kinds of mutations, so these insertions are probably beneficial.

### Proportion of intergenic mutations

Difference is not significant (ztest).
### GC content at insertion site
Duplicates were removed.
Error bars: 95% confidence interval.

IS150, IS186: no bias. IS1, IS3, IS4: preference for AT-rich regions.
### Insertion site motif
Duplicates were removed.
MEME logo: motif found in 118/299 sequences (never exact motif). Exact motif appears 147 times in REL606 genome. 147 unique exact motif in the genome, 28% of which intergenic. But I don't think looking at the exact motifs is the right approach here given that it's not an exact insertion motif like it is for IS186 (see below).

MEME logo: motif found in 42/42 sequences. Motif GGGG(N6)CCCC appears 328 times in the REL606 genome. Motif GGGG(N7)CCCC appears 692 times in the genome. Total of 462 unique motifs. So the possible sites are far from being saturated.

16% of the detected motifs are intergenic, vs 37% of motifs where an insertion occurred.

MEME: no logo found.

MEME: no motif found.

MEME logo: 16/18 sequences

IS4: rho-independent transcriptional terminators?




ARNold: predicts a terminator for 8/17 unique sites.
RNA fold for other sites: one or two possible hairpins.
#### Genomic orientation of IS4 terminator insertions
Closest gene is marked with *.
| IS position | Gene | Upstream gene | Downstream gene | Upstream gene orientation | Downstream gene orientation |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 433,339 | intergenic | lon* | hupB | +1 | +1
| 1,128,109 | intergenic | mdoH | yceK* | +1 | +1
| 1,294,198 | intergenic | tdk | adhE* | +1 | -1
| 1,420,711 | intergenic | ynaK | ECB_01341* | +1 | +1
| 1,734,961 | intergenic | lpp | ynhG* | +1 | -1
| 2,237,496 | intergenic | proL* | yejO | +1 | -1
| 3,147,776 | intergenic | rpoD | ygjF* | +1 | -1
| 3,265,657 | intergenic | dacB* | yhbZ |+1 | -1
### Orientation of IS elements
Duplicates were removed.

No significant differences (Chi2 test).
- Check if there is a difference based on the arm of the chromosome? (Relative to replication)
### Orientation of intergenic IS relative to closest gene
Duplicates were removed. Genes found less than 100 bp away from the insertion site were considered.


No parallel configuration were read-through transcription of downstream gene is possible.
Possible read-through of IS possible. Does the appearance of these mutations correlate with increases in transposition frequencies?
Under-representation of insertions upstream of closest genes: less insertions in the transcription regulator regions? (More likely to affect gene expression if insertion upstream of closest gene than downstream of closest gene)
### Orientation of intergenic IS relative to neighbor genes (irrespective of distance)
Duplicate insertion sites were removed. Also removed a few insertions for which the positions didn't check out - combination of mutations? Faulty position annotation? These were insertions between parallel genes, so once you remove them it seems like there is a depletion of these insertions. If that is the case: might be because it interrupts operons?
- mention the annotation issue to Jeff

Counter-selection of divergent configuration because the IS insertion might affect regulators upstream of both genes (vs just one gene potentially affected in the case were genes are parallel)? (Why is only one of the divergent orientations present?)
Divergent orientation: why is there a bias towards + insertions? The sites are different and span the entire chromosome (IS*150* and IS*1* insertions).
## Number of fixed mutations by type in each population and effect of hypermutability

Ideally, update the following plot to take into account the precise state of the population at the moment the mutations happen (the hypermutator populations are not constantly in an hypermutator state).
This is already discussed in the IS paper, but I wanted to see if there were differences based on the type of point mutations.

The ratio of synonymous mutations in hypermutators vs non mutators is higher than other types of point mutations. Here is a version with a linear axis to see it better:

(Hyp: more precise tinkering available because of more mutations sampled simultaneously (mutations that work well together, but do not appear together in populations with lower point mutation rates)? In this hypothesis, what precludes non-mutator populations from a better fitness trajectory is not just the accumulation of a burden through IS copies but also the inability to reach certain 'subtle' mutation combinations that offer a way of finetuning gene expression and effect.)
## Additional point on the notion of IS deletion bias
Some papers defend the idea that there is an IS deletion bias in bacterial genomes such that the deletion rate is slightly higher than the transposition rate (which would allow the bacteria to maintain IS copy numbers somewhat under control). Ira's data show that this is not the case for most IS elements in the LTEE:
