--- tags: Thèse --- # Model of IS evolutionary dynamics ## Summary - Iranzo neutral model: fits well to the data (including the total), but when you sum individual distributions, you do not reproduce the total distribution - Iranzo selection model: also fits rather well, but the selection process is at the family level rather than the total IS level (which is not what happens in real life) - wrote a stochastic simulation to test hypotheses on an individual family level - with the same parameters for each family (so no family level adjustment) the stochastic model reproduces the total distribution well - the stochastic simulation confirmed the fact that the sum of neutral model fits can't reproduce the total distribution - used the stochastic simulation to hand fit the HGT rate to get a ~ good fit on individual families, but the total distribution is skewed - tried 3 different methods to approximate $k_{tot}$, none of them works well - Misc: - realistic parameter evaluation: looked at the literature to see what estimations for the various parameters are out there ## Fitting the data to Iranzo et al's model ### Fitting the data to Iranzo's probability distribution I started fitting the histogram data (number of copies per genome for each family) to Iranzo et al's model. It's in the script called 'Iranzo_model.py'. For now, the fit isn't great (the tail is terrible: the model does not account for the 'outlier' genomes that contain many copies from a given family). EDIT: Ivan spotted the mistake. Since I'm not taking into account p(k)=0 for now, I need to normalize and look at p(k)/(1-p0) so that the total sum(p(k)) equals 1 and not less. Once the correction is done, it looks much better: ![](https://i.imgur.com/k6GNa4s.png) The fit is quite nice on some IS families: ![](https://i.imgur.com/uZvMl3s.png) ![](https://i.imgur.com/W5IueFr.png) For others, curvefit does not seem to be able to reach the optimal parameters (I get a much better fit by arbitrarily setting alpha and beta to values suggested in Iranzo et al's paper than by plotting the curve with curvefit's suggested values). I also have a subset of IS families for which python seems unable to mathematically compute the best fit. This is probably linked to the very high values taken by the gamma function, which is approximated by inf when k reaches values around 150-200. Ivan suggested going around the problem by using Stirling's approximation of the gamma function for high numbers. By itself it did not work, but I managed to get it to work for most IS families by computing the logarithm of the probability and then retrieving the exponential of that result. This allows me to avoid directly computing gamma(k) and compute loggamma(k) instead. The problem remains for IS3 and the total number of IS copies, though. I also added p(k=0) in the formula, and it seems that curvefit's problem with getting the right fit has extended to other IS families after that. UPDATE: 1. I reformulated the equation using logs to avoid the calculation of too high gamma values, which caused curve_fit to be unable to compute the best fit. 2. I corrected a mistake (something about the choice of bins) that made some of curve_fit's results seem off. The clean version is in the 'Iranzo_fit.py' file. This seemed to do the trick for almost all families, for which I now have a very nice and convincing fit. (The fit is also quite nice on the total, which is not trivial - see the 'exploring the total distribution' section.) The only family for which the fit is weird is IS21, and I think it's actually because the values for that IS family are off. As you can see on the ISEScan vs digIS plot above, IS21 seems to be over-detected by ISEScan - which means that some of the copies in my data are likely false positives. Hence, this 'false' data probably can't be fitted with Iranzo's model right. ![](https://i.imgur.com/InXgKMu.png) ![](https://i.imgur.com/EZBkpCr.png) ![](https://i.imgur.com/9OhHyJZ.png) ![](https://i.imgur.com/2NPNQfJ.png) ### Enhancing the model? Ivan is a bit sceptical of the model because it does not explain the LTEE observations (on a short time scale at least, there seems to be a burden associated with a high IS load). He suggested we try formulating a slightly different version of the model, which would take into account the fact that a genome is limited in terms of how many IS it can reasonably carry. In Iranzo's model, you have: $\frac{dN_0}{dt} =\gamma N_1 - \phi N_0$ $\frac{dN_{k>0}}{dt} = N_{k-1}(\phi + (k-1)r) - N_{k}(k \times \gamma + \phi + kr) + N_{k+1}\times(k+1) \times \gamma$ with: - $N_k$: Number of genomes with k copies - $\phi$: horizontal gene transfer rate - r: duplication rate (per copy) - $\gamma$: deletion rate (per copy) The idea is to change it to the following model: $\frac{dN_0}{dt} =\gamma N_1 - \phi N_0$ $\frac{dN_{k>0}}{dt} = N_{k-1}(\phi + (k-1)r) - N_{k}(k² \times \gamma + \phi + kr) + N_{k+1}\times(k+1)² \times \gamma$ The only change is that we make the deletion rate dependent on the number of IS copies, ie $\gamma$ becomes $\gamma k$. NB: there is a slight issue though: here we make the deletion rate dependent on the number of copies *of this family*, but in the current state it does not account for other IS families present in the genome... (To give an example: if a genome contains 100 copies of IS3, and 1 copy of IS1, the burden will still be there, but in the model the deletion rate for IS1 copies will remain very low.) ### Simulating the model See simulation_of_models.py To try and see what this change results in, I started by simulating the equations for Iranzo's model and compared the result with the probability distribution described in the paper. It more or less matches, but there's a slight difference which I am a bit perplexed by (I'm using the same parameters in both). ![](https://i.imgur.com/tHOS37n.png) ### Simulating the model suggested by Ivan ![](https://i.imgur.com/8ynUwdW.png) ![](https://i.imgur.com/7X4HxF5.png) I also tried fitting this new model to the data: ![](https://i.imgur.com/ZgAD4cH.png) ![](https://i.imgur.com/hnGOOBd.png) ### Simulating Iranzo's second model (With selection) #### Simulation See simulation_of_models.py After sleeping on it, I'm starting to think that we should go with Iranzo's second model, for several reasons. 1. Iranzo's first model does not account for what we see in the LTEE (this has already been discussed). 2. The problem with the formulation of Ivan's model is that the cost of a high number of IS leads to a decrease in copy numbers (we changed the rate of transfer from Nk to Nk-1) rather than an extinction of the genome. This is not very convincing to me: a genome with 'too many' IS is more likely to disappear because of competition with other bacteria than to somehow get rid of a few IS copies. And the process of extinction is exactly the one formulated in Iranzo et al's second model (with selection). However, the issue I raised with our model regarding the separation of IS families is still present in Iranzo et al's second model. Reminder: > NB: there is a slight issue though: here we make the deletion rate dependent on the number of copies *of this family*, but in the current state it does not account for other IS families present in the genome... (To give an example: if a genome contains 100 copies of IS3, and 1 copy of IS1, the burden will still be there, but in the model the deletion rate for IS1 copies will remain very low.) Therefore, I think we should only apply Iranzo et al's second model to the *total* number of IS, rather than for IS families individually. The formulation of the probability distribution law is a bit complex, and I don't think I have the knowledge to implement it alone. But I did simulate the model and test a few parameter combinations, and some yield convincing results on the total number of IS: ![](https://i.imgur.com/fwPkNm7.png) However, the total number of genomes I was starting with was diminishing because of the extinction of some of the genomes (due to the selection factor). Ivan suggested adding a Lagrange factor to compensate for the loss: you add a multiplicative factor simulating the reproduction of genomes, defined in such a way that the total number of genomes is constant (ie the reproduction of genomes exactly compensates for the loss of counter-selected genomes) and all genomes reproduce at the same rate. From this condition, which can be formulated as: $\lambda \sum_{k=0}N_k = \sum_{k=1}^{} N_{k}ks$ We deduce the following: $\lambda = \frac{\sum_{k=1}^{} N_{k}ks}{\sum_{k=0}N_k}$ Once the Lagrange factor is added, the number of genomes stays stable in the simulation. #### Using the probability distribution See Iranzo_selection_model.py I've starting working on the implementation of the probability distribution: ![](https://i.imgur.com/u4azb31.png) But I'm stuck with a problem: the values from the p* eigenvector are not necessarily positive (and also a bit chaotic). I can't find what I did wrong... Ivan said he'd take a look. Update: Ivan managed to solve it. Apparently it is (again) a problem of approximation of very small values, which is solved when I try to compute the eigenvector with higher parameter values (it works with values close to what I approximated by hand with my simulation). ![](https://i.imgur.com/2x70QEd.png) ![](https://i.imgur.com/nxqUNWK.png) ![](https://i.imgur.com/FtSY3zS.png) #### Fitting the probability distribution I then implemented a curve fit and obtained the following parameters: - $\alpha \approx 9.14$ - $\beta \approx 8.2$ - $\sigma \approx 0.21$ ### Realistic parameter value estimation The question is: are the parameters we infer from the model consistent with what is observed in real life? We can approach this question either by trying to find estimations in the literature, or by trying to compute it from the empirical data and the LTEE data. If I look at my parameters for the hand fit for the selection model, I have : - $r \approx 10d$ - $h \approx 10d$ - $s \approx 0.1d$ #### In the literature ##### Bichsel et al 2010: The early phase of a bacterial insertion sequence infection 'Conservative transposition occurs with a rate of around $10^{−7}$ to $10^{−4}$ events per cell and host cell generation (Chandler and Mahillon, 2002; Kleckner, 1989). We assume the replicative transposition rate to be a few orders of magnitude smaller (Tavakoli and Derbyshire, 2001). IS excision rates are lower than replicative transposition rates (Egner and Berg, 1981). For example, IS10 is excised from the genome at a rate of around $10^{−10}$ per cell and host cell generation, whereas its conservative transposition rate is $10^{−4}$ per cell and host cell generation (Kleckner, 1989). Similarly, transposon Tn5, a mobile DNA sequence flanked by two copies of IS50, has an excision rate of $10^{−6}$ to $10^{−5}$ and a conservative transposition rate of $10^{−3}$ to $10^{−2}$ (Berg, 1977). HGT rates vary widely and depend on many environmental factors. For viral transduction in marine bacteria, rates of between 1.6·$10^{−8}$ and 3.7·$10^{−8}$ transductants per colony-forming unit have been reported (Jiang and Paul, 1998). For the conjugational transfer of plasmids in diverse seawater bacteria, 2.3 · $10^{−6}$ to 5.6 · $10^{−5}$ transconjugants per recipient cell have been found after 3 days of incubation (Dahlberg et al., 1998). For transformation involving epilithic bacteria from a river, in situ rates of 2.2 · $10^{−6}$ to 1.0 · $10^{−3}$ events per recipient cell have been reported per 24 h incubation time (Williams et al., 1996). Note that in this case, the transformation occurred in cells that were fixed on a surface, i.e. not in a well-mixed environment as we assume in our models. No information is available about the fitness cost caused by ISs.' (The early phase of a bacterial insertion sequence infection, Bichsel et al 2010) ![](https://i.imgur.com/OTl7UgY.png) These estimations are mostly compatible with my hand fit, except for the HGT rate, which is lower in my hand fit. But apparently they smashed together *all* HGT, and all HGT events do not necessarily involve an IS, so that might not be such a huge problem. NB: that article models something very similar to Iranzo et al, I should cite it too. ##### Bichsel et al 2013: Estimating the fitness cost of an insertion sequence ![](https://i.imgur.com/NggAISu.png) Same rates as in their previous paper, but the sources are indicated. Conclusion of the paper: $s \approx 10⁻9 - 10⁻7$ Higher than what I have in my hand fit. NB: to read too. ##### Lee et al 2015 $r \approx 3.5e10^-4$ ### Defining a stochastic model to test the selection model on individual families The problem of the selection model is that you do not have access to the individual family distributions. To go around this and test whether the selection model yields realistic distributions at the IS family level, I implemented a stochastic model simulating the evolution of IS copy numbers at the genome level. The script is called stochastic_model_Ivan.py. The idea is to simulate the evolution of a given number of genomes. To avoid waiting too long (given the very small parameter values, nothing happens for most generations), I added a Gillespie algorithm: basically, instead of simulating each generation, you simulate the time it takes for a given event to happen (ie a duplication, a deletion...) and then whichever event with the smallest occurring time is considered to be the one that actually happens. That way, you move on by thousands of generations but loop only once through the calculations. I also added a Lagrange factor the equilibrate the population (given that we loose some genomes to selection): each given timestep, I look at how many genomes I lost, and draw as many genomes from the surviving ones. These genomes duplicate, that way I end up with the same number of genomes I started with. I simulated that model with a 10,000 generations timestep for 200 million generations, with the parameters fitted using Iranzo et al's selection model (see above: s = 1e-9, r = 4.35e-8, h = 3.9e-8/27 (total h = 3.9e-8), d = 4.76e-9) and it seems to be working: ![](https://i.imgur.com/o2V7Boc.png) The system seems to stabilize around 200M generations (here is the evolution of the average number of IS copies in a genome with a 100,000 timestep): ![](https://i.imgur.com/OMET3i9.png) NB: I also simulated the model with a null selection pressure, which is essentially the neutral model. While the individual IS families end up with a distribution that has the same shape as the empirical distributions (basically a decreasing exponential), the total does not, which is a nice way to show that without selection, you can't entirely recapitulate the distribution of IS (you can fit the individual families but not the total). I put an example for the IS110 family as well as the total distribution below to illustrate: ![](https://i.imgur.com/OjXRSmU.png) ![](https://i.imgur.com/Uc2Q1bX.png) The next step would be to fit the model to all the individual IS. The problem is that there are too many parameters (if you have an individual duplication, deletion, HGT, and counter-selection rate for each IS, you have 27x4 = 108 parameters!): fitting them numerically would take ages. Ivan suggested I try and see whether we can fit the data with a fixed duplication, deletion and counter-selection rate, and make only the HGT vary at the IS family level. NB: It should be noted that the HGT inferred this way does not really reflect the biological value of an IS family's HGT. Indeed, the model does not account for phylogeny and ecology. As an example: let's imagine you have an IS that is only present in a specific environment. This IS might have a huge HGT rate and be transferred to all other bacteria in this specific environment, but it will never reach bacteria elsewhere. In the end, it will only be found in very specific genomes. With the current approach, I would infer a very small HGT from the data (because the IS is absent from an overwhelming majority of genomes), when in reality the HGT rate is huge. So here, the inferred HGT rate will be more of an indication of an IS's ubiquity than an estimation of the actual biological HGT rate. I fitted the data by hand (starting with the fitted parameters from Iranzo's selection model), and while I manage to get pretty good fits (see below), I have to change the total HGT rate (around 4.4e-8 instead of 3.9e-8) and the duplication rate (2.2e-8 instead of 4.35e-8) (The script is called stochastic_model_adjust_hgt.py). This results in a skewed IS total, with more genomes containing a low number of IS: ![](https://i.imgur.com/ydgMLPP.png) ![](https://i.imgur.com/kJyO8Cd.png) Since the fits at the individual family level aren't that bad, it would maybe be worth trying to do the fit numerically (while taking the total into account) to see if there is a solution that resolves both the individual family level and the total. ## Selection model at the IS family level To try and find the right parameters at the family level without having an infinite computation time, I tried formulating an approached selection model at the IS family level. The main problem at the IS family level is that the selection rate depends not only on the number of copies belonging to that family, but on the total number of copies. To go around this problem, we can try to approximate the total number of copies based on the number of copies for a given IS family. The idea is to work from the total distribution. Let's say we have $k_f$ copies from the family $f$ in a genome. We want to approximate the total selection burden on that genome, which depends on $k_{tot}$, the total number of IS copies in the genome. To do that, we use the total probability distribution $P(k_{tot} = k)$: we know that $k_{tot} >= k_f$ so we know that for all $k < k_f$, $P(k_{tot} = k) = 0$. We can therefore re-normalize the $P(k_{tot} = k)$ curve to obtain $P(k_{tot} = k | k_f)$ (ie the probability to have $k$ copies in total knowing we already have at least $k_f$). ![](https://i.imgur.com/PkvQ1Se.png) In turn, we can use the probability distribution $P(k_{tot} = k | k_f)$ to compute the total selection burden $S$. If the selection burden $s$ associatied to an individual IS copy is identical for all IS families, we have: $S_{k_f} = \sum_{k =k_f}s\times k \times {P(k_{tot} = k | k_f)} = s\times\sum_{k =k_f}k \times {P(k_{tot} = k | k_f)}$ We can then express the matrix A defined in Iranzo et al's paper at the family level: the coefficients are the same, except for $k_f \times \sigma$ which becomes $k_{tot} \times \sigma$, with $k_{tot} = \sum_{k =k_f}k \times {P(k_{tot} = k | k_f)}$. We want to multiply $\sigma$ by the total number of IS in the genome and not just by $k_f$. The script is called family_per_family_selection_model_fit.py. For now it does not seem to be working too well. The approximation of $k_{tot}$ is systematically too low (except for very low values of $k_f$): ![](https://i.imgur.com/kHJJus9.png) As could be expected, the approached probability distribution for the total (obtained with parameters from the selection fit to the total) does not fit the selection model fit to the total: ![](https://i.imgur.com/NGkRLz9.png) What puzzles me a bit is that from the look of the approached probability distribution, it seems that selection is too strong (shift of the probability distribution towards low values compared to the initial fit). But from the graph above, we can see that $k_{tot}$ is *under*estimated, so if the problem comes from the poor approximation of $k_{tot}$, the shift should be towards *higher* values... I think it might be the design of my test that is wrong: I can't infer the right total distribution from the equations written for ISs at the family level. Indeed, let's say you have a certain value of $k_f$: it will be corrected to a higher $k_{tot}$ value, but if you try to apply that to a $k_f$ that in reality *already* corresponds to the total number of IS in a genome, what you are effectively doing is just overestimating the total of IS and, hence, the selection level. I tried looking at the results when fitting for various IS families; overall, the fits aren't great (likely because of the bad estimation of $k_{tot}$). I tried another method (see family_per_family_selection_model_second_method.py). The idea is to use the slope of the linear regression as seen below to approximate $k_{tot}$ instead of the probability distribution: ![](https://i.imgur.com/oLiNoRC.png) This time, you can check the model using the total, which I did: it fits the right parameters. At the IS level, the fits seem pretty okay. ![](https://i.imgur.com/10YLXvh.png) | IS | $\alpha$ | $\beta$ | $\sigma$ | s | d | h | r | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | IS1 | 320.3 | 7.7 |32.9 | 1e-9 | 3.0e-11 | 2.31e-10 | 9.6e-9 | | IS110 | 26.7 | 8.2 | 0.8 | 1e-9 | 1.25e-9 | 1e-8 | 3.3e-8 | | IS1182 | 53.6 | 8.2 | 2.5 | 1e-9 | 4e-10 | 3.28e-9 | 2.1e-8 | | IS1380 | 125.0 | 8.1 | 6.5 | 1e-9 | 1.5e-10 | 1.2e-9 | 1.9e-8 | | IS1595 | 57.7 | 8.2 | 7.8 | 1e-9 | 1.3e-10 | 1.1e-9 | 7.5e-9 | | IS1634 | 188.7 | 7.6 | 5.8 | 1e-9 | 1.7e-10 | 1.3e-9 | 3.2e-8 | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | | | | | | 1e-9 | | | | I started calculating the duplication, deletion and HGT rates from the parameters to test them in the simulation (with $s = 1 \times 10^{-9}$ for the selection burden, as estimated by Bichsel et al). On average, the duplication rate is $<r> = 2.7 \times 10^{-8}$, the deletion rate is $<d> = 6.5 \times 10^{-10}$ and the total HGT rate is $h_{total} = 8.8 \times 10^{-8}$. When I run the simulation, I run into a problem: I don't reach a stable state, instead the number of IS copies keeps on increasing (probably because of the high average duplication rate). ![](https://i.imgur.com/K5NLAbN.png) Ivan mentioned this was expected: when I use the information contained in the $k_{tot} = f(k_f)$ linear regression, I add information about the co-occurence of IS families (ie about the phylogeny); information that is absent in the stochastic model simulation. To give a theoretical example: let's say IS5, IS21 and IS3 tend to co-occur in genomes. This means that genomes that contain IS5 copies will tend to contain a lot of IS copies in total (because they contain all three families), as opposed to genomes that contain a family not correlated to other families (these genomes may or may not contain copies from other families). Hence, for IS5, the slope of the linear regression will tend to be higher than for a family not correlated to others. This in turn impacts the inferred parameters (because the estimated slope has an impact on the inferred $\sigma = s/d$: the higher the slope, the smaller the $\sigma$ you need to fit a given dataset). But when I apply the $\sigma$ inferred for IS5 in the stochastic model *without* forcing a correlation between IS5 and other families, that inferred $\sigma$ is too small to reconstitute the empirical data. (This reasoning is also consistent with the explosion of IS I observe when implementing the inferred parameters in the stochastic simulation: for IS families that are correlated to others, $\sigma$ is too low, so the copy number explodes. I should be able to check this reasoning by looking at which which IS families are over-represented in the simulation compared to the empirical data: it should be IS families that tend to be correlated with others. The effect should be smaller in families with weak correlations to others.) When looking at the average correlation between an IS family and other families, this mostly checks out (IS_family_total_vs_dependency.py): ![](https://i.imgur.com/QoEaPv2.png) Here we can see that the families with the highest correlation to others are IS110, IS4, IS630, IS3 and IS66. And when we look at the result of the stochastic simulation with the fitted parameters, we get this (all the fits are on morrowind in my folder: stochastic_model_simulation/Figures/fitting_IS_families_approximated_model_method2): ![](https://i.imgur.com/rv9rGal.png) So the families with high correlations to others on average generate overfitted distributions. ### Third approximation method for $k_{tot}$ method3_ktot_estimation.py family_per_family_selection_model_third_method.ipynb The problem with method 1 is that it does not actually recapitulate what we have in the data (underestimation of $k_{tot}$). The problem with method 2 is that it introduces phylogenetic structure in a model that is supposed to be blind to that information: what we want is a 'generic' relationship between $k_f$ and $k_{tot}$, rather than the specific relationship between $k_{IS110}$ and $k_{tot}$, for instance (which partly reflects the phylogenetic structure in the data). To achieve this, the idea is to do the same thing as in method 2, but instead of looking at each IS family separately, we aggregate the information from all IS families. For each $i$, we look at each IS family, and at the average $k_{tot}$ in genomes that contain $k_f = i$ copies from that family. Then we average that average $k_{tot}$ across all families. This is shown below (the size of the points indicates how many genomes were taken into account to obtain each point). By fitting this curve, I should be able to obtain an average $k_{tot} = k_f$ relationship for 'the average IS family'. ![](https://i.imgur.com/tBnvde1.png) My idea is to re-try to fit the data with Iranzo's adapted selection model using this method to approximate $k_{tot}$ from $k_f$. I'll test the fits in the stochastic model: can I reproduce each family's distribution with the fitted parameters from this method? If so, even if the sum of the fits does not recapitulate the total distribution, I can use the model to explore the parameters space more thoroughly than I would be able to with only the stochastic model (which takes too long to run). I fitted the data, keeping only data points obtained from at least 20 genomes. ![](https://i.imgur.com/x1UuGhr.png) I then used that fit to correct the Iranzo model (by replacing $k_f$ by $k_{tot})$ and fit parameters to each IS family distribution separately. I then deduced a deletion, duplication and HGT rate for each family from $\alpha_f$, $\beta_f$ and $\sigma_f$, by setting the counter-selection rate at $s = 10e-9$. ![](https://i.imgur.com/FSp1sis.png) Those rates don't make any sense. You can't have thousands of deletions or duplications at a given generation... So I guess this method also fails. ## Integrating the phylogeny ### Theoretical idea Ivan's suggestion is the following: we start by adding the phylogeny to Iranzo's (wrong) selection model, to see if we still have a difference in the distribution at the total IS level. To do that: - we estimate the total number of genomes that cannot contain the IS (for phylogenetic and ecological reasons) by using the theoretical $P(k=0)$ obtained when fitting the data for $k > 0$. We therefore have a theoretical number $N_{f,0}$ of genomes that cannot host copies of the IS family $f$. NB: Is this still valid if one considers the HGT and the duplication/deletion/counter-selection process as independent? The HGT rate determines the $P(k=0)$ and contributes to the form of the distribution for $P(k > 0)$ (with the rest of the factors), but the duplication/deletion/counter-selection proccesses do not play a role in $P(k=0)$. Since, for $P(k>0)$, the HGT and duplication processes kind of mix together (they have the same effect), can it really be argued that we are fitting the $P(k = 0)$ correctly by extrapolating the $P(k>0)$ distribution? - Once we have the total number of genomes $N_{f,th}$ that could potentially carry an IS (which we get with $N_{tot}*(1 - P(k=0))$): we define the IS family associations that can occur. We use the real data: we have a number of genomes that actually contain an IS family, which should be inferior to $N_{f,th}$. We complete the list of genomes that can contain the family by selecting the genomes that are the closest to the genomes that already contain the IS (in terms of phylogenetic distance, using the distance matrix obtained from the 16S DNA sequence alignment). We do that for each IS family, and for each genome, we get a combination of families that can potentially occur in that genome. - We then use these genomic combinations in the stochastic simulation. ![](https://i.imgur.com/lnjJJvX.png) ### Implementation 19/12/2022 I modified the Iranzo_selection_model.py script to fit starting from p(k=1) instead of p(k=0). The new script is called fitting_Iranzo_selection_model_from_1.py. The fits (see Bioinfo/Figures/Iranzo_selection_model_fit_from_1/) are almost identical to the fits for the selection model starting from 0. In some cases, the fitted $p(k=0)$ is actually higher than the real $N_{k = 0}$... 03/01/2023 Given the previous results (the approximated $p(k=0)$ is almost identical to $N_{k = 0}$ is all individual families), I'm going to implement the phylogenetic constraint in the simulation without adding potential receivers. Now I need to extract IS family combinations from the real data. I did so in the simulation_with_phylogeny.py script (edit: now in the 23_01_04_stochastic_simulation_with_phylogeny notebook). When getting to the simulation part, I realized I have a problem with the way the model is built: even if I define one combination of IS families per initial genome (which match the real data), the distribution of combination is going to be mixed up by the extinction of some of the genomes and the survival and expansion of others... 10/01/23 I chose to still try it and go about it in the following way: I keep the original list of possible combinations (from the real data). When a genome is lost and replaced by another one, the replacing genome keeps its contents but the list of possible additions is replaced by the combination corresponding to the genome that has been lost (ex: if the replacing genome contained an IS1, but IS1 is not in the list of possible additions from the lost genome, the replacing genome will keep the IS1 copy but won't be able to acquire another one by HGT). I tried running this modified model using the fits from Iranzo's selection model (unmodified) as parameters (or rather: the parameters computed from the alpha, beta, sigma fits, with s = 1e-9). As you can see below, it doesn't solve the problem of the total: ![](https://i.imgur.com/QTMiici.png) The individual family fits are of variable quality. ## Integrating the phylogeny - second approach (Written down on 03/03/2023) Scripts: see neutral_model_phylogeny_clades_barcodes.ipynb The idea is to try to integrate the phylogeny at different taxonomic levels. At the phylum level, for instance, you look at all the genomes belonging to a clade, and for each IS family, if there is at least one genome that contains that family, you consider that all genomes in the phylum are potential receivers of the IS. You do that for all phyla, and at the end you adjust the total number of genomes that can receive the IS in consequence (for example: if you have a phylum of 400 genomes that does not contain the IS, you consider that these genomes cannot receive it. Therefore, the total number of potentially receiving genomes is 4467 - 400 = 4067). You then fit the neutral model to that 'corrected data': the number of genomes at k = 0 decreases by the number of genomes you consider as 'non-receivers'. The rest of the data you're fitting doesn't change. This gets you a probability distribution corrected for potential receivers for that IS. After doing this, for each phylum, you look at IS elements that the phylum can contain, and you convolute the 'corrected' probability distributions of these IS, which gives you the probability distribution of the total number of copies in that phylum. You multiply by the size of the phylum to get the expected distribution. You do that for all phyla, and then sum the expected distributions to get the distribution of the total number of copies for the entire dataset. You can then do the same at lower, more precise taxonomic levels: orders, genus, genomes, etc. The idea is to see whether integrating the taxonomy in this way can 'correct' the shift between the neutral model and the data for the total distribution. I applied this with the ISEScan data: ![](https://i.imgur.com/6m1l0is.png) ![](https://i.imgur.com/qdg4wRE.png) At the genome level, what I'm doing should be equivalent to shuffling the number of copies of each IS across all genomes, while retaining the initial presence/absence pattern (see 'Exploring the total distribution'). I checked whether the shuffling and my approximation at the genome level were similar, and it seems to be the case: ![](https://i.imgur.com/1jA8zu3.png) ## Exploring the space of parameter values To explore the best combinations of parameters for each IS, I would like to look at heatmaps of the sum of square differences between the real data and the corrected Iranzo model (using method 3?), for various combinations, of $\alpha$ and $\beta$ ($\sigma$ is fixed to a given value, and I'll generate several heatmaps with different values of $\sigma$ to explore the possibilities). ![](https://i.imgur.com/5qByEGI.jpg) ## Exploring the total distribution ![](https://i.imgur.com/LIZ56f1.png) The distribution of total IS copies can be fitted quite nicely with Iranzo's neutral model. Which is not trivial: a sum of exponentials probability distributions is not another exponential probability distribution. Ivan suggested studying the impact of phylogeny by shuffling the data in two ways: 1. shuffle the contents of each IS column randomly (ie randomly shuffle the total number of copies belonging to a given IS family in a genome). 2. shuffle the contents of each IS column, but only between genomes that do contain the IS (ie, a genome that contains no IS1 copy will still not contain any IS1 copy after the shuffling). The script is called testing_total_hyps.py. Here is the distribution of the total number of IS copies after these two shufflings: ![](https://i.imgur.com/b0faYqs.png) ![](https://i.imgur.com/ll0LOnP.png)