---
tags: Thèse
---
# Data analysis
### Overview of IS distribution in prokaryotic genomes




The outliers are ISH3 and IS481. Total average: 35.0 IS/genome for the 4467 genomes dataset, 55.66 copies/genome for the 21958 genomes dataset (likely artificially increased because of the Proteobacteria sampling bias).
Oasis results (for comparison):

When analyzing the 4,000 genomes dataset:
- 106/4467 genomes (2.4%) are devoid of IS.
- Among phyla with at least 10 genomes, the phylum with the highest mean number of IS per genome is Deinococcus-Thermus (51.5 IS/genome), followed by Proteobacteria (45.0 IS/genome).
- The phylum with the least mean number of IS per genomes is Verrucomicrobia (10.8 IS/genome), followed by Tenericutes (16.1 IS/genome). These phyla respectively contain 0/10 and 14/125 genomes (11.2%) devoid of IS.
When analyzing the 21,000 genomes dataset:
- 660/21958 genomes (3.0%) are devoid of IS.
- Among phyla with at least 10 genomes, the phylum with the highest mean number of IS per genome is Proteobacteria (70.0 IS/genome), followed by Deinococcus-Thermus (56.3 IS/genome).
- The phylum with the least mean number of IS per genomes is Chlamydiae (2.7 IS/genome), followed by Verrucomicrobia (4.4 IS/genome).
- For Chlamydiae, this is mostly due to the fact that 135 out of 175 (77.1%) genomes are devoid of IS - vs 324/12671 (2.6%) of Proteobacteria genomes. However, this is not the case for Verrucomicrobia (only 1/61 genome devoid of IS).
Many of the genomes devoid of IS are obligate endosymbionts or endoparasites (such as Chlamydia, Rickettsia, Candidatus Carsonella, Candidatus Portiera, Blattabacterium, Wigglesworthia or Serratia symbiotica). Notable genera with genomes devoid or almost devoid of IS but that are not endosymbionts include Bordetella, Bartonella, Methanococcus and Mycoplasma.
In most cases, most of the genomes belonging to a genus in which IS-free genomes are found contain little or no IS copies: Candidatus Carsonella, Blattabacterium, Candidatus Portiera, Wigglesworthia genomes are all entirely devoid of IS; in Chlamydia, 135 genomes contain no IS, and the 34 other genomes contain only one IS copy.


Bordetella is a notable exception, with numerous genomes containing more than 100 IS copies:

### Analyzing the relationship between genome size and IS content




### Analyzing the relationship between phylogeny and IS content
To look at the relationship between phylogeny (approximated using the taxonomy) and IS content, I performed the following analysis at each taxonomic level and for each IS family (using a python script that I ran on the server):
* I take all the clades that contain at least one genome carrying an IS element belonging to the IS family
* for each of these clades, I look at the proportion of subclades that contain the IS family (i.e. that contain at least one genome containing an IS element classified into that family). To avoid an over-representation of 1 (due to clades that contain only one sub-clade), I only take into account clades that contain at least 5 subclades, and if they contain more than 5, I sample 5 subclades randomly.
* I compare the distribution of these proportions to the distribution obtained after shuffling subclades among clades
Here is an example of results for the IS5 family:

NB: Check that I re-did the analysis several times for the 'real' data too (since I sample 5 subclades randomly, the result is non-deterministic). I don't think I did so I should re-run it with that modification.
Update: I re-ran it with the modifications (see the script called analysis_corrected_n_5.py and the results in Figures/taxonomic_analysis_n_equal_5_corrected; the figure above is from that new analysis). The remaining problem is that by running the analysis several times on the real data, I artificially increase the statistical power by adding more data points.
Here is a visual summary of the statistical significance of the difference between real and shuffled data (four significance levels, by increasing darkness level:
- ns: 5.00e-02 < p <= 1.00e+00
- *: 1.00e-02 < p <= 5.00e-02
- **: 1.00e-03 < p <= 1.00e-02
- ***: 1.00e-04 < p <= 1.00e-03
- ****: p <= 1.00e-04)

Overall, it seems that IS contents are more similar than expected by chance at the species and genus level, but rarely above.
### Similarity score
Overall I think it might be worth thinking about a similarity score again - might not be perfect, but less convoluted and easier to understand by the reader than what I'm currently doing? I've started testing things in the script similarity_score.py. Basically the idea would be to compare genomes two by two:
- you add +1 to the score for each shared IS family
- you remove -1 to the score for each IS family present in one genome but not the other
- you divide the resulting score by the number of families found at least in one genome
Of course this would have to be refined as I test it.
I tested it on the Acinetobacter baumannii species (273 genomes available): I get an average similarity score of 0.0628, vs -0.4445 for 273 genomes sampled at random in my data. I get an average similarity score of -0.1172 for 273 genomes sampled randomly from the Acinetobacter genus.
- Average similarity score for 273 genomes sampled randomly: -0.4445 (minimum: -1, maximum: 1)
- Average similarity score for 273 Acinetobacter baumannii genomes: 0.0628
- Average similarity score for 273 genomes sampled in the Acinetobacter genus: -0.1172
- Average similarity score for 273 genomes sampled in the Moraxellaceae family: -0.1624
- Average similarity score for 273 genomes sampled in the Pseudomonadales order: -0.2217
- Average similarity score for 273 genomes sampled in the Gammaproteobacteria class: -0.2289
- Average similarity score for 273 genomes sampled in the Proteobacteria phylum: -0.3439
Seems pretty good. I wrote and launched a script on the server to perform this on the entire data (similarity_score_general_run.py).




## Checking genomes with the most IS
The genome with the highest percentage of IS in the genome (nearly 30%) is an Orientia tsutsugamushi genome (GCF_900327245.1_Gilliam.fna). The .out file seems okay, with reported IS of the right length. A lot of IS with no data on the ORF reported, but that's often the case compared with other files; maybe a bit more here? (I checked in the tsv file, most of the IS are reported as complete?)
The high percentage of IS checks out when looking at its characteristics of this species: it's an obligate intracellular parasite, reported to have "the most repeated DNA sequences among bacterial genomes sequenced so far" ([Wikipedia](https://en.wikipedia.org/wiki/Orientia_tsutsugamushi)).
The genome with the most IS (1284) is a Piscirickettsia salmonis genome (GCF_009709535.1). 708 complete copies, 576 partial copies, 882 IS with no ORF info reported. Compared to another genome randomly picked (GCF_004801575.1_ASM480157v1.fna) with 91 IS copies: 70 complete copies, 21 partial copies, 48 copies with no ORF info reported.
## Exploring the total distribution

The distribution of total IS copies can be fitted quite nicely with Iranzo's model. Which is not trivial: a sum of exponentials probability distributions is not another exponential probability distribution.
Ivan suggested studying the impact of phylogeny by shuffling the data in two ways:
1. shuffle the contents of each IS column randomly (ie randomly shuffle the total number of copies belonging to a given IS family in a genome).
2. shuffle the contents of each IS column, but only between genomes that do contain the IS (ie, a genome that contains no IS1 copy will still not contain any IS1 copy after the shuffling).
The script is called testing_total_hyps.py. Here is the distribution of the total number of IS copies after these two shufflings:


## Generating a tree and phylogenetical distances
### Phylogenetic tree
see 16s_tree file on morrowind
I extracted the 16S RNA sequences from the 4467 genomes representative of a species (extract_16S_seq.py). Now I need to launch an alignment and generate a tree from it. The goal is to then create a figure in Microreact that will easily allow me (and hopefully other interested people) to navigate the ISEScan output data. I'm not sure which alignment tool is better for 16S data; I'll look into it but for now I launched a mafft alignment over the weekend (with automatic detection of the best parameters).
```
mafft --auto 16S_sequences.fasta > mafft_output.fasta
```
iTOL is a nightmare but Microreact doesn't seem to be working with my tree, so... I generated annotations using [table2itol](https://github.com/mgoeker/table2itol) from a tsv file (I converted the data to log2 values first for a better visualization). Here is the example code in R for the digIS data (launched from the folder containing the table2itol.R script):
```
source('table2itol.R')
create_itol_files(infiles = c('/home/gaudillf/digIS_4467_genomes/digIS_data_log2.tsv'), identifier = 'Full_genome_ID', label = 'Full_genome_ID')
```
ISEScan data:

digIS data (NB: I had to rerun the data extraction because of an issue with the .sum files. See my issue opened on the digIS github):

Sophie suggested adding a third map recapitulating the difference between the two (basically ISEScan - digIS).
### Exploring specific clades
Some subsets of genomes seem to have specific IS contents (either similar contents across related genomes, or related genomes with no IS). I added the taxonomy of the genomes to the digIS tree as popup info (thanks to a script sent by Sophie: see taxonomic_annotations_iTOL.py, which is my adaptation of it) and I started exploring. Work in progress!
### Measure of the phylogenetic distance between genomes
At first I thought I would use fastANI, but from the GitHub issues it seems that below a 80% similarity between genomes, fastANI does not report a score. So it seems it is not adapted to what I want to do.
Maybe I could use distmat (https://www.bioinformatics.nl/cgi-bin/emboss/output/537915) or [pairsnp](https://github.com/gtonkinhill/pairsnp) (I launched a command which gave a distance matrix as output).
Now that I have the pairsnp matrix, I need to check how accurate it is. I'm going to pick out closely related genomes from the tree and look at their pairsnp score.
Seems ok. The further away genomes are on the tree, the higher the pairsnp score.
Now that I have the matrix, I need to compute the similarity score for each genome pair and compare that with the phylogenetic distance between genomes. I checked: python and pyplot should be able to handle the 20 million-ish points. I wrote a script to compute the similarity score for each genome pair (1 genome/species). See in Clean_scripts the script called similarity_score_no_strains.py.
Once it is done, I can combine it with the distance matrix (morrowind home: pairsnp/pairsnp_output.csv).
The result baffles me a bit: there does not seem to be a clear correlation between the distance between a pair of genomes and their IS content. I expected a clear decrease of the similarity score as the distance increases. Maybe it's due to saturation of the figure? I tried running Seaborn's density plot function, but there is too much data and it can't handle it.

## PCA of the IS data (ISEScan)
Just to see if I could see a structuration in the data, I plotted a PCA of the ISEScan data (pca_analysis.py).


It seems there is no clear structuration. The PC1 axis seems to be consistent with the total number of IS (color in the plot below):


Just to check, I redid the analysis but only looked at genomes with less than 100 IS copies in total. No structuration either.


## Exploring IS combinations
See number_of_IS_families_in_genomes_ISEScan.ipynb and See number_of_IS_families_in_genomes_digIS.ipynb.
### ISEScan
Clustering of genomes based on IS combinations:

Clustering of genomes by similarity score:

I tried to see whether clades tend to group together, but based on the few clades I explored (3-4 per taxonomic level), it does not seem to be the case.
Clustering of genomes based on pairwise cosine distances:

I also looked at the Manhattan distances:

### digIS




Euclidian distances:
