Detection and identification of TE using long Read RNAseq in germline tissues

# Detection and identification of TE using long Read RNAseq in germline tissues :warning: **Teloprime** protocol is best suited for **TE detection**, and not for quantification. ## Data and results locations : Data folder needed for IGV : pbil-gates.univ-lyon1.fr/beegfs/data/ecumunel/data_TE_LR_RNAseq [Link to the last counting spreadsheet](https://docs.google.com/spreadsheets/d/1egxQrxByPId_JQt1fxJ79d5BZa_y6XqeVJQbbwvThO4/edit?usp=sharing) [Link to the git repository](https://github.com/EricAmren/TE_LR_RNAseq.git) [Link to the slides](https://docs.google.com/presentation/d/1zVL1_sLJUk7L-axE-PnOMbSWqYwWqodFVCet5MtRTtc/edit?usp=sharing) [Link to the static figures in svg format](https://github.com/EricAmren/TE_LR_RNAseq/tree/main/misc/pictures) [Link to the dynamic figures in html format](https://github.com/EricAmren/TE_LR_RNAseq/tree/main/misc/interactive_figures) ## Description of raw data : 2 datasets of **long reads** were generated using the Nanopore + Teloprime protocol from Drosophila *melanogaster* (Strain : dmGoth101) Ovaries :female_sign: and Testis :male_sign:. These datasets consist of 2 fastq files : * **FC30.fastq** for the female dataset * **FC29.fastq** for the male dataset | | Nb of reads | | ----------- | ----------- | | Female | 1236000 | | Male | 2925554 | ## Alignment : These reads were mapped onto the **strain-specific genome assembly** generated by François Sabot using the LR aligner [Minimap2](https://github.com/lh3/minimap2) (version 2.17-r974-dirty). The **splice** preset as we are working with long-read spliced alignments. (Reducing the cost of opening gap for intronic region to 0) ```typescript minimap2 -ax splice FC30.fastq dmgoth101_assembl_chrom.fasta -o FC30.bam ``` ## From alignment file to counting table of different TE insertions ### Requirements : - [X] Alignments (BAM) - [X] TE annotation (gtf) - [X] Gene annotation (gtf) - [X] TE classification file (tsv) ### Generating a TE specific annotation file In order to count the number of alignment mapped onto a TE feature, we first need to get a gtf file describing the position of each TE. In that instance, we used [RepeatMasker](https://www.repeatmasker.org/) with DFAM databank. Then, we used [OneCodeToFindThemAll](http://doua.prabi.fr/software/one-code-to-find-them-all) to group LTR TE in one feature. (They were splitted in LTR and intern parts). ### Filtering and alignment counting strategy > The objective here is to detect reads that have been expressed from transposable elements. Knowing the location of the transposable elements in the genome and the location where each read was aligned, we can estimate the number of read that was expressed by each TE. > The difficulty here is that this task is particularly dependant on the quality of the annotations. We wrote a python script with the following filters in order to reduce the number of false positives (i.e reads covering a TE locus but not originating from this TE). :warning: We also filtered really small TE for the next part : only TE with at least a length of 150bp will be counted. #### 1. Alignments with maximum Alignment Score (AS): A read can be aligned at several location in the genome (multimapping). In the bam file, several alignments will be coming from this read. As we want to only count each read once, we'll only keep the alignment with the highest Alignment Score (computed by Minimap2) as it corresponds with the most probable alignment. #### 2. Using TE location : We only keep alignments which are covering a TE locus and which have at least one base aligned in the TE locus. (A read can cover a TE locus without actually be aligned on it if the TE is in the intronic region of a gene for example) #### 3. Using minimum subject coverage : Sometime, a read can be mainly aligned on a exon but still overlap a TE for a few bp. In order to filter out those false positives, we only count reads that are at least covering 10% of the TE length. #### 4. Resolving multi-overlapping reads : Due to annotation errors, it may happen that some TE are overlapping each other or gene exons. It is then difficult to assign a read to the good feature. To solve these cases, we enumerate for each read all feature that are covered, and for each feature covered, we measure what we call a number of non-overlapping bases. We'll only increment the counting of the feature with the less number of non-overlapping bases. ### Schema of the algorithm ![](https://i.imgur.com/oHINufL.png) ### Overview of what is being filtered #### Funnel chart <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/5.embed"></iframe> <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/36.embed"></iframe> #### Reads context Non-ambiguous reads : reads that passed all filters and were only aligned on one TE insertion feature and nothing else. **Reads mapped on multiple TE insertions :** reads that passed all filters but were aligned on several TE insertions (according to the annotations) and no genes. The smallest number of non-overlapping bases determined the chosen insertion. **Reads mapped in exonic region :** reads that passed all filters but were also aligned in exonic region. **Reads mapped in intronic region :** reads that passed all filters but were also aligned in intronic region. **Reads not covering the minimal subject coverage** of any TE were also removed. Reads that presented a **smaller number of non-overlapping bases** in a gene instead of a TE insertions were filtered out. <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/38.embed"></iframe> ## Exploring TE expression ### Sunburst plots Male and Female Dataset <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/68.embed"></iframe> **How to read it :** Each area of the outside circle corresponds to a TE insertion. The size of each area is linked to the number of reads counted on each insertions : a bigger area represents a greater number of read counted on the insertion. Each circle corresponds to a level in the TE classification. From outside to inside : insertion, family, superfamily,subclass. Finally, colors describe the mean subject coverage of each TE insertion/family/superfamily/subclass, contained between 0 and 1, with a scale going from dark blue(meaning it is almost not covered at all) to bright yellow (meaning the reads fully covered the feature). ### Violin plots Each violin plot correspond to the expression of a family of TE. A family is composed of several TE insertion, with their number of counted reads. #### DNA : <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/57.embed"></iframe> #### LTR : <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/59.embed"></iframe> #### LINE : <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/55.embed"></iframe> ### Insertion counting comparison : scatterplot ![](https://i.imgur.com/0mqC2zQ.png) ### With/without chimeric reads : ![](https://i.imgur.com/Rg4uJCX.png) ![](https://github.com/EricAmren/TE_LR_RNAseq/blob/main/misc/pictures/expression_diff_chimeric_reads.png) ## Detection of chimeras [Link to the chimeras table](https://github.com/EricAmren/TE_LR_RNAseq/tree/main/chimeric_reads) Chimeras are reads that are expressed from a gene and a TE. In order to detect potential chimeras in our reads, we first isolated reads that were at least aligned on **10bp on a TE**. From this collection of reads, we kept those who were also aligned for **10bp on a gene exon**. Finally, we segregated cases in which the incriminated gene and TE (on which a read was aligned) were overlapping each other and cases in which they weren't. <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/73.embed"></iframe> <iframe width="900" height="800" frameborder="0" scrolling="no" src="//plotly.com/~EricCumunel/70.embed"></iframe>