HackMD - Collaborative Markdown Knowledge Base

We are trying to reduce the number of reference sequences caused by duplicated genomic regions in the *splici* reference in two places. One is using the `joinOverlappingIntrons` as you mentioned. This is on the isoform level. Imagine that there is a very short exon, (shorter than 2 * flanking length). When adding flanking length to the introns surronding it, the flanked introns may overlap because of the short exon. Joining those two overlapping introns will reduce the number of introns that are associated with that short exon. Second, we will join the overlapping introns on the gene level. In `getFeatureRanges()`, the intronic genomic regions of each isoform of a gene are defined *independently* and are regarded as distinct genomic regions and then as distinct reference sequences in `splici`. However, all those "distinct" intronic genomic regions include the introns of that gene. If we do not join those overlapping introns for each gene, the splici reference will be very repetitive because the number of times the introns of each gene will appear in splici is the same as the number of isoforms the gene has. For example, say we have a gene with 3 exons and 2 introns, the unspliced transcript of the gene will be in the structure of $$E_1-I_1-E_2-I_2-E_3$$ where $E$ represents exons and $I$ represent introns. If an isoform $ISO_1$ of the gene consists of $E_1-E_2$, the two intronic genomic regions of $ISO_1$ will be $I^{ISO_1}_{1}= I_1$ and $I^{ISO_1}_2 = I_2-E_3$. If another isoform $ISO_2$ of the gene consists of $E_2-E_3$, the two introns of $ISO_2$ will be $I_1^{ISO_2}= E1-I1$ and $I_2^{ISO_2}=I_2$. The splice will have 6 sequences, which are $ISO_1: E1-E2$ $I^{ISO_1}_{1}: I_1$ $I^{ISO_1}_2 : I_2-E_3$ $ISO_2: E_2-E_3$ $I_1^{ISO_2}: E1-I1$ $I_2^{ISO_2}:I_2$ As you can see, $I_1$ and $I_2$ appear multiple times. If we join the overlapping introns, the splice will be $ISO_1: E1-E2$ $ISO_2: E_2-E_3$ $I^{joined}_1: E1-I1$ $I^{joined}_2: I_2-E_3$ This will reduce the number of reference sequences each sequencing read will have so as to reduce the size of the reference index and the time of our pipeline.