# Mini-Assembly
In this project, we would like to detect cases where unannotated exon-UTR connections exist in the sample, fix those by adding new isoforms, and improve the quality of quantification estimates.
The link to the JCC paper:
https://www.life-science-alliance.org/content/lsa/2/1/e201800175.full.pdf
There exist two main cases that we wanna recover:
1) Easier cases:
**Unnanotated exon-UTR junctions described by the reads in the sample**
3) Challenging cases:
**The exon-UTR junction exist in one of the isoforms of the annotation, but the reads in the sample describe a different isoform which should contain the same exon-UTR junction but is missing from the annotation**
### TODO:
The first steps:
1) Grab a real dataset
- Let's choose one from the JCC paper:
HAP1: contains 55,234,720 paired-end 151-nt Illumina reads
from the HAP1 cell line
3) Align to the genome with STAR
4) Isolate the reads mapping to unannotated exon-UTR junctions
5) Count and draw the distribution of such reads, based on the UTR coverage, etc.
##### In the plots below, the x axis is the number of reads mapping an exon-UTR junction, the y axis is the number of exon-UTR junctions being covered with x many reads
all unique reads: 189811

unique read count < 30: 188748 (%99.4399692)

unique read count < 15: 185676 (%97.8215172)

```
0: 49476,
1: 59071,
2: 29740,
3: 16626,
4: 10916,
5: 5599,
6: 4650,
7: 3299,
8: 1824,
9: 1512,
10: 1003,
11: 702,
12: 594,
13: 346,
14: 318,
15: 301,
16: 479,
17: 429,
18: 322,
19: 161,
20: 162,
21: 156,
22: 100,
23: 119,
24: 561,
25: 39,
26: 67,
27: 97,
28: 24,
29: 55,
30: 27
```
all multi reads count: 189811
multimapping, count < 100: 189761 (%99.973658)

multimapping, count < 20: 186232 (%98.1144402)

```
0: 134363,
1: 6331,
2: 3682,
3: 4511,
4: 11676,
5: 9323,
6: 3292,
7: 4584,
8: 1920,
9: 668,
10: 636,
11: 601,
12: 2016,
13: 684,
14: 520,
15: 896,
16: 183,
17: 106,
18: 45,
19: 195
```
both, count: 189811
both, count < 100: 189566 (%99.8709242)

both, count < 40: 187744 (%98.911022)

```
0: 0,
1: 63007,
2: 31607,
3: 20265,
4: 22302,
5: 14692,
6: 7827,
7: 7966,
8: 3841,
9: 2035,
10: 1429,
11: 1221,
12: 2758,
13: 1029,
14: 776,
15: 1150,
16: 264,
17: 434,
18: 353,
19: 359
```
UTR annotated: 170059
both, count < 100: 152300 (%89.5571537)

multi, count < 100: 168687 (%99.1932212)

uniq, count < 100: 153615 (%90.3304147)

6) Compare the normalized coverage of the UTR (count/length) to the coverage of the isoforms of the genes
So far, one observation is that, for a large number of annotated UTR junctions, the coverage of the UTR junction based on the number uniq reads map to it is very far from the overal coverage of the transcript which includes the UTR based on the absolute relative difference, the histogram of the ards is as following:
including all the ards (135226):

excluind ard = 1 values (remains 50704 pairs):

Out of 50704 pairs, 13896 have ard (of coverage) greater than 0.9 which seems very significant.
The length of the UTR junction is computed as following:
junction_len = min(fragment_length_min, txpLength - utrlen) + min(fragment_length_min, utrlen)
fragment_length_min = 267.81268659152704
There are 331089 junctions covered with reads based on the SJ.out.tab file.
41398 junctions have overlaps with 5' UTRs, which overal constitute 222191 junction-5'UTR pairs.
135226 pairs are only described by annotated transcripts.