Mini-Assembly - HackMD

# Mini-Assembly In this project, we would like to detect cases where unannotated exon-UTR connections exist in the sample, fix those by adding new isoforms, and improve the quality of quantification estimates. The link to the JCC paper: https://www.life-science-alliance.org/content/lsa/2/1/e201800175.full.pdf There exist two main cases that we wanna recover: 1) Easier cases: **Unnanotated exon-UTR junctions described by the reads in the sample** 3) Challenging cases: **The exon-UTR junction exist in one of the isoforms of the annotation, but the reads in the sample describe a different isoform which should contain the same exon-UTR junction but is missing from the annotation** ### TODO: The first steps: 1) Grab a real dataset - Let's choose one from the JCC paper: HAP1: contains 55,234,720 paired-end 151-nt Illumina reads from the HAP1 cell line 3) Align to the genome with STAR 4) Isolate the reads mapping to unannotated exon-UTR junctions 5) Count and draw the distribution of such reads, based on the UTR coverage, etc. ##### In the plots below, the x axis is the number of reads mapping an exon-UTR junction, the y axis is the number of exon-UTR junctions being covered with x many reads all unique reads: 189811 ![](https://i.imgur.com/TW7DGF0.png) unique read count < 30: 188748 (%99.4399692) ![](https://i.imgur.com/ctm0wMG.png) unique read count < 15: 185676 (%97.8215172) ![](https://i.imgur.com/cS2Y7V0.png) ``` 0: 49476, 1: 59071, 2: 29740, 3: 16626, 4: 10916, 5: 5599, 6: 4650, 7: 3299, 8: 1824, 9: 1512, 10: 1003, 11: 702, 12: 594, 13: 346, 14: 318, 15: 301, 16: 479, 17: 429, 18: 322, 19: 161, 20: 162, 21: 156, 22: 100, 23: 119, 24: 561, 25: 39, 26: 67, 27: 97, 28: 24, 29: 55, 30: 27 ``` all multi reads count: 189811 multimapping, count < 100: 189761 (%99.973658) ![](https://i.imgur.com/uhpSAT7.png) multimapping, count < 20: 186232 (%98.1144402) ![](https://i.imgur.com/BNlzO6W.png) ``` 0: 134363, 1: 6331, 2: 3682, 3: 4511, 4: 11676, 5: 9323, 6: 3292, 7: 4584, 8: 1920, 9: 668, 10: 636, 11: 601, 12: 2016, 13: 684, 14: 520, 15: 896, 16: 183, 17: 106, 18: 45, 19: 195 ``` both, count: 189811 both, count < 100: 189566 (%99.8709242) ![](https://i.imgur.com/gDQoWfa.png) both, count < 40: 187744 (%98.911022) ![](https://i.imgur.com/9DM3joR.png) ``` 0: 0, 1: 63007, 2: 31607, 3: 20265, 4: 22302, 5: 14692, 6: 7827, 7: 7966, 8: 3841, 9: 2035, 10: 1429, 11: 1221, 12: 2758, 13: 1029, 14: 776, 15: 1150, 16: 264, 17: 434, 18: 353, 19: 359 ``` UTR annotated: 170059 both, count < 100: 152300 (%89.5571537) ![](https://i.imgur.com/VyqfLTZ.png) multi, count < 100: 168687 (%99.1932212) ![](https://i.imgur.com/TPFL6dR.png) uniq, count < 100: 153615 (%90.3304147) ![](https://i.imgur.com/6jh6b9L.png) 6) Compare the normalized coverage of the UTR (count/length) to the coverage of the isoforms of the genes So far, one observation is that, for a large number of annotated UTR junctions, the coverage of the UTR junction based on the number uniq reads map to it is very far from the overal coverage of the transcript which includes the UTR based on the absolute relative difference, the histogram of the ards is as following: including all the ards (135226): ![](https://i.imgur.com/wIEDfWv.png) excluind ard = 1 values (remains 50704 pairs): ![](https://i.imgur.com/7GzMuLW.png) Out of 50704 pairs, 13896 have ard (of coverage) greater than 0.9 which seems very significant. The length of the UTR junction is computed as following: junction_len = min(fragment_length_min, txpLength - utrlen) + min(fragment_length_min, utrlen) fragment_length_min = 267.81268659152704 There are 331089 junctions covered with reads based on the SJ.out.tab file. 41398 junctions have overlaps with 5' UTRs, which overal constitute 222191 junction-5'UTR pairs. 135226 pairs are only described by annotated transcripts.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.