A.thaliana: development test

# A.thaliana: development test ## Assembly approach and steps ### Graph creation - Graph is created with k=63, coverage cutoff=2. - A **light** tip clipping is done for 3 rounds, clipping nodes up to 2k-1 lenght (i.e. `GraphContigger::tip_clips(125,3)`). - Simple structures are resolved by using read paths, repathing all reads in between: - Tip clipping up to read size if support is low. - OPTIONAL: error bubbles are popped (for bubbles up to 2k+1 on each side) / low information nodes removed. - Canonical repeats are solved by using single reads first and pairs if that fails (pairs part TODO). - Tangle solving by local strider and long reads (TODO): - Graph is divided into higher-level tangles (i.e. frontiers of size 2K). - If necessary, a large complex tangle can be split into smaller sub-components. - each tangle is classified and solved separatedly, by using long reads and a strider from the frontiers and/or internal nodes. - TODO tangles that are tips should be reconnected. Incomplete tangles can be completed by this. - Global Strider with kci-restricted anchors (pe). - Global strider with kci-restricted anchors (long reads, produces graph and scaffolds). # Open problems ## The BIG tangle When assembling any genome with any degree of complexity, a massive tangle usually emerges, specially if using low k (k=63) and low coverage (c=2). This tangle is not easy to simplify, and its tight connectivity makes it very hard to analyse at any level. In a.thaliana it can be 90% of the nodes of the graph, but still a very small ammount of content. ```python import SDGpython as SDG import graphcleaning ws=SDG.WorkSpace() ws.sdg.load_from_gfa("/Users/clavijob/sdg_ath/ath_round2_after_canonicals.gfa") def worst_tangles(g,s=64,c=10): tangles=g.get_all_tangleviews(s) tangles=list(tangles) tangles.sort(key=lambda x:len(x.internals)) return tangles[-c:] for s in [64,127,200,500,1000,5000,10000]: print('\n tangles at s=',s) for x in worst_tangles(ws.sdg,s): print(x) ``` ```txt tangles at s= 64 <TangleView: 311 internals and 217 frontiers in graph SDG> <TangleView: 327 internals and 197 frontiers in graph SDG> <TangleView: 334 internals and 244 frontiers in graph SDG> <TangleView: 436 internals and 206 frontiers in graph SDG> <TangleView: 489 internals and 534 frontiers in graph SDG> <TangleView: 771 internals and 289 frontiers in graph SDG> <TangleView: 1099 internals and 1103 frontiers in graph SDG> <TangleView: 3850 internals and 6221 frontiers in graph SDG> <TangleView: 28672 internals and 23138 frontiers in graph SDG> <TangleView: 63358 internals and 77922 frontiers in graph SDG> tangles at s= 127 <TangleView: 1309 internals and 27 frontiers in graph SDG> <TangleView: 1519 internals and 98 frontiers in graph SDG> <TangleView: 1533 internals and 11 frontiers in graph SDG> <TangleView: 1570 internals and 471 frontiers in graph SDG> <TangleView: 1590 internals and 79 frontiers in graph SDG> <TangleView: 3094 internals and 58 frontiers in graph SDG> <TangleView: 3116 internals and 46 frontiers in graph SDG> <TangleView: 3675 internals and 96 frontiers in graph SDG> <TangleView: 5464 internals and 529 frontiers in graph SDG> <TangleView: 768474 internals and 2322 frontiers in graph SDG> tangles at s= 200 <TangleView: 632 internals and 18 frontiers in graph SDG> <TangleView: 656 internals and 2 frontiers in graph SDG> <TangleView: 661 internals and 126 frontiers in graph SDG> <TangleView: 742 internals and 4 frontiers in graph SDG> <TangleView: 815 internals and 30 frontiers in graph SDG> <TangleView: 1338 internals and 14 frontiers in graph SDG> <TangleView: 1540 internals and 6 frontiers in graph SDG> <TangleView: 3120 internals and 42 frontiers in graph SDG> <TangleView: 6028 internals and 187 frontiers in graph SDG> <TangleView: 795461 internals and 3676 frontiers in graph SDG> tangles at s= 500 <TangleView: 253 internals and 17 frontiers in graph SDG> <TangleView: 283 internals and 19 frontiers in graph SDG> <TangleView: 303 internals and 10 frontiers in graph SDG> <TangleView: 303 internals and 7 frontiers in graph SDG> <TangleView: 320 internals and 0 frontiers in graph SDG> <TangleView: 324 internals and 0 frontiers in graph SDG> <TangleView: 363 internals and 2 frontiers in graph SDG> <TangleView: 644 internals and 18 frontiers in graph SDG> <TangleView: 656 internals and 2 frontiers in graph SDG> <TangleView: 830608 internals and 4070 frontiers in graph SDG> tangles at s= 1000 <TangleView: 174 internals and 6 frontiers in graph SDG> <TangleView: 220 internals and 0 frontiers in graph SDG> <TangleView: 239 internals and 12 frontiers in graph SDG> <TangleView: 292 internals and 25 frontiers in graph SDG> <TangleView: 320 internals and 0 frontiers in graph SDG> <TangleView: 324 internals and 0 frontiers in graph SDG> <TangleView: 363 internals and 2 frontiers in graph SDG> <TangleView: 645 internals and 17 frontiers in graph SDG> <TangleView: 656 internals and 2 frontiers in graph SDG> <TangleView: 841536 internals and 3802 frontiers in graph SDG> tangles at s= 5000 <TangleView: 66 internals and 2 frontiers in graph SDG> <TangleView: 126 internals and 3 frontiers in graph SDG> <TangleView: 139 internals and 20 frontiers in graph SDG> <TangleView: 171 internals and 0 frontiers in graph SDG> <TangleView: 177 internals and 13 frontiers in graph SDG> <TangleView: 220 internals and 0 frontiers in graph SDG> <TangleView: 320 internals and 0 frontiers in graph SDG> <TangleView: 324 internals and 0 frontiers in graph SDG> <TangleView: 401 internals and 2 frontiers in graph SDG> <TangleView: 854094 internals and 2637 frontiers in graph SDG> tangles at s= 10000 <TangleView: 41 internals and 0 frontiers in graph SDG> <TangleView: 51 internals and 2 frontiers in graph SDG> <TangleView: 51 internals and 1 frontiers in graph SDG> <TangleView: 53 internals and 4 frontiers in graph SDG> <TangleView: 126 internals and 3 frontiers in graph SDG> <TangleView: 171 internals and 0 frontiers in graph SDG> <TangleView: 220 internals and 0 frontiers in graph SDG> <TangleView: 320 internals and 0 frontiers in graph SDG> <TangleView: 324 internals and 0 frontiers in graph SDG> <TangleView: 859339 internals and 1654 frontiers in graph SDG> ``` The tangle is "stable" no matter which frontier size, but all its content comes from the very low end of the coverage spectra (notice we are removing any nodes larger than frontier size to avoid "pulled-in internals"): ```python tangles=ws.sdg.get_all_tangleviews(200) tangles=list(tangles) tangles.sort(key=lambda x:len(x.internals)) with open("bigtangle.fasta","w") as f: for nv in tangles[-1].internals: if nv.size()>=200: continue f.write(">seq%d\n%s\n"%(nv.node_id(),nv.sequence())) ``` A kat comp versus the reads confirms this: ![](https://i.imgur.com/TLJPjYT.png) ![](https://i.imgur.com/KeMIKjm.png) The small peak of "unique" content corresponds to the real unique content of the repeats being "collapsed" into the tangle. Counterintuitively, the more "unique" and long a small-ish node of the tangle is, the more likely it has arisen due to error? A good way to clean this up is to "selectively" up the coverage or the K (or both) required for a node to exist in this tangle. ### Trying with yeast Probably yeast is an easy way to try a big tangle without so much complication. PRJEB7245 from "De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms" (the paper has the more convoluted way of data access ever) points to MiSeq (2x150bp), ONT and PB datasets. For SC288C the datasets are: MiSeq: ERR1938683 The idea with the big tangles in illumina-only is to classify their read pairs: - Pairs with both reads fully inside the tangle. - Pairs with 1 read touching a frontier. - Pairs with both reads touching a frontier. - Pairs with one or both reads going past frontiers but not touching frontiers. **The problem with frequency cutoff on indexes:** If a kmer appears on the overlap of multiple contigs, it can make them all ummapable, (i.e. a contig that just contains kmers that appear >cutoff times can't ever map a read). In very repetitive regions, this can be aggravated by the error contigs adding k-mer copies, thus making mapping the k-mers even more difficult. Solutions? Maybe just index the extendable k-mers, so at least one intermediate node is mapped in the read, avoiding the error mappings to be the only ones around. Also, if the index is modified to use 63-mers, at least 63-mers can be mapped (for short reads) ## big tangle nibbling 1 - find frontiers connected by pairs. 2 - recreated the path using the pairs 3 - remove all nodes only supported by those pairs from the tangle. # current scripts (guide) ```python import SDGpython as SDG K=63 MIN_CVG=2 NUM_BATCHES=1 PREFIX="something" ws=SDG.WorkSpace() peds=ws.add_paired_reads_datastore('%s_pe.prseq'%PREFIX) SDG.GraphMaker(ws.sdg).new_graph_from_paired_datastore(peds,K,MIN_CVG,NUM_BATCHES) SDG.GraphContigger(ws).clip_tips(2*K-1,3) #this part both dumps the graph and makes node ids correlative ws.sdg.write_to_gfa1('%s_initial.gfa'%PREFIX) ws.sdg.load_from_gfa('%s_initial.gfa'%PREFIX) ws.sdg.write_to_gfa1('%s_initial.gfa'%PREFIX) kc=ws.add_kmer_counter('main') kc.add_count("pe",peds) ws.dump('%s_initial.sdgws'%PREFIX) peds.mapper.path_reads() peds.mapper.dump_readpaths('%s_initial_pepaths.dump'%PREFIX) ```