HappySorter: A. thaliana example

# HappySorter: A. thaliana example ## Introduction This is an example of how to produce a HappySorter assembly, using an *Arabidopsis thaliana* dataset with illumina PE and ONT. A whole run of this can be found at `/ei/workarea/Research-Groups/RG-Bernardo-Clavijo/bj/ath_rtg/paper/happysorter`. ## Preparing datastores - [x] Illumina reads - [x] Nanopore reads Reads in bj notebook in `~/sdg_paper/`, in cluster in `/ei/workarea/Research-Groups/RG-Bernardo-Clavijo/bj/ath_rtg/data_note_bj/` ```shell= sdg-datastore make -t paired -o ath_pe -1 ERR2173372_1.fastq.gz -2 ERR2173372_2.fastq.gz sdg-datastore make -t long -o ath_nanopore -l 2000 -L ERR2173373.fastq.gz ``` Because sdg only accepts fastq for its datastores we need to create a single-line pseudo fastq to input TAIR chromosomes as long reads, then create a datastore as if it were long reads. ```shell= cat tair10_sl.fa| awk '{if (NR%2==0) print("@"NR/2"\n"$0"\n+\n"$0)}'>tair10.fq sdg-datastore make -t long -o ath_tair -l 2000 -L tair10.fq ``` ## Creating unique anchors - [x] Create k=31 graph - [x] Create kmer counter from PE - [x] Create k=63 graph, light tip clipping - [x] Choose MIN_UKMER_FREQ and MAX_UKMER_FREQ - [x] Extract 31-mer runs from k=63 unitigs with "unique peak" frequencies as anchors #### Counting 31-mers, and creating a lightly tip clipped 63-mer dbg: ```python= import SDGpython as SDG import os K=31 GK=63 MIN_CVG=4 NUM_BATCHES=1 PREFIX='ath' ## SDG-DBG assembly k=63 ws=SDG.WorkSpace() peds=ws.add_paired_reads_datastore('%s_pe.prseq'%(PREFIX), 'pe') gm=SDG.GraphMaker(ws.sdg) gm.new_graph_from_paired_datastore(peds,K,MIN_CVG,NUM_BATCHES); #this is so the initial kmer count includes all 31-mers print("First graph done") kc = ws.add_kmer_counter('pek%d'%K, K) kc.add_count('pe', peds) print("kmers counted") gm.new_graph_from_paired_datastore(peds,GK,MIN_CVG,NUM_BATCHES); #this is the real DBG k print("second graph done") ws.sdg.write_to_gfa1("00-%s_k%d_c%d_raw.gfa" % (PREFIX,GK,MIN_CVG)) gc=SDG.GraphContigger(ws) gc.clip_tips(2*GK-1,3) ws.sdg.write_to_gfa1("00-%s_k%d_c%d_clipped.gfa" % (PREFIX,GK,MIN_CVG)) print("Remaining nodes: %s" %(len(ws.sdg.get_all_nodeviews()))) kc.update_graph_counts() ws.dump("00-%s_k%d_c%d.sdgws" % (PREFIX,GK,MIN_CVG)) ``` #### Choosing frequencies for unique 31-mers The following script uses pylab to output a figure, values in the freqs list can be analysed in any way as to choose where the main peak lies. ```python= import SDGpython as SDG from pylab import plt K=31 GK=63 MIN_CVG=4 PREFIX='ath' ws=SDG.WorkSpace("00-%s_k%d_c%d.sdgws" % (PREFIX,GK,MIN_CVG)) kc=ws.get_kmer_counter('pek%d'%K) freqs=kc.count_spectra('pe',max_freq=100,unique_in_graph=True) plt.plot(freqs) plt.savefig("00-%s_k%d_c%d_u31spectra.png" % (PREFIX,GK,MIN_CVG)) ``` ![](https://i.imgur.com/PxSXwhx.png) From this file, we decide to use MIN_UKMER_FREQ=26 and MAX_UKMER_FREQ=60. #### Creating 31-mer anchors: ```python= import SDGpython as SDG from pylab import plt K=31 GK=63 MIN_CVG=4 PREFIX='ath' MIN_UKMER_FREQ=20 MAX_UKMER_FREQ=60 ws=SDG.WorkSpace("00-%s_k%d_c%d.sdgws" % (PREFIX,GK,MIN_CVG)) gc=SDG.GraphContigger(ws) seqs = gc.contig_reduction_to_unique_kmers(MIN_UKMER_FREQ,MAX_UKMER_FREQ) print(len(seqs),"sequences for anchoring") with open("./%s_k%d_anchors.fasta"% (PREFIX,K) ,"w") as f: for i, s in enumerate(seqs): f.write(">seq_%s\n%s\n"%(i,s)) ``` The output file has these stats: n |min |N80 |N50 |N20 |max |sum |name ------ |------ |------ |------ |------ |------ |------ |------ 1485000 |31 |128 |131 |131 |131 |152.9e6 |ath_k31_anchors.fasta ## Mapping ONT reads, creating and cleaning an RTG - [x] Create a new workspace with the anchors - [x] Read mapping - [x] RTG creation - [ ] Choose repeat threshold and thread cluster size and second percentaje - [ ] RTG cleaning - [ ] HappySorter! - [ ] TAIR mappging (for validation) ```python= import SDGpython as SDG from pylab import plt PREFIX='ath' ws=SDG.WorkSpace() ws.sdg.load_from_fasta("ath_k31_anchors.fasta") lords=ws.add_long_reads_datastore("ath_nanopore.loseq") lrr=SDG.LongReadsRecruiter(ws.sdg,lords,31,2) lrr.map(31) lrr.simple_thread_reads() rtg=lrr.rtg_from_threads(True) rtg.dump('%s_rtg_raw.dump'%PREFIX) ``` ```python= rpl=rtg.clean_repeat_nodes_popping_list(25) print(len(rpl),len(set([x[1] for x in rpl]))) rtg.apply_popping_list(rpl) cpl=rtg.clean_all_nodes_by_thread_clustering_popping_list(5,.1) print(len(cpl),len(set([x[1] for x in cpl]))) rtg.apply_popping_list(cpl) rtg.dump('%s_rtg_cleaned.dump'%PREFIX) ``` ```python= tair_lords=ws.add_long_reads_datastore("ath_tair.loseq") tair_lrr=SDG.LongReadsRecruiter(ws.sdg,tair_lords,31,2) tair_lrr.map(31) tair_lrr.simple_thread_reads() #THIS doesnt work: tair_rtg=tair_lrr.rtg_from_threads(True) ``` ```python= node_tair_pos={} for c in range(1,6): for i,x in enumerate(tair_lrr.read_threads[c],start=1): node_tair_pos[x.node]=(c,i) node_tair_pos[-x.node]=(c,-i) for x in his.get_order().as_signed_nodes(): if x in node_tair_pos: print(x,node_tair_pos[x]) else: print(x) ``` #these lines load and map/dump TAIR10, comment if not needed #end of TAIR loading/mapping/dumping LIMITATIONS on current ordering approaches: - floating nodes in the middle of the order are never included, hence order density suffers - using 2 links to confirm adjacency discards a lot of true connections. So nodes with low coverage may come to depend on very few adjacencies. Regions with low coverage will then be impossible to go thorugh, as whatever node that gets included won't necessarily connect to a set that can be ordered Possible solutions: - track reads used in the order, then any node that belongs in many of these reads is a true candidate. (deciding if a read belongs in the order may be tricky if there is a lot of heterozygous content, but a large proportion of nodes from the read in the order may be a good hint) - use a longer radius for adjacencies # CRASHES (CURRENT ONLY) ## Dies with: no link out in thread ```python= hs=SDG.HappySorter(rtg,.1,.3) hs.start_from_node(100) print(len(hs.order.as_signed_nodes())) old_length=len(hs.order.as_signed_nodes()) for x in range(500): #print("\nStep",x) #print(len(hs.order.as_signed_nodes())) hs.add_placed_nodes(hs.place_nodes(list(hs.find_bw_candidates()))) hs.recruit_all_happy_threads() hs.close_internal_threads() hs.add_placed_nodes(hs.place_nodes(list(hs.find_fw_candidates()))) #print(len(hs.order.as_signed_nodes())) hs.recruit_all_happy_threads() hs.close_internal_threads() hs.add_placed_nodes(hs.place_nodes(list(hs.find_internal_candidates()))) #print(len(hs.order.as_signed_nodes())) hs.recruit_all_happy_threads() hs.close_internal_threads() o=hs.order.as_signed_nodes() print(len(o),hs.node_coordinates[o[-1]]-hs.node_coordinates[o[0]]) if len(o)==old_length: break else: old_length=len(o) ```