# Complex Genomes ### Basic Assembly Flow Raw reads > evaluate quality > trimming / cleanup > assembly with parameter set > is your assembly okay? > draft assembly ***Main differences of bacterial vs complex genome:*** choice of k-mer, depth of coverage, typically going to use a lot more data Many wants to do this #### 1st:Assemble - SPAdes (short reads) and Abyss (short reads) - Flye (vs CANU) (Long reads) - Merging #### 2nd: Assembly of assemblies EXAMPLES ### SPAdes assembly (of Drosophila data) Spades picks the right k-mer for us!! :) - Input paired-end reads - You can pull pacbio and nanopore data directly into SPAdes - trusted-contigs: if there is a discrepency between my data and other, they are right (untrusted means they are right) - cov-cutoff (see code in slides) ### Flye assembly ## How much coverage do you need? ### Illumina Data If you have a great reference genome and only need a bit of genetic variation data - only 6-8x You need more around 16x to call two different haplotypes (8x per haplotype) Draft assembly: 30x Generally improving quality on Illumina up to 50-60x and then you plateau out, you don't get much more out of higher coverage Illumina data ### Oxford Nanopore, long read If you're going to combine it with Illumina data you can squeeze by with 5-8x, but when you start getting above ***10x*** and and add Illumina, it starts to add up very nicely 30x - 50x - great data IF you have great high quality data/high molecular weight DNA (>20kb is long and >200kb is good and long, over a megabase is reallyyyy long)for long data. Long reads provide that spatial data over a really long area - good for haplotypes How to find coverage: Genome size = 500,000,000 Sequencing = 50 GB of data Genome size/sequencing = approximate coverage Use: SAMTOOLS ### Challenges of complex genomes SIZE, repeat regions, gene families, heterochromatin, polyploidy Polyploids pose a complication for de Bruijn graphs (Tool: WhatsHap), you need VERY high depth of sequencing data for polyploids ### Heterozygosity - related to polyploid, how do you distinguish alleles from gene families? - Assembly errors in some programs: but PLATANUS handles heterozygosity really well - http://platanus.bio.titech.ac.jp/ ### How do we get a better assembly? Scaffolding: use linkage data - genetic maps (recommends Chromonomer) - long reads [bionano] - bionano gives you a "pattern of distances" - "nicks" the dna at specific places and you know where the nicks SHOULD occur in my data and compare it - use RUNbng - Hi-C - Dovetail Genomics - Hi-C assembly tools: https://github.com/marbl/SALSA ## Nanopore - can usually run 3x through 1 flow cell (1000 pores, 300 pores, 100 pores) - 10 micrograms of DNA for a flowcell -