# Complex Genomes
### Basic Assembly Flow
Raw reads > evaluate quality > trimming / cleanup > assembly with parameter set > is your assembly okay? > draft assembly
***Main differences of bacterial vs complex genome:***
choice of k-mer, depth of coverage, typically going to use a lot more data
Many wants to do this
#### 1st:Assemble
- SPAdes (short reads) and Abyss (short reads)
- Flye (vs CANU) (Long reads)
- Merging
#### 2nd: Assembly of assemblies
EXAMPLES
### SPAdes assembly (of Drosophila data)
Spades picks the right k-mer for us!! :)
- Input paired-end reads
- You can pull pacbio and nanopore data directly into SPAdes
- trusted-contigs: if there is a discrepency between my data and other, they are right (untrusted means they are right)
- cov-cutoff
(see code in slides)
### Flye assembly
## How much coverage do you need?
### Illumina Data
If you have a great reference genome and only need a bit of genetic variation data - only 6-8x
You need more around 16x to call two different haplotypes (8x per haplotype)
Draft assembly: 30x
Generally improving quality on Illumina up to 50-60x and then you plateau out, you don't get much more out of higher coverage Illumina data
### Oxford Nanopore, long read
If you're going to combine it with Illumina data you can squeeze by with 5-8x, but when you start getting above ***10x*** and and add Illumina, it starts to add up very nicely
30x - 50x - great data IF you have great high quality data/high molecular weight DNA (>20kb is long and >200kb is good and long, over a megabase is reallyyyy long)for long data.
Long reads provide that spatial data over a really long area - good for haplotypes
How to find coverage:
Genome size = 500,000,000
Sequencing = 50 GB of data
Genome size/sequencing = approximate coverage
Use: SAMTOOLS
### Challenges of complex genomes
SIZE, repeat regions, gene families, heterochromatin, polyploidy
Polyploids pose a complication for de Bruijn graphs (Tool: WhatsHap), you need VERY high depth of sequencing data for polyploids
### Heterozygosity
- related to polyploid, how do you distinguish alleles from gene families?
- Assembly errors in some programs: but PLATANUS handles heterozygosity really well
- http://platanus.bio.titech.ac.jp/
### How do we get a better assembly?
Scaffolding: use linkage data
- genetic maps (recommends Chromonomer)
- long reads [bionano]
- bionano gives you a "pattern of distances"
- "nicks" the dna at specific places and you know where the nicks SHOULD occur in my data and compare it
- use RUNbng
- Hi-C
- Dovetail Genomics
- Hi-C assembly tools: https://github.com/marbl/SALSA
## Nanopore
- can usually run 3x through 1 flow cell (1000 pores, 300 pores, 100 pores)
- 10 micrograms of DNA for a flowcell
-