Bacterial Genome Assembly (any small genome)

# Bacterial Genome Assembly (any small genome) Lecture from Corbin ###### tags: `code` ``` conda activate bioinfo ``` Raw read sequences > QC > trimming > CHOOSE PARAMETER SET > ASSEMBLY WITH PARAMETER SET > check (keep looping until you get the right parameter set) > generate draft assembly ## Short Read Illumina Assembly ### 1. SPAdes - A fast (but fussy) Assembler https://cab.spbu.ru/software/spades/ https://github.com/ablab/spades *This stuff is already loaded in the conda environment but I could use the links above if I want to use it myself later ``` mkdir results ``` Run spaces on the cleaned up ".fq.gz" file ``` spades.py -s /home/labuser/baags-materials/Bacterial/1-140.clean.1.fq.gz --isolate --only-assembler --checkpoints last -t 14 --tmp-dir temp/ -m 100 -o results/spades ``` -s means single end reads --isolate - single genome Spades uses "k-mers" order and overlap to generate where all of these sequences come together 1. tears your sequences apart into k-mers and finds where they overlap 2. Builds contigs using Eulerian de Bruijn graph or Hamiltonian de Bruijn graph a. the more reads you get that match one of these relationships the more confidence it has in that pattern. It tries to find the path with the greatest "weight" ### Caveats with SPAdes - fast but a memory hog - self searches best k-mer (more on that) - may not work well for large genomes - gets bogged down with repeat regions (you get an overabudnance of read depth on your repetive element (so repetive elements may just appear once with HIGH confidence)) End results of spades ``` assembly-stats contigs.fasta ``` Output looks like: sum = 6655797, n = 357, ave = 18643.69, largest = 227545 ***genome is ~6.6 Mb, average and largest fragment length*** N50 = 73834, n = 27 ***50% of the genome is explained by 26 scaffolds*** ***The fewer longer reads the better*** N60 = 65067, n = 37 N70 = 54499, n = 48 N80 = 36900, n = 63 N90 = 19852, n = 87 N100 = 78, n = 357 N_count = 0 Gaps = 0 ### 2. ABySS https://github.com/bcgsc/abyss/releases/ - less of a memory hog - can be slower - not quite as good for bacterial genome but sometimes better for larger genomes abyss-pe k=63 se= ~/baags-materials/Bacterial/1- 140.clean.1.fq.gz name=prefix "unitigs" ## Long Read Assembly ### 3. Flye Assembler ``` conda activate bioinfo ``` `flye --nano-raw ~/data/P_aeruginosa_35.nanopore.fastq.gz --genome-size 6m --out-dir ./flye_pseudo --threads 7 --min-overlap 1500 &` https://bio.tools/Flye : "Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PB / ONT reads as input and outputs polished contigs." Why so much better? - Long reads - massive coverage 40 vs 400x - small bacterial genome - short reads are cheap, but hard to do a full assembly, but can't span repeat regions ## Clean up long read data with short read data ### 4. FLMRC(2) What if you want to clean up long read sequence data with less error prone short read data? FMLRC(2) - see details from Dr. Jeremy Wang's lecture ## So what's the next step if you've got your assembled and polished genome? How do you annotate? **No one correct answer UNLESS you annotate bacterial genomes (you can submit it to NCBI!!!)** NCBI Prokaryotic Genome Annotation Pipeline - works great, is consistent