Hybrid Error-correction

# Hybrid Error-correction by Dr. Jeremy Wang ### **Goal**: *Using higher-accuracy short reads (ex. Illumina) to "fix errors" in long error prone reads or genome assemblies* - Older Oxford Nanopore adn PacBio reads used to ahve 5-15% base error rate dominated by indels in homopolymers (the nanopore just sees a long stretch of things in a row and doesn't know how many:()) - These have improved but consensus accuracy still isn't as high as NGS What are the two different approaches to error correction? 1. Correct errors in raw reads directly in Illumina data, then assemble (the only situation to use this is if coverage is low relative to genome size) a. Raw read accuracy is improved and they shoudl assemble better b. Bias in error correction can cause issues - misassebly - Bias examples: you might accidentally correct your errors in long read in highest abundant areas you have (i.e. repeat areas) and cause misassembly because you have a lot of areas that look very similar 2. Assemble raw reads first, then polish the assembly (generally you want to do this! :+1: - will work for Drosophila!) a. polished accuracy is very high b. error correction doesn't improve assembly contiguity ### General Approaches 1. Align short reads to reads/contigs then replace the sequence (looks similar to variant calling) a. will be able to correct a small. amount of errors but can't correct much, so pretty useless with old nanopore data 2. Local assemblies: find well-supported anchors, then assemble short reads across errors (FMLRC(2)) ## FMLRC(2) #### FM-index Long Read Correction #### Uses approach 2 - basicaly anchored reassembly Find high accuracy "anchor" sequences and then look at the stuff inbetween to try to figure out best way to bridge gaps between them #### RM-index acts as a de Bruijn graph with arbitrary k-mer size #### Multiple anchored paths are ranked by alignment score to the long read/contig sequence: How many times does a particular k-mer appears (longer the better) and how many times does this happen in illumina data (higher better). It will find all the possible paths from anchor to anchor and then see which path is best supported in the long read sequence. Then we do a second pass through the long read sequence anchors and how often each path shows up in short read sequence. Not all k-mers will have high support, those that don't appear in Illumina data they will find a well supported path in de Bruijn graph that will fill that gap of not-well supported areas #### Multiple passes with increases values of k suggessively improves bridging Assemble long reads using flye assembler EXAMPLE using pa35_flye_assembly.fasta and .fastq.gz (Illumina data) ``` gunzip -c FILE_R1.fastq.gz FILE_R2.fastq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ./ropebwt2/ropebwt2 - LR | tr NT TN | ./fmlrc/fmlrc-convert comp_msbwt.npy ```