Pipeline for draft genome assembly

--- title: 'Pipeline for draft genome assembly' disqus: hackmd --- Draft genome assembly === ## Table of Contents [TOC] The flow of *de novo* genome assembly --- ```flow st=>start: Raw reads e=>end: Draft genome e2=>end: Complete mitochondrial genome op1=>operation: Virw reads quality (FastQC) op2=>operation: Trim adapter and low quality bases (Trimmonic) cond=>condition: Clean? (FastQC) cond2=>condition: Genome? (or mitogenome) op3=>operation: de novo assembly (IDBA-UD) (SPAdes) (Minia) op4=>operation: Compare assembly quality (Quast) (BUSCO) op5=>operation: Mitogenome assembly (NOVOPlasty) op6=>operation: Mitogenome annotation (MITOS) (OrfFinder) (ARWEN) cond3=>condition: single circular genome? op7=>operation: Check assembly quality (Mapping reads via Bowtie2 or BWA) op8=>operation: PCR or Clone, sequencing with Sanger sequencing op9=>operation: Polishing genome assembly (SSPACE) op10=>operation: Checking assemble quality via mapping reads to the reference genome (Bowtie2) (BWA) st->op1->op2->cond cond(yes)->cond2 cond(no)->op2 cond2(yes)->op3(left)->op9->op10->op4->e cond2(no)->op5->cond3 cond3(yes)->op7->op6->e2 cond3(no)->op8->op7 ``` ## Raw reads quality control First of all, view qualities of raw reads via FastQC ### Trim adapter and low quality bases with Trimmomatic #### Trimmomatic command line ```gherkin= java -jar trimmomatic-0.36.jar PE -threads 3 R1.fastq.gz R2.fastq.gz R1_paired.fq R1_unpaired.fq R2_paired.fq R2_unpaired.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:28 TRAILING:28 SLIDINGWINDOW:4:28 MINLEN:36 HEADCROP:10 ``` **PE**: Pair end mode **threads**: processing CPU number **R1.fastq.gz R2.fastq.gz**: raw reads file name **R1_paired.fq R2_paired.fq**: paied reads after trimming **R1_unpaired.fq R2_unpaired.fq**: unpaired reads after trimming **ILLUMINACLIP**: Illumina mode **TruSeq3-PE.fa**: adapter sequences **LEADING**: cut off quality from head **TRAILING**: cut off quality from tail **MINLEN**: minmum length after trimming (small then this number would be throw) **HEADCROP**: cut assigned number of bases from head #### After trimming report It will report such as >Input Read Pairs: 74078367 Both Surviving: 51988333 (70.18%) Forward Only Surviving: 15387175 (20.77%) Reverse Only Surviving: 1559156 (2.10%) Dropped: 5143703 (6.94%) TrimmomaticPE: Completed successfully #### View reads statistic with FastQC again to confirm --- ## *De novo* assembly There are a lot of *de novo* assembly methods, we can assemble with multiple assembler and compare reaults. ### IDBA-UD As the IDBA assemblers use FASTA format. We first convert FASTQ format to FASTA format by fq2fa program in the package. ```gherkin= bin/fq2fa --merge --filter R1_paired.fq R2_paired.fq PEmerge.fa #merge file #R1 reads #R2 reads #converted fasta name #delete "N" base ``` Then we use IDBA-UD for addembly as follow: ```gherkin= bin/idba_ud -r PEmerge.fa -o idbaUD_assembly --min_support 10 --num_threads 5 --min_contig 300 ``` **-r**: Reads file in FASTA formate **-o**: Output direction **--min_support**: At least how many reads support a contige **--num_threads**: processing CPU number **--Min_contig**: Minmum contig length (bp) --- ### Polish assembly with SSPACE This software would reads mapping quality to improve the scaffolding. First we need to prepare a lib.txt file to indicate reads information. For example, the lib.txt file should contain information like this: ```xml Lib1 bwa YCA_R1_paired.fq YCA_R2_paired.fq 250 0.5 FR ``` Here the number **250** means the library insert size The number **0.5** represent the percent of insert range we allow. The example here indicate insert size range from 125-375 bp are allowed. **FR** means the forward and reverse reads. After giving the information above, we use follow command line to polish assembly. ```gherkin= SSPACE_Standard_v3.0.pl -l YCA_lib.txt -s idbaud_20180710.fa -b Default_out -T 3 ``` **-l**: reads information **-s**: original assembly **-b**: output name **-T**: processing CPU number ——— ## Quality assessment To evalute the quality of the assembly I would use QUAST and BUSCO to conduct statistics for reads and gene orthologs QUAST (Gurevich et al, 2013) BUSCO (busco.ezlab.org) --- ## Mitochondiral genome assembly ### NOVOPlasty For mitochondrial genome assembly, I used NOVOPlasty assembler. This assembler is easy to use, and accurate. As the instruction in NOVOPlasty, we only need 1. Find a suitable seed 2. Creat a configuration file 3. Run NOVOPlasty #### Find a suitable seed Download the closest species mitocondrial genome from NCBI as a seed. (Sometimes the closest one may be not the best, we should try several mitogenomes) #### Creat a configuration file Example: ```xml Project: ----------------------- Project name = Mito_genome Type = mito Genome Range = 12000-18000 K-mer = 39 Max memory = 20 Extended log = 0 Save assembled reads = 1 Seed Input = Seed.fasta Reference sequence = Variance detection = no Heteroplasmy = Chloroplast sequence = Dataset 1: ----------------------- Read Length = 125 Insert size = 250 Platform = illumina Single/Paired = PE Combined reads = Forward reads = R1_paired.fq.gz Reverse reads = R2_paired.fq.gz Optional: ----------------------- Insert size auto = yes Insert Range = 1.6 Insert Range strict = 1.2 ``` #### Run NOVOPlasty Simply type: ```gherkin= perl NOVOPlasty.pl -c config.txt ```