---
title: 'Pipeline for draft genome assembly'
disqus: hackmd
---
Draft genome assembly
===
## Table of Contents
[TOC]
The flow of *de novo* genome assembly
---
```flow
st=>start: Raw reads
e=>end: Draft genome
e2=>end: Complete mitochondrial genome
op1=>operation: Virw reads quality
(FastQC)
op2=>operation: Trim adapter and
low quality bases
(Trimmonic)
cond=>condition: Clean?
(FastQC)
cond2=>condition: Genome?
(or mitogenome)
op3=>operation: de novo assembly
(IDBA-UD)
(SPAdes)
(Minia)
op4=>operation: Compare assembly quality
(Quast)
(BUSCO)
op5=>operation: Mitogenome assembly
(NOVOPlasty)
op6=>operation: Mitogenome annotation
(MITOS)
(OrfFinder)
(ARWEN)
cond3=>condition: single
circular
genome?
op7=>operation: Check assembly quality
(Mapping reads via Bowtie2 or BWA)
op8=>operation: PCR or Clone,
sequencing with Sanger sequencing
op9=>operation: Polishing genome assembly
(SSPACE)
op10=>operation: Checking assemble quality
via mapping reads
to the reference genome
(Bowtie2)
(BWA)
st->op1->op2->cond
cond(yes)->cond2
cond(no)->op2
cond2(yes)->op3(left)->op9->op10->op4->e
cond2(no)->op5->cond3
cond3(yes)->op7->op6->e2
cond3(no)->op8->op7
```
## Raw reads quality control
First of all, view qualities of raw reads via FastQC
### Trim adapter and low quality bases with Trimmomatic
#### Trimmomatic command line
```gherkin=
java -jar trimmomatic-0.36.jar PE -threads 3 R1.fastq.gz R2.fastq.gz R1_paired.fq R1_unpaired.fq R2_paired.fq R2_unpaired.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:28 TRAILING:28 SLIDINGWINDOW:4:28 MINLEN:36 HEADCROP:10
```
**PE**: Pair end mode
**threads**: processing CPU number
**R1.fastq.gz R2.fastq.gz**: raw reads file name
**R1_paired.fq R2_paired.fq**: paied reads after trimming
**R1_unpaired.fq R2_unpaired.fq**: unpaired reads after trimming
**ILLUMINACLIP**: Illumina mode
**TruSeq3-PE.fa**: adapter sequences
**LEADING**: cut off quality from head
**TRAILING**: cut off quality from tail
**MINLEN**: minmum length after trimming (small then this number would be throw)
**HEADCROP**: cut assigned number of bases from head
#### After trimming report
It will report such as
>Input Read Pairs: 74078367 Both Surviving: 51988333 (70.18%) Forward Only Surviving: 15387175 (20.77%) Reverse Only Surviving: 1559156 (2.10%) Dropped: 5143703 (6.94%)
TrimmomaticPE: Completed successfully
#### View reads statistic with FastQC again to confirm
---
## *De novo* assembly
There are a lot of *de novo* assembly methods, we can assemble with multiple assembler and compare reaults.
### IDBA-UD
As the IDBA assemblers use FASTA format. We first convert FASTQ format to FASTA format by fq2fa program in the package.
```gherkin=
bin/fq2fa --merge --filter R1_paired.fq R2_paired.fq PEmerge.fa
#merge file #R1 reads #R2 reads #converted fasta name
#delete "N" base
```
Then we use IDBA-UD for addembly as follow:
```gherkin=
bin/idba_ud -r PEmerge.fa -o idbaUD_assembly --min_support 10 --num_threads 5 --min_contig 300
```
**-r**: Reads file in FASTA formate
**-o**: Output direction
**--min_support**: At least how many reads support a contige
**--num_threads**: processing CPU number
**--Min_contig**: Minmum contig length (bp)
---
### Polish assembly with SSPACE
This software would reads mapping quality to improve the scaffolding.
First we need to prepare a lib.txt file to indicate reads information.
For example, the lib.txt file should contain information like this:
```xml
Lib1 bwa YCA_R1_paired.fq YCA_R2_paired.fq 250 0.5 FR
```
Here the number **250** means the library insert size
The number **0.5** represent the percent of insert range we allow.
The example here indicate insert size range from 125-375 bp are allowed.
**FR** means the forward and reverse reads.
After giving the information above, we use follow command line to polish assembly.
```gherkin=
SSPACE_Standard_v3.0.pl -l YCA_lib.txt -s idbaud_20180710.fa -b Default_out -T 3
```
**-l**: reads information
**-s**: original assembly
**-b**: output name
**-T**: processing CPU number
———
## Quality assessment
To evalute the quality of the assembly
I would use QUAST and BUSCO to conduct statistics for reads and gene orthologs
QUAST (Gurevich et al, 2013)
BUSCO (busco.ezlab.org)
---
## Mitochondiral genome assembly
### NOVOPlasty
For mitochondrial genome assembly, I used NOVOPlasty assembler.
This assembler is easy to use, and accurate.
As the instruction in NOVOPlasty, we only need
1. Find a suitable seed
2. Creat a configuration file
3. Run NOVOPlasty
#### Find a suitable seed
Download the closest species mitocondrial genome from NCBI as a seed.
(Sometimes the closest one may be not the best, we should try several mitogenomes)
#### Creat a configuration file
Example:
```xml
Project:
-----------------------
Project name = Mito_genome
Type = mito
Genome Range = 12000-18000
K-mer = 39
Max memory = 20
Extended log = 0
Save assembled reads = 1
Seed Input = Seed.fasta
Reference sequence =
Variance detection = no
Heteroplasmy =
Chloroplast sequence =
Dataset 1:
-----------------------
Read Length = 125
Insert size = 250
Platform = illumina
Single/Paired = PE
Combined reads =
Forward reads = R1_paired.fq.gz
Reverse reads = R2_paired.fq.gz
Optional:
-----------------------
Insert size auto = yes
Insert Range = 1.6
Insert Range strict = 1.2
```
#### Run NOVOPlasty
Simply type:
```gherkin=
perl NOVOPlasty.pl -c config.txt
```