Try   HackMD

Soapberry bug genome annotation

Dave Angelini
Started 21 September 2021

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

This page will document the workflow used to annotate the genome of the red-shouldered soapberry bug, Jadera haematoloma.

Background on the bugs

The red-shouldered soapberry bug Jadera haematoloma (Hemiptera: Heteroptera: Rhopalidae) is a scentless plant bug native to the US Gulf Coast. It feeds on several native plants of the soapberry family (Sapindaceae) and, since the mid-twentieth century, has adapted to live on the introduced Chinese goldenrain tree (Koelreuteria ssp.). This host shift, along with the abundance of red-shouldered soapberry bugs in urban environments, has made J. haematoloma an excellent model for the study of rapid adaptive evolution (Tsai 2013; Panfilio & Angelini 2018). Indeed, different researchers have examined rapid evolution in beak length (e.g. Carrol & Loye 1987; Yu & Andrés 2014; Cenzer 2016; 2017), the wing/reproductive polyphenism (Carroll et al. 2003; Fawcett et al. 2018), and several other life history traits (Carroll 1991; Carroll et al. 1998) of this animal.

My lab has been studying appendage development and wing polyphenism in the bugs. And we were interested in a draft genome sequence as a resource for developmental genetics, to contextualize genotyping and population studies, and as a point of comparison to the genomes of other insects, especially Oncopeltus fasciatus (Panfilio et al. 2019).

Genome project history

The soapberry bug karyotype was described by Lelia Porter in 1917. Males have 13 chromosomes, and the species appears to use an X0 sex determination system. So there are six pairs of autosomes, and an X that is slightly larger than the smallest autosome (Porter 1917). Based on meiotic behavior, the smallest chromosome has been described as an "m-chromosome". It does not appear to have chiasmata during meiotic prophase and migrates to the poles early in anaphase (Ueshima 1979).

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Camera lucida drawing of a spermatocyte in prophase I. From Porter (1917), Figure 5.

In 2015, Spencer Johnston at Texas A&M used flow cytometry to estimate the genome size of Jadera haematoloma at about 1.95 Gbp.

In 2018, an anonymous gift was made to Colby College to support research in genomics and bioinformatics, and we were given the green light to use these funds for a genome sequencing project. Additional funding was provided by Maine INBRE and the Colby Department of Biology. We contracted Dovetail Genomics for sequencing and assembly. Dovetail offers a combination of library preparation methods, including the use of HiC proximity end-pairing, which allows for assembly to chromosome length.

To reduce heterozygosity, we chose to sequence a lab population of bugs, originally from Plantation Key in Tavernier, FL. Devin O'Brien, then a postdoc in my lab, crossed full siblings for 5 generations. Dovetail made the DNA isolation and prepared 10X and HiC libraries from one of these in-bred male bugs.

Assembly

In August 2019, Dovetail returned the draft genome assembly. The total sequence length from all scaffolds was 2.08 Gbp, very close to the previous estimate. Seven large scaffolds contain 89.9% of the sequence, and likely represent the seven chromosomes seen in the bug's karyotype.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

The chromosomes of J. haematoloma are represented by seven scaffolds over 1 Mbp in length. Here, the length of these scaffolds is plotted on a log-scale against their sequencing depth, reflected by the number of reads mapping per million assembled base pairs (CPM). Chromosome names are given to the scaffolds based on the size and read depth, following Porter (1917).

Metrics, such as the distribution of ambiguous nucleotides and repetitive sequences, all indicate that the genome assembly is high quality.

We are now in the annotation phase of this project. A preliminary survey using BLAST found orthologs for 80 of 81 candidate genes.

chromosome length (Mbp) number of candidate genes gene density (per Mbp)
Chr1 559.6 24 0.0429
Chr2 375.1 12 0.0320
Chr3 293.5 13 0.0443
Chr4 240.4 9 0.0374
Chr5 193.1 19 0.0984
X 179.5 12 0.0669
m 28.9 0 0
other scaffolds each <0.56 (211.5 overall) 1 0.0047 (overall)

We are currently using de novo gene prediction methods to further characterize the genome. Gene expression studies are also underway to characterize genes involved in nutritionally dependent plasticity in wing growth and patterning. In the future, the genome sequenbce will also enable population-level differences among bugs in the wild to be mapped and placed in the context of genes and other genomic features.

RNAseq libraries

As a resource for annaotation, we have sequence from several Illumina 2x125bp RNAseq libraries.

population tissue sex morph bio reps raw reads
Taverneir, FL whole body f LW 3 174,076,008
Taverneir, FL whole body f SW 3 143,188,710
Taverneir, FL whole body m LW 3 160,704,622
Taverneir, FL whole body m SW 3 166,984,020
Aurora, CO dorsal thorax f LW 3 167,414,310
Aurora, CO dorsal thorax f SW 3 111,135,632
Aurora, CO dorsal thorax m LW 3 185,784,180
Aurora, CO dorsal thorax m SW 3 158,732,678
Taverneir, FL dorsal thorax f LW 3 177,552,950
Taverneir, FL dorsal thorax f SW 3 149,760,062
Taverneir, FL dorsal thorax m LW 3 158,470,008
Taverneir, FL dorsal thorax m SW 3 161,301,718
Aurora, CO ovaries f LW 3 151,962,414
Aurora, CO ovaries f SW 3 113,260,652
Taverneir, FL ovaries f LW 3 172,248,994
Taverneir, FL ovaries f SW 3 146,809,538
Aurora, CO testes m LW 3 166,917,032
Aurora, CO testes m SW 3 153,339,676
Taverneir, FL testes m LW 3 150,885,796
Taverneir, FL testes m SW 3 157,930,172

At the same time as those libraries above, we also sequenced mRNA from a few whole individuals of two related species.

species population sex morph bio reps raw reads
J. sanguinolenta Islamorada, FL f LW 2 73,426,516
Islamorada, FL m LW 2 79,259,328
Boisea trivittata Waterville, ME f LW 2 61,980,192
Waterville, ME m LW 2 66,033,812

We also have close to 192 samples sequenced using 3'-end tag-seq to measure gene expression: 3 biological repliates x 2 sexes x 2 morphs (or extreme food regimes) x 4 populations x 2 tissues x 2 stages (nascent adult and fifth instars).

Annotation Workflow Plan

We are proceeding now with in-house annotation of the genome, based on resources suggested by CBC-UConn.

Genome Annotation

RNA Data

Repeat Predictions

RepeatModeler

BUSCO

check number & type

RepeatMasker

RepeatMasker

FastQC, MultiQC

quality trimming

FastQC, MultiQC

hisat2

check alignments

BRAKER2

BRAKER2

FAA: BUSCO

gFACs

Manual curation

EnTAP

OrthoFinder

FAA: BUSCO

FNA: Genome Assembly

FNA: Repeat Predictions

FNA: Softmasked Genome

FQ: RNAseq raw reads

FQ: filtered RNAseq reads

SAM: aligned RNA evidence

GFF: Braker gene predictions

GFF: Gene subsets for BUSCO

GFF: Annotated Genome

Repeat Prediction

Notes

  • It's helpful to do repeat masking first, before predicting genes. Some long repetive elements can look like genes to predictive algorithms.
  • Jill recommends RepeatModeler2. May take 5 days to run on human genome on a typical HPC core.
    • How much repetition is enough? For the repeat scout algorithm, the default is a minimum of 10 repeats.
  • censensi.fa is one of the key output files of RepeatModeler, which lists the LTRs.
  • censensi.fa.classified makes a provisional classification of the LTR types e.g. Copia, Gypsy, etc. This is the file used as input by the next step, RepeatMasker.
  • RepeatMasker literally "masks" parts of the genome sequence by making the characters lower case. This is called "soft" masking.
  • Dfam also uses public repository for repetitive element classification.
  • The key output of RepeatMasker is e.g. Athaliana_167_TAIR9.fa.masked. The GFF Athaliana_167_TAIR9.fa.out.gff details what the masks are. A summary file Athaliana_167_TAIR9.fa.tbl gives nice genome-wide stats on repetitive elements.

BOOKMARK: Working from here

RNA mapping

Notes

  • Guidelines for RNA data: at least 15 million reads per library for use in gene prediction
  • Dfam (or other public repeat libraries) can be added to the de novo repeat library created by RepeatModeler as input to RepeatMasker.
  • samtools stat is a useful tool to summarize the contents of a SAM file.
  • The UCSC genome browser can render a BED file.
  • hitsat2 (and other aligners) require first making an index e.g. hisat2-build
    • For multiple libraries, modify a script like the CBC's to loop through each library file name.
  • After aligning, hitsat2 will produce sorted BAM files, and the original SAM files can be removed.
  • samtools flagstat provides summary stats on each file.
  • More than 85% mapping is considered good.
  • sbatch 04b_busco.sh

BOOKMARK: Working from here

Genome Annotation

Get through Repeat Modeling and RNA mapping before starting annotation proper.

Notes

  • Best gene prediction results for a genome are produced by a mixture of ab initio and homology-based methods.
  • The BRAKER2 pipeline is best if you have RNAseq data. MAKER is pretty much obsolete, unless you're annotating a genome without RNA data.
  • FINDER is the new tool on the horizon: combines evidence-based and ab initio gene prediction.
  • BRAKER2 run times vary. About 5 days for a human genome; 3 hours for Arabidopsis (a very small genome for plants!).
  • After a BRAKER run, use the augustus.hints.aa file as input for BUSCO
    • BUSCO output includes a list of missing and fragmented genes, by OrthoDB IDs that can be queried at https://v100.orthodb.org/
  • Monoexonic genes are predicted more often than they should actually occur. Because they are actually pseudogenes, LTRs, etc. Looking separately at mono- and multiexonic genes (using e.g. gFACs) can help identified potential problems.
  • gFACS is a useful tools for filtering e.g. multiexonic genes, genes with particular length.
  • It's still hard to completely escape manual annotation of e.g. intron/exon boundaries for genes of interest!
  • BUSCO output genes can be taken as training input for ab initio gene prediction.
  • PASA is chooser/combiner pipeline that combines multiple forms of evidence for gene prediction.
  • Biopython is useful for dealing with GenBank
  • BioMart can help with functional annotation of predicated genes.
  • Use protein domain predictions as a cross-validation on suspect predictions e.g. monoexonic predictions. If there's no domains, that's less confidence it's real.
  • IGV is a local viewer for genome feature files, like BED or GFF
  • EnTAP combines annotations from protein domain assignment, orthologous gene family assessment, Gene Ontology term assignment, and KEGG pathway annotation. It weighs evident from different sources.
    • EnTAP can use lots of cores and memory! (e.g. 3 hours for the Arabidopsis test case)
    • Include e.g. RefSeq proteins, UniProt protein databases. While you can restrict the database scope for faster runs, it's useful to include everything to find potential contaminants.
    • You can list the target organism in EnTAP and it will leverage NCBI taxonomy.
    • EnTAP looks awesome! Great for applying GO etc. annotations!
    • The EnTAP log file has useful info on e.g. species from which annotations were drawn, % sequences without alignment or annotations
  • In the absence of RNAseq reads, MAKER is a good option. Follow-up with PASA to remove false positives.
  • MAKER can be used cyclically. Run 3-4 times to refine predictions from ab initio training.
  • Apollo local editor allows manual edits to e.g. MAKER output.
  • For incorporating long RNA reads, use an aligner designed for long, noisy reads, e.g. Minimap2
  • Long reads may lead to annotation of more "duplicate" genes, due to the higher error rate.
  • gffcompare allows comp of different annotation strategies.
  • OrthoFinder is useful for comparative genomics
    • For species comparisons, the fasta folder would contain amino acid fasta files for all species you'd like to compare.
    • OrthoFinder results: If one species has way more of a particular orthogroup, it may be a repeat, not a gene.