Dave Angelini
Started 21 September 2021
This page will document the workflow used to annotate the genome of the red-shouldered soapberry bug, Jadera haematoloma.
The red-shouldered soapberry bug Jadera haematoloma (Hemiptera: Heteroptera: Rhopalidae) is a scentless plant bug native to the US Gulf Coast. It feeds on several native plants of the soapberry family (Sapindaceae) and, since the mid-twentieth century, has adapted to live on the introduced Chinese goldenrain tree (Koelreuteria ssp.). This host shift, along with the abundance of red-shouldered soapberry bugs in urban environments, has made J. haematoloma an excellent model for the study of rapid adaptive evolution (Tsai 2013; Panfilio & Angelini 2018). Indeed, different researchers have examined rapid evolution in beak length (e.g. Carrol & Loye 1987; Yu & Andrés 2014; Cenzer 2016; 2017), the wing/reproductive polyphenism (Carroll et al. 2003; Fawcett et al. 2018), and several other life history traits (Carroll 1991; Carroll et al. 1998) of this animal.
My lab has been studying appendage development and wing polyphenism in the bugs. And we were interested in a draft genome sequence as a resource for developmental genetics, to contextualize genotyping and population studies, and as a point of comparison to the genomes of other insects, especially Oncopeltus fasciatus (Panfilio et al. 2019).
The soapberry bug karyotype was described by Lelia Porter in 1917. Males have 13 chromosomes, and the species appears to use an X0 sex determination system. So there are six pairs of autosomes, and an X that is slightly larger than the smallest autosome (Porter 1917). Based on meiotic behavior, the smallest chromosome has been described as an "m-chromosome". It does not appear to have chiasmata during meiotic prophase and migrates to the poles early in anaphase (Ueshima 1979).
Camera lucida drawing of a spermatocyte in prophase I. From Porter (1917), Figure 5.
In 2015, Spencer Johnston at Texas A&M used flow cytometry to estimate the genome size of Jadera haematoloma at about 1.95 Gbp.
In 2018, an anonymous gift was made to Colby College to support research in genomics and bioinformatics, and we were given the green light to use these funds for a genome sequencing project. Additional funding was provided by Maine INBRE and the Colby Department of Biology. We contracted Dovetail Genomics for sequencing and assembly. Dovetail offers a combination of library preparation methods, including the use of HiC proximity end-pairing, which allows for assembly to chromosome length.
To reduce heterozygosity, we chose to sequence a lab population of bugs, originally from Plantation Key in Tavernier, FL. Devin O'Brien, then a postdoc in my lab, crossed full siblings for 5 generations. Dovetail made the DNA isolation and prepared 10X and HiC libraries from one of these in-bred male bugs.
In August 2019, Dovetail returned the draft genome assembly. The total sequence length from all scaffolds was 2.08 Gbp, very close to the previous estimate. Seven large scaffolds contain 89.9% of the sequence, and likely represent the seven chromosomes seen in the bug's karyotype.
The chromosomes of J. haematoloma are represented by seven scaffolds over 1 Mbp in length. Here, the length of these scaffolds is plotted on a log-scale against their sequencing depth, reflected by the number of reads mapping per million assembled base pairs (CPM). Chromosome names are given to the scaffolds based on the size and read depth, following Porter (1917).
Metrics, such as the distribution of ambiguous nucleotides and repetitive sequences, all indicate that the genome assembly is high quality.
We are now in the annotation phase of this project. A preliminary survey using BLAST found orthologs for 80 of 81 candidate genes.
chromosome | length (Mbp) | number of candidate genes | gene density (per Mbp) |
---|---|---|---|
Chr1 | 559.6 | 24 | 0.0429 |
Chr2 | 375.1 | 12 | 0.0320 |
Chr3 | 293.5 | 13 | 0.0443 |
Chr4 | 240.4 | 9 | 0.0374 |
Chr5 | 193.1 | 19 | 0.0984 |
X | 179.5 | 12 | 0.0669 |
m | 28.9 | 0 | 0 |
other scaffolds | each <0.56 (211.5 overall) | 1 | 0.0047 (overall) |
We are currently using de novo gene prediction methods to further characterize the genome. Gene expression studies are also underway to characterize genes involved in nutritionally dependent plasticity in wing growth and patterning. In the future, the genome sequenbce will also enable population-level differences among bugs in the wild to be mapped and placed in the context of genes and other genomic features.
As a resource for annaotation, we have sequence from several Illumina 2x125bp RNAseq libraries.
population | tissue | sex | morph | bio reps | raw reads |
---|---|---|---|---|---|
Taverneir, FL | whole body | f | LW | 3 | 174,076,008 |
Taverneir, FL | whole body | f | SW | 3 | 143,188,710 |
Taverneir, FL | whole body | m | LW | 3 | 160,704,622 |
Taverneir, FL | whole body | m | SW | 3 | 166,984,020 |
Aurora, CO | dorsal thorax | f | LW | 3 | 167,414,310 |
Aurora, CO | dorsal thorax | f | SW | 3 | 111,135,632 |
Aurora, CO | dorsal thorax | m | LW | 3 | 185,784,180 |
Aurora, CO | dorsal thorax | m | SW | 3 | 158,732,678 |
Taverneir, FL | dorsal thorax | f | LW | 3 | 177,552,950 |
Taverneir, FL | dorsal thorax | f | SW | 3 | 149,760,062 |
Taverneir, FL | dorsal thorax | m | LW | 3 | 158,470,008 |
Taverneir, FL | dorsal thorax | m | SW | 3 | 161,301,718 |
Aurora, CO | ovaries | f | LW | 3 | 151,962,414 |
Aurora, CO | ovaries | f | SW | 3 | 113,260,652 |
Taverneir, FL | ovaries | f | LW | 3 | 172,248,994 |
Taverneir, FL | ovaries | f | SW | 3 | 146,809,538 |
Aurora, CO | testes | m | LW | 3 | 166,917,032 |
Aurora, CO | testes | m | SW | 3 | 153,339,676 |
Taverneir, FL | testes | m | LW | 3 | 150,885,796 |
Taverneir, FL | testes | m | SW | 3 | 157,930,172 |
At the same time as those libraries above, we also sequenced mRNA from a few whole individuals of two related species.
species | population | sex | morph | bio reps | raw reads |
---|---|---|---|---|---|
J. sanguinolenta | Islamorada, FL | f | LW | 2 | 73,426,516 |
Islamorada, FL | m | LW | 2 | 79,259,328 | |
Boisea trivittata | Waterville, ME | f | LW | 2 | 61,980,192 |
Waterville, ME | m | LW | 2 | 66,033,812 |
We also have close to 192 samples sequenced using 3'-end tag-seq to measure gene expression: 3 biological repliates x 2 sexes x 2 morphs (or extreme food regimes) x 4 populations x 2 tissues x 2 stages (nascent adult and fifth instars).
We are proceeding now with in-house annotation of the genome, based on resources suggested by CBC-UConn.
repeat scout
algorithm, the default is a minimum of 10 repeats.censensi.fa
is one of the key output files of RepeatModeler, which lists the LTRs.censensi.fa.classified
makes a provisional classification of the LTR types e.g. Copia, Gypsy, etc. This is the file used as input by the next step, RepeatMasker.Athaliana_167_TAIR9.fa.masked
. The GFF Athaliana_167_TAIR9.fa.out.gff
details what the masks are. A summary file Athaliana_167_TAIR9.fa.tbl
gives nice genome-wide stats on repetitive elements.BOOKMARK: Working from here
samtools stat
is a useful tool to summarize the contents of a SAM file.hitsat2
(and other aligners) require first making an index e.g. hisat2-build
samtools flagstat
provides summary stats on each file.sbatch 04b_busco.sh
BOOKMARK: Working from here
Get through Repeat Modeling and RNA mapping before starting annotation proper.
augustus.hints.aa
file as input for BUSCO
gffcompare
allows comp of different annotation strategies.fasta
folder would contain amino acid fasta files for all species you'd like to compare.