# Soapberry bug genome summary
Dave Angelini, Colby College
19 October 2019

## Background on the bugs
The red-shouldered soapberry bug *Jadera haematoloma* (Hemiptera: Heteroptera: Rhopalidae) is a scentless plant bug native to the US Gulf Coast. It feeds on several native plants of the soapberry family (Sapindaceae) and, since the mid-twenthieth century, has adapted to live on the introduced Chinese goldenrain tree (*Koelreuteria* ssp.). This host shift, along with the abundance of red-shouldered soapberry bugs in urban environments, has made *J. haematoloma* an excellent model for the study of rapid adaptive evolution ([Tsai 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689129/); [Panfilio & Angelini 2018](https://www.sciencedirect.com/science/article/pii/S2214574517301153)). Indeed, different researchers have examined rapid evolution in beak length (e.g. [Carrol & Loye 1987](https://academic.oup.com/aesa/article/80/3/373/10793); [Yu & Andrés 2014](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3931560/); [Cenzer 2016](https://academic.oup.com/aesa/article/80/3/373/10793); [2017](https://www.journals.uchicago.edu/doi/10.1086/693456)), the wing/reproductive polyphenism ([Carroll et al. 2003](http://soapberrybug.org/_dbase_upl/Carroll_et_al._Ann_Ent_2003.pdf); [Fawcett et al. 2018](https://www.nature.com/articles/s41467-018-04102-1)), and several other life history traits ([Carroll 1991](http://soapberrybug.org/_dbase_upl/C1991.pdf); [Carroll et al. 1998](http://soapberrybug.org/_dbase_upl/Carroll_et_al._Ev_Eco_1998.pdf)) of this animal.
## Genome project history
The soapberry bug karyotype was described by Lelia Porter in 1917. Males have 13 chromosomes, and the species appears to use an X0 sex determination system. So there are six pairs of autosomes, and an X that is slightly larger than the smallest autosome ([Porter 1917](https://www.journals.uchicago.edu/doi/10.2307/1536296)). Based on meiotic behavior, the smallest chromosome has been described as an "m-chromosome". It does not appear to have chiasmata during meiotic prophase and migrates to the poles early in anaphase ([Ueshima 1979](https://www.amazon.com/Hemiptera-II-Heteroptera-Animal-cytogenetics/dp/344326008X)).

> Camera lucida drawing of a spermatocyte in prophase I. From [Porter (1917)](https://www.journals.uchicago.edu/doi/10.2307/1536296), Figure 5.
In 2015, Spencer Johnston provided an estimate of the genome sizes for *Jadera haematoloma* and *J. sanguinolenta*.
| species | sex | genome estimate (Mbp) | std err |
|:------- |:---:| ------------:| -------:|
| *J. heamatomala* | female | 1972.5 | 14.3 |
| | male | 1909.6 | 31.9 |
| *J. sanguinolenta* | female | 2167.5 | 29.9 |
| | male | 2094.0 | 9.2 |
So we knew the genome was very large!
My lab has been studying appendage development and wing polyphenism in the bugs. And we were interested in a draft genome sequence as a resource for developmental genetics, to contextualize genotyping and population studies, and as a point of comparison to the genomes of other insects, espcially *Oncopeltus fasciatus* ([Panfilio et al. 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1660-0)).
In 2018, an anonymous gift was made to [Colby College](https://www.colby.edu/) to support research in genomics and bioinformatics, and we were given the green light to use these funds for a genome sequnecing project. Additional funding was provided by [Maine INBRE](https://inbre.maineidea.net/) and the Colby [Department of Biology](http://www.colby.edu/bio/). We contracted [Dovetail Genomics](https://dovetailgenomics.com/) for sequencing and assembly, as I had heard of others using this company for de novo insect genome projects. Dovetail offers a combination of library prepation methods, including the use of HiC proximity end-pairing, which, according to the company, allow for assembly to chromosome length.
To reduce heterozygosity, we chose to sequence a lab popultation of bugs. They were originally collected from [Plantation Key](https://www.google.com/maps/place/Coral+Rd,+Islamorada,+FL+33070/@24.9778928,-80.6211314,10z/) in Tavernier, FL, from balloon vine, and bred in the lab since 2015 at a population size averaging about 50 adults. In preparation for the genome project, Devin O'Brien crossed full siblings for 5 generations. Devin is working as a postdoc in my lab on the evolution and genomics of growth regulation and scaling relationships.
In March 2019 Devin flash-froze and overnighted 3 males to Dovetail. We chose males in the hopes of identifying the X chromosome from differences in read depth. Dovetail made the DNA isolation and prepared libraries using their 2x150bp "Chicago" method, which was used to assemble an initial genome. They also prepared 2x150bp HiC libraries, which allow the determination of physically linked chromatin. This information was used to improve the assembly to the level of presumptive chromosomes. (What Dovetail calls a [HiRise](https://dovetailgenomics.com/chicago-promo/) assembly.)
| | Chicago | HiC | HiRise |
|:------ |:------:|:------:|:------:|
| reads | 905,838,818 | 760,109,284 | |
| coverage | 65.4X | 54.9X | |
| scaffold number | | | 50,609 |
| assembly length | | | 2,076.49 Mb |
| scaffold N50 | | | 293.513 Mb |
| L90 | | | 7 scaffolds |
| percent gaps | | | 8.05% |
| BUSCO single-copy genes | | | 90% (odb9) |
| GC content | | | 33.98% |
This appears to match both the estimate of genome size and observations of chromosome number.
## Characterization
The HiRise assembly had 50,609 scaffolds, of which only 7 were over a million base pairs. This matched the observed karyotype nicely. We mapped the raw reads to the assembly and found that while most of the presumptiove chromosome-scaffolds had a read depth of about 28,000 CPM, one chromosome, the second smallest, had roughly half that number. Again, this matches the cytological observation that the X is the second smallest chromosome. So we are refering to this chromosome as the X.

From here I looked across each scaffold at the distribution of N's GC content, CG dinucleotides, and canonical Kozak and polyadenylation sequences.

## Annotation
We are working with the bioinformatician Shwen Ho Yung, who is now running gene prediction using Maker, Repeat-masker, etc. and will perform manual annotation of a list of about 300 named genes. We would also like to make the *J. haematoloma* genome available for community annotation through [i5k](https://i5k.nal.usda.gov/) if possible.
In the meantime, I have done some cursory local BLAST searches of the genome. Of 81 candidate genes, 80 appear to have clear orthologs. Here's a summary table...
| chromosome | length (Mbp) | number of candidate genes | gene density (per Mbp) |
| :-------------- | :------------------------: | :-----------------------: | :--------------------: |
| Chr1 | 559.6 | 24 | 0.0429 |
| Chr2 | 375.1 | 12 | 0.0320 |
| Chr3 | 293.5 | 13 | 0.0443 |
| Chr4 | 240.4 | 9 | 0.0374 |
| Chr5 | 193.1 | 19 | 0.0984 |
| X | 179.5 | 12 | 0.0669 |
| m | 28.9 | 0 | 0 |
| other scaffolds | each <0.56 (211.5 overall) | 1 | 0.0047 (overall) |
And for those concerned with specific genes, here are the particulars: Only one gene, *knirps*, did not have a strong hit to the assembled chromosomes, although it was present on two smaller scaffolds. Interestingly, of the 81 genes searched, none appears on the m-chromosome. The Hox cluster appears to be on chromosome 5, but may be in two or more peices-- I'm not confident in the orthology Hox genes suggested here without a closer (phylogenetic) analysis.
| gene | chromosome | strand | position |
|:---- |:----------:|:------:| --------:|
| *Notch* | Chr1 | plus | 35717430 |
| *Dicer2* | Chr1 | minus | 80778571 |
| *E-cadherin* | Chr1 | plus | 121991532 |
| *dpp* | Chr1 | plus | 182980184 |
| *ci* | Chr1 | minus | 185843121 |
| *Argos* | Chr1 | minus | 210583657 |
| *TOR* | Chr1 | minus | 215742584 |
| *Distal-less* | Chr1 | plus | 220687442 |
| *dachshund* | Chr1 | minus | 231333627 |
| *EGF1/Keren* | Chr1 | plus | 304228841 |
| *Argonaute3* | Chr1 | plus | 322521645 |
| *cinnabar* | Chr1 | plus | 353266342 |
| *MEK* | Chr1 | plus | 374761848 |
| *spook* | Chr1 | minus | 378516252 |
| *brinker* | Chr1 | plus | 380112381 |
| *FoxO* | Chr1 | minus | 401391260 |
| *engrailed* | Chr1 | minus | 412933498 |
| *extradenticle* | Chr1 | minus | 430033492 |
| *Delta* | Chr1 | plus | 436243066 |
| *vasa* | Chr1 | minus | 466517788 |
| *eve* | Chr1 | plus | 472501908 |
| *shade* | Chr1 | minus | 505196610 |
| *calmodulin* | Chr1 | plus | 511762136 |
| *intersex* | Chr1 | minus | 533628837 |
| *wingless* | Chr2 | plus | 49453874 |
| *abrupt* | Chr2 | minus | 97108727 |
| *DDC* | Chr2 | minus | 166565587 |
| *vittellogenin* | Chr2 | minus | 202587327 |
| *Akt* | Chr2 | minus | 249581369 |
| *broad* | Chr2 | plus | 267821464 |
| *MEK* | Chr2 | plus | 280612654 |
| *InR2* | Chr2 | plus | 289038003 |
| *EGF2*/vein | Chr2 | plus | 291729667 |
| *vitellogenin receptor* | Chr2 | minus | 306807572 |
| *Hippo* | Chr2 | minus | 360040629 |
| *Met/JHR* | Chr2 | minus | 362634728 |
| *spalt* | Chr3 | minus | 22080882 |
| *vitellogenin receptor* | Chr3 | minus | 31406055 |
| *dsx-b* | Chr3 | minus | 72256843 |
| *orthodenticle* | Chr3 | minus | 77616848 |
| *white* | Chr3 | minus | 83056959 |
| *chicadee* | Chr3 | plus | 95983800 |
| *broad* | Chr3 | plus | 104543763 |
| *fat/ds* | Chr3 | minus | 171484381 |
| *EGFR* | Chr3 | minus | 179718658 |
| *flightin* | Chr3 | plus | 184136678 |
| *fat*/ds | Chr3 | minus | 190945545 |
| *Cyclin-E* | Chr3 | minus | 229374526 |
| *bcl2* | Chr3 | minus | 269519283 |
| *virilizer* | Chr4 | minus | 34752342 |
| *Rheb* | Chr4 | minus | 84806444 |
| *Laccase-2* | Chr4 | plus | 122759751 |
| *ERK* | Chr4 | plus | 153271268 |
| *PTEN* | Chr4 | plus | 173468108 |
| *E(spl)m-beta* | Chr4 | plus | 183886072 |
| *upd* | Chr4 | plus | 188008776 |
| *yorkie* | Chr4 | plus | 201911980 |
| *hedgehog* | Chr4 | plus | 206714139 |
| *Dicer1* | Chr5 | minus | 2667134 |
| *Labial* | Chr5 | minus | 38247854 |
| *fruitless* | Chr5 | minus | 40045178 |
| *abdominal-A* | Chr5 | plus | 58525657 |
| *Ubx* | Chr5 | plus | 59151848 |
| *Abdominal-B* | Chr5 | plus | 59517732 |
| *Antennapedia* | Chr5 | plus | 60294311 |
| *zen* | Chr5 | plus | 60319800 |
| *Scr* | Chr5 | plus | 60902763 |
| *transformer-2* | Chr5 | minus | 76488163 |
| *vasa* | Chr5 | plus | 78793843 |
| *GSK3-binding protein* | Chr5 | minus | 78952521 |
| *Chico* | Chr5 | minus | 103044792 |
| *transformer-2* | Chr5 | minus | 115705628 |
| *proboscipedia* | Chr5 | minus | 115812137 |
| *zen* | Chr5 | minus | 116208403 |
| *Deformed* | Chr5 | minus | 116534587 |
| *Ras* | Chr5 | plus | 131791626 |
| *Hippo* | Chr5 | plus | 149291963 |
| *Serrate* | X | minus | 5461245 |
| *jagged* | X | minus | 5461245 |
| *homothorax* | X | plus | 36924263 |
| *E-cadherin* | X | minus | 54668882 |
| *InR1* | X | plus | 89628234 |
| *dsx-a* | X | plus | 90144987 |
| *pointed* | X | plus | 93608897 |
| *dsx-c* | X | minus | 106106684 |
| *yan* | X | plus | 125061846 |
| *fat/ds* | X | plus | 129962989 |
| *eyeless* | X | minus | 134763199 |
| *caudal* | X | plus | 146377152 |
| *knirps* | Scaffold_43571 | minus | 2588 |
| *knirps* | Scaffold_43573 | minus | 2584 |
## RNAseq libraries
The genome's annotation is being aided by data from several RNAseq libraries. These were generated during our studies of wing morph differences. All libraries were Illumina 2x125bp.
| population | tissue | sex | morph | bio reps | raw reads |
|:---------- |:------ |:---:|:-----:|:--------:| ---------:|
| Taverneir, FL | whole body | f | LW | 3 | 174,076,008 |
| Taverneir, FL | whole body | f | SW | 3 | 143,188,710 |
| Taverneir, FL | whole body | m | LW | 3 | 160,704,622 |
| Taverneir, FL | whole body | m | SW | 3 | 166,984,020 |
| Aurora, CO | dorsal thorax | f | LW | 3 | 167,414,310 |
| Aurora, CO | dorsal thorax | f | SW | 3 | 111,135,632 |
| Aurora, CO | dorsal thorax | m | LW | 3 | 185,784,180 |
| Aurora, CO | dorsal thorax | m | SW | 3 | 158,732,678 |
| Taverneir, FL | dorsal thorax | f | LW | 3 | 177,552,950 |
| Taverneir, FL | dorsal thorax | f | SW | 3 | 149,760,062 |
| Taverneir, FL | dorsal thorax | m | LW | 3 | 158,470,008 |
| Taverneir, FL | dorsal thorax | m | SW | 3 | 161,301,718 |
| Aurora, CO | ovaries | f | LW | 3 | 151,962,414 |
| Aurora, CO | ovaries | f | SW | 3 | 113,260,652 |
| Taverneir, FL | ovaries | f | LW | 3 | 172,248,994 |
| Taverneir, FL | ovaries | f | SW | 3 | 146,809,538 |
| Aurora, CO | testes | m | LW | 3 | 166,917,032 |
| Aurora, CO | testes | m | SW | 3 | 153,339,676 |
| Taverneir, FL | testes | m | LW | 3 | 150,885,796 |
| Taverneir, FL | testes | m | SW | 3 | 157,930,172 |
At the same time as those libraries above, we also sequenced mRNA from a few whole individuals of two related species.
| species | population | sex | morph | bio reps | raw reads |
|:------- |:---------- |:---:|:-----:|:--------:| ---------:|
| *J. sanguinolenta* | Islamorada, FL | f | LW | 2 | 73,426,516 |
| | Islamorada, FL | m | LW | 2 | 79,259,328 |
| *Boisea trivittata* | Waterville, ME | f | LW | 2 | 61,980,192 |
| | Waterville, ME | m | LW | 2 | 66,033,812 |
Right now, we also have samples in the queue for 3'-end sequencing to measure gene expression. Our plan is for the eventual experiment to include 3 biological repliates x 2 sexes x 2 morphs (or extreme food regimes) x 4 populations x 2 tissues x 2 stages (nascent adult and fifth instars).
## Samples for genotyping
In addition to the individuals in the lab, this year we've collected about 200 individuals from south Florida, from *K. elegans*, *K. bipinnata* and *Cardiospermum* hosts. Devin and I are planning to use these samples and the genome for mapping.
## Opportunity for collaboration
*Jadera* is an excellent model for the study of adaptive evolution, and I would like the publication of this genome to show-case this system's potential and the varied interests of our small, but growing, *Jadera* community. Specifically, I am envisioning something modeled after the *Oncopeltus* genome paper ([Panfilio et al. 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1660-0)) in which contributors each wrote different sections that highlighted aspects of the genome relevant to their research programs.
Devin will act the coordinator for this project, and we are happy to share sequence with anyone who is willing to contribute to the project.
Please get in touch!
Best,
Dave Angelini - <dave.angelini@colby.edu>
Devin O'Brien - <dmobrien@colby.edu>
-------