# Soapberry bug genome annotation
Dave Angelini
Started 21 September 2021

This page will document the workflow used to annotate the genome of the red-shouldered soapberry bug, *Jadera haematoloma*.
## Background on the bugs
The red-shouldered soapberry bug *Jadera haematoloma* (Hemiptera: Heteroptera: Rhopalidae) is a scentless plant bug native to the US Gulf Coast. It feeds on several native plants of the soapberry family (Sapindaceae) and, since the mid-twentieth century, has adapted to live on the introduced Chinese goldenrain tree (*Koelreuteria* ssp.). This host shift, along with the abundance of red-shouldered soapberry bugs in urban environments, has made *J. haematoloma* an excellent model for the study of rapid adaptive evolution ([Tsai 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689129/); [Panfilio & Angelini 2018](https://www.sciencedirect.com/science/article/pii/S2214574517301153)). Indeed, different researchers have examined rapid evolution in beak length (e.g. [Carrol & Loye 1987](https://academic.oup.com/aesa/article/80/3/373/10793); [Yu & Andrés 2014](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3931560/); [Cenzer 2016](https://academic.oup.com/aesa/article/80/3/373/10793); [2017](https://www.journals.uchicago.edu/doi/10.1086/693456)), the wing/reproductive polyphenism ([Carroll et al. 2003](http://soapberrybug.org/_dbase_upl/Carroll_et_al._Ann_Ent_2003.pdf); [Fawcett et al. 2018](https://www.nature.com/articles/s41467-018-04102-1)), and several other life history traits ([Carroll 1991](http://soapberrybug.org/_dbase_upl/C1991.pdf); [Carroll et al. 1998](http://soapberrybug.org/_dbase_upl/Carroll_et_al._Ev_Eco_1998.pdf)) of this animal.
My lab has been studying appendage development and wing polyphenism in the bugs. And we were interested in a draft genome sequence as a resource for developmental genetics, to contextualize genotyping and population studies, and as a point of comparison to the genomes of other insects, especially *Oncopeltus fasciatus* ([Panfilio et al. 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1660-0)).
## Genome project history
The soapberry bug karyotype was described by Lelia Porter in 1917. Males have 13 chromosomes, and the species appears to use an X0 sex determination system. So there are six pairs of autosomes, and an X that is slightly larger than the smallest autosome ([Porter 1917](https://www.journals.uchicago.edu/doi/10.2307/1536296)). Based on meiotic behavior, the smallest chromosome has been described as an "m-chromosome". It does not appear to have chiasmata during meiotic prophase and migrates to the poles early in anaphase ([Ueshima 1979](https://www.amazon.com/Hemiptera-II-Heteroptera-Animal-cytogenetics/dp/344326008X)).

> Camera lucida drawing of a spermatocyte in prophase I. From [Porter (1917)](https://www.journals.uchicago.edu/doi/10.2307/1536296), Figure 5.
In 2015, Spencer Johnston at Texas A&M used flow cytometry to estimate the genome size of *Jadera haematoloma* at about 1.95 Gbp.
In 2018, an anonymous gift was made to [Colby College](https://www.colby.edu/) to support research in genomics and bioinformatics, and we were given the green light to use these funds for a genome sequencing project. Additional funding was provided by [Maine INBRE](https://inbre.maineidea.net/) and the Colby [Department of Biology](http://www.colby.edu/bio/). We contracted [Dovetail Genomics](https://dovetailgenomics.com/) for sequencing and assembly. Dovetail offers a combination of library preparation methods, including the use of HiC proximity end-pairing, which allows for assembly to chromosome length.
To reduce heterozygosity, we chose to sequence a lab population of bugs, originally from [Plantation Key](https://www.google.com/maps/place/Coral+Rd,+Islamorada,+FL+33070/@24.9778928,-80.6211314,10z/) in Tavernier, FL. [Devin O'Brien](https://www.devinmobrien.com/), then a postdoc in my lab, crossed full siblings for 5 generations. Dovetail made the DNA isolation and prepared 10X and HiC libraries from one of these in-bred male bugs.
## Assembly
In August 2019, Dovetail returned the draft genome assembly. The total sequence length from all scaffolds was 2.08 Gbp, very close to the previous estimate. Seven large scaffolds contain 89.9% of the sequence, and likely represent the seven chromosomes seen in the bug's karyotype.

> The chromosomes of *J. haematoloma* are represented by seven scaffolds over 1 Mbp in length. Here, the length of these scaffolds is plotted on a log-scale against their sequencing depth, reflected by the number of reads mapping per million assembled base pairs (CPM). Chromosome names are given to the scaffolds based on the size and read depth, following [Porter (1917)](https://www.journals.uchicago.edu/doi/10.2307/1536296).
Metrics, such as the distribution of ambiguous nucleotides and repetitive sequences, all indicate that the genome assembly is high quality.
We are now in the annotation phase of this project. A preliminary survey using BLAST found orthologs for 80 of 81 candidate genes.
| chromosome | length (Mbp) | number of candidate genes | gene density (per Mbp) |
| :-------------- | :------------------------: | :-----------------------: | :--------------------: |
| Chr1 | 559.6 | 24 | 0.0429 |
| Chr2 | 375.1 | 12 | 0.0320 |
| Chr3 | 293.5 | 13 | 0.0443 |
| Chr4 | 240.4 | 9 | 0.0374 |
| Chr5 | 193.1 | 19 | 0.0984 |
| X | 179.5 | 12 | 0.0669 |
| m | 28.9 | 0 | 0 |
| other scaffolds | each <0.56 (211.5 overall) | 1 | 0.0047 (overall) |
We are currently using de novo gene prediction methods to further characterize the genome. Gene expression studies are also underway to characterize genes involved in nutritionally dependent plasticity in wing growth and patterning. In the future, the genome sequenbce will also enable population-level differences among bugs in the wild to be mapped and placed in the context of genes and other genomic features.
## RNAseq libraries
As a resource for annaotation, we have sequence from several Illumina 2x125bp RNAseq libraries.
| population | tissue | sex | morph | bio reps | raw reads |
|:---------- |:------ |:---:|:-----:|:--------:| ---------:|
| Taverneir, FL | whole body | f | LW | 3 | 174,076,008 |
| Taverneir, FL | whole body | f | SW | 3 | 143,188,710 |
| Taverneir, FL | whole body | m | LW | 3 | 160,704,622 |
| Taverneir, FL | whole body | m | SW | 3 | 166,984,020 |
| Aurora, CO | dorsal thorax | f | LW | 3 | 167,414,310 |
| Aurora, CO | dorsal thorax | f | SW | 3 | 111,135,632 |
| Aurora, CO | dorsal thorax | m | LW | 3 | 185,784,180 |
| Aurora, CO | dorsal thorax | m | SW | 3 | 158,732,678 |
| Taverneir, FL | dorsal thorax | f | LW | 3 | 177,552,950 |
| Taverneir, FL | dorsal thorax | f | SW | 3 | 149,760,062 |
| Taverneir, FL | dorsal thorax | m | LW | 3 | 158,470,008 |
| Taverneir, FL | dorsal thorax | m | SW | 3 | 161,301,718 |
| Aurora, CO | ovaries | f | LW | 3 | 151,962,414 |
| Aurora, CO | ovaries | f | SW | 3 | 113,260,652 |
| Taverneir, FL | ovaries | f | LW | 3 | 172,248,994 |
| Taverneir, FL | ovaries | f | SW | 3 | 146,809,538 |
| Aurora, CO | testes | m | LW | 3 | 166,917,032 |
| Aurora, CO | testes | m | SW | 3 | 153,339,676 |
| Taverneir, FL | testes | m | LW | 3 | 150,885,796 |
| Taverneir, FL | testes | m | SW | 3 | 157,930,172 |
At the same time as those libraries above, we also sequenced mRNA from a few whole individuals of two related species.
| species | population | sex | morph | bio reps | raw reads |
|:------- |:---------- |:---:|:-----:|:--------:| ---------:|
| *J. sanguinolenta* | Islamorada, FL | f | LW | 2 | 73,426,516 |
| | Islamorada, FL | m | LW | 2 | 79,259,328 |
| *Boisea trivittata* | Waterville, ME | f | LW | 2 | 61,980,192 |
| | Waterville, ME | m | LW | 2 | 66,033,812 |
We also have close to 192 samples sequenced using 3'-end tag-seq to measure gene expression: 3 biological repliates x 2 sexes x 2 morphs (or extreme food regimes) x 4 populations x 2 tissues x 2 stages (nascent adult and fifth instars).
## Annotation Workflow Plan
We are proceeding now with in-house annotation of the genome, based on resources suggested by [CBC-UConn](https://github.com/CBC-UCONN/Structural-Annotation).
```mermaid
graph TD
subgraph Repeat Predictions
A(FNA: Genome Assembly) ==>|RepeatModeler| reps(FNA: Repeat Predictions)
A ==>|BUSCO| A
reps ==>|check number & type| reps
A ==>|RepeatMasker| smA(FNA: Softmasked Genome)
reps ==>|RepeatMasker| smA
end
subgraph RNA Data
rna(FQ: RNAseq raw reads) ==>|FastQC, MultiQC| rna
rna ==>|quality trimming| rna2(FQ: filtered RNAseq reads)
rna2 ==>|FastQC, MultiQC| rna2
rna2 ==>|hisat2| samrna(SAM: aligned RNA evidence)
samrna ==>|check alignments| samrna
end
subgraph Genome Annotation
smA ==>|BRAKER2| brk(GFF: gene predictions)
samrna ==>|BRAKER2| brk(GFF: Braker gene predictions)
brk ==>|FAA: BUSCO| brk
brk ==>|gFACs| subset(GFF: Gene subsets for BUSCO)
brk ==>|Manual curation| final(GFF: Annotated Genome)
brk ==>|EnTAP| final
brk ==>|OrthoFinder| final
final ==>|FAA: BUSCO| final
end
linkStyle 1,2,5,7,9,12,13,17 stroke:darkred,color:darkred
style subset fill:salmon,color:#eeeeee
style final fill:#333333,color:#eeeeee
```
## Repeat Prediction
### Notes
- It's helpful to do repeat masking first, before predicting genes. Some long repetive elements can look like genes to predictive algorithms.
- Jill recommends [RepeatModeler2](https://doi.org/10.1073/pnas.1921046117). May take 5 days to run on human genome on a typical HPC core.
- How much repetition is enough? For the `repeat scout` algorithm, the default is a minimum of 10 repeats.
- `censensi.fa` is one of the key output files of RepeatModeler, which lists the LTRs.
- `censensi.fa.classified` makes a provisional classification of the LTR types e.g. Copia, Gypsy, etc. This is the file used as input by the next step, RepeatMasker.
- [RepeatMasker](https://www.repeatmasker.org/) literally "masks" parts of the genome sequence by making the characters lower case. This is called "soft" masking.
- [Dfam](https://www.dfam.org/) also uses public repository for repetitive element classification.
- The key output of RepeatMasker is e.g. `Athaliana_167_TAIR9.fa.masked`. The GFF `Athaliana_167_TAIR9.fa.out.gff` details what the masks are. A summary file ` Athaliana_167_TAIR9.fa.tbl` gives nice genome-wide stats on repetitive elements.
:::danger
BOOKMARK: Working from here
:::
## RNA mapping
### Notes
- Guidelines for RNA data: at least 15 million reads per library for use in gene prediction
- Dfam (or other public repeat libraries) can be added to the de novo repeat library created by RepeatModeler as input to RepeatMasker.
- `samtools stat` is a useful tool to summarize the contents of a SAM file.
- The UCSC genome browser can render a BED file.
- `hitsat2` (and other aligners) require first making an index e.g. `hisat2-build`
- For multiple libraries, modify a script like [the CBC's](https://github.com/CBC-UCONN/Structural-Annotation/blob/main/06_short_read_align/06_align.sh) to loop through each library file name.
- After aligning, hitsat2 will produce sorted BAM files, and the original SAM files can be removed.
- `samtools flagstat` provides summary stats on each file.
- More than 85% mapping is considered good.
- `sbatch 04b_busco.sh`
:::danger
BOOKMARK: Working from here
:::
## Genome Annotation
:::warning
Get through Repeat Modeling and RNA mapping before starting annotation proper.
:::
### Notes
- Best gene prediction results for a genome are produced by a mixture of *ab initio* and homology-based methods.
- The [BRAKER2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6635606/) pipeline is best if you have RNAseq data. MAKER is pretty much obsolete, unless you're annotating a genome without RNA data.
- [FINDER](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04120-9) is the new tool on the horizon: combines evidence-based and *ab initio* gene prediction.
- BRAKER2 run times vary. About 5 days for a human genome; 3 hours for *Arabidopsis* (a very small genome for plants!).
- After a BRAKER run, use the `augustus.hints.aa` file as input for BUSCO
- BUSCO output includes a list of missing and fragmented genes, by OrthoDB IDs that can be queried at https://v100.orthodb.org/
- Monoexonic genes are predicted more often than they should actually occur. Because they are actually pseudogenes, LTRs, etc. Looking separately at mono- and multiexonic genes (using e.g. gFACs) can help identified potential problems.
- [gFACS](https://gfacs.readthedocs.io/en/latest/) is a useful tools for filtering e.g. multiexonic genes, genes with particular length.
- It's still hard to completely escape manual annotation of e.g. intron/exon boundaries for genes of interest!
- BUSCO output genes can be taken as training input for ab initio gene prediction.
- [PASA](https://github.com/PASApipeline/PASApipeline/blob/master/docs/index.asciidoc) is chooser/combiner pipeline that combines multiple forms of evidence for gene prediction.
- [Biopython](https://biopython.org/) is useful for dealing with GenBank
- [BioMart](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) can help with functional annotation of predicated genes.
- Use protein domain predictions as a cross-validation on suspect predictions e.g. monoexonic predictions. If there's no domains, that's less confidence it's real.
- [IGV](https://software.broadinstitute.org/software/igv/) is a local viewer for genome feature files, like BED or GFF
- [EnTAP](https://entap.readthedocs.io/en/v0.10.8-beta/Getting_Started/introduction.html) combines annotations from protein domain assignment, orthologous gene family assessment, Gene Ontology term assignment, and KEGG pathway annotation. It weighs evident from different sources.
- EnTAP can use lots of cores and memory! (e.g. 3 hours for the *Arabidopsis* test case)
- Include e.g. RefSeq proteins, UniProt protein databases. -- While you can restrict the database scope for faster runs, it's useful to include everything to find potential contaminants.
- You can list the target organism in EnTAP and it will leverage NCBI taxonomy.
- **EnTAP looks awesome!** Great for applying GO etc. annotations!
- The EnTAP log file has useful info on e.g. species from which annotations were drawn, % sequences without alignment or annotations
- In the absence of RNAseq reads, MAKER is a good option. Follow-up with PASA to remove false positives.
- MAKER can be used cyclically. Run 3-4 times to refine predictions from ab initio training.
- Apollo local editor allows manual edits to e.g. MAKER output.
- For incorporating long RNA reads, use an aligner designed for long, noisy reads, e.g. Minimap2
- Long reads may lead to annotation of more "duplicate" genes, due to the higher error rate.
- [`gffcompare`](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml) allows comp of different annotation strategies.
- [OrthoFinder](https://github.com/davidemms/OrthoFinder) is useful for comparative genomics
- For species comparisons, the `fasta` folder would contain amino acid fasta files for all species you'd like to compare.
- OrthoFinder results: If one species has way more of a particular orthogroup, it may be a repeat, not a gene.
---