---
tags: [scverse, design-doc]
---
# Genomic annotation design doc
## Abstract
The goal of this project will be to replicate features from biocoductors genomic annotation packages in python. To do this, we will access their precomputed sqlite tables.
## Related projects/ links
### Bioconductor
* [ensembldb](https://bioconductor.org/packages/release/bioc/html/ensembldb.html)
* [GenomicFeatures](https://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html)
### Python packages we can use for this
* [ibis](https://ibis-project.org) - a nice interface to SQL in python
* [pooch](https://www.fatiando.org/pooch/latest/) – Handles downloads and caching of datasets
## Goals/ features
* Access to bioconductor annotation resources
* Ensembl
* UCSC
* Mapping IDs (e.g. ensembl -> hgnc, ensembl -> uniprot)
* Genomic range features
* Get exons for a gene/ trascript
* Get transcription start sites
### Eventual goals
* Metadata handling
* Building custom references
* Access to seqence from annotation
### Requested features
* Gene id -> position (something that can be used with bioframe)
* Discoverability of features (e.g. make sure users know what they can retrieve/ filter by)
* Basic gene id mapping
* https://github.com/vanheeringen-lab/genomepy
* Integration with resources like this, e.g. ability to grab sequences
* Bringing in data from other sources?
* Create a custom database, bring in links to other databases? E.g. bring in uniprot -> ensembl id
* TSS start site - see `promoter` function in GenomicAnnotations, to specify region of a given size upstream or downstream of TSS
* Support for gene models/annotations for different genome builds? Useful to work with legacy datasets mapped to hg19
## Example code (comparison with `ensembldb`)
Here are a few code examples showing how features from `bioc` pacakges can be replicated in python. Under the section **this package** is the proposed API for this tool.
### Access to precomputed annotation tables
#### R code
```r
library(EnsDb.Hsapiens.v86)
ensdb = EnsDb.Hsapiens.v86
```
#### Python code
```python
import pooch, ibis
url_template = "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v{version}/EnsDb.{species}.v{version}.sqlite"
local_path = pooch.retrieve(url_template.format(version="86", species="Hsapiens"))
ensdb = ibis.sqlite.connect(local_path)
```
#### This package
```python
import genomic_annotations as ga
ensdb = ga.ensembl.EnsemblDB(species="hsapiens", version="86")
```
### Retrieving annotations
#### R code
```r
genes(ensdb, filter=GeneBiotypeFilter("protein_coding"))
```
#### Python code
```python
from ibis import _
ensdb.table("genes").filter(_.gene_biotype == "protein_coding").execute()
```
### This package
First draft could be:
```python
from ibis import _
ensdb.genes(filter=_.gene_biotype == "protein_coding")
```
And we could eventually include pre-existing filters.
### ID mapping
```r
genes(ensdb, columns=c("gene_name", "uniprot_id"), ...)
```
```
seqnames ranges strand | gene_name uniprot_id
<Rle> <IRanges> <Rle> | <character> <character>
ENSG00000186092 1 69091-70008 + | OR4F5 Q8NH21
... ... ... ... . ... ...
ENSG00000185894 Y 25030901-25062548 - | BPY2C R4GMN0
```
#### Retrieving exons by gene
```r
exonsBy(ensdb, by="gene", filters=...)
```
```python
import awkward as ak
genes = ensdb.table("gene")
tx = ensdb.table("tx")
tx2exon = ensdb.table("tx2exon")
exons_by_gene_query = (
gene
.inner_join(tx2exon, predicates=gene["canonical_transcript"] == tx2exon["tx_id"])
.inner_join(exon, predicates=tx2exon["exon_id"] == exon["exon_id"])
.select("gene_id", "gene_name", "seq_name", "exon_seq_start", "exon_seq_end")
)
awk = ak.from_arrow(
exons_by_gene_query
.to_pyarrow()
.group_by(["gene_id", "gene_name", "seq_name"])
.aggregate([
("exon_seq_start", "list"),
("exon_seq_end", "list")
])
)
```
### Emma feedback
* Download times are fine
* Experience with pyensembl
* One function for each thing, bad not flexible
* Skip the "by" operations for now
* UCSC
* Mid level
* Conversion via https://github.com/Bioconductor/GenomeInfoDb/tree/devel/R
### Functions to define for now
* genes
* transcripts
* exons
* promoters
* chromosomes
* what ever the genomeinfo db does
* annotate_anndata function
* Get sequence from range