--- tags: [scverse, design-doc] --- # Genomic annotation design doc ## Abstract The goal of this project will be to replicate features from biocoductors genomic annotation packages in python. To do this, we will access their precomputed sqlite tables. ## Related projects/ links ### Bioconductor * [ensembldb](https://bioconductor.org/packages/release/bioc/html/ensembldb.html) * [GenomicFeatures](https://www.bioconductor.org/packages/release/bioc/html/GenomicFeatures.html) ### Python packages we can use for this * [ibis](https://ibis-project.org) - a nice interface to SQL in python * [pooch](https://www.fatiando.org/pooch/latest/) – Handles downloads and caching of datasets ## Goals/ features * Access to bioconductor annotation resources * Ensembl * UCSC * Mapping IDs (e.g. ensembl -> hgnc, ensembl -> uniprot) * Genomic range features * Get exons for a gene/ trascript * Get transcription start sites ### Eventual goals * Metadata handling * Building custom references * Access to seqence from annotation ### Requested features * Gene id -> position (something that can be used with bioframe) * Discoverability of features (e.g. make sure users know what they can retrieve/ filter by) * Basic gene id mapping * https://github.com/vanheeringen-lab/genomepy * Integration with resources like this, e.g. ability to grab sequences * Bringing in data from other sources? * Create a custom database, bring in links to other databases? E.g. bring in uniprot -> ensembl id * TSS start site - see `promoter` function in GenomicAnnotations, to specify region of a given size upstream or downstream of TSS * Support for gene models/annotations for different genome builds? Useful to work with legacy datasets mapped to hg19 ## Example code (comparison with `ensembldb`) Here are a few code examples showing how features from `bioc` pacakges can be replicated in python. Under the section **this package** is the proposed API for this tool. ### Access to precomputed annotation tables #### R code ```r library(EnsDb.Hsapiens.v86) ensdb = EnsDb.Hsapiens.v86 ``` #### Python code ```python import pooch, ibis url_template = "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v{version}/EnsDb.{species}.v{version}.sqlite" local_path = pooch.retrieve(url_template.format(version="86", species="Hsapiens")) ensdb = ibis.sqlite.connect(local_path) ``` #### This package ```python import genomic_annotations as ga ensdb = ga.ensembl.EnsemblDB(species="hsapiens", version="86") ``` ### Retrieving annotations #### R code ```r genes(ensdb, filter=GeneBiotypeFilter("protein_coding")) ``` #### Python code ```python from ibis import _ ensdb.table("genes").filter(_.gene_biotype == "protein_coding").execute() ``` ### This package First draft could be: ```python from ibis import _ ensdb.genes(filter=_.gene_biotype == "protein_coding") ``` And we could eventually include pre-existing filters. ### ID mapping ```r genes(ensdb, columns=c("gene_name", "uniprot_id"), ...) ``` ``` seqnames ranges strand | gene_name uniprot_id <Rle> <IRanges> <Rle> | <character> <character> ENSG00000186092 1 69091-70008 + | OR4F5 Q8NH21 ... ... ... ... . ... ... ENSG00000185894 Y 25030901-25062548 - | BPY2C R4GMN0 ``` #### Retrieving exons by gene ```r exonsBy(ensdb, by="gene", filters=...) ``` ```python import awkward as ak genes = ensdb.table("gene") tx = ensdb.table("tx") tx2exon = ensdb.table("tx2exon") exons_by_gene_query = ( gene .inner_join(tx2exon, predicates=gene["canonical_transcript"] == tx2exon["tx_id"]) .inner_join(exon, predicates=tx2exon["exon_id"] == exon["exon_id"]) .select("gene_id", "gene_name", "seq_name", "exon_seq_start", "exon_seq_end") ) awk = ak.from_arrow( exons_by_gene_query .to_pyarrow() .group_by(["gene_id", "gene_name", "seq_name"]) .aggregate([ ("exon_seq_start", "list"), ("exon_seq_end", "list") ]) ) ``` ### Emma feedback * Download times are fine * Experience with pyensembl * One function for each thing, bad not flexible * Skip the "by" operations for now * UCSC * Mid level * Conversion via https://github.com/Bioconductor/GenomeInfoDb/tree/devel/R ### Functions to define for now * genes * transcripts * exons * promoters * chromosomes * what ever the genomeinfo db does * annotate_anndata function * Get sequence from range