NCBI VCF for Population Genomics Codeathon

# Participants - Nezar **nezar.abdennur@umassmed.edu** - Garrett **garrett.ng@umassmed.edu** - Thomas **thomas.reimonn@umassmed.edu** - David david.adeleke@ndsu.edu - Lei lei.ma@usda.gov - Mehmet mkuscuog@jcvi.org - Clark cc8dm@virginia.edu - Se-Ran sjun@uams.edu # Team Who are we? * Abdennur Lab: Our lab studies 3D genome organization, the epigenome, and their roles in cell fate determination. We also work on foundational software for genomic data science. * Nezar - PI, Computational Biologist * Garrett - Software engineer in the Abdennur Lab with a background in clinical medicine, cloud, and data ops * Tom - MD/PhD student in the Abdennur Lab, Computational Biologist * Clark - Software Engineer at the UVA Biocomplexity Institute with a background in bioinformatics and statistics * Mehmet - Software Engineer at J. Craig Venter Institute * Lei - Postdoctoral researcher at USDA ARS * David - PhD student at North Dakota State University # Project Scope ### Making variant call data more accessible and scalable with Oxbow We recently launched an open-source project called [Oxbow](https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos), which aims to provide a unified interface to high-volume genomic file formats as tabular data using Apache Arrow. This allows genomic data to be efficiently accessible to dataframe-based analytics libraries such as Pandas, Polars, and Dask, and data analysis environments such as Jupyter. As a goal, we plan to support teams who would like to use Oxbow for their projects, and to brainstorm convenient APIs and/or dataframe schemas for VCF/BCF data and implement them. The goals of our project are largely technical but complementary to the scientific goals of the codeathon. This project is a good fit for the codeathon setting because our oxbow project seeks to address both the accessibility and scalability problems inherent with working with large NGS file formats. This project will increase the FAIRness (Findability, Accessibility, Interoperability, and Reuse) of existing VCF and BCF data and help users perform more direct data manipulation tasks that no pre-existing specialized tool currently provides. _**Note:** We are not heavy VCF users ourselves: we come from the fields epigenomics and 3D genomics and are very heavy users of other NGS formats. This codeathon will help us better understand the needs of computational biologists in population genomics and hopefully solicit more engagement in the oxbow project._ # Team Resources The core of oxbow is written in **Rust** and uses the Rust-based **noodles** project from St. Jude to interface directly with the file formats (instead of the classic `htslib` library which is written in C). Our oxbow "mono-repo" consists of a core Rust "crate" and binding packages for **Python** and **R**. We aim to focus largely on integrations with the Python data science ecosystem, using Rust as a systems level language for high-performance and memory-safe low-level data access and conversion. If you are interested in diving into the Rust-based internals, see the Rust resources. Team [Miro Board invitation](https://miro.com/welcomeonboard/MktzTmI0Y0pGcHpmSUxiSkRNdDBwYlhhc2F5RmFVN1VXRVR3NUZpcDZKeWswWEY3MG9xakcyOFFYZXpvcnV1SnwzNDU4NzY0NTQyNzQ4MTc0NzA0fDI=?share_link_id=818711514161) NCBI - [Codeathon Slack](vcffilesforpo-oep6172.slack.com) - [Our Team Github](https://github.com/NCBI-Codeathons/vcf-4-population-genomics-team-abdennur) Oxbow - 🌊 Introductory [Oxbow blog post](https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos) - 💬 Oxbow [Zulip](https://oxbow.zulipchat.com/) - ⚡ Lightning talk slides on [Oxbow and Dask-NGS](https://docs.google.com/presentation/d/14F4uUvLcvHYqkik16GsCoST3atsmlwUeQ93giOIRE44/edit?usp=sharing) - 🎥 [Video](https://drive.google.com/file/d/1NtI7OShT4BK5oyoXQIIzQTineIUkS7hd/view?usp=sharing) of Dask-NGS in action on BAM files - Oxbow on [GitHub](https://github.com/abdenlab/oxbow) [🦀/🐍/R] - Dask-NGS on [GitHub](https://github.com/abdenlab/dask-ngs) [🐍] - Noodles on [GitHub](https://github.com/zaeleus/noodles) [🦀] Rust - The [Rust Book](https://doc.rust-lang.org/book/) - The [PyO3 Maturin](https://pyo3.rs/v0.19.1/) project for seamless Python bindings to Rust crates Scientific Python ecosystem - [sgkit](https://pystatgen.github.io/sgkit/latest/index.htm) - statistical genetics toolkit in Python based on Zarr, Dask and Xarray: - [GitHub](https://github.com/pystatgen/sgkit) - [VCF conversion](https://pystatgen.github.io/sgkit/latest/vcf.html) - The [Zarr format/protocol](https://zarr.readthedocs.io/en/stable/) for multidimensional arrays - The [kerchunk](https://github.com/fsspec/kerchunk) library to map archival data to Zarr without converting formats - The [Dask](https://www.dask.org/) library for distributed computing on dataframes and arrays - The [Xarray](https://docs.xarray.dev/en/stable/) library for working with labeled arrays (popular in climate and geosciences fields) - The [Pandas](http://pandas.pydata.org/pandas-docs/stable/) dataframe library for Python - The [Polars](https://www.pola.rs/) dataframe library (Rust, Python, and more) - [pandas-genomics](https://pandas-genomics.readthedocs.io/) provides some extension types for variants and genotypes Python style - The [NumPy Style guide](https://numpydoc.readthedocs.io/en/latest/format.html) is commonly used for scientific python packages. The [HTS-lib Specifications](http://samtools.github.io/hts-specs/) * VCF * http://samtools.github.io/hts-specs/VCFv4.4.pdf * BCF * http://samtools.github.io/hts-specs/BCFv1_qref.pdf * http://samtools.github.io/hts-specs/BCFv2_qref.pdf * BGZF (bgzip) * All compressed and indexed HTS lib formats are built on top of bgzf (block gzip format), a clever specialization of the GZIP compression scheme which breaks a gzip stream into fixed-size blocks that can be indexed. It is documented within the [SAM spec](http://samtools.github.io/hts-specs/SAMv1.pdf). * http://www.htslib.org/doc/bgzip.html (BGZF command line tool) * Index files * The original `.bai` index for BAM files is detailed within the [SAM spec](http://samtools.github.io/hts-specs/SAMv1.pdf) and is useful for historical reasons * VCF files are usually indexed with [Tabix](http://samtools.github.io/hts-specs/tabix.pdf), producing `.tbi` companion files * More recently, `.bai` and `.tbi`, were generalized to [CSI](http://samtools.github.io/hts-specs/CSIv1.pdf), allowing for larger reference genomes Traditional tools for working with VCF - pysam and samtools - cyvcf2 (used by sgkit to convert VCF to Zarr) - plink - bgen Re: Parquet and Avro (not Arrow) ![](https://hackmd.io/_uploads/rkHT7PBi2.png) * https://adam.readthedocs.io/en/latest/architecture/schemas/ * Avro Schemas: https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl ## Getting started Any environment that works for you is fine. We recommend working in VSCode, which provides support for Rust, Python, R and Jupyter development. 1. Install [Rust](https://www.rust-lang.org/tools/install) 2. Install [pipx](https://github.com/pypa/pipx) ```bash $ brew install pipx # on a mac $ pipx ensurepath ``` 3. Install [PyO3/Maturin](https://www.maturin.rs/installation.html) ```bash $ pipx install maturin ``` 4. Install [hatch](https://hatch.pypa.io/latest/) ```bash $ pipx install hatch ``` If you like to keep your virtual environments inside your project directory: ```bash hatch config set dirs.env.virtual .venv ``` 5. Clone the repos and set up virtualenvs ```bash $ mkdir oxbow_project $ cd oxbow_project $ git clone https://github.com/abdenlab/oxbow.git $ git clone https://github.com/abdenlab/dask-ngs.git ``` For oxbow ```bash $ cd oxbow/py-oxbow # The following will create a virtual environment the # first time you run it, then activate it. # Alternatively to create: python -m venv .venv; pip install -e '.[dev]' # Alternatively to activate: source .venv/bin/activate $ hatch shell # Download fixture files $ cd ../fixtures $ wget -i list.txt ``` For dask-ngs ```bash $ cd dask-ngs $ hatch shell ``` 6. To (re)build oxbow and py-oxbow bindings (from `oxbow/py-oxbow` directory): ```bash $ maturin develop --release ``` # Initial Ideas Need: we are looking for burning analysis wishes and challenges from our team or others to drive these projects! * Documentation (oxbow, pyoxbow, dask-ngs) * Sphinx for py-oxbow and dask-ngs (dask-ngs has API docs now) * Rustdoc for oxbow * New features * Implement nested and semi-structured fields (e.g. BAM tags) in oxbow, which is supported by Arrow. * Explore most efficient data types (e.g., pyarrow strings instead of numpy) * Improve virtual offset mapping (advanced) * use either linear or bin indexes * Remote query (advanced) * Wrap pyo3 file-like into `noodles` Readers * Test with regular Python file handles, smart_open filse, fsspec * Credentialed access: test querying protected AWS or GCP files * API design * How to deal with headers * Conveniences: * Bit-flag expansion/interpretation * Read ends * Preferred field names * Which fields/columns are most useful: support reading only a subset of fields. * Operations and performance * Which types of operations are most critical? * Do they better fit the dataframe or the array paradigm? * VCF vs BCF performance * Try applying distributed computation graphs with Dask * Compare oxbow dask dataframes with sgkit Zarr stores and conventional tools * Ecosystem integrations * Oxbow and sgkit * Can we get VCF/BCF to work with the Ibis SQL engine? * Converters and tools * Re-implement some basic `vcftools` utilities in a few lines of pandas/polars/dask * Re-implement some bloated popgen workflow in a few lines of code * Implement a brand new QC or summarization tool * VCF converter: parquet, Zarr, cloud store, ML pipeline # Daily Log ## Day 1 Assigning roles: * Team lead = Nezar * Tech lead = Garrett * Writer = David * Flex Notes * Query across VCF is a big deal ! * Gathering information about vcf use case: * Popularity of sgkit among genomics statisticians. * How are VCF files generally visualized? Visualization * Dynamic Genome Browser * UCSC * IGV * JBrowse * HiGlass * NCBI CGV browser * Static Plot * Circos Common pipelines * BCFtools, freeBayes * [GATK](https://gatk.broadinstitute.org/hc/en-us) * Sarek (nextflow pipeline) ### Project focus * Examine typical scalable analysis approaches, visualization of VCF * Can we accomplish what is desired with Oxbow * https://glow.readthedocs.io/en/latest/etl/vcf2delta.html * Data representation * Examine existing schemas * Can we merge VCFs in pandas/dask/polars? * Flatten VCF data into columnar data representation * Data subsetting * Separate out INFO column from the rest of the columns * Data harmonization * How to subset your VCF project by metadata outside the file and in the file (harmonize) * Do we need to merge everything into one database? * Or can we federate heterogeneous VCF files * Remote queries with file-like objects * HPC vs Cloud * BigQuery not usable on HPC * Existing resources * Dataframe-based (Apache Spark) * [glow](https://glow.readthedocs.io/en/latest/etl/vcf2delta.html) * [hail](https://hail.is/docs/0.2/overview/index.html) * Array-based * [tiledb](https://docs.tiledb.com/main/integrations-and-extensions/genomics/population-genomics/data-model) * [sgkit](https://pystatgen.github.io/sgkit/latest/) * Other (novel, sparser formats) * [tachyon](https://github.com/mklarqvist/tachyon) * SVCR, Savvy, etc. * Make your own schema * Access to sample VCF files **Goal: _Facilitate data harmonization and analysis by dynamically mapping VCF data to a flat schema where select information nested within desired columns is parsed and transformed into new columns upon querying._** ### Action items - everyone: get set up with oxbow (rust and python); contact @Garrett for help - acquire test VCF data - create diagram explaining project - daily jupyter ipynb to explain work for the day - Garrett to update daily log for repo - project name for repo - everyone can update the repo with their affiliation (good first commit) ### Plan for tomorrow - help people get set up - walk through oxbow codebase - walk through noodles codebase - check in with Team Park ## Day 2 ### Design doc for rust (room 1) [Design Doc](https://hackmd.io/YX6XotGKT0evC0Tj5eXLgg) ### Researching schemas (room 2) [User Stories](https://hackmd.io/YX6XotGKT0evC0Tj5eXLgg?both#User-Stories) ### What we did - set up local environments for development - created [Design Doc](https://hackmd.io/YX6XotGKT0evC0Tj5eXLgg) - software design - gathered user stories ### Plan for tomorrow - implementing dataframe solution - check in with other teams on vcf usage - investigate common vcf schemas ### Action items - upload Day 2 log on github repo ## Day 3 ### What we did - jupyter notebook for schema flattening - flattened the info field to the top level - field parsing for fields with multiple values - cast vcf data into numpy types - allow exploding arbitrary columns vertically - example usage ### Plan for tomorrow - handle genotype field - demo with multiple vcfs - prep presentation ### Action items - upload Day 3 log to github repo - upload jupyter notebook to github repo - upload test vcf and index files to github repo ## Day 4 ### What we did - Jupyter notebook for flattening sample field - Integrated flattening info and sample field into one function - Created demo use-cases ### Plan for tomorrow - present

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.