Nezar Abdennur
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    1
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Participants - Nezar **nezar.abdennur@umassmed.edu** - Garrett **garrett.ng@umassmed.edu** - Thomas **thomas.reimonn@umassmed.edu** - David david.adeleke@ndsu.edu - Lei lei.ma@usda.gov - Mehmet mkuscuog@jcvi.org - Clark cc8dm@virginia.edu - Se-Ran sjun@uams.edu # Team Who are we? * Abdennur Lab: Our lab studies 3D genome organization, the epigenome, and their roles in cell fate determination. We also work on foundational software for genomic data science. * Nezar - PI, Computational Biologist * Garrett - Software engineer in the Abdennur Lab with a background in clinical medicine, cloud, and data ops * Tom - MD/PhD student in the Abdennur Lab, Computational Biologist * Clark - Software Engineer at the UVA Biocomplexity Institute with a background in bioinformatics and statistics * Mehmet - Software Engineer at J. Craig Venter Institute * Lei - Postdoctoral researcher at USDA ARS * David - PhD student at North Dakota State University # Project Scope ### Making variant call data more accessible and scalable with Oxbow We recently launched an open-source project called [Oxbow](https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos), which aims to provide a unified interface to high-volume genomic file formats as tabular data using Apache Arrow. This allows genomic data to be efficiently accessible to dataframe-based analytics libraries such as Pandas, Polars, and Dask, and data analysis environments such as Jupyter. As a goal, we plan to support teams who would like to use Oxbow for their projects, and to brainstorm convenient APIs and/or dataframe schemas for VCF/BCF data and implement them. The goals of our project are largely technical but complementary to the scientific goals of the codeathon. This project is a good fit for the codeathon setting because our oxbow project seeks to address both the accessibility and scalability problems inherent with working with large NGS file formats. This project will increase the FAIRness (Findability, Accessibility, Interoperability, and Reuse) of existing VCF and BCF data and help users perform more direct data manipulation tasks that no pre-existing specialized tool currently provides. _**Note:** We are not heavy VCF users ourselves: we come from the fields epigenomics and 3D genomics and are very heavy users of other NGS formats. This codeathon will help us better understand the needs of computational biologists in population genomics and hopefully solicit more engagement in the oxbow project._ # Team Resources The core of oxbow is written in **Rust** and uses the Rust-based **noodles** project from St. Jude to interface directly with the file formats (instead of the classic `htslib` library which is written in C). Our oxbow "mono-repo" consists of a core Rust "crate" and binding packages for **Python** and **R**. We aim to focus largely on integrations with the Python data science ecosystem, using Rust as a systems level language for high-performance and memory-safe low-level data access and conversion. If you are interested in diving into the Rust-based internals, see the Rust resources. Team [Miro Board invitation](https://miro.com/welcomeonboard/MktzTmI0Y0pGcHpmSUxiSkRNdDBwYlhhc2F5RmFVN1VXRVR3NUZpcDZKeWswWEY3MG9xakcyOFFYZXpvcnV1SnwzNDU4NzY0NTQyNzQ4MTc0NzA0fDI=?share_link_id=818711514161) NCBI - [Codeathon Slack](vcffilesforpo-oep6172.slack.com) - [Our Team Github](https://github.com/NCBI-Codeathons/vcf-4-population-genomics-team-abdennur) Oxbow - 🌊 Introductory [Oxbow blog post](https://lifeinbytes.substack.com/p/breaking-out-of-bioinformatic-data-silos) - 💬 Oxbow [Zulip](https://oxbow.zulipchat.com/) - ⚡ Lightning talk slides on [Oxbow and Dask-NGS](https://docs.google.com/presentation/d/14F4uUvLcvHYqkik16GsCoST3atsmlwUeQ93giOIRE44/edit?usp=sharing) - 🎥 [Video](https://drive.google.com/file/d/1NtI7OShT4BK5oyoXQIIzQTineIUkS7hd/view?usp=sharing) of Dask-NGS in action on BAM files - Oxbow on [GitHub](https://github.com/abdenlab/oxbow) [🦀/🐍/R] - Dask-NGS on [GitHub](https://github.com/abdenlab/dask-ngs) [🐍] - Noodles on [GitHub](https://github.com/zaeleus/noodles) [🦀] Rust - The [Rust Book](https://doc.rust-lang.org/book/) - The [PyO3 Maturin](https://pyo3.rs/v0.19.1/) project for seamless Python bindings to Rust crates Scientific Python ecosystem - [sgkit](https://pystatgen.github.io/sgkit/latest/index.htm) - statistical genetics toolkit in Python based on Zarr, Dask and Xarray: - [GitHub](https://github.com/pystatgen/sgkit) - [VCF conversion](https://pystatgen.github.io/sgkit/latest/vcf.html) - The [Zarr format/protocol](https://zarr.readthedocs.io/en/stable/) for multidimensional arrays - The [kerchunk](https://github.com/fsspec/kerchunk) library to map archival data to Zarr without converting formats - The [Dask](https://www.dask.org/) library for distributed computing on dataframes and arrays - The [Xarray](https://docs.xarray.dev/en/stable/) library for working with labeled arrays (popular in climate and geosciences fields) - The [Pandas](http://pandas.pydata.org/pandas-docs/stable/) dataframe library for Python - The [Polars](https://www.pola.rs/) dataframe library (Rust, Python, and more) - [pandas-genomics](https://pandas-genomics.readthedocs.io/) provides some extension types for variants and genotypes Python style - The [NumPy Style guide](https://numpydoc.readthedocs.io/en/latest/format.html) is commonly used for scientific python packages. The [HTS-lib Specifications](http://samtools.github.io/hts-specs/) * VCF * http://samtools.github.io/hts-specs/VCFv4.4.pdf * BCF * http://samtools.github.io/hts-specs/BCFv1_qref.pdf * http://samtools.github.io/hts-specs/BCFv2_qref.pdf * BGZF (bgzip) * All compressed and indexed HTS lib formats are built on top of bgzf (block gzip format), a clever specialization of the GZIP compression scheme which breaks a gzip stream into fixed-size blocks that can be indexed. It is documented within the [SAM spec](http://samtools.github.io/hts-specs/SAMv1.pdf). * http://www.htslib.org/doc/bgzip.html (BGZF command line tool) * Index files * The original `.bai` index for BAM files is detailed within the [SAM spec](http://samtools.github.io/hts-specs/SAMv1.pdf) and is useful for historical reasons * VCF files are usually indexed with [Tabix](http://samtools.github.io/hts-specs/tabix.pdf), producing `.tbi` companion files * More recently, `.bai` and `.tbi`, were generalized to [CSI](http://samtools.github.io/hts-specs/CSIv1.pdf), allowing for larger reference genomes Traditional tools for working with VCF - pysam and samtools - cyvcf2 (used by sgkit to convert VCF to Zarr) - plink - bgen Re: Parquet and Avro (not Arrow) ![](https://hackmd.io/_uploads/rkHT7PBi2.png) * https://adam.readthedocs.io/en/latest/architecture/schemas/ * Avro Schemas: https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl ## Getting started Any environment that works for you is fine. We recommend working in VSCode, which provides support for Rust, Python, R and Jupyter development. 1. Install [Rust](https://www.rust-lang.org/tools/install) 2. Install [pipx](https://github.com/pypa/pipx) ```bash $ brew install pipx # on a mac $ pipx ensurepath ``` 3. Install [PyO3/Maturin](https://www.maturin.rs/installation.html) ```bash $ pipx install maturin ``` 4. Install [hatch](https://hatch.pypa.io/latest/) ```bash $ pipx install hatch ``` If you like to keep your virtual environments inside your project directory: ```bash hatch config set dirs.env.virtual .venv ``` 5. Clone the repos and set up virtualenvs ```bash $ mkdir oxbow_project $ cd oxbow_project $ git clone https://github.com/abdenlab/oxbow.git $ git clone https://github.com/abdenlab/dask-ngs.git ``` For oxbow ```bash $ cd oxbow/py-oxbow # The following will create a virtual environment the # first time you run it, then activate it. # Alternatively to create: python -m venv .venv; pip install -e '.[dev]' # Alternatively to activate: source .venv/bin/activate $ hatch shell # Download fixture files $ cd ../fixtures $ wget -i list.txt ``` For dask-ngs ```bash $ cd dask-ngs $ hatch shell ``` 6. To (re)build oxbow and py-oxbow bindings (from `oxbow/py-oxbow` directory): ```bash $ maturin develop --release ``` # Initial Ideas Need: we are looking for burning analysis wishes and challenges from our team or others to drive these projects! * Documentation (oxbow, pyoxbow, dask-ngs) * Sphinx for py-oxbow and dask-ngs (dask-ngs has API docs now) * Rustdoc for oxbow * New features * Implement nested and semi-structured fields (e.g. BAM tags) in oxbow, which is supported by Arrow. * Explore most efficient data types (e.g., pyarrow strings instead of numpy) * Improve virtual offset mapping (advanced) * use either linear or bin indexes * Remote query (advanced) * Wrap pyo3 file-like into `noodles` Readers * Test with regular Python file handles, smart_open filse, fsspec * Credentialed access: test querying protected AWS or GCP files * API design * How to deal with headers * Conveniences: * Bit-flag expansion/interpretation * Read ends * Preferred field names * Which fields/columns are most useful: support reading only a subset of fields. * Operations and performance * Which types of operations are most critical? * Do they better fit the dataframe or the array paradigm? * VCF vs BCF performance * Try applying distributed computation graphs with Dask * Compare oxbow dask dataframes with sgkit Zarr stores and conventional tools * Ecosystem integrations * Oxbow and sgkit * Can we get VCF/BCF to work with the Ibis SQL engine? * Converters and tools * Re-implement some basic `vcftools` utilities in a few lines of pandas/polars/dask * Re-implement some bloated popgen workflow in a few lines of code * Implement a brand new QC or summarization tool * VCF converter: parquet, Zarr, cloud store, ML pipeline # Daily Log ## Day 1 Assigning roles: * Team lead = Nezar * Tech lead = Garrett * Writer = David * Flex Notes * Query across VCF is a big deal ! * Gathering information about vcf use case: * Popularity of sgkit among genomics statisticians. * How are VCF files generally visualized? Visualization * Dynamic Genome Browser * UCSC * IGV * JBrowse * HiGlass * NCBI CGV browser * Static Plot * Circos Common pipelines * BCFtools, freeBayes * [GATK](https://gatk.broadinstitute.org/hc/en-us) * Sarek (nextflow pipeline) ### Project focus * Examine typical scalable analysis approaches, visualization of VCF * Can we accomplish what is desired with Oxbow * https://glow.readthedocs.io/en/latest/etl/vcf2delta.html * Data representation * Examine existing schemas * Can we merge VCFs in pandas/dask/polars? * Flatten VCF data into columnar data representation * Data subsetting * Separate out INFO column from the rest of the columns * Data harmonization * How to subset your VCF project by metadata outside the file and in the file (harmonize) * Do we need to merge everything into one database? * Or can we federate heterogeneous VCF files * Remote queries with file-like objects * HPC vs Cloud * BigQuery not usable on HPC * Existing resources * Dataframe-based (Apache Spark) * [glow](https://glow.readthedocs.io/en/latest/etl/vcf2delta.html) * [hail](https://hail.is/docs/0.2/overview/index.html) * Array-based * [tiledb](https://docs.tiledb.com/main/integrations-and-extensions/genomics/population-genomics/data-model) * [sgkit](https://pystatgen.github.io/sgkit/latest/) * Other (novel, sparser formats) * [tachyon](https://github.com/mklarqvist/tachyon) * SVCR, Savvy, etc. * Make your own schema * Access to sample VCF files **Goal: _Facilitate data harmonization and analysis by dynamically mapping VCF data to a flat schema where select information nested within desired columns is parsed and transformed into new columns upon querying._** ### Action items - everyone: get set up with oxbow (rust and python); contact @Garrett for help - acquire test VCF data - create diagram explaining project - daily jupyter ipynb to explain work for the day - Garrett to update daily log for repo - project name for repo - everyone can update the repo with their affiliation (good first commit) ### Plan for tomorrow - help people get set up - walk through oxbow codebase - walk through noodles codebase - check in with Team Park ## Day 2 ### Design doc for rust (room 1) [Design Doc](https://hackmd.io/YX6XotGKT0evC0Tj5eXLgg) ### Researching schemas (room 2) [User Stories](https://hackmd.io/YX6XotGKT0evC0Tj5eXLgg?both#User-Stories) ### What we did - set up local environments for development - created [Design Doc](https://hackmd.io/YX6XotGKT0evC0Tj5eXLgg) - software design - gathered user stories ### Plan for tomorrow - implementing dataframe solution - check in with other teams on vcf usage - investigate common vcf schemas ### Action items - upload Day 2 log on github repo ## Day 3 ### What we did - jupyter notebook for schema flattening - flattened the info field to the top level - field parsing for fields with multiple values - cast vcf data into numpy types - allow exploding arbitrary columns vertically - example usage ### Plan for tomorrow - handle genotype field - demo with multiple vcfs - prep presentation ### Action items - upload Day 3 log to github repo - upload jupyter notebook to github repo - upload test vcf and index files to github repo ## Day 4 ### What we did - Jupyter notebook for flattening sample field - Integrated flattening info and sample field into one function - Created demo use-cases ### Plan for tomorrow - present

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully