changed a year ago
Linked with GitHub

Upstreaming Kerchunk

Summary

We aim to upstream much of the functionality of kerchunk into Zarr and Zarr-python, through a series of individually-useful features.

Context / motivation:

  • All NASA archival array data (including netCDF, HDF5, GRIB, TIFF?, FITS?) either is or could be accessed as Zarr
    • Seen the impact of this work with AWS + PODAAC creation of the MUR SST zarr product
  • Already able to use kerchunk to do some of this
  • But need a more sustainable, powerful, and maintainable solution for the longer-term

Problems with kerchunk as-is:

  • Monolithic project with few active maintainers
  • Relies on fsspec, meaning that kerchunk's reference stores can only be read from python
  • Uses a store-level abstraction which is less modular than an array-based abstraction
    • An array-based abstraction will support various needs to merge and concatenate references. Combining references currently relies on Kerchunk's MultiZarrToZarr. MultiZarrToZarr handles a wide variety of use cases thus overloading the responsibility of this one function. See https://github.com/fsspec/kerchunk/issues/377 for more details.
  • Current schema cannot handle data arrays with varying shapes or chunk schemas (otherwise known as variable-length chunks. See this Zarr Enhancement Proposal to learn more.)

Proposal:

  • Multi-stakeholder effort to upstream functionality in Zarr specification / Zarr-Python / possibly a new dedicated VirtualiZarr package
  • Formalization of extension features in Zarr Specification itself allows for language-agnostic data access
  • Using Zarr Specification is mature, clearly-defined, multi-stakeholder, and therefore more reliable in the long term
  • New array-based abstraction through a dedicated VirtualZarrArray allows for wrapping with xarray, greatly streamlining the user experience for data providers tasked with giving access to data via Zarr.
  • Direct integration with the Zarr model allows for taking advantage of other zarr enhancements, including the Variable Chunks ZEP and performance optimizations (e.g. sharded data access).

Roadmap

We are really talking about a whole roadmap of features here. They can be broken up, and each has an MVP. The top-level list is the feature, the inner-level list is the steps that should be tried to create the MVP.

Feature 0: Storage transformers in zarr-python v3

Idea: Make sure the Zarr-Python 3.0 implementation actually has developed enough to allow adding features 1 and 2 below.

Steps:

  1. Complete the store refactor (e.g. zarr-python#1686)
  2. Develop prototype manifest storage transformer as an experimental wrapper around a Zarr Store
  3. Design, implement, and test generic array storage transformer API
  4. (After formalizing the manifest and array metadata schema for the manifest) implement the manifest storage transformer in zarr-python

Feature 1: "Chunk Manifest" indexing into legacy formats

Idea: Formalize kerchunk’s format for storing byte ranges via a new zarr extension, the so-called “chunk manifest”.

Steps:

  1. Think through the format of the chunk manifest explicitly enough to actually describe such a metadata file in it's entirety,
  2. Create the necessary byte ranges from a netCDF4 file (ideally by calling kerchunk.backends.SingleHDF5ToZarr and manipulating the result),
  3. Then write a v3 Zarr array (i.e. serialize this metadata to disk) that conforms to this new chunk manifest ZEP,
  4. Try to read this array in python (requiring a modification to zarr-python to teach it how to read the manifest)
  5. Try to read this array in another language (e.g. using zarr-js, requiring a modification to zarr-js).

MVP: Read this test array from multiple languages

Milestone: Get the chunk manifest ZEP accepted into the Zarr Spec, and implemented in zarr-python

Feature 2: Virtual Concatenation inside Zarr stores

Idea: Formalize the idea of virtual concatenation at the Zarr level via another new zarr extension

Steps:

  1. Describe how to record the concatenation of multiple zarr arrays in zarr metadata,
  2. Create such a concatenated zarr array on disk manually,
  3. PR to zarr-python to read this concatenated array,
  4. This should automatically work with the chunk manifest arrays above, but test that too.

MVP: Read a Zarr array that was defined through concatenation

Milestone: Get the virtual concatenation ZEP accepted into the Zarr Spec, and implemented in zarr-python

Feature 3: VirtualZarrArray python object

Idea: Replace the overloaded kerchune.combine.MultiZarrToZarr function with a virtual array type so that all combining of legacy file data can be expressed as array concatenations.

Steps:

  1. Create a VirtualZarrArray object which contains only the zarr metadata, but can nevertheless be concatenated like a numpy array (similar to the KerchunkArray prototype),
  2. Add a serialization method to the array object that can write out valid Zarr on-disk.
  3. Create an instance of VirtualZarrArray object which contains only NaNs, of any desired shape.
  4. Probably want to make a creation function like np.empty_like
  5. Make this concatenatable with the normal VirtualZarrArray objects and serializable too.

MVP: Prototype VirtualZarrArray class that supports concatenation and serialization to Zarr on-disk

Milestone: Fully-developed VirtualZarrArray class that supports concatenation, indexing, NaNs, and serialization, which lives either in zarr-python or in a separate new package ("VirtualiZarr")

Feature 4: Xarray wrapping VirtualZarrArray objects

Idea: Make it easy to use xarray semantics (e.g. xr.concat or xr.open_mfdataset) to combine many legacy files into one Zarr store.

Steps:

  1. Create a small custom xarray backend for opening netCDF data in metadata-only form (i.e. as a VirtualZarrArray instead of as a numpy/dask array) - see the KerchunkArray notebook linked above.
  2. Open the on-disk data with this xarray backend, and ensure concatenation etc. works correctly (see possible issues),
  3. Write a special xarray accessor to serialize the resultant concatenated VirtualZarrArray to disk as a new valid zarr array.

MVP: Gist showing how to open legacy files as xarray-wrapped VirtualZarrArrays and concatenate them

Milestone: Provide the xarray backend and accessor along with documentation, living either in zarr-python or in a separate new package ("VirtualiZarr").

Impact

The end result of this would allow us to:

  1. Use xarray’s high-level API for concatenation / opening, so “kerchunking” a whole set of files becomes just one or two lines of familiar xarray code,
  2. "Kerchunk" tricky datasets more easily, such as those with variable-length chunks, staggered grids, or uneven dimension sizes that require padding,
  3. Serialize the new combined reference files as a valid Zarr store, without copying the legacy data files,
  4. Open the legacy data via this zarr store without using fsspec, and therefore not requiring python (instead we could imagine opening data from a web browser via javascript for example),
  5. Get free usage of other recent Zarr features, such as sharding and variable-length chunks.

Example datasets

  • PODAAC datasets: SWOT, MURSST
  • [C]Worthy's datasets: ROMS, CESM
Select a repo