owned this note
owned this note
Published
Linked with GitHub
# Upstreaming Kerchunk
## Summary
We aim to upstream much of the functionality of kerchunk into Zarr and Zarr-python, through a series of individually-useful features.
## Context / motivation:
- All NASA archival array data (including netCDF, HDF5, GRIB, TIFF?, FITS?) either is or could be accessed as Zarr
- Seen the impact of this work with AWS + PODAAC creation of the [MUR SST zarr](https://registry.opendata.aws/mur/) product
- Already able to use kerchunk to do some of this
- But need a more sustainable, powerful, and maintainable solution for the longer-term
Problems with kerchunk as-is:
- Monolithic project with few active maintainers
- Relies on fsspec, meaning that kerchunk's reference stores can only be read from python
- Uses a store-level abstraction which is less modular than an array-based abstraction
- An array-based abstraction will support various needs to merge and concatenate references. Combining references currently relies on Kerchunk's MultiZarrToZarr. MultiZarrToZarr handles a wide variety of use cases thus overloading the responsibility of this one function. See https://github.com/fsspec/kerchunk/issues/377 for more details.
- Current schema cannot handle data arrays with varying shapes or chunk schemas (otherwise known as variable-length chunks. See this [Zarr Enhancement Proposal](https://zarr.dev/zeps/draft/ZEP0003.html) to learn more.)
Proposal:
- Multi-stakeholder effort to upstream functionality in Zarr specification / Zarr-Python / possibly a new dedicated `VirtualiZarr` package
- Formalization of extension features in Zarr Specification itself allows for language-agnostic data access
- Using Zarr Specification is mature, clearly-defined, multi-stakeholder, and therefore more reliable in the long term
- New array-based abstraction through a dedicated `VirtualZarrArray` allows for wrapping with xarray, greatly streamlining the user experience for data providers tasked with giving access to data via Zarr.
- Direct integration with the Zarr model allows for taking advantage of other zarr enhancements, including the Variable Chunks ZEP and performance optimizations (e.g. sharded data access).
## Roadmap
We are really talking about a whole roadmap of features here. They can be broken up, and each has an MVP. The top-level list is the feature, the inner-level list is the steps that should be tried to create the MVP.
### Feature 0: Storage transformers in zarr-python v3
Idea: Make sure the Zarr-Python 3.0 implementation actually has developed enough to allow adding features 1 and 2 below.
Steps:
1. Complete the store refactor (e.g. [zarr-python#1686](https://github.com/zarr-developers/zarr-python/discussions/1686))
1. Develop prototype manifest storage transformer as an experimental wrapper around a Zarr Store
1. Design, implement, and test generic array storage transformer API
1. (After formalizing the manifest and array metadata schema for the manifest) implement the manifest storage transformer in zarr-python
### Feature 1: "Chunk Manifest" indexing into legacy formats
Idea: Formalize kerchunk’s format for storing byte ranges via a [new zarr extension](https://github.com/zarr-developers/zarr-specs/issues/287), the so-called “chunk manifest”.
Steps:
1. Think through the format of the chunk manifest explicitly enough to actually describe such a metadata file in it's entirety,
2. Create the necessary byte ranges from a netCDF4 file (ideally by calling `kerchunk.backends.SingleHDF5ToZarr` and manipulating the result),
3. Then write a v3 Zarr array (i.e. serialize this metadata to disk) that conforms to this new chunk manifest ZEP,
4. Try to read this array in python (requiring a modification to zarr-python to teach it how to read the manifest)
5. Try to read this array in another language (e.g. using zarr-js, requiring a modification to zarr-js).
MVP: Read this test array from multiple languages
Milestone: Get the chunk manifest ZEP accepted into the Zarr Spec, and implemented in zarr-python
### Feature 2: Virtual Concatenation inside Zarr stores
Idea: Formalize the idea of virtual concatenation at the Zarr level via another [new zarr extension](https://github.com/zarr-developers/zarr-specs/issues/288)
Steps:
1. Describe how to record the concatenation of multiple zarr arrays in zarr metadata,
2. Create such a concatenated zarr array on disk manually,
3. PR to zarr-python to read this concatenated array,
4. This should automatically work with the chunk manifest arrays above, but test that too.
MVP: Read a Zarr array that was defined through concatenation
Milestone: Get the virtual concatenation ZEP accepted into the Zarr Spec, and implemented in zarr-python
### Feature 3: `VirtualZarrArray` python object
Idea: Replace the overloaded `kerchune.combine.MultiZarrToZarr` function with a virtual array type so that all combining of legacy file data can be expressed as array concatenations.
Steps:
1. Create a VirtualZarrArray object which contains only the zarr metadata, but can nevertheless be concatenated like a numpy array (similar to the [`KerchunkArray` prototype](https://github.com/pydata/xarray/issues/8699#issuecomment-1925916420)),
2. Add a serialization method to the array object that can write out valid Zarr on-disk.
3. Create an instance of VirtualZarrArray object which contains only NaNs, of any desired shape.
4. Probably want to make a creation function like [`np.empty_like`](https://numpy.org/doc/stable/reference/generated/numpy.empty_like.html)
5. Make this concatenatable with the normal `VirtualZarrArray` objects and serializable too.
MVP: Prototype `VirtualZarrArray` class that supports concatenation and serialization to Zarr on-disk
Milestone: Fully-developed `VirtualZarrArray` class that supports concatenation, indexing, NaNs, and serialization, which lives either in zarr-python or in a separate new package ("`VirtualiZarr`")
### Feature 4: Xarray wrapping `VirtualZarrArray` objects
Idea: Make it easy to use xarray semantics (e.g. `xr.concat` or `xr.open_mfdataset`) to combine many legacy files into one Zarr store.
Steps:
1. Create a small [custom xarray backend](https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html) for opening netCDF data in metadata-only form (i.e. as a `VirtualZarrArray` instead of as a numpy/dask array) - see the `KerchunkArray` notebook linked above.
2. Open the on-disk data with this xarray backend, and ensure concatenation etc. works correctly (see [possible issues](https://github.com/pydata/xarray/issues/8699)),
3. Write a special [xarray accessor](https://docs.xarray.dev/en/stable/internals/extending-xarray.html) to serialize the resultant concatenated `VirtualZarrArray` to disk as a new valid zarr array.
MVP: Gist showing how to open legacy files as xarray-wrapped `VirtualZarrArray`s and concatenate them
Milestone: Provide the xarray backend and accessor along with documentation, living either in zarr-python or in a separate new package ("`VirtualiZarr`").
## Impact
The end result of this would allow us to:
1. Use xarray’s high-level API for concatenation / opening, so “kerchunking” a whole set of files becomes just one or two lines of familiar xarray code,
2. "Kerchunk" tricky datasets more easily, such as those with variable-length chunks, staggered grids, or uneven dimension sizes that require padding,
3. Serialize the new combined reference files as a valid Zarr store, without copying the legacy data files,
4. Open the legacy data via this zarr store without using fsspec, and therefore not requiring python (instead we could imagine opening data from a web browser via javascript for example),
6. Get free usage of other recent Zarr features, such as sharding and variable-length chunks.
## Example datasets
- PODAAC datasets: SWOT, MURSST
- [C]Worthy's datasets: ROMS, CESM