Upstreaming Kerchunk (Updated May 2024)

# Upstreaming Kerchunk (Updated May 2024) ## Summary We aim to upstream much of the functionality of [kerchunk](https://fsspec.github.io/kerchunk) into the [Zarr](https://zarr.dev/) specification and the [zarr-python](https://github.com/zarr-developers/zarr-python) library, through a series of individually-useful features. ## Context & Motivation: Kerchunk has demonstrated that you don't need to store a copy of archival data files to access data in a cloud-optimized manner. Zarr has become a popular format for working with large volumes of multidimensional array data on the cloud. One example is the [MUR SST zarr](https://registry.opendata.aws/mur/) product available on AWS public datasets. Kerchunk has demonstrated that you can access data stored as HDF5/NetCDF, GeoTIFF, GRIB and FITS as zarr stores. Kerchunk generates zarr metadata which is stored alongside key/value pairs of zarr indexes (e.g. 0.0.0) to file URLs and byte ranges. Separation of the metadata from the actual bytes in the chunks allows opening lazy representations of large datasets without loading bytes, e.g. via xarray's lazy loading machinery. The success of Kerchunk however, has also shown the need for a more sustainable, powerful, and maintainable solution for long-term adoption by data archivers. Problems with Kerchunk as-is: - Monolithic project with few active maintainers - Relies on fsspec, meaning that kerchunk's reference stores can only be read from python - Uses a store-level abstraction which is less modular than an array-based abstraction - An array-based abstraction will support various needs to merge and concatenate references. Combining references currently relies on Kerchunk's `MultiZarrToZarr`. MultiZarrToZarr handles a wide variety of use cases thus overloading the responsibility of this one function. See https://github.com/fsspec/kerchunk/issues/377 for more details. - Current schema cannot handle data arrays with varying shapes or chunk schemas (otherwise known as variable-length chunks. See this [Zarr Enhancement Proposal](https://zarr.dev/zeps/draft/ZEP0003.html) to learn more.) - High memory usage during generation of reference files for large datasets. ## Proposal: A multi-stakeholder effort to upstream functionality, including: - Formalization of extension features in the Zarr Specification itself allows for language-agnostic data access. Using the Zarr Specification is mature, clearly-defined, multi-stakeholder, and therefore more reliable in the long term. - The Zarr specs will include a chunk manifest specification which will include how to declare manifest arrays concatenation. - A manifest storage transformer in the zarr-python library - A `VirtualiZarr` package which allows for concatenation - New array-based abstraction through a dedicated `ManifestArray` (also may be called a virtual zarr array) allows for wrapping with xarray, greatly streamlining the user experience for data providers tasked with giving access to data via Zarr. - New modules for creating chunk manifests from archival formats Chunk manifests becoming a part of the formal Zarr specification will allow for taking advantage of other Zarr enhancements, including the Variable Chunks ZEP and performance optimizations (e.g. sharded data access). ## Roadmap There are a number of features which must be developed to achieve the goal of upstreaming kerchunk. The chunk manifest specification must become a part of the Zarr spec and then libraries (zarr-python, VirtualiZarr) need to be able to read, write and modify a chunk manifest via an API. The top-level list is the feature, the inner-level list is the steps that should be tried to create the MVP. ### Feature 0: Formalize the chunk manifest specification Idea: Formalize kerchunk’s format for storing byte ranges via a [new zarr extension](https://github.com/zarr-developers/zarr-specs/issues/287), the so-called “chunk manifest”. Steps: 1. Think through the format of the chunk manifest explicitly enough to actually describe such a metadata file in it's entirety. 2. Create the necessary byte ranges from a netCDF4 file. 3. Write a zarr v3 array (i.e. serialize this metadata to disk) that conforms to the new chunk manifest ZEP. 4. After the chunk manifest storage transformer has been implemented, try to read this array in python 5. Try to read this array in another language (e.g. using zarr-js, requiring a modification to zarr-js). MVP: Read this test array in python. Milestone: Get the chunk manifest ZEP accepted into the Zarr Spec. ### Feature 1: Chunk manifest storage transformers in zarr-python v3 Idea: Make sure the Zarr-Python 3.0 implementation actually has developed enough to allow reading data from files using chunk manifests. Steps: 1. Complete the store refactor (e.g. [zarr-python#1686](https://github.com/zarr-developers/zarr-python/discussions/1686)). 1. Develop prototype manifest storage transformer as an experimental wrapper around a Zarr Store. 1. Design, implement, and test generic array storage transformer API. 1. After formalizing the manifest and array metadata schema for the manifest, implement the manifest storage transformer in zarr-python. ### Feature 2: `ManifestArray` python object Idea: Create a virtual array object class and API for creating, storing and combining of chunks in legacy file data. **Update May 2024: This feature has been largely completed with the release of the [`VirtualiZarr`](https://github.com/TomNicholas/VirtualiZarr) library. However, `VirtualiZarr` will need to be tested for compliance with the Zarr chunk manifest specification when that and the manifest storage transformer once those are released.** Steps: 1. Create a virtual zarr array object class (`ManifestArray`) which contains the zarr metadata and chunk manifests, but can nevertheless be concatenated like a numpy array (similar to the [`KerchunkArray` prototype](https://github.com/pydata/xarray/issues/8699#issuecomment-1925916420)), 2. Add a serialization method to the array object that can write out valid zarr metadata and chunk manifests on-disk. 3. Support creation an instance of `ManifestArray` object which contains only NaNs, of any desired shape; a creation function like [`np.empty_like`](https://numpy.org/doc/stable/reference/generated/numpy.empty_like.html), a. Make this concatenatable with the normal `ManifestArray` objects and serializable too. 4. Modify chunk manifest serialization and zarr metadata as needed to adapt to the release of Zarr v3 and the chunk manifest specification. MVP: Prototype `ManifestArray` class that supports concatenation and serialization to Zarr on-disk Milestone: Fully-developed `ManifestArray` class that supports concatenation, indexing, NaNs, and serialization, which lives either in zarr-python or in a separate new package (`VirtualiZarr`) ### Feature 3: Xarray wrapping `ManifestArray` objects Idea: Make it easy to use xarray semantics (e.g. `xr.concat` or `xr.open_mfdataset`) to combine many legacy files into one Zarr store. **Update May 2024: This feature has been largely completed with the release of the `VirtualiZarr` library.** Steps: 1. Create a small [custom xarray backend](https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html) for opening netCDF data in metadata-only form (i.e. as a `ManifestArray` instead of as a numpy/dask array). 2. Open the on-disk data with this xarray backend, and ensure concatenation etc. works correctly (see [possible issues](https://github.com/pydata/xarray/issues/8699)). 3. Write a special [xarray accessor](https://docs.xarray.dev/en/stable/internals/extending-xarray.html) to serialize the resultant concatenated `ManifestArray` to disk as a new valid zarr array. MVP: Gist showing how to open legacy files as xarray-wrapped `ManifestArray`s and concatenate them Milestone: Provide the xarray backend and accessor along with documentation, living either in zarr-python or in a separate new package (`VirtualiZarr`). ### Feature 4: Create new modules for reading metadata and chunks from archival formats. Idea: New modules for reading metadata and chunks from archival formats, starting with HDF5/NetCDF, will be more stable and easier to maintain. **Update May 2024: This feature is in progress in the `VirtualiZarr` library** Steps: 1. Create new modules using typed python for reading metadata and chunks from HDF5 files and returning valid Zarr metadata and chunk manifests. 2. Speak with data providers to understand which datasets to prioritize for testing chunk manifest functionality. Use this information to also inform step 3. 3. Create modules for other legacy formats according to the assessment of need by the community. ### Feature 5: Virtual concatenation inside Zarr stores Idea: Virtual concatenation at the Zarr level will support concatenation of arrays with different codecs. This includes different compression options and different encodings. An example is weather data that gets written out daily but the scale_factor and offset compression arguments are different for different files. At this time, there is only array-level encoding so chunks with different encodings cannot be written as a single Zarr array. Steps: 1. Describe how to record the concatenation of multiple Zarr arrays in Zarr metadata (Issue: https://github.com/zarr-developers/zarr-specs/issues/288). 2. Create such a concatenated zarr array on disk manually. 3. PR to zarr-python to read this concatenated array. 4. This should automatically work with the chunk manifest arrays above, but test that too. 5. Nice to have: Migrate xarray encoding up into Zarr codecs. See https://github.com/TomNicholas/VirtualiZarr/issues/68 for details. MVP: Read a zarr array that was defined through concatenation Milestone: Get the virtual concatenation ZEP accepted into the Zarr Specification, and implemented in zarr-python. ### Additional Features Additional features for VirtualiZarr can be reviewed here: https://github.com/TomNicholas/VirtualiZarr/issues. ## Impact The end result of this would allow us to: 1. Use xarray’s high-level API for concatenation / opening, so creating an in-memory chunk manifest for a whole set of files becomes just one or two lines of familiar xarray code. 2. Create chunk manifests for tricky datasets more easily, such as those with variable-length chunks, staggered grids, or uneven dimension sizes that require padding. 3. Serialize the in-memory chunk manifests as a valid Zarr store, without copying the legacy data files. 4. Open the legacy data as a zarr store without using fsspec, and therefore not requiring python (instead we could imagine opening data from a web browser via javascript for example). 6. Get free usage of other recent Zarr features, such as sharding and variable-length chunks. ## Example datasets - PODAAC datasets: SWOT, MURSST - [C]Worthy's datasets: ROMS, CESM