Upstreaming Kerchunk

Summary

We aim to upstream much of the functionality of kerchunk into Zarr and Zarr-python, through a series of individually-useful features.

Context / motivation:

All NASA archival array data (including netCDF, HDF5, GRIB, TIFF?, FITS?) either is or could be accessed as Zarr
- Seen the impact of this work with AWS + PODAAC creation of the MUR SST zarr product
Already able to use kerchunk to do some of this
But need a more sustainable, powerful, and maintainable solution for the longer-term

Problems with kerchunk as-is:

Monolithic project with few active maintainers
Relies on fsspec, meaning that kerchunk's reference stores can only be read from python
Uses a store-level abstraction which is less modular than an array-based abstraction
- An array-based abstraction will support various needs to merge and concatenate references. Combining references currently relies on Kerchunk's MultiZarrToZarr. MultiZarrToZarr handles a wide variety of use cases thus overloading the responsibility of this one function. See https://github.com/fsspec/kerchunk/issues/377 for more details.
Current schema cannot handle data arrays with varying shapes or chunk schemas (otherwise known as variable-length chunks. See this Zarr Enhancement Proposal to learn more.)

Proposal:

Multi-stakeholder effort to upstream functionality in Zarr specification / Zarr-Python / possibly a new dedicated VirtualiZarr package
Formalization of extension features in Zarr Specification itself allows for language-agnostic data access
Using Zarr Specification is mature, clearly-defined, multi-stakeholder, and therefore more reliable in the long term
New array-based abstraction through a dedicated VirtualZarrArray allows for wrapping with xarray, greatly streamlining the user experience for data providers tasked with giving access to data via Zarr.
Direct integration with the Zarr model allows for taking advantage of other zarr enhancements, including the Variable Chunks ZEP and performance optimizations (e.g. sharded data access).

Roadmap

We are really talking about a whole roadmap of features here. They can be broken up, and each has an MVP. The top-level list is the feature, the inner-level list is the steps that should be tried to create the MVP.

Feature 0: Storage transformers in zarr-python v3

Idea: Make sure the Zarr-Python 3.0 implementation actually has developed enough to allow adding features 1 and 2 below.

Steps:

Complete the store refactor (e.g. zarr-python#1686)
Develop prototype manifest storage transformer as an experimental wrapper around a Zarr Store
Design, implement, and test generic array storage transformer API
(After formalizing the manifest and array metadata schema for the manifest) implement the manifest storage transformer in zarr-python

Feature 1: "Chunk Manifest" indexing into legacy formats

Idea: Formalize kerchunk’s format for storing byte ranges via a new zarr extension, the so-called “chunk manifest”.

Steps:

Think through the format of the chunk manifest explicitly enough to actually describe such a metadata file in it's entirety,
Create the necessary byte ranges from a netCDF4 file (ideally by calling kerchunk.backends.SingleHDF5ToZarr and manipulating the result),
Then write a v3 Zarr array (i.e. serialize this metadata to disk) that conforms to this new chunk manifest ZEP,
Try to read this array in python (requiring a modification to zarr-python to teach it how to read the manifest)
Try to read this array in another language (e.g. using zarr-js, requiring a modification to zarr-js).

MVP: Read this test array from multiple languages

Milestone: Get the chunk manifest ZEP accepted into the Zarr Spec, and implemented in zarr-python

Feature 2: Virtual Concatenation inside Zarr stores

Idea: Formalize the idea of virtual concatenation at the Zarr level via another new zarr extension

Steps:

Describe how to record the concatenation of multiple zarr arrays in zarr metadata,
Create such a concatenated zarr array on disk manually,
PR to zarr-python to read this concatenated array,
This should automatically work with the chunk manifest arrays above, but test that too.

MVP: Read a Zarr array that was defined through concatenation

Milestone: Get the virtual concatenation ZEP accepted into the Zarr Spec, and implemented in zarr-python

Feature 3: `VirtualZarrArray` python object

Idea: Replace the overloaded kerchune.combine.MultiZarrToZarr function with a virtual array type so that all combining of legacy file data can be expressed as array concatenations.

Steps:

Create a VirtualZarrArray object which contains only the zarr metadata, but can nevertheless be concatenated like a numpy array (similar to the KerchunkArray prototype),
Add a serialization method to the array object that can write out valid Zarr on-disk.
Create an instance of VirtualZarrArray object which contains only NaNs, of any desired shape.
Probably want to make a creation function like np.empty_like
Make this concatenatable with the normal VirtualZarrArray objects and serializable too.

MVP: Prototype VirtualZarrArray class that supports concatenation and serialization to Zarr on-disk

Milestone: Fully-developed VirtualZarrArray class that supports concatenation, indexing, NaNs, and serialization, which lives either in zarr-python or in a separate new package ("VirtualiZarr")

Feature 4: Xarray wrapping `VirtualZarrArray` objects

Idea: Make it easy to use xarray semantics (e.g. xr.concat or xr.open_mfdataset) to combine many legacy files into one Zarr store.

Steps:

Create a small custom xarray backend for opening netCDF data in metadata-only form (i.e. as a VirtualZarrArray instead of as a numpy/dask array) - see the KerchunkArray notebook linked above.
Open the on-disk data with this xarray backend, and ensure concatenation etc. works correctly (see possible issues),
Write a special xarray accessor to serialize the resultant concatenated VirtualZarrArray to disk as a new valid zarr array.

MVP: Gist showing how to open legacy files as xarray-wrapped VirtualZarrArrays and concatenate them

Milestone: Provide the xarray backend and accessor along with documentation, living either in zarr-python or in a separate new package ("VirtualiZarr").

Impact

The end result of this would allow us to:

Use xarray’s high-level API for concatenation / opening, so “kerchunking” a whole set of files becomes just one or two lines of familiar xarray code,
"Kerchunk" tricky datasets more easily, such as those with variable-length chunks, staggered grids, or uneven dimension sizes that require padding,
Serialize the new combined reference files as a valid Zarr store, without copying the legacy data files,
Open the legacy data via this zarr store without using fsspec, and therefore not requiring python (instead we could imagine opening data from a web browser via javascript for example),
Get free usage of other recent Zarr features, such as sharding and variable-length chunks.

Example datasets

PODAAC datasets: SWOT, MURSST
[C]Worthy's datasets: ROMS, CESM

Joe Hamman

2024/02/29 17:27:35

Relies on fsspec, meaning that kerchunk’s reference stores can only be read from python

This is only partially true. For example, we use Kerchunk to generate references in Arraylake but we don't use Fsspec to read them back.

Aimee Barciauskas

2024/05/08 15:32:17

Add a serialization method to the array object that can write out valid Zarr on-disk.

Is it a common use case to write zarr from a virtual array? I'm trying to think of when someone would need this functionality.

Tom Nicholas

2024/05/08 18:21:48

By this I mean "write out a zarr store containing the chunk manifest json files to disk", so the chunk manifest ZEP would be pretty useless without this feature!

2024/05/08 21:05:21

ah ok, I interpreted this as writing out zarr data itself (e.g. new data files), not the chunk manifest references.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Upstreaming Kerchunk

Summary

Context / motivation:

Roadmap

Feature 0: Storage transformers in zarr-python v3

Feature 1: "Chunk Manifest" indexing into legacy formats

Feature 2: Virtual Concatenation inside Zarr stores

Feature 3: VirtualZarrArray python object

Feature 4: Xarray wrapping VirtualZarrArray objects

Impact

Example datasets

Feature 3: `VirtualZarrArray` python object

Feature 4: Xarray wrapping `VirtualZarrArray` objects