VirtualiZarr - HackMD

## "Virtual" Zarr breakout A bridge to connecting archival formats to Zarr. ### Agenda: Zarr spec changes: - Defining a zarr-native interchange format for virtual chunk "manifests"? - Zarr extensions that would help for virtualizing things - Variable-length chunks - Strings? - What would "Virtual" Zarr need from a future variable length chunk extension? - Virtual concatenation? - Could virtual zarr stores just take the place of sharding so "regular" zarr implementations don't have to worry about it? Interaction with specific formats: - Zarr stores that are also TIFF/HDF files? - What archival formats are being used by this group today? - NetCDF, HDF, TIFF, GRIB - What have we learned from other similar efforts (COG, Kerchunk, etc.) that would also be worth adopting upstream? VirtualiZarr project: - Improving the VirtualiZarr python library - Rust-ifying virtualization tools? - Relationship to Kerchunk - Virtualizarr store in Icechunk in Cloud object store --- - Access control and security - Zarr v3 to Zarr v2 conversion - Zarr v3 shards to non-sharded Zarr - FUSE (user-space file system drivers) versus HTTP/S3 interfaces --- - Demo and benchmarking on using Icechunk refernce from virtulizarr - Efficient (in terms of space) location storage - What has been learned from the Icechunk experience that we would like to upstream? --- - Pencil and pancake chunking method and benchmarking using virtualizarr - Migrating CF xarray decoding logic to a dedicated codec library. - Mixed dtype array indirection - Per-chunk metadata - Partial decompression of chunks - Views of subarrays in other Zarr arrays - Views of part of a Zarr hierarchy - - Presenting Zarr as other file formats (HDF5, TIFF) - - ### Notes: #### Background: - Defining virtual references - Example of [JSON alias that Joe developed](https://github.com/zarr-developers/zarr-specs/issues/287) to reference a specific shard/chunk: ```json { "0.0.0": {"path": "s3://bucket/prefix/file.nc", "offset": 1242, "length": 100} } ``` - path, offset and size are not enough: - s3: endpoint, credentials / requester_pays / anon, region, ... - per-chunk transformer / codec - How do you communicate credential requirements in this case? - Zarr Extensions - How can archival formats be valid Zarr shards? - Zarr- needs to be able to handle variable length chunks to support any kind of archival data. - Some formats (like NASA has) already have multiple files. So it would be useful if each file mapped to a shard, but that doesn't quite work today. - Need to be able to define a Codec that reads this data and a way of publishing codecs so that people can access them when reading. - Virtual concatention - Inconsistent chunks because of size and dtype differences - For example may have temperatures that are `float32` and `float64`. Want a way to promote them to a common type (with a Codec?) and forward that to the reader. - Another option could be a native zarr concept of concatenation so you could have zarr arrays with different dtypes in separate arrays that are then combined via the concatenation interface. (Ryan's original issue https://github.com/zarr-developers/zarr-python/issues/2536) - - J Program lang - https://en.wikipedia.org/wiki/J_(programming_language) - How would we describe transformations in a way that doesn't involve a lot of repetition. - Could Virtual Zarr stores become shards in their own right? - Some details about chunks may be lost in this abstraction. Unclear whether those details are needed though - External index - Would allow Zarr shards to valid HDF5 - Could be useful in reading for virtualized data - Common use cases - How do I take file formats that were not optimized for cloud object storage and index them so they can be efficiently used in that environment - Or how do I access files that _are_ cloud optimized (like COG) as if they are Zarr? In that case you can do the parsing very quickly so we can do it on the fly. - Virtual Zarr stores do two different things, one is the parsing, the other is grouping a bunch of files together into one big stack. - Virtual Zarr could also be useful as a way to take V2 dataset and represent as V3 - Commonly user request is to move from pancake to pencil sharding. However Virtual Zarr cannot change the underlying shards. - Vortex format in Tabular data - https://vortex.dev - Temporarily persisting or caching intermediate results of a transform - - How would one leaverage VirtualiZarr in web rendering of underlying data? - VirtualiZarr take the aproach of data is read-only and has a manifest that - How could we implent VirtualiZarr or Icechunk today with a StorageTransformer? What issues do we discover when we try that we would like to fix? - Content-addressable storage transformer (v3 protocol extension) - https://github.com/zarr-developers/zarr-specs/issues/82 - How could we have a set of common codecs usable from other languages to read CF convention arrays? - https://github.com/xarray-contrib/cf-codecs - - - - ## Spec-ification ### Use cases - Lightweight persistent virtual stores - Cross-language compatibility - On-demand "synthetic" virtual stores - view of the original data - Upstreaming parts of Icechunk - Virtual chunk indirection - Last modified date - Version control - Transactions - Consolidated metadata store ### Candidate formats - JSON - Parquet - Kerchunk - (JSON) - (Parquet) - Icechunk (flatbuffers) - https://github.com/earth-mover/icechunk/blob/main/icechunk/flatbuffers/manifest.fbs - Zarr arrays ### Things to discuss - Chunk-level - Path - Offset and length (a.k.a byte range) - Validation check - Last modified date or integrity(?) hash - Last modified date - a fast path on read to see if the chunk has changed since last read - Integrity hash - if the chunk has changed, compute hash to make sure the chunk matches or error if not - Confirm data size match - Convention for empty chunk - Array or group - level - API endpoint to view prefixes (virtual chunk containers) - Authentication injection - ### Relevant Broader Q's that we won't solve today - Per-chunk metadata? - Chunk-level summary statistics