## "Virtual" Zarr breakout
A bridge to connecting archival formats to Zarr.
### Agenda:
Zarr spec changes:
- Defining a zarr-native interchange format for virtual chunk "manifests"?
- Zarr extensions that would help for virtualizing things
- Variable-length chunks
- Strings?
- What would "Virtual" Zarr need from a future variable length chunk extension?
- Virtual concatenation?
- Could virtual zarr stores just take the place of sharding so "regular" zarr implementations don't have to worry about it?
Interaction with specific formats:
- Zarr stores that are also TIFF/HDF files?
- What archival formats are being used by this group today?
- NetCDF, HDF, TIFF, GRIB
- What have we learned from other similar efforts (COG, Kerchunk, etc.) that would also be worth adopting upstream?
VirtualiZarr project:
- Improving the VirtualiZarr python library
- Rust-ifying virtualization tools?
- Relationship to Kerchunk
- Virtualizarr store in Icechunk in Cloud object store
---
- Access control and security
- Zarr v3 to Zarr v2 conversion
- Zarr v3 shards to non-sharded Zarr
- FUSE (user-space file system drivers) versus HTTP/S3 interfaces
---
- Demo and benchmarking on using Icechunk refernce from virtulizarr
- Efficient (in terms of space) location storage
- What has been learned from the Icechunk experience that we would like to upstream?
---
- Pencil and pancake chunking method and benchmarking using virtualizarr
- Migrating CF xarray decoding logic to a dedicated codec library.
- Mixed dtype array indirection
- Per-chunk metadata
- Partial decompression of chunks
- Views of subarrays in other Zarr arrays
- Views of part of a Zarr hierarchy
-
- Presenting Zarr as other file formats (HDF5, TIFF)
-
-
### Notes:
#### Background:
- Defining virtual references
- Example of [JSON alias that Joe developed](https://github.com/zarr-developers/zarr-specs/issues/287) to reference a specific shard/chunk:
```json
{
"0.0.0": {"path": "s3://bucket/prefix/file.nc", "offset": 1242, "length": 100}
}
```
- path, offset and size are not enough:
- s3: endpoint, credentials / requester_pays / anon, region, ...
- per-chunk transformer / codec
- How do you communicate credential requirements in this case?
- Zarr Extensions
- How can archival formats be valid Zarr shards?
- Zarr- needs to be able to handle variable length chunks to support any kind of archival data.
- Some formats (like NASA has) already have multiple files. So it would be useful if each file mapped to a shard, but that doesn't quite work today.
- Need to be able to define a Codec that reads this data and a way of publishing codecs so that people can access them when reading.
- Virtual concatention
- Inconsistent chunks because of size and dtype differences
- For example may have temperatures that are `float32` and `float64`. Want a way to promote them to a common type (with a Codec?) and forward that to the reader.
- Another option could be a native zarr concept of concatenation so you could have zarr arrays with different dtypes in separate arrays that are then combined via the concatenation interface. (Ryan's original issue https://github.com/zarr-developers/zarr-python/issues/2536)
-
- J Program lang
- https://en.wikipedia.org/wiki/J_(programming_language)
- How would we describe transformations in a way that doesn't involve a lot of repetition.
- Could Virtual Zarr stores become shards in their own right?
- Some details about chunks may be lost in this abstraction. Unclear whether those details are needed though
- External index
- Would allow Zarr shards to valid HDF5
- Could be useful in reading for virtualized data
- Common use cases
- How do I take file formats that were not optimized for cloud object storage and index them so they can be efficiently used in that environment
- Or how do I access files that _are_ cloud optimized (like COG) as if they are Zarr? In that case you can do the parsing very quickly so we can do it on the fly.
- Virtual Zarr stores do two different things, one is the parsing, the other is grouping a bunch of files together into one big stack.
- Virtual Zarr could also be useful as a way to take V2 dataset and represent as V3
- Commonly user request is to move from pancake to pencil sharding. However Virtual Zarr cannot change the underlying shards.
- Vortex format in Tabular data
- https://vortex.dev
- Temporarily persisting or caching intermediate results of a transform
-
- How would one leaverage VirtualiZarr in web rendering of underlying data?
- VirtualiZarr take the aproach of data is read-only and has a manifest that
- How could we implent VirtualiZarr or Icechunk today with a StorageTransformer? What issues do we discover when we try that we would like to fix?
- Content-addressable storage transformer (v3 protocol extension)
- https://github.com/zarr-developers/zarr-specs/issues/82
- How could we have a set of common codecs usable from other languages to read CF convention arrays?
- https://github.com/xarray-contrib/cf-codecs
-
-
-
-
## Spec-ification
### Use cases
- Lightweight persistent virtual stores
- Cross-language compatibility
- On-demand "synthetic" virtual stores
- view of the original data
- Upstreaming parts of Icechunk
- Virtual chunk indirection
- Last modified date
- Version control
- Transactions
- Consolidated metadata store
### Candidate formats
- JSON
- Parquet
- Kerchunk
- (JSON)
- (Parquet)
- Icechunk (flatbuffers)
- https://github.com/earth-mover/icechunk/blob/main/icechunk/flatbuffers/manifest.fbs
- Zarr arrays
### Things to discuss
- Chunk-level
- Path
- Offset and length (a.k.a byte range)
- Validation check
- Last modified date or integrity(?) hash
- Last modified date - a fast path on read to see if the chunk has changed since last read
- Integrity hash - if the chunk has changed, compute hash to make sure the chunk matches or error if not
- Confirm data size match
- Convention for empty chunk
- Array or group
- level
- API endpoint to view prefixes (virtual chunk containers)
- Authentication injection
-
### Relevant Broader Q's that we won't solve today
- Per-chunk metadata?
- Chunk-level summary statistics