# making new Zarr stuff * who: Davis (lead), Justus, Josh (notes), Vincent, Lachlan, The Toms, Alistair, Hugo, Emmanuel, Max ## Status quo * Collecting feedback. What's missing? What's unknown? - Davis: concerned with maintaining compatibility of codecs between v2 and v3. Some of this is zarr-python specific. Numcodecs specifics as well, for example, leading to breakage between the two formats. See also imagecodecs which uses the numcodecs API. What would a v3 version look like? - Tom W.: [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec/blob/main/vcf_zarr_spec.md) spec feedback. Several years old, working on v2. How does one represent it in v3? How do you migrate? While being aware of the codecs, etc. Davis: there is a metadata-only transition but it requires that the chunks are compatible. Super valuable. However, some of the v3 definitions of the codecs have changed. Tom W.: also need **strings**, v3 spec is not finished - Tom N.: Would be nice to link discussions to Arrow. Also unclear what a Zarr extension is. Is it just anything that's not in the spec? Davis: think of new "X" where "X" is a data type defined in the spec. Tom N.: there are things that don't fit that definition. e.g. Putting STAC files in the hierarchy. Davis: Overloaded term. Adding is ok. Restricting the space of stuff (specializing/removing) is also valid. - Max: is the method for doing that set? Davis: zarr-extensions is the place to coordinate to make sure we don't have a lot of duplicated material. Claiming a name. - Hugo: VCF Zarr as an extension? Basically attributes. - The Toms: support for the term "convention" - Alistair: see Dublic Core application profiles - Davis: consolidated_metadata as the only hard-baked extension - Recommend using a top-level key - Max: question : Does everything need to be nested under that key? - E.g.,: `{cf-conventions: 1.2, standard_name: longitude, ...}` vs `cf-conventions: {version: 1.2, standard_name: "longitude"}` - Josh: register one mandatory key; doesn't matter if it's flat - Lachy: if linking out to specs, should recommend top-level keys - Davis: could still have conflicts with hierarchy structures - Davis: - Is there value in registering keys? - Should we instead just have a catalog of conventions? (Josh: or additionaly) - for example, a specification of how xarray writes things down - can't guarantee the composability of conventions - useful to have specialization document? wrote pydantic & docs while at Janelia - Max: doesn't have to be a document. A tool for iterating on UML document? Davis: Some rules that people want to express are difficult to express there, but :+1: for making it as easy as possible. - Vincent: telling users how to make the data or how tools should read the data - Recommendations: :+1: - Add a top-level key * String spec - Lachy: zarr-python has adopted v2 strings - can't do partial decoding. - Jeremy has a nice specification for using the codec infra in v3 for encoding an index. good to standardize. - Should be recommended for all V3 strings - Davis: what needs to be done? **New codec**. Written up? "VLen" Implemented in Zarrs - Lachy: want to write up a spec for that - Tom W.: inspired by arrow? Somewhat - Lachy: serialized format is very similar to how arrow. Omits a few things. - Tom W.: :+1: on default way to encoding strings. Interleaved data might confuse people. Avoid? At least not the default. - Max: minimum number of implementations before we **recommend** something - Lachy: 2? Several things implemented that can't visualize. - Josh: could document things (e.g., datetime, etc.) in zarr-extensions that are **not** recommended - Lachy: concerned that people are creating things for a long time so it has to be supported - dangerous even implementing before discussed. - Davis: goal was parity with all data, leaving a lot of specs to write (then can remove warnings) - Lachy: codecs are more important than data types. More of a hassle to support. Good to discuss across implementations. * Missing codecs? - Lachy: sharding codec requires evenly dividing. Not compatible with variable chunking. Define how we encode the trailing inner chunks. Resolve that in a day? Davis: but with what kind of compatibility. Existing implementations should error. Version bump on codec, but not major breaking change. - How many implementations of sharding? 5+ - Max: renaming? Lachy: Very painful. - Lachy: just lifting a constraint. - Davis: we should test them all - Josh: see **zarr_implementations** dockers to run on multiple implementations - Davis: CLI, adding more things. More uses. Combine with zarrs. * Array storage transformers - Lachy: full support it technically. Have some testing ones. - what is it? - an idea that preceded codec space - array transformers were meant to be an aliasing machanism for keys - e.g., content addressable storage (e.g., IPFS) - overloading the Zarr store API is a smell ideally addressed through array transfoms - something is missing from the metadata space for indirections - how much can you boostrap externally? - ZEP8 is about getting to a zarr json document, then array storage transformers is getting from json document to chunks - should apply to all implementations - What should we do about the current state of array storage transformers? - technically would be breaking - most people would not know it exists, zarr-python raises an exception - ideal would be to make it useful, second option would be to deprecate it to make it less complicated for implementers - would sub-arrays help? - sub-arrays break the spirit of the law - Should we formalize icechunk and/or create a simpler serialization solution? - Zarr should be simple - Simple is benefitial for delineating Zarr from TileDB - Simpler would help more people adopt indirection - Josh: Exchange format like `git bundle`? (Tom N.: serliaization format for Manifest) - Davis: O(n) on chunk specific metadata. Many things that people want to track per chunk - Tom N. 3 ways (ref: https://github.com/zarr-developers/VirtualiZarr/issues/238#issuecomment-2585549691) - Analytic function - Semi-structured lookup table - Unstructured lookup table - Josh: JSON chunks - Lachy: Jeremy had some thing file-to-chunk - Davis: do something of the redirection at the sharding level? - Josh: have a flag in storage transformer for disallowing writing? - Lachy: Generic extension of "immutable flag" * What is implemented but not written up: - tensorstore: vlen - xarray: - "_FillValue" - "Coordinates" - Schema restrictions (e.g. dimensions in same groups must have same sizes) - Recently updated the [Zarr Encoding Specification](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html). That might help to align - numpy datetime data types? - ---- * Topic list - [x] attributes and conventions - [ ] codec compatibility and stability - [ ] metadata-only migration - [x] strings specification - [ ] arrow (Tom/Ryan) - [ ] missing values