# making new Zarr stuff
* who: Davis (lead), Justus, Josh (notes), Vincent, Lachlan, The Toms, Alistair, Hugo, Emmanuel, Max
## Status quo
* Collecting feedback. What's missing? What's unknown?
- Davis: concerned with maintaining compatibility of codecs between v2 and v3. Some of this is zarr-python specific. Numcodecs specifics as well, for example, leading to breakage between the two formats. See also imagecodecs which uses the numcodecs API. What would a v3 version look like?
- Tom W.: [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec/blob/main/vcf_zarr_spec.md) spec feedback. Several years old, working on v2. How does one represent it in v3? How do you migrate? While being aware of the codecs, etc. Davis: there is a metadata-only transition but it requires that the chunks are compatible. Super valuable. However, some of the v3 definitions of the codecs have changed. Tom W.: also need **strings**, v3 spec is not finished
- Tom N.: Would be nice to link discussions to Arrow. Also unclear what a Zarr extension is. Is it just anything that's not in the spec? Davis: think of new "X" where "X" is a data type defined in the spec. Tom N.: there are things that don't fit that definition. e.g. Putting STAC files in the hierarchy. Davis: Overloaded term. Adding is ok. Restricting the space of stuff (specializing/removing) is also valid.
- Max: is the method for doing that set? Davis: zarr-extensions is the place to coordinate to make sure we don't have a lot of duplicated material. Claiming a name.
- Hugo: VCF Zarr as an extension? Basically attributes.
- The Toms: support for the term "convention"
- Alistair: see Dublic Core application profiles
- Davis: consolidated_metadata as the only hard-baked extension
- Recommend using a top-level key
- Max: question : Does everything need to be nested under that key?
- E.g.,: `{cf-conventions: 1.2, standard_name: longitude, ...}` vs `cf-conventions: {version: 1.2, standard_name: "longitude"}`
- Josh: register one mandatory key; doesn't matter if it's flat
- Lachy: if linking out to specs, should recommend top-level keys
- Davis: could still have conflicts with hierarchy structures
- Davis:
- Is there value in registering keys?
- Should we instead just have a catalog of conventions? (Josh: or additionaly)
- for example, a specification of how xarray writes things down
- can't guarantee the composability of conventions
- useful to have specialization document? wrote pydantic & docs while at Janelia
- Max: doesn't have to be a document. A tool for iterating on UML document? Davis: Some rules that people want to express are difficult to express there, but :+1: for making it as easy as possible.
- Vincent: telling users how to make the data or how tools should read the data
- Recommendations: :+1:
- Add a top-level key
* String spec
- Lachy: zarr-python has adopted v2 strings
- can't do partial decoding.
- Jeremy has a nice specification for using the codec infra in v3 for encoding an index. good to standardize.
- Should be recommended for all V3 strings
- Davis: what needs to be done? **New codec**. Written up? "VLen" Implemented in Zarrs
- Lachy: want to write up a spec for that
- Tom W.: inspired by arrow? Somewhat
- Lachy: serialized format is very similar to how arrow. Omits a few things.
- Tom W.: :+1: on default way to encoding strings. Interleaved data might confuse people. Avoid? At least not the default.
- Max: minimum number of implementations before we **recommend** something
- Lachy: 2? Several things implemented that can't visualize.
- Josh: could document things (e.g., datetime, etc.) in zarr-extensions that are **not** recommended
- Lachy: concerned that people are creating things for a long time so it has to be supported
- dangerous even implementing before discussed.
- Davis: goal was parity with all data, leaving a lot of specs to write (then can remove warnings)
- Lachy: codecs are more important than data types. More of a hassle to support. Good to discuss across implementations.
* Missing codecs?
- Lachy: sharding codec requires evenly dividing. Not compatible with variable chunking. Define how we encode the trailing inner chunks. Resolve that in a day? Davis: but with what kind of compatibility. Existing implementations should error. Version bump on codec, but not major breaking change.
- How many implementations of sharding? 5+
- Max: renaming? Lachy: Very painful.
- Lachy: just lifting a constraint.
- Davis: we should test them all
- Josh: see **zarr_implementations** dockers to run on multiple implementations
- Davis: CLI, adding more things. More uses. Combine with zarrs.
* Array storage transformers
- Lachy: full support it technically. Have some testing ones.
- what is it?
- an idea that preceded codec space
- array transformers were meant to be an aliasing machanism for keys
- e.g., content addressable storage (e.g., IPFS)
- overloading the Zarr store API is a smell ideally addressed through array transfoms
- something is missing from the metadata space for indirections
- how much can you boostrap externally?
- ZEP8 is about getting to a zarr json document, then array storage transformers is getting from json document to chunks
- should apply to all implementations
- What should we do about the current state of array storage transformers?
- technically would be breaking
- most people would not know it exists, zarr-python raises an exception
- ideal would be to make it useful, second option would be to deprecate it to make it less complicated for implementers
- would sub-arrays help?
- sub-arrays break the spirit of the law
- Should we formalize icechunk and/or create a simpler serialization solution?
- Zarr should be simple
- Simple is benefitial for delineating Zarr from TileDB
- Simpler would help more people adopt indirection
- Josh: Exchange format like `git bundle`? (Tom N.: serliaization format for Manifest)
- Davis: O(n) on chunk specific metadata. Many things that people want to track per chunk
- Tom N. 3 ways (ref: https://github.com/zarr-developers/VirtualiZarr/issues/238#issuecomment-2585549691)
- Analytic function
- Semi-structured lookup table
- Unstructured lookup table
- Josh: JSON chunks
- Lachy: Jeremy had some thing file-to-chunk
- Davis: do something of the redirection at the sharding level?
- Josh: have a flag in storage transformer for disallowing writing?
- Lachy: Generic extension of "immutable flag"
* What is implemented but not written up:
- tensorstore: vlen
- xarray:
- "_FillValue"
- "Coordinates"
- Schema restrictions (e.g. dimensions in same groups must have same sizes)
- Recently updated the [Zarr Encoding Specification](https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html). That might help to align
- numpy datetime data types?
-
----
* Topic list
- [x] attributes and conventions
- [ ] codec compatibility and stability
- [ ] metadata-only migration
- [x] strings specification
- [ ] arrow (Tom/Ryan)
- [ ] missing values