Zarr Extensible Versioned Namespace

See [Second email to ZSC](#Second-email-to-ZSC) # First email to ZSC tl;dr -- perhaps versioning the namespace separately from the format will help if we add a few limitations to JSON-LD that it would offer one possible solution (and a well-specified one) for named resolution. I was hoping to come up with some working code that I could share with you, but ran out of time. Since I don't know when I'll be back at work in earnest, a quick email with some examples. First, a quick reminder of how JSON-LD works: - Each JSON-LD file *can* have a "@context" key which defines how name resolution works. With that it's possible to either load a remote context field, or define namespaces and fields inline. - All keys within the JSON-LD are then resolved against this context. *EVERYTHING* is a URI but can alternatively be referred to as "https://example.com/field", "example:field" or for the default namespace "field" So this is a little like the opposite of Ryan's goal of not having *any* URIs, but it gives us a carrot of having people register things in the context to get shortened URIs. The limitations that I would think we need to apply, at least initially are: - we control the context and don't let people define it in their zarr.json files. - once a term is put under a given URI ("http://earthmover.com/iceChunk") it will need to stay there. - we don't break the context (unless we have a ZEP that allows people to set the @context) Ok. Some examples: The situation we're in now is essentially that there is an implicit context of `zarr.dev/v3` (used for brevity in this email). All the terms in the v3 spec are within that namespace. ## 1 (implicit; not in file) "@context": { "@vocab": "zarr.dev/v3/" # default "codecs": "zarr.dev/v3/codecs", "bool": "zarr.dev/v3/bool", ...etc... } "codecs": [{"configuration": {"endian": "little"}, "name": "bytes"}, ## 2 If someone wants to add a field to their zarr.json that is not in the context, then they need to use the full URI: "codecs": [...], "https://example.com/consolidateMetadata": ... ## 3 If someone wants a prefix it has to be "added" to the @context: (implicit; not in file) "@context": { "@vocab": "zarr.dev/v3/" # default "example": "example.com/", and then the document looks like this: "codecs": [...], "example:consolidateMetadata": ... ## 4 and to drop the prefix completely, the term needs to be registered: (implicit; not in file) "@context": { "@vocab": "zarr.dev/v3/" # default "example": "example.com/", "consolidatedMetadata": "example.com/consolidatedMetadata", with the document simplified to: "codecs": [...], "consolidateMetadata": ... Depending on how much of the parsing algorithm is implemented, code that previously looked for example.com/consolidatedMetadata *should* keep working. ## 5 If we wanted to add support for @context, this could be a mechanism for versioning, separate from bump v3 as Norman was discussing. (explicit; in file) "@context": { "@vocab": "zarr.dev/v3/UPDATE" Since this is a new field without "must_understand" , clients should fail if they haven't been updated to understand it. This would of course require a ZEP, etc. etc. but if we can convince ourselves that the parsing of 1-4 could be retrofitted into the existing v3, it might give us a path forward. # Discussing with Norman (10th Jan) - TODO: Code examples for codecs and data types * stage 0: {"name": "numcodecs.pcodec", "configuration": {...}} - in the wild - dot as the privileged character? - implications: - "numcodecs." as a defined prefix - disallow them in core names (others start with protocol) - semicolon as the priv. char? - implications - "numcodecs.pcodec" is a well-known string - possibly change in numcodecs ```json { "zarr_format": 3, "data_type": "scalableminds.string", # prefixed data type "chunk_key_encoding": { "name": "default", "configuration": {"separator": "."} }, "codecs": [ { "name": "numcodecs.vlen-utf8", # prefixed codec "version": "2", # new version attribute, must understand }, {"name": "zstd", "configuration": {...}} # promoted to "core" codec ], "chunk_grid": { "name": "regular", "configuration": {"chunk_shape":[32]} }, "shape":[128], "extensions": { "must_understand": false, "value": [ # new extension object, must understand { "name": "https://scalableminds.com/consolidated-metadata", # uri-based extension "configuration": {...} "must_understand": false, } ] } } ``` requirements: - extensions go into existing extenstion points (codec, data_type, chunk_grid, ...) or new general-purpose `extensions` - extensions := name | {"name":name, "configuration"?: object, "version"?: string} - name := uri | prefixes-string | raw name - raw names are reserved for "core" extensions, ie. go through some ZEP process, managed through zarr-developers git repo - prefixes can be claimed by the community, managed through zarr-developers git repo - uris are free for all Questions: - do we support top-level extension points? - tradeoffs of "." versus ":" - can we constrain the "extension" objects? - is every extension an object with name, configuration, (and possibly version). --- # Second email to ZSC Last week, Josh and I found ourselves discussing the current spec problems and his email to the steering council from just before the holidays. We've written up what we think could be a solution below which prioritizes: - clarifying not breaking the spec - enabling new names to be assigned - minimizing initial impact on the implementations (later phases may require bigger changes) ## Issues to address The extension mechanism for v3 is not well-defined. While there are some extension points that are defined in the core spec, there is no advice about selecting names for these extensions that would avoid conflicts. Additionally, there is no mechanism for defining extensions that don't fit into the existing extension points. Implementations have started to use the v3 spec and are using extensions (e.g `numcodecs.*` codecs and `zstd` in zarr-python), which means that any changes we want to make to the v3 spec need to be compatible with the current reality in the community. Also, there is some confusion in the community and contradictions in the spec as to whether there are codecs in the v3 spec. We highly recommend to make the necessary changes to the spec within the v3 version. Bumping the major version would lead to catastrophic churn in the community. We would suggest writing up the below as a comprehensive, dedicated spec (non-process) ZEP. Aware of the fact that the ZEP process is still in limbo, though, we might want to identify clarifications about codec names & URIs that can be fast-tracked for a zarr-spec PR. ## Proposal ### Extension categories and names There are 3 categories of names: URI-based, prefix-based and bare names. 1. **URI-based names** can be used by anyone without further coordinating though the assumption is that users reasonably "own" the URI. There are no guarantees in terms of versioning or compatibility. The URI doesn't need to resolve to anything. Example: `https://scalableminds.com/zarr3/consolidated-metadata` 2. With **prefix-based names**, the prefixes are shortcuts for a URI-prefix and belong to a maintainer (either a team or a single user). The maintainer can create any extension they want under that prefixed name space. The special character `.` (or `:` tbd) is chosen to delimit prefixes. It is expected that the maintainer create a versioning and compatibility policy, but there are no guarantees. The prefix assignment is managed through a `zarr-developers`-owned git repo. Prefixes are never unassigned or reassigned. Example: `imagecodecs.jpeg2000` or `imagecodecs:jpeg2000` 3. **Bare names** are assigned through a ZEP process. Compatibility and versioning is guaranteed as per [v3 stability policy](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stability-policy). The name assignment is managed through a `zarr-developers`-owned git repo. Names are never unassigned or reassigned. Example: `zstd` Extensions are expected to go through these evolutionary naming steps. The further they are in the process, the more mature they are expected to be. Some extensions will never go through the whole process. ## Extension definitions Extensions are defined in the metadata either as objects or as short-hand names. Objects have the following keys: ```json { "name": "<name>", "configuration": { ... }, # optional "version": "<version>" # optional, TBD } ``` If present, all keys are `must_understand=True`. `version` is a possible addition to the definition of existing extension objects, especially important if name resolution does not suffice for versioning needs. We could explicitly RECOMMEND version be added to future objects without any breaking change. Instead of extension objects, short-hand names may be used. They would be equivalent to extension objects with just a `name` key. This is in-line with the current wording of the spec. ## Extension points The v3 core spec defines a number of extension points for arrays and groups: - `data_type` (array only) - `codecs` (array only) - `chunk_grids` (array only) - `chunk_key_encoding` (array only) - `storage_transformers` (array only) // extendable to groups While the `storage_transformers` is already a powerful extension point, we propose to add another general-purpose extension point `extensions`. The metadata holds an array of extension definitions. The key itself is either implicitly `must_understand=True` or explicitly `must_understand=False`, if it **only** contains optional extensions. ## "Core" extensions We acknowledge that the spec has a few core extensions that are expected to be supported by all implementations. But this proposal isn't addressing that (since it would require enforcing something on implementations). Extensions with "bare names" don't automatically become "core". Extensions become "core" by being listed in the core spec document (e.g., with a clear `MUST`) through a to-be-defined process, e.g. ZEP. ### Data types - `(u)int{8,16,32,64}` - `float{32,64}` - `complex{64,128}` // debatable ### Codecs - `bytes` - `gzip` - `blosc` - `zstd` // not merged yet, but widely implemented - `transpose` ### Chunk grids - `regular` ### Chunk key encoding - `default` - `v2` ### Storage transformers (none) ## Versioning and spec evolution The versioning of the core spec remains unchanged. That means the value of `zarr_format=3` and new keys to the metadata must be understood by implementations, i.e. they need to fail if they find a key they don't know. All changes to the core spec need to go through the ZEP process. Extensions **can** be versioned, which is up to the maintainer. We either add a new optional `version` key to extension objects to facilitate versioning, or use the context versioning mechanism described in Josh's email. Maintainers can define their own spec evolution processes. ## Example Array ```javascript { "zarr_format": 3, "data_type": "scalableminds.string", // prefix-based name, short-hand name "chunk_key_encoding": { "name": "default", // core "configuration": { "separator": "." } }, "codecs": [ { "name": "numcodecs.vlen-utf8", // prefix-based name "version": "2", // new version attribute, must understand }, { "name": "zstd", // bare name, promoted to "core" "configuration": { ... } } ], "chunk_grid": { "name": "regular", // core "configuration": { "chunk_shape": [ 32 ] } }, "shape": [ 128 ], "extensions": { // new general-purpose extension point "must_understand": false, // contains only optional extension "value": [ { "name": "https://scalableminds.com/zarr/consolidated-metadata", // uri-based name "configuration": { ... } "must_understand": false // optional extension } ] }, "dimension_names": [ "x" ], "attributes": { ... }, "storage_transformers": [] } ``` ## Discussion ### Relationship with JSON-LD As Josh mentioned in his email, for some choice of the "TBD" items above, the name resolution algorithm outlined conforms to the JSON-LD spec. (TBD choices for JSON-LD compatibility would be: prefix separator of ":" and no "version" field.) This may or may not be a design goal, but it provides us a solid basis from which to start our conversations. One implication of choosing JSON-LD as the basis is that at least conceptually *all* names would be URIs, but prefixes and bare-names would provide shortened versions that would be used by default. There's obviously some cost to this beyond the above proposal; the added value of this is that URIs which are introduced into the while (e.g., `https://scalableminds.com...`) continue to function even after receiving a prefix **AND** becoming a bare name. ### Adding new metadata keys The v3 core spec allows for the addition of new metadata keys as part of spec evolution. The idea is to force implementations to parse the entire metadata and fail, if they find key they cannot parse (unless marked by `must_understand=False`). We should use this mechanism to evolve the spec and add the keys that are necessary to achieve a well-defined extension mechanism. ### Naming conflicts Naming conflicts are avoided through - owned URIs - prefixes that are uniquely assigned through a well-known repo - names that are uniquely assigned through a well-known repo ### Extension objects vs. arbitrary metadata key The rationale for using extension objects instead of arbitrary keys in the metadata is: - composibility with existing extension points (e.g. `codecs` array) - the use of URIs as keys in JSON metadata is controversial - less pollution of the root namespace of a `zarr.json` ### Reassigning or unassigning prefixes and names We consider it a design goal to not allow datasets written with one set of naming expectations to be unintentionally interpreted with *other* names by a future version of an implementation. Without further mechanics, this means that prefixes and core names, once assigned, cannot be changed. With this limitation, the above proposal can be applied to current implementations quite simply. However, to support evolving the names in the future, some mechanism like the `@context` logic in Josh's email, would be needed so that what is written as "foo" or "foo:bar" today can be migrated in the future. ## Rollout These changes will be proposed to the community and implemented through a ZEP. Revisions to ZEP 0 will be paused until we are clear about the extension mechanism and the implications for the role of the ZEP process in the future. As mentioned above, it might be worth considering whether we can identify immediate clarifications which are compatible/non-breaking with the current state of the Zarr v3 spec *and* such a proposal. ## Future work - Stores