Primers - HackMD

# Primers This document contains a series of technical primers for discussion topics at the Zarr Summit in Rome ## Rectilinear (variable-length) chunking ### Summary Today, Zarr requires that the chunks of an array all have the same shape. But some applications would benefit from arrays where the chunks are different shapes. A new "chunk grid" specification is the first step towards enabling this. ### Definition "Regular chunking" is the practice of partitioning an array into sub-arrays that all have the same shape (modulo boundary effects). The term "regular" is used because the origin of each chunk is on a [regular grid](https://en.wikipedia.org/wiki/Regular_grid). If instead the origin of each chunk lies on a *rectilinear grid*, then we have rectilinear, or variable-length, chunking, which allows arrays to be divided into sub-arrays that have varying shapes. ### Chunking in Zarr today There is no widely used scheme for rectilinear chunking in Zarr. A Zarr array declares the shape of its chunks via the `chunk_grid` field in metadata. Only the [regular grid](https://github.com/zarr-developers/zarr-specs/blob/main/docs/v3/chunk-grids/regular-grid/index.rst) is widely used. It defines a regular chunk grid, so every chunk has the same shape. See [this discussion](https://github.com/orgs/zarr-developers/discussions/52) about this [Zarr enhancement proposal](https://github.com/zarr-developers/zeps/pull/18) (see a rendered version [here](https://zarr.dev/zeps/draft/ZEP0003.html)) that defined a scheme for rectilinear chunking. A relevant section from that proposal: > Each element of chunk_shape, corresponding to each dimension of the array, may be a single integer, as before, or a list of integers which add up to the size of the array in that dimension. In this example, the single value of 10 for the chunks on the second dimension would be identical to [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]. The number of values in the list is equal to the number of chunks along that dimension. Thus, a “rectangular” grid may be fully compatible as a “regular” grid. That proposal targets an older version of the Zarr V3 spec, but the key ideas remain useful. ### Use cases Here are two important use cases for rectilinear chunking for Zarr arrays. This is not an exhaustive list. #### Optimizing IO for append-heavy workflows Arrays used for both reading and writing can benefit from two different chunk shapes. Arrays that model periodically acquired data can use one chunk shape when appended new data, and another chunk shape when reading old data. If desired recently appended chunks can be periodically aggregated into the "read" chunk shape. #### Array virtualization Tools like [kerchunk](https://fsspec.github.io/kerchunk/) and [virtualizarr](https://virtualizarr.readthedocs.io/en/stable/) generate Zarr arrays by virtually concatenating sub-arrays, each of which might be stored in a totally separate storage backend. This provides a Zarr API layer on top of file formats like tiff or hdf5. Concatenating arrays with different chunk shapes is impossible with regular chunks, but with rectilinear chunks it becomes much more tractable. ### Rectilinear chunking in Zarr today To get rectilinear chunking in Zarr today, we would need a new `chunk_grid` spec, ideally one that all Zarr implementers are willing to support. The pre-existing [proposal](https://zarr.dev/zeps/draft/ZEP0003.html) from 2023 is probably a very solid starting point. It would not be much work at all to adapt this to the Zarr V3 metadata style. Once this spec is finalized, it should be submitted via PR to the [zarr-extensions](https://github.com/zarr-developers/zarr-extensions/) repo as an additional chunk grid. ## Sharding ### Summary Zarr V2 users often complained about creating arrays with too many small files. Sharding was introduced in Zarr V3 to solve this problem by storing multiple sub-chunks inside a single larger chunk (a "shard"). Sharding was implemented as a *codec*, which has advantages and disadvantages that are worth discussing. ### Definition In the context of Zarr, "sharding" is the practice of storing *multiple chunks inside the same file*. People with a database background might find this confusing -- for databases, "sharding" denotes horizontal partitioning, or *splitting* rows of a table across separate independent database instances. But in Zarr, sharding means *combining* sub-chunks inside a larger chunk. ### Background In Zarr V2, array chunks are stored in separate objects. This puts the performance advantages of small, granular chunks in tension with the cost of storing a large number of individual objects. Zarr V3 relaxes this tension somewhat by introducing a *sharding codec*, which supports reading and writing individually compressed sub-chunks within a single larger chunk object (often termed a "shard"). ### The sharding codec Although Zarr V3 supports sharding, it is not a core feature of the Zarr V3 model. Instead, sharding is implemented by a *codec*, i.e. an encode / decode routine applied to a chunk of data when writing it to storage. This means some Zarr V3 libraries might not support accessing sharded data, or some Zarr V3 libraries could support multiple distinct sharding codecs. But so far neither of these outcomes looks likely. The Zarr V3 spec includes a [sharding codec spec](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/index.html) that defines procedures for writing full shards and reading sub-chunks from within a shard. The full spec is worth a read, but in very simple terms, the sharding codec uses an index, stored either at the beginning or end of the shard, to map the offset and length of the sub-chunks contained in the shard. Sub-chunks can be written to a shard in any order. Sub-chunks are encoded independently by a list of codecs defined on the sharding codec itself (and so the sharding codec can contain another sharding codec). The sub-chunk shape must tile the shape, in array coordinates, of the shard. That shape is set by the `chunk_grid` field in array metadata. So if the `chunk_grid` field declares a shape like `(128,)`, only factors of 128 are permitted for the sub-chunk shape. ### Performance considerations On storage backends that support byte-range reads, sub-chunks inside a shard can be read concurrently. Concurrent *writing* of sub-chunks is not possible except in the very special case where the sub-chunks are not compressed. Entire shards can be written and read concurrently. Compared to simple chunking, sharding brings some new performance considerations: - Reading a single sub-chunk from a shard requires two requests: one to fetch the index, and another to fetch the byte range of the sub-chunk. If the shard file is guaranteed to be immutable, then an application might *cache* the index, which would reduce the number of requests needed for later sub-chunk reads. - There are multiple ways to read multiple sub-chunks from a shard: - The byte range for each sub-chunk can be fetched separately. - Sub-chunks that are "close enough" can be fetched with a byte-range request that contains those sub-chunks The sharding codec only requires that sub-chunks be located *somewhere* within a shard. There are no requirements on the order of the sub-chunks, and there is no requirement that the sub-chunks be densely packed. This means shards can have an extra header or footer that contains data that is not managed by Zarr, which opens a lot of opportunities. ### Terminology Because Zarr V3 implemented sharding as a codec, we have some awkward terminology. Per the Zarr V3 spec, codecs encode and decode chunks. So the input to the sharding codec is a chunk. But the sharding codec itself defines another layer of chunking, which is specified by the `"chunk_shape"` field in the sharding codec metadata. So the word "chunk" can mean two different things depending on context. This is unfortunate, and causes a lot of confusion. In this document we have used the word "shard" to denote "chunk, but when the sharding codec is in use", and "sub-chunk" to mean "chunk, from the perspective of the sharding codec". ## Codecs ### Summary The concept that users can define custom data compression routines for their data is central to Zarr. In Zarr V3, data compression is declared in the `"codecs"` field, which is a list of codec metadata documents. ### Chunk encoding / decoding During [chunk encoding](https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#chunk-encoding), the codecs are applied in a chain, where the output of the Nth codec is provided to the N+1th codec. During chunk decoding, the same procedure happens in reverse. When encoding a chunk, the input to the first codec is an in-memory array. (The Zarr V3 spec doesn't explicitly define its assumptions about the interface of in-memory arrays, but there are no assumptions about the memory layout used by in-memory arrays.) The output of the last codec is a byte string. To make this invariant more explicit, the Zarr V3 spec defines 3 distinct types of codecs: - array -> array codecs, which map arrays to arrays. Examples include any transformation that returns an array with a new axis order, data type, or values. - array -> bytes codecs, which map arrays to bytes strings. Examples include the sharding codec, and the "bytes" codec which serializes an array to bytes. - bytes -> bytes codecs, which map byte strings to byte strings. Examples include compression algorithms and checksums. The `"codecs"` field must define a complete transformation from arrays to bytes. That means the codecs must be any number of array -> array codecs followed by exactly one array -> bytes codec, followed by any number of bytes -> bytes codecs. ### Defining new codecs The core Zarr V3 spec defines codecs structurally. Although the Zarr V3 spec defines a [few core codecs](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/index.html), users can define their own compression routines. To ensure feature parity across multiple Zarr implementations, we encourage people who write custom codecs to publish them at the [zarr-extensions](https://github.com/zarr-developers/zarr-extensions/) repo. This repo is meant as a lightweight coordination mechanism. ## Zarr pain points This primer is a bit less structured than the others. We will just enumerate some aspects of Zarr that are common sources of friction with users. ### Indexing chunks The Zarr spec does not provide any mechanisms for explicitly modeling the state of all the stored chunks of an array. For example, applications might benefit from easy access to information about which chunks have been already created, or equivalently which chunks are missing. But with a large number of chunks, this information would not be ergonomic in JSON, and it could very likely become stale without careful coordination and synchronization procedures. ### Interfacing with tabular data Zarr Array data is often used alongside other forms of N-dimensional data, such as points, geometries, or meshes, which are typically not sampled on a regular grid and often not naturally a good fit for an N-dimensional array representation. Instead the values are often stored in tabular data formats or storage backends like data bases. Two questions naturally emerge from this situation: - How can we store tabular data in Zarr arrays? - What are strategies for composing Zarr data with data stored in tabular formats / backends?