ZEP Draft: Variable Length Chunks

# ZEP Draft: Variable Length Chunks ``` Type: Specification Created: <date created on, in dd-mmm-yyyy format> Resolution: <url> (required for Accepted | Rejected | Withdrawn) ``` ## TODO: - [ ] What is stored for chunk offsets? cumulative? - Upside of cumulative, random access - But if any chunk sizes change all following ones do too - [ ] Handling of intermediates - [ ] Discussion of previous work in dask - [ ] Finish TODO list ## Resources * Previous discussion: * [Zarr Dask Table dask/dask#1599](https://github.com/dask/dask/issues/1599) * [Protocol extensions for awkward arrays zarr-developers/zarr-specs#62](https://github.com/zarr-developers/zarr-specs/issues/62) * [Handling arrays with non-uniform chunking zarr-developers/zarr-specs#40](https://github.com/zarr-developers/zarr-specs/issues/40) * [Chunk spec zarr-developers/zarr-spec#7](https://github.com/zarr-developers/zarr-specs/issues/7#issuecomment-468127219) ## Abstract This ZEP proposes variable length chunking for zarr arrays. Instead of having a fixed length chunksize along each dimensions, each step could be specifed. This has a number of applications, including more efficient creation, mapping across datasets (as in `kerchunk`), and representation of variable length data. ## Motivation and Scope ```quote This section describes the need for the proposed change. It should describe the existing problem, who it affects, what it is trying to solve, and why. This section should explicitly address the scope of and key requirements for the proposed change. ``` ### Problems ### Creating zarr arrays from irregular sources Streaming data, like sensor data, may not come in fixed increments. > variably chunked storage would be great for parallel writing. With variable chunk sizes I don't need to know how my data's physical chunking previous logical-chunks physical-chunking. I just need to make sure my offsets are correct once I'm done. To me, this seems much more tractable than storing variable length data in fixed size chunks (where write locations for chunks are dependent on previous chunks). > It is true that changing the chunk size of a zarr array would require rewriting all of the data. However, increasing a dimension of the array just requires rewriting the .zarray metadata file --- none of the chunks need to be touched. Decreasing a dimension of the array may require deleting chunks (or you could skip that and just let them remain). And when overwriting a portion of an existing array, only the affected chunks are touched, and this can be done with arbitrarily high parallelism (as long as each machine touches disjoint chunks). ### Combining data created with different chunking strategies Zarr python's current strategy for estimating chunk sizes does not generate compatible chunking strategies. ### Variable length data In particular, ragged data access patterns. In the case of both ragged and compressed sparse formats, each index along one dimension has a variable amount of data associated with it. This means there is no mapping between logical chunks along these dimensions and the physical chunks of the data. Lets say we have a sparse CSR array of observations * features. Alongside this we have an array of class labels for each observation. We want to some distributed method to train on a model on these inputs. We can encode a sparse array as a group containing three arrays: * `data` which carries the non-zero values for the array * `indices` which carries the column indices of the non-zero values * `indptr` which stores offsets into `data` and `indices` which groups them by row. As `data` and `indices` are the same length, their chunks can align. However as each row in the matrix can have a variable number of non-zero values, the chunks here won't neccesarily align with the row offsets. Even more unlikely is any alignment of the rows of these arrays with the class labels. This has a tunable worst case based on whether latency or throughput is more expensive, but there will best case will occur very rarely. ### Distributed query In a case like tabular data, I may want to filter my entires based on some criteria across in a distributed manner. Even if I distribute my data in regular chunks, the output may not be regularly sized. Zarr currently cannot represent the result of this operation. ## Usage and Impact ```quote This section describes how users of Zarr will use the new features, spec changes or a new process described in this ZEP. It should be comprised mainly of code examples that wouldn’t be possible without acceptance and implementation of this ZEP, as well as the impact the proposed changes would have on the ecosystem. This section should be written from the perspective of the users of Zarr, and the benefit it will provide them; as such, it should include implementation details only if necessary to explain the functionality. ``` ### Combining data without rechunking ```python a1 = zarr.ones(100, chunks=100) a2 = zarr.zeros(100, chunks=50) combined = kerchunk.concatenate([a1, a2]) combined.chunks # ((100, 50, 50),) ``` #### Point data (oct trees) ## Backward Compatibility ```quote This section describes how the ZEP breaks backward compatibility. Its purpose is to provide a high-level summary to users who are not interested in detailed technical discussion, but may have opinions around, e.g., usage and impact. ``` Implementations of zarr which cannot understand variable length chunks would not be able to read these arrays. However, this would be obvious from the metadata, so should not create any huge problems. ## Detailed description ```quote This section should provide a detailed description of the proposed change. It should include examples of how the new functionality would be used, intended use-cases and pseudo-code illustrating its use. ``` ### Creation ```python zarr.create(1000, chunks=((100, 300, 500, 100),)) ``` ## Related Work ```quote This section should list relevant and/or similar technologies, possibly in other libraries. It does not need to be comprehensive, just list the major examples of prior and relevant art. ``` ### Dask ### Parquet/ Arrow Arrow describes tables as a collection of record batches. There is no restriction on the size of these batches. This is not only very flexible, but can be used as an indexing strategy for low cardinatlity columns within parquet. ``` dataset_name/ year=2007/ month=01/ 0.parq 1.parq ... month=02/ 0.parq 1.parq ... month=03/ ... year=2008/ month=01/ ... ... ``` This feature was cited as one of the reasons parquet was chose over zarr: https://github.com/dask/dask/issues/1599 ### awkward array ## Implementation This section lists the major steps required to implement the ZEP. Where possible, it should be noted where one step is dependent on another, and which steps may be optionally omitted. Where it makes sense, each step should include a link to related pull requests as the implementation progresses. Any pull requests or development branches containing work on this ZEP be linked to from here. (A ZEP does not need to be implemented in a single pull request if it makes sense to implement it in discrete phases). ## Alternatives ```quote If there were any alternative solutions to solving the same problem, they should be discussed here, along with a justification for the chosen approach. ``` ### Don't, just tune chunk sizes https://github.com/zarr-developers/zarr-specs/issues/62#issuecomment-1100806513 ## Discussion ```quote This section should have links related to any discussion regarding the ZEP. It could be GitHub issues and/or discussions. (The links to discussions in past if any, goes in this section.) ``` ## References and Footnotes ```quote Each ZEP must either be explicitly labelled as placed in the public domain (see this ZEP as an example) or licensed under the [Open Publication License](https://www.opencontent.org/openpub/). ``` ## Copyright This document has been placed in the public domain.