Arrow in Zarr: Design Doc

# Arrow in Zarr: Design Doc ## Motivation Arrow has data types that Zarr does not. While Arrow's primitive, fixed-length data types mostly map 1:1 to Zarr dtypes, it's variable-length and nested dtypes do not exist in Zarr. Teaching Zarr to use these Arrow dtypes would unlock immediate benefits for Zarr users, such as: - Ragged array storage - Efficient structs (e.g. for machine learning training datasets) - Arrow strings (maybe better than Numpy strings) - Extension types such as GEOMETRY, useful for vector data ## Requirements ### General - The individual elements of a Zarr Array may be any valid [Arrow data type](https://arrow.apache.org/docs/format/Columnar.html#data-types), including - Fixed-length primitive types: numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number - Variable-length primitive types: binary, string - Nested types: list, map, struct, and union - Dictionary type: An encoded categorical type - There are no restrictions on the shape or chunking of Arrays with; any valid shape and chunk grid is supported ### Specific to Zarr Python - When writing data to an Arrow dtype array, the user can provide data as one of: - A numpy Array (suitable to fixed-length primitive types only) - A pyarrow Array (limitation: pyarrow arrays are 1D only) - Maybe a new thing ("ND Arrow Array wrapper"; see below) - When reading data from an Arrow dtype array, the user can receive data in memory in any of the above formats ### Non-Goals - This is only about dtypes; it's not about mapping other parts of the Arrow ecosystem to Zarr (e.g. schemas, record batches, etc) - This is not about teaching Arrow to understand nd-array data; we are not proposing any changes or extensions to Arrow - This is not about arrow `tensors`, which are separate and don't suppor the dtypes we need anyway ## FAQ - How is this different from [Arrow Zarr](https://github.com/datafusion-contrib/arrow-zarr)? - Arrow Zarr is specifically about _querying existing Zarr data_ from Apache Data Fusion. It doesn't involve any extensions to the Zarr Spec. It maps existing Zarr dtypes to Arrow Dtypes as part of query direction. - This proposal goes in the opposite direction: it's about serializng Arrow-dtyped data into Zarr storage. This requires a Zarr dtype extension. - They could be integrated of course! ## Design ### Zarr Data Type Extension We will introduce a new Zarr data type extension called `arrow`. The dtype configuration contains two elements - `version` - Because JSON serialization of Arrow dtypes is not considered cannonical in the Arrow ecosystem, we explicitly version this extension for future proofing. - `field` - JSON metadata for an Arrow field, serialized via the [Arrow JSON test data format](https://github.com/apache/arrow/blob/main/docs/source/format/Integration.rst#json-test-data-format). This is the closest thing that exists to a JSON spec for serialization of Arrow schemas. - The `name` field is required here but is currently _ignored_ by Zarr! For consistency, implementations should prefer to use the name of the Zarr array here. Example Array metadata (simple primitive type), via <https://github.com/apache/arrow/blob/main/docs/source/format/integration_json_examples/simple.json> ```json { "data_type": { "name": "arrow", "configuration": { "version": "0.1.0", "field": { "name": "foo", "type": {"name": "int", "isSigned": true, "bitWidth": 32}, "nullable": true, "children": [] } } } } ``` Struct, via https://github.com/apache/arrow/blob/main/docs/source/format/integration_json_examples/struct.json ```json { "data_type": { "name": "arrow", "configuration": { "version": "0.1.0", "field": { "name": "struct_nullable", "type": { "name": "struct" }, "nullable": true, "children": [ { "name": "f1", "type": { "name": "int", "isSigned": true, "bitWidth": 32 }, "nullable": true, "children": [] }, { "name": "f2", "type": { "name": "utf8" }, "nullable": true, "children": [] } ] } } } } ``` ### Arrow Serializer Codec We will create a new Zarr Serializer codec called `arrow`. - Input: an array of items with the specified Arrow dtype - Output: an [Arrow IPC stream](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format) of 1 or more record batches. At this point, the stream can be treated as `bytes` by the rest of the Zarr codec pipeline. - In order to create a IPC stream, the input array will be promoted to an Arrow Record Batch with a single field. Questions: - Is there any benefit to using more than one record batch per chunk? - How does the Zarr Python buffer interface have to evolve to support this proposal? - Should we create an new set of buffers? - Can we leverage the zero-copy properties of Arrow effectively? - Fill values - How do we handle fill value for arbitrary Arrow dtypes? - Disallow default fill value? - Require that field is Nullable? ## TODO List - [ ] Expose [Arrow JSON test data format](https://github.com/apache/arrow/blob/main/docs/source/format/Integration.rst#json-test-data-format) via Python bindings to [rust crate](https://docs.rs/arrow-integration-test/latest/arrow_integration_test/index.html) ([GitHub code](https://github.com/apache/arrow-rs/blob/main/arrow-integration-test/src/datatype.rs)). - [ ] Implement a new Zarr Arrow dtype in Zarr Python (see [example](https://github.com/zarr-developers/zarr-python/blob/main/examples/custom_dtype/custom_dtype.py)) - [ ] Implement Arrow serializer ([prior attempt](https://github.com/zarr-developers/zarr-python/pull/2031/files#diff-ea60a3b75ede8b21aa1c9d2e5ab9be2a5321d43cdd750bc499aebb0e05400a26))