# Arrow in Zarr: Design Doc
## Motivation
Arrow has data types that Zarr does not. While Arrow's primitive, fixed-length data types mostly map 1:1 to Zarr dtypes, it's variable-length and nested dtypes do not exist in Zarr. Teaching Zarr to use these Arrow dtypes would unlock immediate benefits for Zarr users, such as:
- Ragged array storage
- Efficient structs (e.g. for machine learning training datasets)
- Arrow strings (maybe better than Numpy strings)
- Extension types such as GEOMETRY, useful for vector data
## Requirements
### General
- The individual elements of a Zarr Array may be any valid [Arrow data type](https://arrow.apache.org/docs/format/Columnar.html#data-types), including
- Fixed-length primitive types: numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number
- Variable-length primitive types: binary, string
- Nested types: list, map, struct, and union
- Dictionary type: An encoded categorical type
- There are no restrictions on the shape or chunking of Arrays with; any valid shape and chunk grid is supported
### Specific to Zarr Python
- When writing data to an Arrow dtype array, the user can provide data as one of:
- A numpy Array (suitable to fixed-length primitive types only)
- A pyarrow Array (limitation: pyarrow arrays are 1D only)
- Maybe a new thing ("ND Arrow Array wrapper"; see below)
- When reading data from an Arrow dtype array, the user can receive data in memory in any of the above formats
### Non-Goals
- This is only about dtypes; it's not about mapping other parts of the Arrow ecosystem to Zarr (e.g. schemas, record batches, etc)
- This is not about teaching Arrow to understand nd-array data; we are not proposing any changes or extensions to Arrow
- This is not about arrow `tensors`, which are separate and don't suppor the dtypes we need anyway
## FAQ
- How is this different from [Arrow Zarr](https://github.com/datafusion-contrib/arrow-zarr)?
- Arrow Zarr is specifically about _querying existing Zarr data_ from Apache Data Fusion. It doesn't involve any extensions to the Zarr Spec. It maps existing Zarr dtypes to Arrow Dtypes as part of query direction.
- This proposal goes in the opposite direction: it's about serializng Arrow-dtyped data into Zarr storage. This requires a Zarr dtype extension.
- They could be integrated of course!
## Design
### Zarr Data Type Extension
We will introduce a new Zarr data type extension called `arrow`. The dtype configuration contains two elements
- `version` - Because JSON serialization of Arrow dtypes is not considered cannonical in the Arrow ecosystem, we explicitly version this extension for future proofing.
- `field` - JSON metadata for an Arrow field, serialized via the [Arrow JSON test data format](https://github.com/apache/arrow/blob/main/docs/source/format/Integration.rst#json-test-data-format). This is the closest thing that exists to a JSON spec for serialization of Arrow schemas.
- The `name` field is required here but is currently _ignored_ by Zarr! For consistency, implementations should prefer to use the name of the Zarr array here.
Example Array metadata (simple primitive type), via <https://github.com/apache/arrow/blob/main/docs/source/format/integration_json_examples/simple.json>
```json
{
"data_type": {
"name": "arrow",
"configuration": {
"version": "0.1.0",
"field": {
"name": "foo",
"type": {"name": "int", "isSigned": true, "bitWidth": 32},
"nullable": true,
"children": []
}
}
}
}
```
Struct, via https://github.com/apache/arrow/blob/main/docs/source/format/integration_json_examples/struct.json
```json
{
"data_type": {
"name": "arrow",
"configuration": {
"version": "0.1.0",
"field": {
"name": "struct_nullable",
"type": {
"name": "struct"
},
"nullable": true,
"children": [
{
"name": "f1",
"type": {
"name": "int",
"isSigned": true,
"bitWidth": 32
},
"nullable": true,
"children": []
},
{
"name": "f2",
"type": {
"name": "utf8"
},
"nullable": true,
"children": []
}
]
}
}
}
}
```
### Arrow Serializer Codec
We will create a new Zarr Serializer codec called `arrow`.
- Input: an array of items with the specified Arrow dtype
- Output: an [Arrow IPC stream](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format) of 1 or more record batches. At this point, the stream can be treated as `bytes` by the rest of the Zarr codec pipeline.
- In order to create a IPC stream, the input array will be promoted to an Arrow Record Batch with a single field.
Questions:
- Is there any benefit to using more than one record batch per chunk?
- How does the Zarr Python buffer interface have to evolve to support this proposal?
- Should we create an new set of buffers?
- Can we leverage the zero-copy properties of Arrow effectively?
- Fill values
- How do we handle fill value for arbitrary Arrow dtypes?
- Disallow default fill value?
- Require that field is Nullable?
## TODO List
- [ ] Expose [Arrow JSON test data format](https://github.com/apache/arrow/blob/main/docs/source/format/Integration.rst#json-test-data-format) via Python bindings to [rust crate](https://docs.rs/arrow-integration-test/latest/arrow_integration_test/index.html) ([GitHub code](https://github.com/apache/arrow-rs/blob/main/arrow-integration-test/src/datatype.rs)).
- [ ] Implement a new Zarr Arrow dtype in Zarr Python (see [example](https://github.com/zarr-developers/zarr-python/blob/main/examples/custom_dtype/custom_dtype.py))
- [ ] Implement Arrow serializer ([prior attempt](https://github.com/zarr-developers/zarr-python/pull/2031/files#diff-ea60a3b75ede8b21aa1c9d2e5ab9be2a5321d43cdd750bc499aebb0e05400a26))