# Sharding Breakout
```
### Improving ease-of-use and performance of sharding
*7 adopter mentions*
- **Lead It**: Mark Kittisopikul (@markkitti)
- **Work On It**: @perlman, @jakirkham
- **Join If Space**: @LDeakin, @keewis, @malmans2, @rouault, @markkitti, @maxrjones
```
## Sub-topics for discussion
1. Who is using sharding and what is the target optimization?
- What are we trying to optimize? Performance?
- Choices?
- Performance
- writing
- reading
- concurrency
- Discussion
- Address the large number of small files
- as an alternative to having large chunks
- same number of read requests
- overhead (extra inodes with / separator)
- problems on (networked) file systems (on clusters)
- metadata iops overhead
- Precedence: Neuroglancer pre-computed
- Why not default?
- Precedence, no sharding
- zarr-python
- No coalescing
- tensorstore
- cache headers, transactions
- Dask + Zarr
- byte range coalescing
- Z-order discussion / example
- Zarr of Zarrs
- Could this be a VirtualiZarr approach where the chunk is a file format?
- Currently chunk indices are used. Could we use hashes for load balancing benefits?
- How can we make writing better?
- Could we pad chunks to all be the same size to allow jumping to an offset and writing it out?
- There is interest in direct acquisition cases to write results directly out in Zarr. How can we write directly in sharded/chunked format?
- Icechunk has compaction, which could allow writing normal chunks that are optimized in the backend. IOW should Zarr be a database?
- Concurrency
- Zarr Python v3 has thread and async, which can create contention between other libraries also doing threading (like Dask, Cubed/executor, etc.).
-
2. APIs (application programming interface)
- Experience with existing implementations
- zarr-python
- tensorstore
- Discussion
- Zarr Python
- https://zarr.readthedocs.io/en/stable/user-guide/arrays.html#sharding
- https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.ShardingCodec
- encode and decode are async
- Index codecs could be a useful extension point
- https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.ShardingCodecIndexLocation
- How does transpose and shard interact?
- There might be ways of combining that would break (transposing shard index)
- Eric: Convenience APIs would be useful
- Might already have, but would be good to confirm these are working as expected.
- Try the example in the zarr-python docs
- TensorStore
- https://google.github.io/tensorstore/python/tutorial.html
- Everything works by passing a large `dict`
- Zarr v3 Sharding
- https://google.github.io/tensorstore/driver/zarr3/index.html#json-driver/zarr3/Codec/sharding_indexed
- Need to call `.result()` or `await` to launch compute and get a result
- Transactions
- https://google.github.io/tensorstore/python/api/tensorstore.Transaction
- Can be used to combine multiple changes and batching them
-
3. Implementation of sharding
- It is challenging. How to make this easier?
- GDAL
- Basic versus advanced implementations
- Support Zarr v3 without sharding (SHOULD VS MUST)
- 1-level of sharding
- n-levels
-
- Discussion
- Several compliance levels of implementing Zarr v3
- Is there anything that needs to change with shards?
- Is recursion useful? How would this be useful?
- What spec changes would be worthwhile?
- Maybe a standalone index?
- Could help with concurrency (lock index)
- Appending
- Would padding be useful?
- VirtualiZarr / TIFF (w/tiled JPEGs or pages)
- Learning from COG
- https://element84.com/software-engineering/is-zarr-the-new-cog/
- Could Zarr be written in a way that is compatible with other file formats
- Maintaining headers needed for compatibility with other file formats
- https://github.com/mkitti/simple_image_formats/blob/main/header_formats.ipynb
4. Recursive (nested) sharding
5. As a variable length codec?
6. Where to put the variable length information?
- External chunk index?
7. Specification