# Conformance tests and performance benchmarks
## Conformance tests
- [Zarr implementations page](https://zarr.dev/implementations/)
- Manual, subjective interpretation and list
- [Zarr implementations repo](https://github.com/zarr-developers/zarr_implementations)
- Matrix of implementations & codecs
- PR to add docker support
- Lots of work
- Would benefit from common CLI across implementations
- [v3 / v2 array examples](https://github.com/d-v-b/zarr-workbench/tree/main/v3-sharding-compat/data)
- [Zarr python regression tests](https://github.com/zarr-developers/zarr-python/blob/main/tests/test_regression/test_v2_dtype_regression.py) (comparing conformance between different Zarr Python versions)
- Uses a CLI to roundtrip data types between Zarr-Python V2 and Zarr-Python V3
- Joe: build environments from Zarr-Python for every month that read and wrote the same data
- Josh: Airspeed has been useful for NumPy
- Joe: Codspeed seems to be gaining traction as a managed benchmark tool
- Davis: could you define conformance as a subset of benchmarking?
- Chunkmark or chunkbench
- Josh: https://github.com/ome/bioimage-latency-benchmark
- Hugo:
- Should the benchmarks/conformance tests be centralized or decentralized?
- Josh: reading is straightforward, writing is hard, how to determine serialized data is equivalent (for non-ordered structs)
- Chunks should be bit-by-bit identical
- Metadata can be the same with different serialization due to codec aliases (need to canonicalize)
- Enough homogeneity across primary, except sharding
- Equality should ignore chunk order; doesn't matter if there's extra information
- Eric: chaos monkey implementation the introduces noise. Davis: any degrees of freedom?
- Harder to integrate centralized with your CI?
- Probably some subset of both
- Pick a canonical reader?, or have no canonical reader and define all input/output data in the tests
- Seems important to have the full matrix
- At least eventually
- Define data analytically is helpful to keep size low and reduce
- What are the stakes of the test?
-
### What do the tests actual look like
- Cannot compare on disk
- Need to test that partial reads work
- Declaration of what the source data should be
- CLI would need to be really simple
- Cf. [h5dump](https://support.hdfgroup.org/documentation/hdf5/latest/_view_tools_view.html)?
### What implementations does Zarr-Python ping for implications?
- Lachy -- Zarrs
- Jeremy -- tensorstore
#### What implementations should be pinged?
- N5?
- Zarr.js
-
### What are the goals
- Maintaining a high bar of compatibility
- Implementations need to know what their goals should be
- Provide a list of what needs to be done and the order of priorities
### What is the matrix for conformance ?
Prioritization
P0 - Required for addition
P4 - Focus on last
- Zarr implementation
- V2 vs. V3
- V3 (requi)
- V2 (not recommended)
- Codecs
- Data types
- Dimensionality
- Sharded vs. unsharded
- Storage backends (**local vs. S3** vs. GSC vs. Azure)
Miminal
- non-listing S3
Network testing could be a separate set of tests
- where should we put these?
- GitHub with minio
Inspiration
- Array API
- Browser compatibility
## Other
- Maximum dimensionality (limited by max array)
- We should make ObjectStore the default and include it in the required dependencies
- how will this impact future S3 specific stores?
## What is the right number for Zarr implementations?
- Depends on the use-case and motivations
- e.g., some people just want Zarr in a very specific language
- Native implementations vs bindings
## Performance benchmarks
* Memory https://github.com/tomwhite/memray-array
* Obstore / Fsspec comparison - https://github.com/maxrjones/zarr-obstore-performance
* Microbenchmarks for comparing how Zarr represents byte ranges - https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks
* https://github.com/zarrs/zarr_benchmarks/
* https://github.com/HEFTIEProject/zarr-benchmarks/ (V2 only?)
* https://github.com/zarr-developers/zarr-benchmark (dead)
* https://github.com/pangeo-data/benchmarking
* https://github.com/ome/bioimage-latency-benchmark (with TIFF and HDF5, python & javascript)
* https://www.nature.com/articles/s41592-021-01326-w/figures/1
* Earthmover Benchmarks
* [Writeup](https://www.notion.so/earthmover/Icechunk-Performance-241492ee309f8067b3fdc4edae8c6229?source=copy_link)
* [Gist](https://gist.github.com/rabernat/0f0b71f1764fb8345f2db2a1143d24e1)
* Zarr-Xarray performance tracking from Joe


## What's next?
* Josh - mind-meld with folks interested in zarr-implementations
* 600 tests
* should be made lower
* Eric - mind-meld, probably not working afterwards
* Davis - every datatype and codec having example data for spec expensions
* Ryan - try to solve issues (likely around memory copies) and publish blog posts