Conformance tests and performance benchmarks

# Conformance tests and performance benchmarks ## Conformance tests - [Zarr implementations page](https://zarr.dev/implementations/) - Manual, subjective interpretation and list - [Zarr implementations repo](https://github.com/zarr-developers/zarr_implementations) - Matrix of implementations & codecs - PR to add docker support - Lots of work - Would benefit from common CLI across implementations - [v3 / v2 array examples](https://github.com/d-v-b/zarr-workbench/tree/main/v3-sharding-compat/data) - [Zarr python regression tests](https://github.com/zarr-developers/zarr-python/blob/main/tests/test_regression/test_v2_dtype_regression.py) (comparing conformance between different Zarr Python versions) - Uses a CLI to roundtrip data types between Zarr-Python V2 and Zarr-Python V3 - Joe: build environments from Zarr-Python for every month that read and wrote the same data - Josh: Airspeed has been useful for NumPy - Joe: Codspeed seems to be gaining traction as a managed benchmark tool - Davis: could you define conformance as a subset of benchmarking? - Chunkmark or chunkbench - Josh: https://github.com/ome/bioimage-latency-benchmark - Hugo: - Should the benchmarks/conformance tests be centralized or decentralized? - Josh: reading is straightforward, writing is hard, how to determine serialized data is equivalent (for non-ordered structs) - Chunks should be bit-by-bit identical - Metadata can be the same with different serialization due to codec aliases (need to canonicalize) - Enough homogeneity across primary, except sharding - Equality should ignore chunk order; doesn't matter if there's extra information - Eric: chaos monkey implementation the introduces noise. Davis: any degrees of freedom? - Harder to integrate centralized with your CI? - Probably some subset of both - Pick a canonical reader?, or have no canonical reader and define all input/output data in the tests - Seems important to have the full matrix - At least eventually - Define data analytically is helpful to keep size low and reduce - What are the stakes of the test? - ### What do the tests actual look like - Cannot compare on disk - Need to test that partial reads work - Declaration of what the source data should be - CLI would need to be really simple - Cf. [h5dump](https://support.hdfgroup.org/documentation/hdf5/latest/_view_tools_view.html)? ### What implementations does Zarr-Python ping for implications? - Lachy -- Zarrs - Jeremy -- tensorstore #### What implementations should be pinged? - N5? - Zarr.js - ### What are the goals - Maintaining a high bar of compatibility - Implementations need to know what their goals should be - Provide a list of what needs to be done and the order of priorities ### What is the matrix for conformance ? Prioritization P0 - Required for addition P4 - Focus on last - Zarr implementation - V2 vs. V3 - V3 (requi) - V2 (not recommended) - Codecs - Data types - Dimensionality - Sharded vs. unsharded - Storage backends (**local vs. S3** vs. GSC vs. Azure) Miminal - non-listing S3 Network testing could be a separate set of tests - where should we put these? - GitHub with minio Inspiration - Array API - Browser compatibility ## Other - Maximum dimensionality (limited by max array) - We should make ObjectStore the default and include it in the required dependencies - how will this impact future S3 specific stores? ## What is the right number for Zarr implementations? - Depends on the use-case and motivations - e.g., some people just want Zarr in a very specific language - Native implementations vs bindings ## Performance benchmarks * Memory https://github.com/tomwhite/memray-array * Obstore / Fsspec comparison - https://github.com/maxrjones/zarr-obstore-performance * Microbenchmarks for comparing how Zarr represents byte ranges - https://github.com/maxrjones/zarr-byterangerequest-microbenchmarks * https://github.com/zarrs/zarr_benchmarks/ * https://github.com/HEFTIEProject/zarr-benchmarks/ (V2 only?) * https://github.com/zarr-developers/zarr-benchmark (dead) * https://github.com/pangeo-data/benchmarking * https://github.com/ome/bioimage-latency-benchmark (with TIFF and HDF5, python & javascript) * https://www.nature.com/articles/s41592-021-01326-w/figures/1 * Earthmover Benchmarks * [Writeup](https://www.notion.so/earthmover/Icechunk-Performance-241492ee309f8067b3fdc4edae8c6229?source=copy_link) * [Gist](https://gist.github.com/rabernat/0f0b71f1764fb8345f2db2a1143d24e1) * Zarr-Xarray performance tracking from Joe ![image](https://hackmd.io/_uploads/SyW14_5pgg.png) ![image](https://hackmd.io/_uploads/HkSx4O9axx.png) ## What's next? * Josh - mind-meld with folks interested in zarr-implementations * 600 tests * should be made lower * Eric - mind-meld, probably not working afterwards * Davis - every datatype and codec having example data for spec expensions * Ryan - try to solve issues (likely around memory copies) and publish blog posts