--- tags: draft --- # CZI EOSS4 Reports <details><summary>Josh editing</summary> From https://chanzuckerberg.force.com/ * Zarr: a common backbone for the scalable storage of annotated tensor data (EOSS4) Due: 8/1/2022 RR-5765 * Zarr: a common backbone for the scalable storage of annotated tensor data (EOSS4) Due: 10/31/2023 RR-5766 # EOSS4 Interim Report #1 (RR-5765) ## I. Grant Overview ### Grantee Name NumFOCUS Inc. ## Grantee Contact Ms. Leah Silen Executive Director NumFOCUS Inc P.O. Box 90596 Austin, TX 78709 Email: leah@numfocus.org ### Key Personnel | Name | Email | Affiliation | GitHub Handle | | ---- | ----- | ----------- | ------------- | | Josh Moore | j.a.moore@dundee.ac.uk | University of Dundee, Scotland | joshmoore | | Sanket Verma | | | ## II. Financial Overview ### Proposed Budget (FYI) Our proposed budget of $397,625.00 was comprised of: * $150,000 for contract work * $22,500 for contract overheads at 15% * $165,00 for a Community Manager salary * $28,875 for C.M overheads at 15% * $7,500 for C.M. travel * $3,750 for C.M. hardware * $20,000 for trademarking and other standardization efforts which was funded in full. ### Budget narrative * changes to original budget: Expect a surplus for both community manager and contractors * challenges to spending: - hiring process throughout 2021, community manager began Jan 2022 - Contractors on hold awaiting ZEP0001. - Unfortunately open-source developers keep contributing projects - xarray datatree - netcdf support for xarray (Deliverable whatever) * plans for use: - assume extensions - advanced developers: sharding, IPFS - netcdf junior developer - with ZIC implemented, focus on community building (GSoC, blogs, etc.) ## III. Progress Overview ### Progress towards the deliverables Below is a review of the most recently revised deliverables; more discussion follows: * A1. API unification * Seamless interchangeable, data-apis, dask, numpy, napari, indexing | SOME | | * A2. FSSpec * fsspec zips, etc. | TBD | | * A3. netcdf | SOME | | * * A4. xarray/multiscale | Lots? | | * A5. sparse arrays, GSOC? | Some | | * B1. upgrade tools | None | | * B2. test_suite verifications | None (but GSOC) | | * B3. OME/pangeo v3 specifications and extensions | Some | | * B4. project maturity (DNS, website, logo, etc.) | Lots | | - Logo, trademarking - ogc standard * No progress on: A2 - Community manager - Hiring (NumFOCUS, etc.) - Blogs - - ZEP... - ZSC/ZIC/ZEP - Xarray/B-Open ### Major changes in scope or project plan Q: Where do we include sharding? As extension in V3 since lots of community support? --> sharding here sharding scalableminds potential new member, seeking funding ### Key outputs and project recognition Community manager ZEP ZIC OGC xarray OSSci??? - Blog and website revamping - Assisting with the releases - Social media coverage - Speaking at conference and meet-ups - ZEP and ZIC formation - Zarr participation in GSoC for the first time - V3 progress - Community calls engagement - Surfacing important stuff for ZSC meetings - OGC Standard - Download numbers (PyPi, Conda, mamba)!? ----- ## TODO Tweet: - respond to https://twitter.com/notjustmoore/status/1432795729890877441 ## Original deliverables A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow 6 / 39 well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelization of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. A. Strengthening bridges EOSS 1 funded Zarr to: develop a v3 of the format specification, extend support to other programming languages, and solidify the project’s governance and operation. As a result, Zarr built bridges to several open-source projects. We now seek to establish Zarr as a standard storage mechanism across these communities. Concretely, we propose the following list of independent milestones. Each has been discussed with the referenced projects and is suitable for contracting via NumFOCUS: A1. We will work with array-providing projects NumPy and Dask on API unification. This will free developers to transparently choose between implementations, making algorithms more generalizable and scalable. A2. We will work with the fsspec community to foster the fsspec-reference-maker specification. This effort will allow accessing non-Zarr files (HDF5, TIFF, Zip, etc.) as if they were Zarr. A3. Similarly, we will work with the NetCDF community, a long-time provider of stable, file format solutions, to have transparent access to both Zarr and NetCDF4 (HDF5-based) files. A4. We will work with the Xarray community to formalize the multiscale array representation that resulted from Zarr’s EOSS 1 funding. The result will be a clear, public home for such cross-cutting, community conventions. A5. We will work with the community to identify and implement extensions, as defined in Zarr’s EOSS 1 work. Sparse arrays, e.g. from the Awkward Array project, are a first candidate. B. Building community and trust Beyond these bridges, we seek support to continue fostering users and contributors into our own community. B1. In addition to API stability -- vital to growing OSS ecosystems, data formats have the additional long- term burden of preventing data loss. With new Zarr versions, older data formatted may become less accessible. To ensure data integrity, we will provide data producer’s with upgrade tools. B2. To encourage the creation of new data in Zarr v3, we will engage with domain-specific organizations like pangeo and the Open Microscopy Environment to define specifications which meet their annotation needs, increasing the FAIRness of their tensor data. B3. A primary difficulty of our EOSS 1 grant, exacerbated by the pandemic, was providing timely feedback to paid and open-source developers on their work. We propose funding a dedicated community manager to that end. B4. Finally, as part of community management, a number of project maturity tasks could be addressed, including: trademarking the Zarr logo, expanding the web presence, and shepherding the Zarr format through certification bodies, e.g., OGC, ISO. We hope with these activities to convince scientific communities to trust their long-term data to this young yet powerful format. </details>