SciPy 2023 Proposal

# SciPy 2023 Proposal # Talk 🎙️ ### Proposal submitted! Check here: https://cfp.scipy.org/2023/talk/review/JRFX9N987YHCUKJLD8ZESCECQJVPC8FB ## Title: Preference: - Zarr: community specification of large, cloud-optimized, N-dimensional, typed array storage ([name=Sanket]: Looks good!) - "tensors" rather than "n-dimensional typed arrays"? Other Possible titles: - Evolution of the Zarr Specification - Zarr: A community owned data format specification - Maintenance and evolution of Zarr throughout the years ## Session Type: Talk (30 minutes) ## Abstract: A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. [Zarr](https://zarr.dev/) provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the [Zarr specification](https://zarr.readthedocs.io/en/stable/spec/v2.html) enables the storage of large out-of-memory datasets both locally and in the cloud. [Implementations](https://github.com/zarr-developers/zarr_implementations) exist in [C++](https://github.com/google/tensorstore/), [C](https://github.com/Unidata/netcdf-c), [Java](https://github.com/bcdev/jzarr), [Javascript](https://github.com/gzuidhof/zarr.js), [Julia](https://github.com/meggart/Zarr.jl), and [Python](https://github.com/zarr-developers/zarr-python), enabling. In this presentation, we will discuss the evolution of Zarr, first introduced at [SciPy 2019](https://youtu.be/qyJXBlrdzBs); the development of the [Zarr Enhancement Process (ZEP)](https://zarr.dev/zeps/) and its use to define the next major version of the [Zarr Specification (V3)](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html); as well as uptake of the format across the research landscape. *QUESTION: mention relationship to arrow, etc. in the tabular space?* ## Description: Zarr is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on an open-source technical specification and has implementations in several languages. Zarr is a [NumFOCUS’s sponsored project](https://numfocus.org/project/zarr) and is under their umbrella. ### Outline: First, we’ll be talking about: ### Introduction and Working of Zarr (10 mins.) - What is Zarr, and how it works? - The inner workings of Zarr using illustrated graphics - When and Why should you use Zarr? - Extensive pluggable compressors (via [numcodecs](https://github.com/zarr-developers/numcodecs/)) and file-storage systems - What is the [Zarr Specification](https://zarr.readthedocs.io/en/stable/spec/v2.html)? - A summary of the technical specification of Zarr - Adoption of the Zarr specification in various programming languages like Python, C, C++, Java, and Javascript and how all of us form a wonderful community together - Development of Zarr since it was first presented in SciPy 2019 by Alistair Miles - Highlighting some important technical and community milestones since 2019 - Securing grants from [CZI](https://chanzuckerberg.com/eoss/proposals/zarr-a-common-backbone-for-the-scalable-storage-of-annotated-tensor-data/) and getting sponsored by NumFOCUS After this: ### Usage of Zarr across several domains (5 mins.) - Interoperability with Dask, Xarray and Numpy - Adoption of Zarr by various communities like Geospatial, Bio-imaging, Genomics, Data Science/Engineering etc. - Development of convention processes like [GeoZarr](https://github.com/zarr-developers/geozarr-spec) and [OME-Zarr](https://github.com/ome/ome-zarr-py) Then we’ll discuss the: ### [ZEP Process](https://zarr.dev/zeps/) (10 mins.) - Need and origin of a community feedback process for the evolution of Zarr specification - How it works? - Transformation from steering council governed to community-owned specification - Learnings when migrating from [Spec V2](https://zarr.readthedocs.io/en/stable/spec/v2.html) → [Spec V3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) And finally, closing by: ### Conclusion (5 mins.) - Key takeaways - How can you get involved? - QnA This talk aims to address an audience who works with large amounts of data and are looking for a format which is transparent, open-source, reliable, cloud-optimised, and friendly to the environment. Also, we’d like to invite anyone interested in the lessons we learnt by maintaining the project throughout the years. The tone of the talk is set to be informative, story-telling and fun. ### After this talk, you’d: - understand the basics of Zarr and its specification, - know why should you have a process for your project, - have essential takeaways regarding when an OSS project transitions from a young to a mature stage - as well as pros and cons of a steering council vs community-owned open-source project ### Notes (Optional): ### Session Image (Optional): ### Additional Speaker: - Josh Moore - John A. Kirkham - *Ryan Abernathey?* --- # Sprint 🏃🏻‍♂️ ### Registration for sprints will open later in May, 2023