Zarr refactor meeting

# Zarr-Python Refactor Meeting Notes ## May 8, 2024 #### Attendees - Joe Hamman / @jhamman - Davis Bennett @d-v-b - Norman Rzepka / @normanrz - Sanket Verma - Alden Keefe Sampson (AKS) - Akshay Subramaniam - John Kirkham Notes: - Progress update ([project board](https://github.com/orgs/zarr-developers/projects/5/views/2)) - numcodecs codecs: [numcodecs#524](https://github.com/zarr-developers/numcodecs/pull/524) - zstd in numcodecs needs a review: [numcodecs#519](https://github.com/zarr-developers/numcodecs/pull/519) - `HybridCodecPipeline` (interleaved with configurable batch size) needs a review: [#1670](https://github.com/zarr-developers/zarr-python/pull/1670) - Runtime configuration? [#1772](https://github.com/zarr-developers/zarr-python/pull/1772) - Batched store discussion - Store metadata methods: [zarr-python#1851](https://github.com/zarr-developers/zarr-python/pull/1851) - Initial `NDBuffer` implementation: [zarr-python#1826](https://github.com/zarr-developers/zarr-python/pull/1826) - Proposed new meeting times - week 1: Friday 7a PT - week 2: Thursday 3p PT Notes: Major updates - JH: - implicit groups are gone :) - - NR: codecs are getting into a good place - new rev on batched pipeline - DB: - out last week, getting back into it - group tests - need a decision about removing v2 code paths - should go now - AKS: - open PR generalizing array types TODOs: - add tests for ***v2*** and v3 arrays - ## April 24, 2024 #### Attendees - Joe Hamman / @jhamman - Davis Bennett @d-v-b - Jack Kelly / @jackkelly - Ryan Abernathey / @rabernat - Max Jones / @maxrjones - Sanket Verma - Akshay Subramaniam - Norman Rzepka / @normanrz Notes: - Progress update ([project board](https://github.com/orgs/zarr-developers/projects/5/views/2)) - Codecs - Norman needs a review on [#1670](https://github.com/zarr-developers/zarr-python/pull/1670) - Store API [#1806](https://github.com/zarr-developers/zarr-python/issues/1806) - discussion around batch vs interleve API - someone could look at https://github.com/zarr-developers/zarr-python/pull/1661 - - Group API ## April 22, 2024 #### Attendees - Joe Hamman / @jhamman - Norman Rzepka / @normanrz - Josh Moore / @joshmoore - Sanket Verma - John Kirkham - Martin Durant - Ryan Abernathey - Davis Bennett - Akshay Subramaniam Excused: Juan Nuñez-Iglesias #### Goals 1. Make sure we're all on the same page with what has been going on with the project 2. Organize around v3 efforts #### Agenda - Recent efforts - [Updates to core team](https://github.com/zarr-developers/zarr-python/blob/main/TEAM.md) (JH) - Moved some to emeritus status, etc. - We should work to get more core devs. (Lots to do) - RA: candidates? JH: let's get people making commits for a while. - meeting (JM) - propose to make it the regular meeting but find a time where everyone can join. - all aboard :tada: - Zarr-Python 3 update (JH) - [Design doc](https://github.com/zarr-developers/zarr-python/blob/main/v3-roadmap-and-design.md) - Progress update ([project board](https://github.com/orgs/zarr-developers/projects/5/views/2) + notes below 👇) - Ambitious release schedule ([#1777](https://github.com/zarr-developers/zarr-python/issues/1777)) - loose plan: May/alpha, June/release - roughly following the Pydantic 2 model (breaking API changes) - JM: all for getting pre-releases out - NR: need to get rid of the v3 folder (messes things up) - `support/v2` branch may be useful (JM) - JM: we probably should be pushier - JH: sure, just our 100% confidence may lag - NR: define a window, e.g., by the end of the year everyone should move. - JH: pinned issue with the release plan? Yes. - https://github.com/zarr-developers/zarr-python/issues/1777 - v3 update - DB: - v3 **metadata** is done, i.e., can create spec compliant v3 arrays - working on **groups** that would work as expected, e.g., listing children (one of 2 big PRs). nearly done. required getting into the async implementation which is one of the biggest changes for the storage layer. Also means that we're not able to just paste in old code. - high-level **convenience** APIs are not there - only the nucleus of a testing strategy. using a different strategy from v2. bringing in what we can. - NR: - codecs are pretty advanced. async ... - MD: that means thread pools? NR: Yes. They can choose how they do that. - JH: core part of that is in the v3 release that will spend time on async/threading/scheduling. Lots of new behavior that we're going to learn about it. But now we have an API that can be tuned. - arrays are missing **indexing** - **documentation** is largely open. (pushed to post alpha for the launch) - JH: - on our way towards having 100% **type hints** - abstract **base classes** for the Store and Codecs that allow people to implement their own (outside of zarr-python) with an entrypoint system. perhaps something for **chunk grids** as well - store is no longer a mutable mapping but a custom class. list methods are async generators, etc. all synchronization happens upstream. Synchronize wrapper of Arrays and Groups, but wait until you're at the top-level API for sync. - build is cleaner. using **hatch**. - need to discuss **numcodecs**. currently isn't taking part in the protocol system. what does it mean to Zarr going forward? - https://github.com/zarr-developers/numcodecs/issues/502 - Discussion - JK: documented path for upgrading? JH: no, there's an open ticket. Need docs on upgrading code but also **migrating data**, e.g., metadata only changes. (Alistair did this for v1->v2). - MD: need to discuss what **kerchunk** is going to do. it will take some pretty deep working. the style of where the metadata is has changed (along with the codeces). JH: yeah, no filters, all one pipeline. i.e., just the metadata. MD: more involved (i.e. goes deeper) into Zarr then other things. RA: zarr data model that is independent of the spec could be super useful. DB: don't think there is an overlap of v2 and v3 arrays. i.e., it would be a UNION. you need to map between the names and the types. don't get that for free just with the hierarchies. RA: don't do it once off, but build something re-usable and then serialize those. JH: clearly separated metadata from the classes. can turn one dataclass into another dataclass. (work to be done) - DB: spec allows **v2/v3 things to be mixed**, so a coroutine of some form may need to be opinionated about what it prioritizes. JH: good point, since you might have to look for 4 things, or prioritize one or the other. we should just be clear and then let people suffer the consequences. RA: have some shim functions similar to the current `open()` which keeps things working. JH: zarr.open has a version flag. None could mean do both. - **implicit groups** (DB) basically anything is a group even if it has garbage in it. NR: haven't seen anyone who is against removing it. - DB: if so, also make mixing versions disallowed? JM: can we allow a complete mixing? JH: don't want to be polymorphic about children. DB: can't **forbid** having .zattrs in a v3 group/array. Agreed. - JM: if need be, can try to organize a ZIC meeting with SV. - Numcodecs (NR) - Opened PR today if someone wants to review that, but more generally where are we with numcodecs - have specialized codec classes in v3 branch. arrays-to-bytes, etc. etc. Different classes from in numcodecs. Do use it under the hood. - for v2 support in the v3 branch we use the code unchanged. we ask numcodecs to do it for us. we could pull that into the v3 arrays which would give us support for batching, async, etc. we will likely need some glue code. (that's with minimal effort). Do we move numcodecs in a direction such that it uses v3 abstract classes. - DB: I like the idea of their being a repo on github for people to go to. numcodecs should exist where we have these compression routines at a low level. It should be there to support zarr. - NR: **closed list**. How do we handle that? - DB: spec says that it just needs a URL. - NR: what if two implement blosc2 differently. - DB: people are going to do what they want. - NR: make it more difficult? or use the github URL to prefix? - JH: raise a warning that users can turn off if they aren't using an approved list of codecs. for experimentation, we definitely want to make it possible (and **easy**). That's what zarr-python is known for. - DB: what's the advantage of enumerating a list of codecs? - NR: when creating an implementation, you can just follow the list. - JM: allows us to say, "this implementation is not complete". plus also :+1: for a schema where possible. - NR: ok to have optional codecs - JH: open issue with `tifffile` of a missing default flag (size parameter?) - https://github.com/cgohlke/tifffile/issues/211 - RA: a way out of this is to outsource as much as we can with blosc, has an ambition of being a meta compressor - AS: blosc as the main library is that it also has sharding etc. under the hood. better IMO to just expose the compression stream formats. blosc is less flexible than numcodecs currently. (more difficult to add new compressors or options) - AS: gzip links to RFC not an implementation, i.e., a specific stream format. this is also an issue with numcodecs lz4. would be good to have these written down and link to a **spec**. - Continue conversation on https://github.com/zarr-developers/numcodecs/issues/502 - [NASA Funding Opportunity](https://nspires.nasaprs.com/external/solicitations/summary.do?solId=%7B910CC61E-4616-9958-C26F-F8D9BC5AB8D9%7D&path=&method=init) (JH) - Planning to submit a LOI next week - targetting Zarr-Python, v3 feature development, and support at least 3 years - TODOs - [ ] find a new time for the bi-weekly meeting. becomes zarr-python dev meeting but open invitation to anyone who would want to join. #### Notes ## April 10, 2024 - SV: Zarr-Python B&P meeting discontinued - rename refactor meeting to 'Zarr-Python meeting'? - JH: updates since last meeting - [Project board](https://github.com/orgs/zarr-developers/projects/5/views/2) - Many new issues! - Discuss timeline [#1777](https://github.com/zarr-developers/zarr-python/issues/1777) Active work - DB: removing old v3 - JH: move v3 dir to root, remove v2 stuff - JH: list_* (AsyncGenerator[str]) - stream through `members`: https://github.com/zarr-developers/zarr-python/pull/1782/files#r1558820360 - https://stackoverflow.com/questions/78301926/asyncio-creating-a-producer-consumer-flow-with-async-generator-output - AS: generalized NDarray support - two options for the design in https://github.com/zarr-developers/zarr-python/issues/1751 - can we use c++ for this? will make zero-copy memory sharing easier - Qs: what does it mean for development process and pyodide support? - MJ: merged CI updates - looking for the next thing ## Apr 5, 2024 - Davis Bennett / @d-v-b (DB) - Norman Rzepka / @normanrz (NR) - Joe Hamman / @jhamman - Deepak Cherian / @dcherian ### Todo list :point_down: Note: the topics listed below have been converted to issues and placed on the v3 project board: https://github.com/orgs/zarr-developers/projects/5 p0 - must happen now p1 - must happen before alpha release (target first week of May) p2 - must happen before 3.0 release (target early June) p3 - nice to have, can happen after 3.0 release * Arrays * [p1] [Finalize codec API, including codec pipeline](https://github.com/zarr-developers/zarr-python/issues/1659) * [p1] [Try out codec entry points](https://github.com/zarr-developers/zarr-python/issues/1748) * [p1] [Array indexing feature parity with zarr-python 2](https://github.com/zarr-developers/zarr-python/issues/1749) * [p2] [Resolve numcodecs question](https://github.com/zarr-developers/zarr-python/issues/1750) * [p2] [Generalized array support (numpy, cupy, jax, torch etc)](https://github.com/zarr-developers/zarr-python/issues/1751) * [p3] [ChunkGrid API -- do we need it now (or ever)?](https://github.com/zarr-developers/zarr-python/issues/1752) * Groups * [p1] [implement members / children](https://github.com/zarr-developers/zarr-python/issues/1753) * [p3] [(reach) declarative hierarchy API](https://github.com/zarr-developers/zarr-python/issues/1754) * Store * [p0] [Finalize store API](https://github.com/zarr-developers/zarr-python/issues/1755) * [p1] [Deprecate stores we don't want in Zarr-Python core anymore](https://github.com/zarr-developers/zarr-python/issues/1756) * [p1] [remote store support (s3, gcs, azure, http)](https://github.com/zarr-developers/zarr-python/issues/1757) * [p3] [request coalescing: where to implement?](https://github.com/zarr-developers/zarr-python/issues/1758) * [p3] [Storage transormer API -- do we need it now (or ever)?](https://github.com/zarr-developers/zarr-python/issues/1718) * [p3] (reach) http proxy - may not need to be in zarr-python * [p3] [(reach) caching bytes-chunks in the store, caching array chunks in the array](https://github.com/zarr-developers/zarr-python/issues/1500) * Tests * [p1] [Bring in as much of the existing test suite in as possible](https://github.com/zarr-developers/zarr-python/issues/1759) * [p1] [Test serialization of Arrays](https://github.com/zarr-developers/zarr-python/issues/1760) * [p1] [Add test that instruments traffic to the store -- we should be very careful to only read what is needed](https://github.com/zarr-developers/zarr-python/issues/1761) * [p2] [Develop integration test suite in Zarr-Python -- needed to validate new async tooling (could be xarray and Dask)](https://github.com/zarr-developers/zarr-python/issues/1762) * [p2] [Coordinate downstream testing (e.g. Dask + Xarray)] * [p3] [Develop performance test suite in Zarr-Python](https://github.com/zarr-developers/zarr-python/issues/1763) * [p3] [Add hypothesis test hooks](https://github.com/zarr-developers/zarr-python/issues/1764) * Docs * [p2] [Update developer docs](https://github.com/zarr-developers/zarr-python/issues/1765) * [p2] [Update API docs](https://github.com/zarr-developers/zarr-python/issues/1766) * [p2] [Update tutorial docs](https://github.com/zarr-developers/zarr-python/issues/1767) * [p2] [Write a doc about how Zarr-Python thinks about consistency and how it opperates when concurrent writers are acting on a store/group/array/chunk](https://github.com/zarr-developers/zarr-python/issues/1768) * [p2] [Start a release doc summarizing the major changes in V3](https://github.com/zarr-developers/zarr-python/issues/1769) * [p3] [docs for extending zarr](https://github.com/zarr-developers/zarr-python/issues/1770) * how to write a custom store * how to subclass array / group (e.g., to have typed attributes, typed members) * How to get good performance * Misc * [p1] [Top level zarr API (open, create)](https://github.com/zarr-developers/zarr-python/issues/1598) * [p1] [make mypy and pylint/ruff happy](https://github.com/zarr-developers/zarr-python/issues/1593) * [p1] [Dial in runtime config API -- e.g. IO loop cannot be attached to](https://github.com/zarr-developers/zarr-python/issues/1772) * [p2] [`TypedDict` for all typed dictionaries](https://github.com/zarr-developers/zarr-python/issues/1773) * [p2] [develop synchronization API or declare it dropped](https://github.com/zarr-developers/zarr-python/issues/1596) * [p2] Add logging throughout * Migration * [p3] 2 -> 3 conversion cli, (maybe in its own repo) * [p1] [remove v2 code](https://github.com/zarr-developers/zarr-python/issues/1771) ## March 27, 2024 - Davis Bennett (DB) - Alden Keefe Sampson (AKS) - Norman Rzepka / @normanrz (NR) - Sanket Verma / @MSanKeys963 (SV) - Akshay Subramaniam / @akshaysubr (AS) - Max Jones / @maxrjones (MJ) - Raphael Hagen / @norlandrhagen (RH) ### Meeting notes: - Sanket: Bi-weekly meeting ends on May 1st, 2024 - shall we continue after that? - Yes! Schedule it until June end! - DONE - DB: Fleshing out the group API in v3 https://github.com/zarr-developers/zarr-python/pull/1726 - NR: We need to find a common understanding of what we still need to work on for beta release. NR will create a tracking issue. - Akshay: Generalized array support - Where to create issue to track this? zarr-python or zarr-specs? Any direction for structuring the issue and proposal? - Create a native zarr NDArray class for typing and to interface with existing protocols. This includes: - Buffer protocol - `__array_interface__` - `__cuda_array_interface__` - DLPack - Raw pointers ```cpp namespace zarr { namespace py = pybind11; using namespace py::literals; class Array { public: Array(zarrArrayInfo_t* array_info, int device_id); Array(py::object o, intptr_t cuda_stream = 0); py::dict array_interface() const; py::dict cuda_interface() const; py::tuple shape() const; py::tuple strides() const; // Strides of axes in bytes py::object dtype() const; zarrArrayBufferKind_t getBufferKind() const; // Device or Host buffer py::capsule dlpack(py::object stream) const; // Export to DLPack py::object cpu(); // Move array to CPU py::object cuda(bool synchronize, int device_id) const; // Move array to GPU const zarrArrayInfo_t& getArrayInfo() const { return array_info_; }; static void exportToPython(py::module& m); }; } // namespace zarr ``` - Interoparability with Numpy ```python ascending = np.arange(0, 4096, dtype=np.int32) zarray_h = zarr.ndarray.as_array(ascending) print(ascending.__array_interface__) print(zarray_h.__array_interface__) print(zarray_h.__cuda_array_interface__) print(zarray_h.buffer_size) print(zarray_h.buffer_kind) print(zarray_h.ndim) print(zarray_h.dtype) ``` - Interoparability with Cupy ```python data_gpu = cp.array(ascending) zarray_d = zarr.ndarray.as_array(data_gpu) print(data_gpu.__cuda_array_interface__) print(zarray_d.__cuda_array_interface__) print(zarray_d.buffer_kind) print(zarray_d.ndim) print(zarray_d.dtype) ``` - Convert CPU to GPU ```python zarray_d_cnv = zarray_h.cuda() print(zarray_d_cnv.__cuda_array_interface__) ``` - Convert GPU to CPU ```python zarray_h_cnv = zarray_d.cpu() print(zarray_h_cnv.__array_interface__) ``` - Anything that supports the buffer protocol ```python with open('file.txt', "rb") as f: text = f.read() zarray_txt_h = zarr.ndarray.as_array(text) print (zarray_txt_h.__array_interface__) zarray_txt_d = zarray_txt_h.cuda() print(zarray_txt_d.__cuda_array_interface__) ``` ## March 13, 2024 - Joe Hamman / @jhamman (JH) - Alden Keefe Sampson (AKS) - Norman Rzepka / @normanrz (NR) - Sanket Verma / @MSanKeys963 (SV) - Akshay Subramaniam / @akshaysubr (AS) - Max Jones / @maxrjones (MJ) - Agriya Khetarpal / @agriyakhetarpal (AK) ### Meeting notes: - Alden / top level API - https://github.com/zarr-developers/zarr-python/issues/1598#issuecomment-1994729420 - In the v3 library, are the top level methods - the primary way users interact with the library? or - the smoothest on ramp for v2 library users into v3? (and Array.xxx and Group.xxx become primary) - something else? - Notes: - Joe's thought: We want to provide a pretty similar interface, help massage or raise errors when args not compatible. We can start deprecating and changing behavior - Norman: like Array. and Group entrypoints, but we need to have these top level entry points. promote the Array and Group classmethods in the docs - Joe: don't love polymophism of .open, but it exists - The kwargs to any method that can create an array are currently v2 specific (filters vs codecs, etc), plus there are a number of performance/behavior modifying args (cache_[metadata|attrs], partial_decompress, write_empty_chunks, dim separator). Do we - try not to change the api at all and try to translate into spec V3 land - make the kwargs actually match those of the Array.xxx v3 library methods, but also take in `**v2_kwargs` and translate where possible, checking for conflicts with v3 kwargs if provided - make the top level methods kwargs align with those of Array.create, etc - Norman: for open: make it compatible, you get back your method, for create could make - Runtime parameters: many of these didn't know exist, can debate case by case - Joe: if run time param provided that v3 don't use, raise error - Currently zarr.open and similar will return an array if it exists, even if the existing array's dtype, codecs, etc don't match those provided. - Keep this? - Joe: think we should raise an error, wide agreement on this - Norman / Batched and interleaved codec pipelines - https://github.com/zarr-developers/zarr-python/pull/1670 - Hybrid interleaved batched codec pipeline - How to set runtime configuration? - Batched store API - BatchedCodecPipeline as abstract class that can be overridden by user code - Move thread dispatch from codec to pipeline to allow for coalescing and locality? - Akshay / Generalized array support - Open issue and tag with v3 - Sanket / Summary for the core-devs → potential blog post in the future - JH: I can take this on -- target April? - SV: Sounds good! - Agriya / Zarr Pyodide support, out-of-tree - Requires patching numcodecs, zarr here and there - Zarr is pure Python, so lesser patches there. Numcodecs needs more patches because it is Cython-based. - Already done by Pyodide devs per Pyodide/Emscripten release - The Emscripten and Pyodide versions are not decoupled yet - Leads to missing versions - Establish CI job that runs on PRs and nightly — or just nightly - Issue with this is maintainability and how to keep support? - Interactive documentation (end goal). - **Action item**: I will be opening an issue for this on the Zarr repository and link both previous discussions (the ones that I have found). Discussion may proceed there further ## February 28, 2024 - Joe Hamman / @jhamman (JH) - Tom Nicholas / @TomNicholas - Norman Rzepka / @normanrz (NR) - Davis Bennett / @d-v-b (DB) - Sanket Verma / @MSanKeys963 (SV) - Akshay Subramaniam / @akshaysubr (AS) - Charles Stern - Alden Keefe Sampson (AKS) ### Meeting notes: - Numcodecs discussion - https://github.com/zarr-developers/numcodecs/issues/502 - How ready is the v3 branch for kerchunk-related experiments? - i.e. chunk manifest ZEP, virtual concatenation ZEP - https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA?view - v3 store discussion - https://github.com/zarr-developers/zarr-python/discussions/1686 ## February 14, 2024 - Norman Rzepka / @normanrz (NR) - Davis Bennett / @d-v-b (DB) - Sanket Verma / @MSanKeys963 (SV) - Akshay Subramaniam / @akshaysubr (AS) ### Meeting notes: - AS: Planning to send a couple PRs around numcodecs and wanted to join the refactor meeting to get the sense of current state of things - NR: https://github.com/zarr-developers/zarr-python/pull/1660 - AS: Plan to add the encode and decode batch in numcodecs and move the logic from the Zarr-Python to numcodecs - NR: In the current codec mechanism there will be place to add the encode/decode class - SV: Also a question of how you'd add a new codec in V3 - https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html - NR: Could use entrypoints for the new codec registrations - AS: New codecs are added via KwickIO - AS: https://github.com/zarr-developers/zarr-python/issues/1398 ## January 31, 2024 ### Attendees - Sanket Verma / @MSanKeys963 (SV) - Norman Rzepka / @normanrz (NR) - Davis Bennett / @d-v-b (DB) - Max Jones / @maxrjones (MJ) - Alden Keefe Sampson / @aldenks (AS) - Raphael Hagen / @norlandrhagen (RH) - Charles Stern / @cisaacstern (CS) - Jeremy Maitin-Shepard / @jbms (JMS) ### Meeting notes: - NR: Codec pipeline - Open question: merging partial and full versions? - Next - reading/writing partial chunks for uncompressed data - DB: Saransh helped with hatch and source layout updates - providing a review on packaging PRs: https://github.com/zarr-developers/zarr-python/pull/1592 - new branch for V3 work - removing attrs - using frozen dataclasses - relies on handling to/from dict in each class with validation functions - MJ: No updates, participating in Joe's virtual sprint on Zarr refactor - can test out setting up test env with Hatch, provide feedback - AS: Setup on dev environment, still intending to work on [high-level methods](https://github.com/zarr-developers/zarr-python/issues/1598). - Also adding setup/dev environment doc improvements to https://github.com/zarr-developers/zarr-python/pull/1643 - RH: No updates - CS: Interested in participating in Zarr sprint remotely - JMS: No updates, analogous decisions in tensorstore ## January 17, 2024 ### Attendees - Sanket Verma / @MSanKeys963 (SV) - Joe Hamman / @jhamman (JH) - Norman Rzepka / @normanrz (NR) - Davis Bennett / @d-v-b (DB) - Max Jones / @maxrjones (MJ) - Alden Keefe Sampson / @aldenks (AS) - Raphael Hagen / @norlandrhagen (RH) ### Meeting notes: - SV: plans for numcodecs going forward - TODO: connect with JK about this - JH: made some good progress on the Store API - Thinking about what to do when keys are missing, `raise KeyError` or `return None` - Needs work: `getsize`, `move`, `tree`, `rmdir`, `open`, `close` - NR: move should only exist on a store if its cheep - Open questions: - `Store.list_*` could change to return async generators - NR: working on codec api, removing array metadata - not 100% happy with the API yet - new methods: `evolve` and `validate` - check if the codec matches - looking for input here https://github.com/zarr-developers/zarr-python/pull/1632 - DB: working on a messy / unmergable PR for the Array API - end goal: unify array/group apis for v2 and v3 - added a new directory with v2 and v3 metadata - stuck on dataclass inheritance - JH: where will the normalization of metadata keys - MJ: not much, going to pick up the hatch PR - JH: yay! - DB: flag issue around imported modules from tests https://github.com/zarr-developers/zarr-python/pull/1601 - AKS: playing with zarr in rust - very fast! - going to pick up the top level api this week - RH: No updates at this time ### Discussion What goes in the beta release 1. core array, group, and store api 2. thesis: almost feature complete but the api should be set - we want people to start using it so we can get some feedback 3. ## January 3, 2024 ### Attendees - Sanket Verma / @MSanKeys963 (SV) - Alden Keefe Sampson / @aldenks (AS) - Joe Hamman / @jhamman (JH) - Norman Rzepka / @normanrz (NR) - Davis Bennett / @d-v-b (DB) - Max Jones / @maxrjones (MJ) ### Meeting notes: - JH: Still working on Group and Store APIs - NR: Left off with codec api, sharding api, and sharding layouts - DB: Still working on array api - considering a major change to indexing/slicing api (slicing a Zarr array gives NumPy array, which is weird and should give out a Zarr array) - thinking about serialization of nested objects - partial writes - MJ: Looking for a place to jump in - Refactor metadata objects (e.g. ChunkGrid, ChunkKeyEncoding) - Remove attrs and refactor (de)serialization - Hatch PR https://github.com/zarr-developers/zarr-python/pull/1592 - AKS - may want jump in on the top level `zarr.foo*` api ## December 20, 2023 ### Attendees - Charles Stern / @cisaacstern (CS) - Jack Kelly / @JackKelly (JK) - Sanket Verma / @MSanKeys963 (SV) - Alden Keefe Sampson / @aldenks (AS) - Joe Hamman / @jhamman (JH) ### Meeting notes: - CS: nothing directly on zarr, keeping an eye on the zarr issues with help wanted tags - SV: looking at hatch pr from davis, zep 0 revisions, and zarr paper - JK: working on a [light-speed-io](https://github.com/jackKelly/light-speed-io/) project (rust), playing around with ideas for fast data loading - AS: seems too early to jump in, don't want to get in the way - JH: Many things in progress: Store, Codecs, Arrays, Groups ## December 6, 2023 ### Attendees - Ryan Aberanthey / @rabernat - Joe Hamman / @jhamman (JH) - Charles Stern / @cisaacstern (CS) - Jack Kelly / @JackKelly (JK) - Sanket Verma / @MSanKeys963 (SV) - Davis Bennett (DB) / @d-v-b - Raphael Hagen / @norlandrhagen - Alden Keefe Sampson / @aldenks - Norman Rzepka / @normanrz ### Meeting notes: - Design doc for v3.0 has moved to GitHub. If you want to comment on the design then comment on [the GitHub pull request]( https://github.com/zarr-developers/zarr-python/pull/1583). - Zarrita: Alistair started it as a place to experiment with the Zarr v3 spec (back in July 2020). Norman picked Zarrita up in Summer. - We've taken Zarrita, copied it, renamed & refactored things. There's a new Store interface (we're leaving behind the mutable mapping interface of Zarr stores.) Aiming for 100% coverage of static type checking. - Norman, Davis & Joe have been together for the last 3 days (in Berlin). JH has been working on the Group API (compatibility with Zarr-Python's group API). See [this PR](https://github.com/zarr-developers/zarr-python/pull/1590). For v2 there are two metadata docs which describe a group. Reads now happen async: which cuts the loading time for groups in half. Now working on listing contents of a group. - DB: In Zarrita we have representations of arrays for v2 and v3. DB has been working on a uniform interface to v2 and v3. Breaking lots of things :). Looking at how codecs decompress & compress. Overall strategy is to use the v3 way of doing things. See [DB's PR](https://github.com/zarr-developers/zarr-python/pull/1589). - RA: It's great that there are both async and sync APIs. But downstream datascience libraries will always want the sync API. - NR: The Zarrita code is based on fsspec, with some small changes. - JH: We're just using `fsspec` (for now). Very convinced that having an async API is the right call for Zarr-Python. Less convinced that the `fsspec` way of doing things will be the long-term solution. - NR: Adding sharding strategies. Customise how chunks are laid out in the file. e.g. if you want to do partial writes (where you can write to specific places). Instead of having to write entire shards at a time. Writing tests right now. First PR has been merged into the v3 branch (codecs). - JH: The codecs are now an entry point into the Zarr-Python code. Zarr-Python v2 basically supports any codec in numcodecs. Do we need to register _all_ of those compressors and filters as codecs? Or should we limit them? - NR: Let's make a generic codec. bytes-to-bytes, and array-to-bytes. - JH: For now, we've decided not to work on variable chunk sizes. We could release a version of Zarr-Python without variable chunk sizes. Questions? - RA: Everyone's very supportive! This is what we need to get over the rut that zarr-python has been stuck in. A lot of folks would like to help, but don't know _where_. Are there concrete tasks that we can give to people? (The answer may be "no"! Some software projects are just no parallelizable like that.) - JH: Some of these first blocks of work have required us to already resolve conflicts. I'll jot down a couple of tasks which could be useful for folks to work on. - the top-level API has not been ported over yet (e.g. `zarr.open()`). Most people use that top-level API. - documentation! A lot of copy-and-pasting from Zarr-Python v2.0. But some function signatues will change. - type checking needs work! MyPy isn't happy right now. - DB: v3 introduces the concept of a codec pipeline. We build an object which is a sequence of transformations of chunk data, which leads to it being storage, or - in reverse - leads to it being opened. The documentation for this doesn't exist yet. If anyone has an idea for a transformation, then work through the process of doing this with the v3 machinery, and write up some docs for how to do this. v3 is a lot more explicit about how chunks are encoded. - NR: +1 to adding codecs, or wrapping numcodecs. Also: - try wiring the new Zarr-Python to downstream libraries. That would tell us what's missing in terms of the public API. - Also, it'd be useful to having a champion for variable chunking. The codec pipeline should know about variable chunking. - RA: The problem with the ZEP process is that it's hard for ZEP authors to just implement the ZEP. - JH: We should be able to find byte-sized chunks which folks can work on. - NR: It'd be great to keep up the momentum after this week. And it'd be great to have a beta by Jan 2024! - RA: Where is the discussion of the ZEP3 proposal ([here's the PoC implementation PR](https://github.com/zarr-developers/zarr-python/pull/1483))? And the discussion is [here](https://github.com/orgs/zarr-developers/discussions/52). Looking for a champion on: - variable chunking - synchronizers - h5py compat ## Agenda - Update from Joe + Davis on refactor progress ## November 22, 2023 ### Attendees - Joe Hamman / @jhamman (JH) - Charles Stern / @cisaacstern (CS) - Jack Kelly / @JackKelly (JK) - Sanket Verma / @MSanKeys963 (SV) - Davis Bennett (DB) ## Agenda - Zarr-Python 3.0 design doc: https://hackmd.io/0DVKP6d9QI-VaHc0zvOuxw ### Meeting notes: - JH: We can start working from store interface - kind of leaky abstraction - JK: Looking to read million chunks - sharding helps with that - discussions around batching requests in Zarr-Python - requesting million chunks in a single request - if Zarr V3 is a good place to pull in these performance bumps? (don't want to delay the existing work) - DB: To make `get` more efficient, you need to wrap it in something - mostly users are getting multiple chunks at a time - JH: In Dask/Xarray world you map a single chunk of Zarr at a time - At Earthmover there is 1-to-1 reads - to handle big size chunks we have rechunker - sharding codec sits above the store interface - JH: https://github.com/scalableminds/zarrita/blob/async/zarrita/sharding.py#L309 - indexing for sharding - the sharding codec will need access to store API whereas the other codecs doesn't need it - DB: Like the idea - add a new abstraction - we have leaky abstraction and we can use it - JH: Norman is willing to help but only if Zarr-Python is first class citizen - JH: https://docs.xarray.dev/en/stable/roadmap.html - publish the roadmap on [Zarr-Python docs](https://zarr.readthedocs.io/) for the community - JH: Jack can help us in fast concurrent loading problem - JH: Meeting with Davis and Norman in 1.5 weeks to work on Zarr-Python 3.0 ## November 8, 2023 ### Attendees - Joe Hamman / @jhamman - Charles Stern / - Raphael Hagen / @norlandrhagen - Sanket Verma / @MSanKeys963 - Martin Durant / ## Agenda - Zarr-Python 3.0 design doc: https://hackmd.io/0DVKP6d9QI-VaHc0zvOuxw - how would batching work across arrays - use pydantic zarr - other dependencies - python 3.9 - drop in dec.? - which sharding impl ## November 1, 2023 ### Attendees - Joe Hamman / @jhamman - Davis Bennett / @d-v-b - Sanket Verma / @msankeys963 - Raphael Hagen / @norlandrhagen - Charles Stern - Brian Davis - Thomas Nicholas ## Agenda - Request - can someone try to take some notes today? - Request - can we move this meeting time to 8:30a PT (currently at 9a PT). - V3 API migration - Now that we are starting to work on implementing v3, we're faced with the question of what to do with the existing API - Observation: the current v2/v3 polymorphism is unsustainable (and incorrectly prioritizes v2 internally) - Proposal - we create a v3 namespace within zarr-python where we can develop in an isolated space toward a complete v3-spec implementation - Included in this namespace: - classes: `zarr.v3.{Array,Group,Store}` - These classes implement an internal api that closely aligns with the v3 spec - - high level functions: `zarr.v3.{create, open, ...}`` - As much as possible, these function should look and feel like the v2 equivalents but should not be tied to the exact implementation - e.g. `zarr.create(shape=..., dtype=..., compressor=...) -> zarr.create(shape=..., data_type=..., codecs=..., attributes=...)` - We may also want to deprecate and/or rename some of the existing top level functions - backward compatability: - high-level functions in the v3 namespace should be able to `create` or `open` a v2 dataset - The `Group` or `Array` does not need to be backward compatible though. - All development toward v3 happens on the `main` branch in zarr-python - Alternative proposal - We avoid the `v3` namespace and instead take over the primary namespace in a development branch (e.g. `v3`) - When we feel that the `v3` branch is complete, we merge to main and make a `3.0` release - Folks have less time to test out the v3 implementation but we have a cleaner development process - Ideas - Idea of zarr array is to look like a numpy array - could move all the zarr array details to a polymorphic metadata object - trim things down to just the minimal array api interface - declarative heirarchy specification - type hints - #### Sanket Notes - DB: Definition of Zarr and Dask chunks are different and that's not good - JH: Benefits of generative chunk indexing - Impacts with sharding, variable chunking and other shiny feature - Large array with billions of chunks - JH: Maintaining both V2 and V3 at the same time is not ideal - DB: V2 has of lot stuff that people don't use - stores - TN: The current public facing APIs (V2 and V3) are conformant to the existing spec - but what we're thinking to work on a new public facing API which is wrapper of V2 and V3, and not conformant to V3 exactly - isn't that a bad thing? - DB: The public-facing Zarr array object API is not covered by the spec anyway - Also can't be, because syntax might be language-dependent - Therefore we have full freedom in the public python API of the python zarr array type - TN: Okay, in that case makes sense to follow python array API standard as much as possible - TN: Array API has granular functionality which is super useful (e.g. you can say "we don't support the statistical functions") - TN: Note that chunking is not part of the array API standard ## October 18, 2023 ### Attendees - Joe Hamman / @jhamman - Max Jones / @maxrjones - Davis Bennett / @d-v-b - Tom Nicholas - Charles Stern - Sanket Verma - Ryan Abernathey ## Agenda - Proposal: just use Zarrita :) - 0.1% done: https://github.com/jhamman/zarr-python/pull/1 - Ryan added memorystore to Zarrita: https://github.com/scalableminds/zarrita/pull/12 ## September 20, 2023 ### Attendees - Joe Hamman / @jhamman - Charles Stern / @cisaacstern - Sanket Verma / @MSanKeys963 - Raphael Hagen / @norlandrhagen ## Agenda - Review ZEP 6 proposal and proposed implementation - https://github.com/zarr-developers/zeps/pull/46 - https://github.com/zarr-developers/zarr-python/pull/1526 - Goal with ZEP6 in Zarr-Python - Clean up interface for Group/Array constructors from V2/V3 metadata - Use ZOMs internally as part of the migration to V3 spec - Use ZOMs in array/group constructors to consolidate initialization reads/writes - https://github.com/zarr-developers/zarr-python/issues/538 (repeated writes to set attrs) - https://github.com/pangeo-data/pangeo-eosc/issues/39 (many contains/iter calls) - Expose ZOMs to third parties ## September 6, 2023 ### Attendees - Joe Hamman / @jhamman - Ryan Abernathey / @rabernat - Sanket Verma / @msankeys963 - Raphael Hagen / @norlandrhagen - Ryan Williams - Charles Stern / @cisaacstern - Davis Bennett / @d-v-b ## Agenda - review scoping section (below) - performance - zarr + pydantic (https://github.com/janelia-cellmap/pydantic-zarr) - observation: Zarr-python is missing specific data models for Groups / Arrays - price of depending on pydantic is probably not worth it - ## Scoping V3 update (by @jhamman) _Written by @jhamman on September 5, 2023_ In the Winter and Spring of 2022, while the V3 spec was still under development, an experimental V3 implementation was added to the Zarr-Python codebase ([#898](https://github.com/zarr-developers/zarr-python/pull/898)). This implementation followed the spec, as it was written at the time. However, in the months following these developments, major changes to the spec were made. This has left Zarr-Python out of sync with the V3 specification. ### Summary of current status 1. V3 support is behind an experimental API (accessed by setting `zarr_version=3` and `ZARR_V3_EXPERIMENTAL_API=1`). 2. A separate code path for V3 stores was implemented in `zarr._storage.v3`. Major changes to the spec since the experimental implementation include: - Entrypoint metadata document (`zarr.json`) is no longer required - Metadata keys were renamed (e.g. `meta/foo/bar.group.json -> /foo/bar/zarr.json`) - Group and metadata documents are no longer distinguished by their key name (everything is `zarr.json` and a `node_type` field is included in all documents) - Various updates to metadata fields: - `format_version` → `int` - added `dimension_names` - removed `chunk_memory_layout` (in favor of transpose codec) - `codecs` now includes a list of codects that was previously split between the `filters` and `compressor` fields - etc. Open questions: - fallback data types ### Actions https://github.com/orgs/zarr-developers/projects/5/views/1 ## Zarr refactor meeting _Aug 16, 2023_ ### Attendees - Joe Hamman (Earthmover) - Xarray and Zarr dev - Sanket Verma (Zarr) - Tom White (independent dev) - SGKit and Cubed - Max Jones (CarbonPlan) - Data scientist - Raphael Hagen (CarbonPlan) - Data eng. - Charles Stern (Columbia) - Pangeo-forge ### Discussion - Max: how do we view V3 extensions already in Zarr-python - Charles: how does Zarr python register plugins - Zarrita (https://github.com/scalableminds/zarrita/) - reference implementation - no baggage / tech debt of Zarr-python - not production ready - also has sharding - Tom: Interop tests between implementations ### Timeline Goal: by the end of the year, have a fully-functional implementation of V3 in Zarr Python - Starting now: survey users to get an understanding of how a breaking change to the V3 implementation will impact users - Next two weeks: Break [#1290](https://github.com/zarr-developers/zarr-python/issues/1290) into smaller chunks and set up project board - September: start refactor efforts - Oct-Dec: Integration and interop testing - ### TODOs: - add regular call to community calendar - break out V3 implementation tasks into issues / project board (try to identify issues that can be picked up by others) - publish read out of this call

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.