owned this note
owned this note
Published
Linked with GitHub
# Zarr-Python Developer Meeting Notes
_formerly Zarr-Python Refactor Meeting Notes_
## November 22, 2024
-
Notes:
- Should `obstore`-based Store be in zarr-python or its own package? https://github.com/zarr-developers/zarr-python/pull/1661
- Top-level sharding configuration in `zarr.open` etc.
## November 15, 2024
- Joe Hamman / @jhamman
- Davis Bennett / @d-v-b
- Tom Augspurger / @TomAugspurger
- Sanket Verma / @MSanKeys963
- Theodore Visvikis /
- Josh Moore / @joshmoore
- Akshay Subramaniam / @akshaysubr
- Norman Rzepka / @normanrz
Notes:
- Discussions on https://github.com/zarr-developers/zarr-python/blob/f74e53aca5311ec077da71585dd962c4af7b8a11/tests/test_api.py#L68-L78
-
## November 8, 2024
- Joe Hamman / @jhamman
- Davis Bennett / @d-v-b
- Tom Augspurger / @TomAugspurger
- Sanket Verma / @MSanKeys963
Notes:
- Joe is working on store stuff and needs help with review - Tom would help with the review
- Davis would like this PR to be reviewed: https://github.com/zarr-developers/zarr-python/pull/2447
Discussion points
- https://github.com/zarr-developers/zarr-python/issues/2412
-
## November 1, 2024
- Joe Hamman / @jhamman
## October 25, 2024
- Joe Hamman / @jhamman
- Tom Augspurger / @TomAugspurger
- Sanket Verma / @MSanKeys963
Notes:
- Updates from Tom — working on `info`, `size` and `tree` properties
- Joe - hopefully wrapping store mode refactor up today
## October 18, 2024
- Tom Augspurger / @TomAugspurger
- Davis Bennett / @d-v-b
- Ryan Abernathey / @rabernat
- Joe Hamman / @jhamman
- Norman Rzepka / @normanrz
- Matt Iannucci / @mpiannucci
- Akshay Subramaniam
Notes:
- Updates
- Joe: getting icechunk out, interested in speaking about release blockers
- Davis: moving v3 tests, working on store api
- Norman: working on the numcodecs, sharding bug, filter/codecs for v2 arrays
- Ryan: worked on strings (out of spec), commited to dealing with spec problems (specifically on extensions)
- Tom: Xarray compat (probably ready to merge)
- Akshay: focusing on gpu compression codecs (nvcomp)
- Matt: working on getting v3 working with kerchunk and virtualizarr
Topics:
- store api
- DB: two phases of IO: reading/writing chunks or initializing an array a group
- `mode` was added to the store
- pathalogical situations where `clear()` is happening on reopen
- NR: Be able to use StorePath in `zarr.open`, e.g. `zarr.open(LocalStore("...", mode="a") / "testdata.zarr")`
- JH: https://github.com/zarr-developers/zarr-python/issues/2359
-
- v2 filters/codecs
- https://github.com/zarr-developers/zarr-python/issues/2325
- NR: hasn't done much here yet
- MI: same, just looked at the issue
- codec naming
- NR: numcodecs codec namespace will be just for v3 arrays
- we may also want to split the kerchunk filters into compressor/filter categories
- RA: what would give us more developer velocity here?
-
- release blockers
- extensions
-
## October 11, 2024
- Tom Augspurger / @TomAugspurger
- Davis Bennett / @d-v-b
- Sanket Verma / @MSanKeys963
- Josh Moore / @joshmoore
- Ryan Abernathey / @rabernat
- Joe Hamman / @jhamman
- Norman Rzepka / @normanrz
Notes:
- Summary
- JH: xarray & dask test suites passing with v3.
- JH: milestone for a beta release
- string PR should be included
- NR: doc sprint?
- https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/Docs.20sprint.20in.20September.3F/near/474312033
- Big items
- Strings (RA) - braindump
- confusing how it (ever) used to work
- `np<2` had no notion of varlen str
- fixlen (utf-8)
- or object array
- zarr allowed both as valid dtypes, e.g. u4 or object
- now np has a varlen
- question of dtype+codec
- PRs merged
- 236: new dtype "string" and "bytes". also zarrs implements them.
- NR: v2 compatibility? pickling mode and another mode.
- RA: two questions
- API: strings in zarr and pull them out. works as expected? (with this PR: https://github.com/zarr-developers/zarr-python/pull/2036)
- data on disk stored the same between v2 & v3 (without rewriting). believe so if using vlenutf8 codec.
- RA: assumption is always vlen, and impl can use the appropriate in memory data structure (i.e., different in c)
- in python, decoding to varlen if `np>2` (breaking from zarr v2); decoding to object if `np<2`
- NR: zarr array v2 a year ago will that work? believe so. tom working on the v2 side of it.
- TA: https://github.com/zarr-developers/zarr-python/pull/2323
- supported once that is supported
- NR: for v3, bit of concern since there's no spec. a ZEP? (see stuck https://github.com/zarr-developers/zeps/pull/47)
- RA: two components need to be addressed in the spec (dtype+codec; leaky abstraction?)
- ...integer array interpreted as bytes?! unsure.
- RA: far from being able to make changes to the spec.
- JM: good to confer with John Kirkham
- JH: pickled dtypes is inoperable with other languages
- NR: how do we want to deal with experiments?
- can't write specs without implementation and python is a great place to do that
- but maintain a few other implementations. not useful if the main implementation blazes ahead and everyone must follow
- suggest we be cautious about that.
- various ways to handle that
- previously environment variables
- issue a warning on non-standard dtypes
- some discussion then it'll be fine
- RA: see https://github.com/zarr-developers/zarr-specs/pull/312#issuecomment-2407444223
- SV: Isaac stepped down from https://github.com/zarr-developers/zeps/pull/47 (no one to steer)
- RA: different opinion -- just everything through extensions.
- trying to get to parity. can we do the same things.
- still struggling to evolve it. we decided that it's unversioned, so it's unchangeable.
- would need to move to v4.
- immutability was to be balanced by extensions.
- haven't managed to develop a robust ecosystem/process for extensions. ZEPs have failed. nothing adopted. we can't agree...
- extensions and then let's go make them
- see TA's https://github.com/zarr-developers/zarr-specs/issues/316
- let them be free
- practical way forward
- DB: tried to address that in https://github.com/zarr-developers/zarr-specs/pull/312
- are we willing to change the parts of the spec that are blatantly contradictory
- if it's immutable, so be it.
- RA: clarifications are ok
- NR: state of zarr-specs is terrible. ZEPs are a symptom. people are fatigued. process broken.
- spec core team is a good path.
- will have the same issue treating everything as an exception. names need to be coordinated. two string dtypes?!
- who controls the namespace. need a process. even a repository, pypi style.
- JM: feedback on zarr versioning from other implementations
- RA: namespacing
- extensions need to be namespaced. URI ok. absolute. resolves to the document.
- need to figure how what are the extensions and what's their scope
- 2 different extensions (different URIs) that define the same codec. that's ok.
- make whatever changes are needed to have that process, socialize it, etc. shutdown ZEPs.
- NR: that's not how they were meant to work in v3. extension *points*. let's you create, e.g., codecs. (nothing wraps it)
- RA's is a new concept. might work. there might be issues with composability.
- comfortable in zarr-python if you need to actively opt in.
- RA: ask everyone if the original intention of zarr-spec work in practice.
- haven't been able to make forward progress. incorporate learning
- face reality of how things work in the real world and adapt.
- look at others where it's working
- JH: laser focused on getting zarr-python out
- can set config (don't need environment variable). go off-road
- consoldiated metadata, few codecs, etc. to add (not much more coming)
- SV: lack of implementations was definitely an issue. Tried to work that into https://github.com/zarr-developers/zeps/pull/59
- JH: on spec process, v3 is in accepted not final. missed that date by a year.
- reasonable to say that changes are going to have to be made
- change the status of the spec for a while?
- SV: previous conversation about when to set it to final
-
- Technical things
- Beta release (JH): ok when smashing merge on string PR?
- NR: v2 filters in beta or after? JH: weird kerchunk
- RA: to make xarray work we had to special case everything ("working" isn't accurate)
- convenience for our users
- NR: backwards compatibility. (lack of) v2 spec are out in the wild that we have to define indefinitely. (see extensions above)
- 2 camps: people that think it through for a long time and the others that want to wage ahead. a tension that we have to work out.
- JH: filters just land in the array metadata. never seen that in the wild.
- Endianness (DB)
- https://github.com/zarr-developers/zarr-python/issues/2324
- no longer part of the dtype. if someone create endian whatever, then the zarr array doesn't report it (have to check the codec)
- creating a new array then it will get drop the endianess
- don't care? in memory representation is decoupled from how it is stored.
- NR: yes, that's what we did in zarrita. you can control how it lands out, but not how it is read into memory.
- RA: prevent memmapping data? NR: that's what we have the metadata for. RA: can imagine an impl without codecs that wants memmap to access the data (though zarr-python doesn't work that way) NR: zarr-spec requires a bytes codec which defines the endianess.
- JH: if you get a big endian array (e.g., zarr.save(np.array)) .... round-tripe so you get a big-endian back out the other side
- NR: zarr.save() would need to handle.
- JH: yes, interpret at the top-level and do something smart about the bytes codec
- DB: if we keep it, what was the point of parameterizing it in the codec.
- NR: compatibility, you need to store it somewhere
- DB: what comes out is undefined.
- RA: use platforms preferred endianness
- DB: then users won't round-trip. won't come out in the chosen endianness
- NR: struggled with this, but it is an implementation detail (incl. exposing it to the user). matters only for some performance issues.
- DB: what is the dtype of a zarr array relative to what the user puts in.
- NR: zarr only cares about how it looks on disk (not in-memory)
- JM: zarr_array.dtype calculated from zarr data_type and checking the codec ("dynamic dtype")
- NR: need to check the read path and what it is doing
- RA: similar to the strings that it's coupling dtype and codec
- DB: in v4 would like to see codec & dtype together
- also want to put shape and chunk together (JM: plus shard)
## October 4, 2024
- Tom Augspurger / @TomAugspurger
- Davis Bennett / @d-v-b
- Sanket Verma / @MSanKeys963
- Josh Moore / @joshmoore
- Ryan Abernathey / @rabernat
Notes:
- Array metadata refactor needs some mypy fixes (or we accept overriding)
- Discussions on https://github.com/zarr-developers/zarr-python/pull/2272 (Davis)
- https://yarl.aio-libs.org/en/latest/
- DB: not making stores mutable, but they do IO!...
- RA: inconsistency in the definition of where a store begins
- can some stores disallow starting from inside?
- Similarity between store and URL (abstractly speaking)
- Ryan: would suggest (also for sharding) to use stores?
- Josh: but how to bootstrap?
- Strings (Ryan)
- https://github.com/zarr-developers/zarr-python/pull/2278
- define a string dtype on a zarr DataType rather than np
- https://github.com/zarr-developers/zarr-python/pull/2036
- just implemented and then do the spec later
- JM: try this out as a community codec (i.e. extension)?
- then can discuss a ZEP to make a core codec.
- SV: know users who are interested. (feel for core or not)
- e.g. geopandas
- DataClasses (Tom)
## September 20, 2024
- Joe Hamman / @jhamman
- Tom Augspurger / @TomAugspurger
- Davis Bennett / @d-v-b
- Sanket Verma / @MSanKeys963
- Akshay Subramaniam
Notes:
- Storage transformers - decision: error when creating an array - https://github.com/zarr-developers/zarr-python/pull/2180
- Dtype validation for v3 - https://github.com/zarr-developers/zarr-python/pull/2209
- Fill value validation for v3 - https://github.com/zarr-developers/zarr-python/pull/2216
- Consolidated metadata discussion
- Xarray integration
- https://github.com/pydata/xarray/issues/9515
- Dask integration
- https://github.com/dask/dask/pull/11388
- Doc sprint: https://github.com/zarr-developers/zarr-python/issues/2215
- What modules are we targetting for the sprint?
- Create issues for different modules so that folks self-assign?
- Some functions have docstring but missing a code sample - should we have code sample for docstring?
- highest priority:
- zarr.Group
- zarr.Array
- zarr.api.synchronous
- next tier:
- zarr.AsyncGroup
- zarr.AsyncArray
- zarr.api.asynchronous
- zarr.storage
- zarr.metadata
- Should we also plan for tutorials?
- consider removing `chunk_shape` kwarg
## September 13, 2024
#### Attendees
- Joe Hamman / @jhamman
- Tom Augspurger / @TomAugspurger
- Davis Bennett / @d-v-b
- Sanket Verma / @MSanKeys963
#### Notes
- Updates
- Davis is happy about recent improvements to
- on people's minds:
- Davis is going to look at the synchronizer api
- Sanket: Doc sprint in September? Dates? How many days? Async?
- Let's try for Sept. 30-Oct 1
- Yes, Async with a kickoff on Sept. 30
- Tom: consolidated metadata is getting pretty close
- reworking metadata layout
- first iteration will support reading/writing v3 consolidated metadata and reading v2
- should be possible to write new v2 metadata as well
- will need to do more thinking on future proofing for metadata schemas
- also thinking about the maximum depth of consolidation
- storage transformers issue: https://github.com/zarr-developers/zarr-python/issues/2178
- may need to update the spec lanaguage around optional metadata fields
```python
import zarr
kwargs = zarr.codecs.make_sharding_pipeline(
read_chunks={...},
write_chunks={},
compressor=Gzip(),
)
zarr.create_array(shape=(...), **kwargs)
```
## August September 6, 2024
#### Attendees
- Joe Hamman / @jhamman
- Tom Augspurger / @TomAugspurger
- Norman Rzepka / @normanrz
- Josh Moore / @joshmoore
- Davis Bennett / @d-v-b
- Akshay Subramaniam
#### Notes
- https://github.com/orgs/zarr-developers/projects/5/views/2 big lifts?
- d-v-b: shape for sharding? does it have to change for 3.0.0?
- shape is currently dependent on which codec
- should encourage thinking about it as a new interpretation of chunking
- JH: define some preset pipelines? NR: similarly. doesn't have to change the array. i.e., top-level API.
- DB: people want easy access to the configuration for looping
- JM: .writing_chunks to go with .reading_chunks. (would dask also adopt?)
- DB: agreed, might be the right level of detail for users
- would also help to guard against other implementations (transformers, etc.)
- NR: also produce an ergonomic way of *creating* them
- DB: you'd also want to pass as an argument
- NR: ok, and doesn't have to be set forever.
- JH: xarray/dask zarr-readers didn't need the attribute. (just from_array needs as an argument)
- JM: default? which one wouldn't fail.
- JH: default today is write chunk? NR: yes. but can be too big.
- consolidated metadata
- JH: reading/writing v2 metadata as a blocker
- TA: writing, too? kinda yeah.
- TA: status update
- pretty straight-forward
- issues with del item: do we **synchronize** out to the consolidated? (i.e. doing more IO)
- relationship between group & store objects is just "call save metadata"?
- JM: writing down the v2 schema in the v3 (since no v2 process)
- JH: just do it in the v2 schema. people are using it.
- docs (NR)
- sprint? still happening.
- issue raised about the formatting. not using the left pain. (sad & empty)
- synchronizer API (JH)
- issues ("it doesn't work")
- hot potato: v2 has one but without distributed version
- DB: how does it plug in in V2?
- JH: mucked up the v2 code `_set_item_nosync`
- DB: property of an array (i.e. high-up in the API)
- JH: could go further down. store level?
- DB: every store has a locking class
- JH: zip store requires thread/asyncio locks (not-merged)
- NR: not using synchronizers
- JH: frequent bug reports in xarray
- DB: does it have to live at arrays and groups because stores didn't know the key names
- does it tie in to having the names knowledge in the store?
- JH: possibly a high-level and a low-level store API
- NR: higher-order store so that you can compose them
- zip store always has it
- mutable mapping
- JH: use memory store to adapt anything (no async stuff though)
- GPU (AS)
- merged :tada:
- testing with the codec interface. things look to be working.
- JH: v3 branch *works*. other big lifts for 3.0.0?
- batched store API. few minor issues. (rust under the hood?)
- JH: definitely lots of small chunk calls. add to store API (bunch of keys)
- DB: allow fetches to run out of order then it changes the API
- gather runs sequentially
- NR: async iterator? or as_completed
- JH: streaming approach is the most powerful but almost most complicated
- JM: add in delete and then it's approaching transactions
- DB: no lazy execution model right now. leverage futures?
- AS: gpu batch in kvikio does that, collects all the futures and then waits
- DB: path is open for that. (if we're leave the mutable mapping API) not too painful.
- AS: also async events needs separate codec pipeline. effects more.
- DB: (dreaming) if txn as a context manager, then it could take a region
- NR: :heart:
- DB: docs as the most important
- NR: agreed
- DB: pay attention to what sucks.
- NR: migration guide.
- CLI tools (convert v2 to v3) - DB
- NR: not difficult for small arrays
- TA: zarr v3 metadata refer to v2 data?
- NR: most of the time. only if the codec is compatible. zarrita had a function for that. could do that.
- time-permitting (Josh)
- impl tests, netcdf-c, bluesky/tiled
## August 30, 2024
#### Attendees
- Joe Hamman / @jhamman
- Sanket Verma / @MSanKeys963
- Akshay Subramaniam /
- Davis Bennett / @d-v-b
- Josh Moore /
- Tom Augspurger / @TomAugspurger
#### Agenda
- alpha release last week
- 2.18.3 is close (maybe today)
- GPU PR is in
- lots of stale PRs
- async / sync boundary in store
- look at how tensorstore does this
- probably pass the store name and a config dict?
- dvb: having users instantiate a store is kind of an anti pattern
- want more or a declarative pattern
- as: could be useful to decouple protocol from store api
- like what we have w/ codecs
- consolidated metadata
- https://github.com/zarr-developers/zarr-specs/pull/309
- https://github.com/zarr-developers/zarr-python/pull/2113
- discussion about store api
- any changes to the on disk format are a spec change
- discussion about cache consistency and invalidation
-
- back to attrs? or something else?
- serialization of metadata is really hard
- Tom is looking at something here -> https://github.com/TomAugspurger/zarr-python/blob/feature/serde/src/zarr/_serialization.py
- probably don't need to go to attrs
## August 23, 2024
#### Attendees
- Joe Hamman / @jhamman
- Sanket Verma / @MSanKeys963
- Josh Moore / @joshmoore
- Norman Rzepka
- Akshay Subramaniam
- Gustavo Hidalgo
#### Agenda
- https://github.com/zarr-developers/zarr-python/pull/2102
- NR: important to have a written document
- OME is also interested in support for reading v2 data
- may be good to remove the `v2` module asap
- JM: crux is supporting v2 and v3 data
- does it make sense to create a zarr3 library
- NR: not a fan of the zarrv3
- discoverability and asthetics are not great
- pitch weekly alpha releases
- need to do the work and get the release out
- JM: when do we go from alpha to beta to full release
- SV: also address these questions: https://github.com/zarr-developers/zarr-python/discussions/2093#discussioncomment-10429985
- alpha release frequency
- proposal: weekly release on Monday
- consolidated metadata
- https://github.com/zarr-developers/zarr-specs/issues/136
- JM: no problem supporting this for v2 ala 2.*
- add something that supports v2 data
- add zep for v3
- RemoteStore
- PR1956
- blocking: writing is completely broken because the exist method
- doing naive synchronous user thing
- `open_array(s3://...)`
- using `sync` in user code
- accessing fsspec directly
- `store = await MyStore.open('s3://foo')`
- `store = sync(MyStore.open('s3://foo'))`
- `store = MyStore.open_sync('s3://foo'), loop=...)`
- `sync_store = SyncWrapper(MyStore, 's3://foo')`
- `sync_store.set(filename, bytes)`
- GPU array progress
- squashing bugs around merge conflicts
- GPU CI is working now, need to sort out liminiting the size of the matrix and installing cupy
-
## August 9, 2024
#### Attendees
- Joe Hamman / @jhamman
- Davis Bennett / @d-v-b
- Sanket Verma / @MSanKeys963
- Eschal Najmi
-
#### Agenda
https://github.com/zarr-developers/zarr-python/issues/2008
- PR updates
- v2/v3 metadata: https://github.com/zarr-developers/zarr-python/pull/2059 *would love to see this merged --Davis*
- picklable classes: https://github.com/zarr-developers/zarr-python/pull/2006
- GPU support: https://github.com/zarr-developers/zarr-python/pull/1967
- blocked by GPU runner on GitHub
- also 2064, 2065 are close
- set a docs sprint date
- target late september
## July 26, 2024
#### Attendees
- Joe Hamman / @jhamman
- Norman Rzepka / @normanrz
- Davis Bennett / @d-v-b
- Hannes Spitz / @brokkoli71
- Gustavo Hidalgo / @ghidalgo3
- Sanket Verma /
#### Agenda
- API surface
- Array API: https://github.com/zarr-developers/zarr-python/discussions/2052
- API survey: https://docs.google.com/spreadsheets/d/1ev4Hj_YU-QCiZJuxRYMrBqdrYYqP3tIdnYGmp9saJS8/edit?usp=sharing
- https://github.com/zarr-developers/zarr-python/issues/2037
- Second alpha relase: https://github.com/zarr-developers/zarr-python/issues/2008
#### Notes:
- sharding is complicating the `.chunks` attribute on Arrays
- ideas
```python
Array.chunks -> tuple[int] # raise if variable chunks or sharding
Array.chunk_grid.read_chunks() # or inner_chunks or chunks
Array.chunk_grid.write_chunks() # or outter_chunks or shards
```
- sharding configuration is pretty complicated today
- template module
- zarr.open_array remains the top level API where we can do user-friendly things
- the Array.open method remains a low level entrypoint
- async codec api
- sharding is the only codec that needs to be async / do IO
- NR: today we get scheduling in a threadpool
- assumption that codecs release the GIL
- need to do performance testing
Deprecate in 2.18.3
- h5py compat methods
TODOs from this meeting
- performance test suite
- dask + threaded scheduler
- GPU runner billing
-
## July 12, 2024
#### Attendees
- Ryan Abernathey / @rabernat
- Norman Rzepka / @normanrz
- Davis Bennett / @d-v-b
- Akshay Subramaniam / @akshaysubr
#### Agenda
- What to do with numcodecs?
- Make a release, needs docs for
- https://github.com/zarr-developers/numcodecs/pull/535
- https://github.com/zarr-developers/numcodecs/pull/531
- https://github.com/zarr-developers/numcodecs/pull/515
- Move more codecs specs into Zarr
-
## June 6, 2024
#### Attendees
- Joe Hamman / @jhamman
- Juan Nunez-Iglesias / @jni
## May 30, 2024
#### Attendees
- Joe Hamman / @jhamman
- Norman Rzepka / @normanrz
- Davis Bennett / @d-v-b
- Akshay Subramaniam
- Max Jones / @maxrjones
#### Agenda
* Upcoming alpha release
Quick topics:
- Norman, do we have an accessible api for extracting a shard index?
- chunkstore API
- joe: ask Martin
- MemoryStore has Buffer objects in it :(
- `out` kwarg
#### Notes
* Joe
* Store open mode is in, but incomplete
* Top level API is functional but needs a bunch of work
* Working on sharding codec, using fsspec branch + top level API branch
* slow for now
* Norman
* working on indexing, tests are working
* stuck on typing
* ready early next week
* Davis
* Store tests for Martin to get fsspec
* Hierarchy api
* codec pipeline API
* typed dicts for metadata objects
* Akshay
* On vacation, keeping track of Buffer/Indexing PRs
* Max
* No updates, can contribute to the v3.0.0 docs task (starting with dev docs)
## May 23, 2024
#### Attendees
- Joe Hamman / @jhamman
- John Kirkham / @jakirkham
- Juan Nunez-Iglesias / @jni
- Sanket Verma / @MSanKeys963
#### Agenda
* Upcoming alpha release
* Joe's demo of new features: https://gist.github.com/jhamman/8381dd971d928bf220405057107562b1
## May 17, 2024
#### Attendees
- Joe Hamman / @jhamman
- Norman Rzepka / @normanrz
- Davis Bennett / @d-v-b
- Max Jones / @maxrjones
#### Agenda
- Outstanding design topics for 3.0.0.alpha - https://github.com/zarr-developers/zarr-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22design+discussion%22
- Additional topics to consider before 3.0.0 (more deprecations may be desired)
- synchronizers? or move to design topic
- move sync.py to new module
- object arrays? need a plan here
- open issue
- meta_array
- assign to nvidia folks
- maybe move to config
- consolidated metadata (v2 and v3)
- joe to take on
- no support for v3
- write_empty_chunks
- runtime array configuration
-
- Test sprint soon?
#### Notes
- release alpha next week, need top level api
- chunks attribute
- for now, regular chunk grid
- indexing
- oindex, vindex, integer, ...
- https://zarr.readthedocs.io/en/stable/tutorial.html#indexing-with-coordinate-arrays
## May 8, 2024
#### Attendees
- Joe Hamman / @jhamman
- Davis Bennett @d-v-b
- Norman Rzepka / @normanrz
- Sanket Verma
- Alden Keefe Sampson (AKS)
- Akshay Subramaniam
- John Kirkham
Notes:
- Progress update ([project board](https://github.com/orgs/zarr-developers/projects/5/views/2))
- numcodecs codecs: [numcodecs#524](https://github.com/zarr-developers/numcodecs/pull/524)
- zstd in numcodecs needs a review: [numcodecs#519](https://github.com/zarr-developers/numcodecs/pull/519)
- `HybridCodecPipeline` (interleaved with configurable batch size) needs a review: [#1670](https://github.com/zarr-developers/zarr-python/pull/1670)
- Runtime configuration? [#1772](https://github.com/zarr-developers/zarr-python/pull/1772)
- Batched store discussion
- Store metadata methods: [zarr-python#1851](https://github.com/zarr-developers/zarr-python/pull/1851)
- Initial `NDBuffer` implementation: [zarr-python#1826](https://github.com/zarr-developers/zarr-python/pull/1826)
- Proposed new meeting times
- week 1: Friday 7a PT
- week 2: Thursday 3p PT
Notes:
Major updates
- JH:
- implicit groups are gone :)
-
- NR: codecs are getting into a good place
- new rev on batched pipeline
- DB:
- out last week, getting back into it
- group tests
- need a decision about removing v2 code paths
- should go now
- AKS:
- open PR generalizing array types
TODOs:
- add tests for ***v2*** and v3 arrays
-
## April 24, 2024
#### Attendees
- Joe Hamman / @jhamman
- Davis Bennett @d-v-b
- Jack Kelly / @jackkelly
- Ryan Abernathey / @rabernat
- Max Jones / @maxrjones
- Sanket Verma
- Akshay Subramaniam
- Norman Rzepka / @normanrz
Notes:
- Progress update ([project board](https://github.com/orgs/zarr-developers/projects/5/views/2))
- Codecs
- Norman needs a review on [#1670](https://github.com/zarr-developers/zarr-python/pull/1670)
- Store API [#1806](https://github.com/zarr-developers/zarr-python/issues/1806)
- discussion around batch vs interleve API
- someone could look at https://github.com/zarr-developers/zarr-python/pull/1661
-
- Group API
## April 22, 2024
#### Attendees
- Joe Hamman / @jhamman
- Norman Rzepka / @normanrz
- Josh Moore / @joshmoore
- Sanket Verma
- John Kirkham
- Martin Durant
- Ryan Abernathey
- Davis Bennett
- Akshay Subramaniam
Excused: Juan Nuñez-Iglesias
#### Goals
1. Make sure we're all on the same page with what has been going on with the project
2. Organize around v3 efforts
#### Agenda
- Recent efforts
- [Updates to core team](https://github.com/zarr-developers/zarr-python/blob/main/TEAM.md) (JH)
- Moved some to emeritus status, etc.
- We should work to get more core devs. (Lots to do)
- RA: candidates? JH: let's get people making commits for a while.
- meeting (JM)
- propose to make it the regular meeting but find a time where everyone can join.
- all aboard :tada:
- Zarr-Python 3 update (JH)
- [Design doc](https://github.com/zarr-developers/zarr-python/blob/main/v3-roadmap-and-design.md)
- Progress update ([project board](https://github.com/orgs/zarr-developers/projects/5/views/2) + notes below 👇)
- Ambitious release schedule ([#1777](https://github.com/zarr-developers/zarr-python/issues/1777))
- loose plan: May/alpha, June/release
- roughly following the Pydantic 2 model (breaking API changes)
- JM: all for getting pre-releases out
- NR: need to get rid of the v3 folder (messes things up)
- `support/v2` branch may be useful (JM)
- JM: we probably should be pushier
- JH: sure, just our 100% confidence may lag
- NR: define a window, e.g., by the end of the year everyone should move.
- JH: pinned issue with the release plan? Yes.
- https://github.com/zarr-developers/zarr-python/issues/1777
- v3 update
- DB:
- v3 **metadata** is done, i.e., can create spec compliant v3 arrays
- working on **groups** that would work as expected, e.g., listing children (one of 2 big PRs). nearly done. required getting into the async implementation which is one of the biggest changes for the storage layer. Also means that we're not able to just paste in old code.
- high-level **convenience** APIs are not there
- only the nucleus of a testing strategy. using a different strategy from v2. bringing in what we can.
- NR:
- codecs are pretty advanced. async ...
- MD: that means thread pools? NR: Yes. They can choose how they do that.
- JH: core part of that is in the v3 release that will spend time on async/threading/scheduling. Lots of new behavior that we're going to learn about it. But now we have an API that can be tuned.
- arrays are missing **indexing**
- **documentation** is largely open. (pushed to post alpha for the launch)
- JH:
- on our way towards having 100% **type hints**
- abstract **base classes** for the Store and Codecs that allow people to implement their own (outside of zarr-python) with an entrypoint system. perhaps something for **chunk grids** as well
- store is no longer a mutable mapping but a custom class. list methods are async generators, etc. all synchronization happens upstream. Synchronize wrapper of Arrays and Groups, but wait until you're at the top-level API for sync.
- build is cleaner. using **hatch**.
- need to discuss **numcodecs**. currently isn't taking part in the protocol system. what does it mean to Zarr going forward?
- https://github.com/zarr-developers/numcodecs/issues/502
- Discussion
- JK: documented path for upgrading? JH: no, there's an open ticket. Need docs on upgrading code but also **migrating data**, e.g., metadata only changes. (Alistair did this for v1->v2).
- MD: need to discuss what **kerchunk** is going to do. it will take some pretty deep working. the style of where the metadata is has changed (along with the codeces). JH: yeah, no filters, all one pipeline. i.e., just the metadata. MD: more involved (i.e. goes deeper) into Zarr then other things. RA: zarr data model that is independent of the spec could be super useful. DB: don't think there is an overlap of v2 and v3 arrays. i.e., it would be a UNION. you need to map between the names and the types. don't get that for free just with the hierarchies. RA: don't do it once off, but build something re-usable and then serialize those. JH: clearly separated metadata from the classes. can turn one dataclass into another dataclass. (work to be done)
- DB: spec allows **v2/v3 things to be mixed**, so a coroutine of some form may need to be opinionated about what it prioritizes. JH: good point, since you might have to look for 4 things, or prioritize one or the other. we should just be clear and then let people suffer the consequences. RA: have some shim functions similar to the current `open()` which keeps things working. JH: zarr.open has a version flag. None could mean do both.
- **implicit groups** (DB) basically anything is a group even if it has garbage in it. NR: haven't seen anyone who is against removing it.
- DB: if so, also make mixing versions disallowed? JM: can we allow a complete mixing? JH: don't want to be polymorphic about children. DB: can't **forbid** having .zattrs in a v3 group/array. Agreed.
- JM: if need be, can try to organize a ZIC meeting with SV.
- Numcodecs (NR)
- Opened PR today if someone wants to review that, but more generally where are we with numcodecs
- have specialized codec classes in v3 branch. arrays-to-bytes, etc. etc. Different classes from in numcodecs. Do use it under the hood.
- for v2 support in the v3 branch we use the code unchanged. we ask numcodecs to do it for us. we could pull that into the v3 arrays which would give us support for batching, async, etc. we will likely need some glue code. (that's with minimal effort). Do we move numcodecs in a direction such that it uses v3 abstract classes.
- DB: I like the idea of their being a repo on github for people to go to. numcodecs should exist where we have these compression routines at a low level. It should be there to support zarr.
- NR: **closed list**. How do we handle that?
- DB: spec says that it just needs a URL.
- NR: what if two implement blosc2 differently.
- DB: people are going to do what they want.
- NR: make it more difficult? or use the github URL to prefix?
- JH: raise a warning that users can turn off if they aren't using an approved list of codecs. for experimentation, we definitely want to make it possible (and **easy**). That's what zarr-python is known for.
- DB: what's the advantage of enumerating a list of codecs?
- NR: when creating an implementation, you can just follow the list.
- JM: allows us to say, "this implementation is not complete". plus also :+1: for a schema where possible.
- NR: ok to have optional codecs
- JH: open issue with `tifffile` of a missing default flag (size parameter?)
- https://github.com/cgohlke/tifffile/issues/211
- RA: a way out of this is to outsource as much as we can with blosc, has an ambition of being a meta compressor
- AS: blosc as the main library is that it also has sharding etc. under the hood. better IMO to just expose the compression stream formats. blosc is less flexible than numcodecs currently. (more difficult to add new compressors or options)
- AS: gzip links to RFC not an implementation, i.e., a specific stream format. this is also an issue with numcodecs lz4. would be good to have these written down and link to a **spec**.
- Continue conversation on https://github.com/zarr-developers/numcodecs/issues/502
- [NASA Funding Opportunity](https://nspires.nasaprs.com/external/solicitations/summary.do?solId=%7B910CC61E-4616-9958-C26F-F8D9BC5AB8D9%7D&path=&method=init) (JH)
- Planning to submit a LOI next week
- targetting Zarr-Python, v3 feature development, and support at least 3 years
- TODOs
- [ ] find a new time for the bi-weekly meeting. becomes zarr-python dev meeting but open invitation to anyone who would want to join.
#### Notes
## April 10, 2024
- SV: Zarr-Python B&P meeting discontinued - rename refactor meeting to 'Zarr-Python meeting'?
- JH: updates since last meeting
- [Project board](https://github.com/orgs/zarr-developers/projects/5/views/2)
- Many new issues!
- Discuss timeline [#1777](https://github.com/zarr-developers/zarr-python/issues/1777)
Active work
- DB: removing old v3
- JH: move v3 dir to root, remove v2 stuff
- JH: list_* (AsyncGenerator[str])
- stream through `members`: https://github.com/zarr-developers/zarr-python/pull/1782/files#r1558820360
- https://stackoverflow.com/questions/78301926/asyncio-creating-a-producer-consumer-flow-with-async-generator-output
- AS: generalized NDarray support
- two options for the design in https://github.com/zarr-developers/zarr-python/issues/1751
- can we use c++ for this? will make zero-copy memory sharing easier
- Qs: what does it mean for development process and pyodide support?
- MJ: merged CI updates
- looking for the next thing
## Apr 5, 2024
- Davis Bennett / @d-v-b (DB)
- Norman Rzepka / @normanrz (NR)
- Joe Hamman / @jhamman
- Deepak Cherian / @dcherian
### Todo list
:point_down: Note: the topics listed below have been converted to issues and placed on the v3 project board: https://github.com/orgs/zarr-developers/projects/5
p0 - must happen now
p1 - must happen before alpha release (target first week of May)
p2 - must happen before 3.0 release (target early June)
p3 - nice to have, can happen after 3.0 release
* Arrays
* [p1] [Finalize codec API, including codec pipeline](https://github.com/zarr-developers/zarr-python/issues/1659)
* [p1] [Try out codec entry points](https://github.com/zarr-developers/zarr-python/issues/1748)
* [p1] [Array indexing feature parity with zarr-python 2](https://github.com/zarr-developers/zarr-python/issues/1749)
* [p2] [Resolve numcodecs question](https://github.com/zarr-developers/zarr-python/issues/1750)
* [p2] [Generalized array support (numpy, cupy, jax, torch etc)](https://github.com/zarr-developers/zarr-python/issues/1751)
* [p3] [ChunkGrid API -- do we need it now (or ever)?](https://github.com/zarr-developers/zarr-python/issues/1752)
* Groups
* [p1] [implement members / children](https://github.com/zarr-developers/zarr-python/issues/1753)
* [p3] [(reach) declarative hierarchy API](https://github.com/zarr-developers/zarr-python/issues/1754)
* Store
* [p0] [Finalize store API](https://github.com/zarr-developers/zarr-python/issues/1755)
* [p1] [Deprecate stores we don't want in Zarr-Python core anymore](https://github.com/zarr-developers/zarr-python/issues/1756)
* [p1] [remote store support (s3, gcs, azure, http)](https://github.com/zarr-developers/zarr-python/issues/1757)
* [p3] [request coalescing: where to implement?](https://github.com/zarr-developers/zarr-python/issues/1758)
* [p3] [Storage transormer API -- do we need it now (or ever)?](https://github.com/zarr-developers/zarr-python/issues/1718)
* [p3] (reach) http proxy - may not need to be in zarr-python
* [p3] [(reach) caching bytes-chunks in the store, caching array chunks in the array](https://github.com/zarr-developers/zarr-python/issues/1500)
* Tests
* [p1] [Bring in as much of the existing test suite in as possible](https://github.com/zarr-developers/zarr-python/issues/1759)
* [p1] [Test serialization of Arrays](https://github.com/zarr-developers/zarr-python/issues/1760)
* [p1] [Add test that instruments traffic to the store -- we should be very careful to only read what is needed](https://github.com/zarr-developers/zarr-python/issues/1761)
* [p2] [Develop integration test suite in Zarr-Python -- needed to validate new async tooling (could be xarray and Dask)](https://github.com/zarr-developers/zarr-python/issues/1762)
* [p2] [Coordinate downstream testing (e.g. Dask + Xarray)]
* [p3] [Develop performance test suite in Zarr-Python](https://github.com/zarr-developers/zarr-python/issues/1763)
* [p3] [Add hypothesis test hooks](https://github.com/zarr-developers/zarr-python/issues/1764)
* Docs
* [p2] [Update developer docs](https://github.com/zarr-developers/zarr-python/issues/1765)
* [p2] [Update API docs](https://github.com/zarr-developers/zarr-python/issues/1766)
* [p2] [Update tutorial docs](https://github.com/zarr-developers/zarr-python/issues/1767)
* [p2] [Write a doc about how Zarr-Python thinks about consistency and how it opperates when concurrent writers are acting on a store/group/array/chunk](https://github.com/zarr-developers/zarr-python/issues/1768)
* [p2] [Start a release doc summarizing the major changes in V3](https://github.com/zarr-developers/zarr-python/issues/1769)
* [p3] [docs for extending zarr](https://github.com/zarr-developers/zarr-python/issues/1770)
* how to write a custom store
* how to subclass array / group (e.g., to have typed attributes, typed members)
* How to get good performance
* Misc
* [p1] [Top level zarr API (open, create)](https://github.com/zarr-developers/zarr-python/issues/1598)
* [p1] [make mypy and pylint/ruff happy](https://github.com/zarr-developers/zarr-python/issues/1593)
* [p1] [Dial in runtime config API -- e.g. IO loop cannot be attached to](https://github.com/zarr-developers/zarr-python/issues/1772)
* [p2] [`TypedDict` for all typed dictionaries](https://github.com/zarr-developers/zarr-python/issues/1773)
* [p2] [develop synchronization API or declare it dropped](https://github.com/zarr-developers/zarr-python/issues/1596)
* [p2] Add logging throughout
* Migration
* [p3] 2 -> 3 conversion cli, (maybe in its own repo)
* [p1] [remove v2 code](https://github.com/zarr-developers/zarr-python/issues/1771)
## March 27, 2024
- Davis Bennett (DB)
- Alden Keefe Sampson (AKS)
- Norman Rzepka / @normanrz (NR)
- Sanket Verma / @MSanKeys963 (SV)
- Akshay Subramaniam / @akshaysubr (AS)
- Max Jones / @maxrjones (MJ)
- Raphael Hagen / @norlandrhagen (RH)
### Meeting notes:
- Sanket: Bi-weekly meeting ends on May 1st, 2024 - shall we continue after that?
- Yes! Schedule it until June end! - DONE
- DB: Fleshing out the group API in v3 https://github.com/zarr-developers/zarr-python/pull/1726
- NR: We need to find a common understanding of what we still need to work on for beta release. NR will create a tracking issue.
- Akshay: Generalized array support
- Where to create issue to track this? zarr-python or zarr-specs? Any direction for structuring the issue and proposal?
- Create a native zarr NDArray class for typing and to interface with existing protocols. This includes:
- Buffer protocol
- `__array_interface__`
- `__cuda_array_interface__`
- DLPack
- Raw pointers
```cpp
namespace zarr
{
namespace py = pybind11;
using namespace py::literals;
class Array
{
public:
Array(zarrArrayInfo_t* array_info, int device_id);
Array(py::object o, intptr_t cuda_stream = 0);
py::dict array_interface() const;
py::dict cuda_interface() const;
py::tuple shape() const;
py::tuple strides() const; // Strides of axes in bytes
py::object dtype() const;
zarrArrayBufferKind_t getBufferKind() const; // Device or Host buffer
py::capsule dlpack(py::object stream) const; // Export to DLPack
py::object cpu(); // Move array to CPU
py::object cuda(bool synchronize, int device_id) const; // Move array to GPU
const zarrArrayInfo_t& getArrayInfo() const
{
return array_info_;
};
static void exportToPython(py::module& m);
};
} // namespace zarr
```
- Interoparability with Numpy
```python
ascending = np.arange(0, 4096, dtype=np.int32)
zarray_h = zarr.ndarray.as_array(ascending)
print(ascending.__array_interface__)
print(zarray_h.__array_interface__)
print(zarray_h.__cuda_array_interface__)
print(zarray_h.buffer_size)
print(zarray_h.buffer_kind)
print(zarray_h.ndim)
print(zarray_h.dtype)
```
- Interoparability with Cupy
```python
data_gpu = cp.array(ascending)
zarray_d = zarr.ndarray.as_array(data_gpu)
print(data_gpu.__cuda_array_interface__)
print(zarray_d.__cuda_array_interface__)
print(zarray_d.buffer_kind)
print(zarray_d.ndim)
print(zarray_d.dtype)
```
- Convert CPU to GPU
```python
zarray_d_cnv = zarray_h.cuda()
print(zarray_d_cnv.__cuda_array_interface__)
```
- Convert GPU to CPU
```python
zarray_h_cnv = zarray_d.cpu()
print(zarray_h_cnv.__array_interface__)
```
- Anything that supports the buffer protocol
```python
with open('file.txt', "rb") as f:
text = f.read()
zarray_txt_h = zarr.ndarray.as_array(text)
print (zarray_txt_h.__array_interface__)
zarray_txt_d = zarray_txt_h.cuda()
print(zarray_txt_d.__cuda_array_interface__)
```
## March 13, 2024
- Joe Hamman / @jhamman (JH)
- Alden Keefe Sampson (AKS)
- Norman Rzepka / @normanrz (NR)
- Sanket Verma / @MSanKeys963 (SV)
- Akshay Subramaniam / @akshaysubr (AS)
- Max Jones / @maxrjones (MJ)
- Agriya Khetarpal / @agriyakhetarpal (AK)
### Meeting notes:
- Alden / top level API
- https://github.com/zarr-developers/zarr-python/issues/1598#issuecomment-1994729420
- In the v3 library, are the top level methods
- the primary way users interact with the library? or
- the smoothest on ramp for v2 library users into v3? (and Array.xxx and Group.xxx become primary)
- something else?
- Notes:
- Joe's thought: We want to provide a pretty similar interface, help massage or raise errors when args not compatible. We can start deprecating and changing behavior
- Norman: like Array. and Group entrypoints, but we need to have these top level entry points. promote the Array and Group classmethods in the docs
- Joe: don't love polymophism of .open, but it exists
- The kwargs to any method that can create an array are currently v2 specific (filters vs codecs, etc), plus there are a number of performance/behavior modifying args (cache_[metadata|attrs], partial_decompress, write_empty_chunks, dim separator). Do we
- try not to change the api at all and try to translate into spec V3 land
- make the kwargs actually match those of the Array.xxx v3 library methods, but also take in `**v2_kwargs` and translate where possible, checking for conflicts with v3 kwargs if provided
- make the top level methods kwargs align with those of Array.create, etc
- Norman: for open: make it compatible, you get back your method, for create could make
- Runtime parameters: many of these didn't know exist, can debate case by case
- Joe: if run time param provided that v3 don't use, raise error
- Currently zarr.open and similar will return an array if it exists, even if the existing array's dtype, codecs, etc don't match those provided.
- Keep this?
- Joe: think we should raise an error, wide agreement on this
- Norman / Batched and interleaved codec pipelines
- https://github.com/zarr-developers/zarr-python/pull/1670
- Hybrid interleaved batched codec pipeline
- How to set runtime configuration?
- Batched store API
- BatchedCodecPipeline as abstract class that can be overridden by user code
- Move thread dispatch from codec to pipeline to allow for coalescing and locality?
- Akshay / Generalized array support
- Open issue and tag with v3
- Sanket / Summary for the core-devs → potential blog post in the future
- JH: I can take this on -- target April?
- SV: Sounds good!
- Agriya / Zarr Pyodide support, out-of-tree
- Requires patching numcodecs, zarr here and there
- Zarr is pure Python, so lesser patches there. Numcodecs needs more patches because it is Cython-based.
- Already done by Pyodide devs per Pyodide/Emscripten release
- The Emscripten and Pyodide versions are not decoupled yet
- Leads to missing versions
- Establish CI job that runs on PRs and nightly — or just nightly
- Issue with this is maintainability and how to keep support?
- Interactive documentation (end goal).
- **Action item**: I will be opening an issue for this on the Zarr repository and link both previous discussions (the ones that I have found). Discussion may proceed there further
## February 28, 2024
- Joe Hamman / @jhamman (JH)
- Tom Nicholas / @TomNicholas
- Norman Rzepka / @normanrz (NR)
- Davis Bennett / @d-v-b (DB)
- Sanket Verma / @MSanKeys963 (SV)
- Akshay Subramaniam / @akshaysubr (AS)
- Charles Stern
- Alden Keefe Sampson (AKS)
### Meeting notes:
- Numcodecs discussion
- https://github.com/zarr-developers/numcodecs/issues/502
- How ready is the v3 branch for kerchunk-related experiments?
- i.e. chunk manifest ZEP, virtual concatenation ZEP
- https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA?view
- v3 store discussion
- https://github.com/zarr-developers/zarr-python/discussions/1686
## February 14, 2024
- Norman Rzepka / @normanrz (NR)
- Davis Bennett / @d-v-b (DB)
- Sanket Verma / @MSanKeys963 (SV)
- Akshay Subramaniam / @akshaysubr (AS)
### Meeting notes:
- AS: Planning to send a couple PRs around numcodecs and wanted to join the refactor meeting to get the sense of current state of things
- NR: https://github.com/zarr-developers/zarr-python/pull/1660
- AS: Plan to add the encode and decode batch in numcodecs and move the logic from the Zarr-Python to numcodecs
- NR: In the current codec mechanism there will be place to add the encode/decode class
- SV: Also a question of how you'd add a new codec in V3 - https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html
- NR: Could use entrypoints for the new codec registrations
- AS: New codecs are added via KwickIO
- AS: https://github.com/zarr-developers/zarr-python/issues/1398
## January 31, 2024
### Attendees
- Sanket Verma / @MSanKeys963 (SV)
- Norman Rzepka / @normanrz (NR)
- Davis Bennett / @d-v-b (DB)
- Max Jones / @maxrjones (MJ)
- Alden Keefe Sampson / @aldenks (AS)
- Raphael Hagen / @norlandrhagen (RH)
- Charles Stern / @cisaacstern (CS)
- Jeremy Maitin-Shepard / @jbms (JMS)
### Meeting notes:
- NR: Codec pipeline
- Open question: merging partial and full versions?
- Next - reading/writing partial chunks for uncompressed data
- DB: Saransh helped with hatch and source layout updates
- providing a review on packaging PRs: https://github.com/zarr-developers/zarr-python/pull/1592
- new branch for V3 work
- removing attrs
- using frozen dataclasses
- relies on handling to/from dict in each class with validation functions
- MJ: No updates, participating in Joe's virtual sprint on Zarr refactor
- can test out setting up test env with Hatch, provide feedback
- AS: Setup on dev environment, still intending to work on [high-level methods](https://github.com/zarr-developers/zarr-python/issues/1598).
- Also adding setup/dev environment doc improvements to https://github.com/zarr-developers/zarr-python/pull/1643
- RH: No updates
- CS: Interested in participating in Zarr sprint remotely
- JMS: No updates, analogous decisions in tensorstore
## January 17, 2024
### Attendees
- Sanket Verma / @MSanKeys963 (SV)
- Joe Hamman / @jhamman (JH)
- Norman Rzepka / @normanrz (NR)
- Davis Bennett / @d-v-b (DB)
- Max Jones / @maxrjones (MJ)
- Alden Keefe Sampson / @aldenks (AS)
- Raphael Hagen / @norlandrhagen (RH)
### Meeting notes:
- SV: plans for numcodecs going forward
- TODO: connect with JK about this
- JH: made some good progress on the Store API
- Thinking about what to do when keys are missing, `raise KeyError` or `return None`
- Needs work: `getsize`, `move`, `tree`, `rmdir`, `open`, `close`
- NR: move should only exist on a store if its cheep
- Open questions:
- `Store.list_*` could change to return async generators
- NR: working on codec api, removing array metadata
- not 100% happy with the API yet
- new methods: `evolve` and `validate` - check if the codec matches
- looking for input here https://github.com/zarr-developers/zarr-python/pull/1632
- DB: working on a messy / unmergable PR for the Array API
- end goal: unify array/group apis for v2 and v3
- added a new directory with v2 and v3 metadata
- stuck on dataclass inheritance
- JH: where will the normalization of metadata keys
- MJ: not much, going to pick up the hatch PR
- JH: yay!
- DB: flag issue around imported modules from tests https://github.com/zarr-developers/zarr-python/pull/1601
- AKS: playing with zarr in rust
- very fast!
- going to pick up the top level api this week
- RH: No updates at this time
### Discussion
What goes in the beta release
1. core array, group, and store api
2. thesis: almost feature complete but the api should be set
- we want people to start using it so we can get some feedback
3.
## January 3, 2024
### Attendees
- Sanket Verma / @MSanKeys963 (SV)
- Alden Keefe Sampson / @aldenks (AS)
- Joe Hamman / @jhamman (JH)
- Norman Rzepka / @normanrz (NR)
- Davis Bennett / @d-v-b (DB)
- Max Jones / @maxrjones (MJ)
### Meeting notes:
- JH: Still working on Group and Store APIs
- NR: Left off with codec api, sharding api, and sharding layouts
- DB: Still working on array api
- considering a major change to indexing/slicing api (slicing a Zarr array gives NumPy array, which is weird and should give out a Zarr array)
- thinking about serialization of nested objects
- partial writes
- MJ: Looking for a place to jump in
- Refactor metadata objects (e.g. ChunkGrid, ChunkKeyEncoding)
- Remove attrs and refactor (de)serialization
- Hatch PR https://github.com/zarr-developers/zarr-python/pull/1592
- AKS - may want jump in on the top level `zarr.foo*` api
## December 20, 2023
### Attendees
- Charles Stern / @cisaacstern (CS)
- Jack Kelly / @JackKelly (JK)
- Sanket Verma / @MSanKeys963 (SV)
- Alden Keefe Sampson / @aldenks (AS)
- Joe Hamman / @jhamman (JH)
### Meeting notes:
- CS: nothing directly on zarr, keeping an eye on the zarr issues with help wanted tags
- SV: looking at hatch pr from davis, zep 0 revisions, and zarr paper
- JK: working on a [light-speed-io](https://github.com/jackKelly/light-speed-io/) project (rust), playing around with ideas for fast data loading
- AS: seems too early to jump in, don't want to get in the way
- JH: Many things in progress: Store, Codecs, Arrays, Groups
## December 6, 2023
### Attendees
- Ryan Aberanthey / @rabernat
- Joe Hamman / @jhamman (JH)
- Charles Stern / @cisaacstern (CS)
- Jack Kelly / @JackKelly (JK)
- Sanket Verma / @MSanKeys963 (SV)
- Davis Bennett (DB) / @d-v-b
- Raphael Hagen / @norlandrhagen
- Alden Keefe Sampson / @aldenks
- Norman Rzepka / @normanrz
### Meeting notes:
- Design doc for v3.0 has moved to GitHub. If you want to comment on the design then comment on [the GitHub pull request]( https://github.com/zarr-developers/zarr-python/pull/1583).
- Zarrita: Alistair started it as a place to experiment with the Zarr v3 spec (back in July 2020). Norman picked Zarrita up in Summer.
- We've taken Zarrita, copied it, renamed & refactored things. There's a new Store interface (we're leaving behind the mutable mapping interface of Zarr stores.) Aiming for 100% coverage of static type checking.
- Norman, Davis & Joe have been together for the last 3 days (in Berlin). JH has been working on the Group API (compatibility with Zarr-Python's group API). See [this PR](https://github.com/zarr-developers/zarr-python/pull/1590). For v2 there are two metadata docs which describe a group. Reads now happen async: which cuts the loading time for groups in half. Now working on listing contents of a group.
- DB: In Zarrita we have representations of arrays for v2 and v3. DB has been working on a uniform interface to v2 and v3. Breaking lots of things :). Looking at how codecs decompress & compress. Overall strategy is to use the v3 way of doing things. See [DB's PR](https://github.com/zarr-developers/zarr-python/pull/1589).
- RA: It's great that there are both async and sync APIs. But downstream datascience libraries will always want the sync API.
- NR: The Zarrita code is based on fsspec, with some small changes.
- JH: We're just using `fsspec` (for now). Very convinced that having an async API is the right call for Zarr-Python. Less convinced that the `fsspec` way of doing things will be the long-term solution.
- NR: Adding sharding strategies. Customise how chunks are laid out in the file. e.g. if you want to do partial writes (where you can write to specific places). Instead of having to write entire shards at a time. Writing tests right now. First PR has been merged into the v3 branch (codecs).
- JH: The codecs are now an entry point into the Zarr-Python code. Zarr-Python v2 basically supports any codec in numcodecs. Do we need to register _all_ of those compressors and filters as codecs? Or should we limit them?
- NR: Let's make a generic codec. bytes-to-bytes, and array-to-bytes.
- JH: For now, we've decided not to work on variable chunk sizes. We could release a version of Zarr-Python without variable chunk sizes. Questions?
- RA: Everyone's very supportive! This is what we need to get over the rut that zarr-python has been stuck in. A lot of folks would like to help, but don't know _where_. Are there concrete tasks that we can give to people? (The answer may be "no"! Some software projects are just no parallelizable like that.)
- JH: Some of these first blocks of work have required us to already resolve conflicts. I'll jot down a couple of tasks which could be useful for folks to work on.
- the top-level API has not been ported over yet (e.g. `zarr.open()`). Most people use that top-level API.
- documentation! A lot of copy-and-pasting from Zarr-Python v2.0. But some function signatues will change.
- type checking needs work! MyPy isn't happy right now.
- DB: v3 introduces the concept of a codec pipeline. We build an object which is a sequence of transformations of chunk data, which leads to it being storage, or - in reverse - leads to it being opened. The documentation for this doesn't exist yet. If anyone has an idea for a transformation, then work through the process of doing this with the v3 machinery, and write up some docs for how to do this. v3 is a lot more explicit about how chunks are encoded.
- NR: +1 to adding codecs, or wrapping numcodecs. Also:
- try wiring the new Zarr-Python to downstream libraries. That would tell us what's missing in terms of the public API.
- Also, it'd be useful to having a champion for variable chunking. The codec pipeline should know about variable chunking.
- RA: The problem with the ZEP process is that it's hard for ZEP authors to just implement the ZEP.
- JH: We should be able to find byte-sized chunks which folks can work on.
- NR: It'd be great to keep up the momentum after this week. And it'd be great to have a beta by Jan 2024!
- RA: Where is the discussion of the ZEP3 proposal ([here's the PoC implementation PR](https://github.com/zarr-developers/zarr-python/pull/1483))? And the discussion is [here](https://github.com/orgs/zarr-developers/discussions/52).
Looking for a champion on:
- variable chunking
- synchronizers
- h5py compat
## Agenda
- Update from Joe + Davis on refactor progress
## November 22, 2023
### Attendees
- Joe Hamman / @jhamman (JH)
- Charles Stern / @cisaacstern (CS)
- Jack Kelly / @JackKelly (JK)
- Sanket Verma / @MSanKeys963 (SV)
- Davis Bennett (DB)
## Agenda
- Zarr-Python 3.0 design doc: https://hackmd.io/0DVKP6d9QI-VaHc0zvOuxw
### Meeting notes:
- JH: We can start working from store interface - kind of leaky abstraction
- JK: Looking to read million chunks - sharding helps with that - discussions around batching requests in Zarr-Python - requesting million chunks in a single request - if Zarr V3 is a good place to pull in these performance bumps? (don't want to delay the existing work)
- DB: To make `get` more efficient, you need to wrap it in something - mostly users are getting multiple chunks at a time
- JH: In Dask/Xarray world you map a single chunk of Zarr at a time - At Earthmover there is 1-to-1 reads - to handle big size chunks we have rechunker - sharding codec sits above the store interface
- JH: https://github.com/scalableminds/zarrita/blob/async/zarrita/sharding.py#L309 - indexing for sharding - the sharding codec will need access to store API whereas the other codecs doesn't need it
- DB: Like the idea - add a new abstraction - we have leaky abstraction and we can use it
- JH: Norman is willing to help but only if Zarr-Python is first class citizen
- JH: https://docs.xarray.dev/en/stable/roadmap.html - publish the roadmap on [Zarr-Python docs](https://zarr.readthedocs.io/) for the community
- JH: Jack can help us in fast concurrent loading problem
- JH: Meeting with Davis and Norman in 1.5 weeks to work on Zarr-Python 3.0
## November 8, 2023
### Attendees
- Joe Hamman / @jhamman
- Charles Stern /
- Raphael Hagen / @norlandrhagen
- Sanket Verma / @MSanKeys963
- Martin Durant /
## Agenda
- Zarr-Python 3.0 design doc: https://hackmd.io/0DVKP6d9QI-VaHc0zvOuxw
- how would batching work across arrays
- use pydantic zarr
- other dependencies
- python 3.9 - drop in dec.?
- which sharding impl
## November 1, 2023
### Attendees
- Joe Hamman / @jhamman
- Davis Bennett / @d-v-b
- Sanket Verma / @msankeys963
- Raphael Hagen / @norlandrhagen
- Charles Stern
- Brian Davis
- Thomas Nicholas
## Agenda
- Request - can someone try to take some notes today?
- Request - can we move this meeting time to 8:30a PT (currently at 9a PT).
- V3 API migration
- Now that we are starting to work on implementing v3, we're faced with the question of what to do with the existing API
- Observation: the current v2/v3 polymorphism is unsustainable (and incorrectly prioritizes v2 internally)
- Proposal - we create a v3 namespace within zarr-python where we can develop in an isolated space toward a complete v3-spec implementation
- Included in this namespace:
- classes: `zarr.v3.{Array,Group,Store}`
- These classes implement an internal api that closely aligns with the v3 spec
-
- high level functions: `zarr.v3.{create, open, ...}``
- As much as possible, these function should look and feel like the v2 equivalents but should not be tied to the exact implementation
- e.g. `zarr.create(shape=..., dtype=..., compressor=...) -> zarr.create(shape=..., data_type=..., codecs=..., attributes=...)`
- We may also want to deprecate and/or rename some of the existing top level functions
- backward compatability:
- high-level functions in the v3 namespace should be able to `create` or `open` a v2 dataset
- The `Group` or `Array` does not need to be backward compatible though.
- All development toward v3 happens on the `main` branch in zarr-python
- Alternative proposal
- We avoid the `v3` namespace and instead take over the primary namespace in a development branch (e.g. `v3`)
- When we feel that the `v3` branch is complete, we merge to main and make a `3.0` release
- Folks have less time to test out the v3 implementation but we have a cleaner development process
- Ideas
- Idea of zarr array is to look like a numpy array
- could move all the zarr array details to a polymorphic metadata object
- trim things down to just the minimal array api interface
- declarative heirarchy specification
- type hints
-
#### Sanket Notes
- DB: Definition of Zarr and Dask chunks are different and that's not good
- JH: Benefits of generative chunk indexing
- Impacts with sharding, variable chunking and other shiny feature
- Large array with billions of chunks
- JH: Maintaining both V2 and V3 at the same time is not ideal
- DB: V2 has of lot stuff that people don't use - stores
- TN: The current public facing APIs (V2 and V3) are conformant to the existing spec - but what we're thinking to work on a new public facing API which is wrapper of V2 and V3, and not conformant to V3 exactly - isn't that a bad thing?
- DB: The public-facing Zarr array object API is not covered by the spec anyway
- Also can't be, because syntax might be language-dependent
- Therefore we have full freedom in the public python API of the python zarr array type
- TN: Okay, in that case makes sense to follow python array API standard as much as possible
- TN: Array API has granular functionality which is super useful (e.g. you can say "we don't support the statistical functions")
- TN: Note that chunking is not part of the array API standard
## October 18, 2023
### Attendees
- Joe Hamman / @jhamman
- Max Jones / @maxrjones
- Davis Bennett / @d-v-b
- Tom Nicholas
- Charles Stern
- Sanket Verma
- Ryan Abernathey
## Agenda
- Proposal: just use Zarrita :)
- 0.1% done: https://github.com/jhamman/zarr-python/pull/1
- Ryan added memorystore to Zarrita: https://github.com/scalableminds/zarrita/pull/12
## September 20, 2023
### Attendees
- Joe Hamman / @jhamman
- Charles Stern / @cisaacstern
- Sanket Verma / @MSanKeys963
- Raphael Hagen / @norlandrhagen
## Agenda
- Review ZEP 6 proposal and proposed implementation
- https://github.com/zarr-developers/zeps/pull/46
- https://github.com/zarr-developers/zarr-python/pull/1526
- Goal with ZEP6 in Zarr-Python
- Clean up interface for Group/Array constructors from V2/V3 metadata
- Use ZOMs internally as part of the migration to V3 spec
- Use ZOMs in array/group constructors to consolidate initialization reads/writes
- https://github.com/zarr-developers/zarr-python/issues/538 (repeated writes to set attrs)
- https://github.com/pangeo-data/pangeo-eosc/issues/39 (many contains/iter calls)
- Expose ZOMs to third parties
## September 6, 2023
### Attendees
- Joe Hamman / @jhamman
- Ryan Abernathey / @rabernat
- Sanket Verma / @msankeys963
- Raphael Hagen / @norlandrhagen
- Ryan Williams
- Charles Stern / @cisaacstern
- Davis Bennett / @d-v-b
## Agenda
- review scoping section (below)
- performance
- zarr + pydantic (https://github.com/janelia-cellmap/pydantic-zarr)
- observation: Zarr-python is missing specific data models for Groups / Arrays
- price of depending on pydantic is probably not worth it
-
## Scoping V3 update (by @jhamman)
_Written by @jhamman on September 5, 2023_
In the Winter and Spring of 2022, while the V3 spec was still under development, an experimental V3 implementation was added to the Zarr-Python codebase ([#898](https://github.com/zarr-developers/zarr-python/pull/898)). This implementation followed the spec, as it was written at the time. However, in the months following these developments, major changes to the spec were made. This has left Zarr-Python out of sync with the V3 specification.
### Summary of current status
1. V3 support is behind an experimental API (accessed by setting `zarr_version=3` and `ZARR_V3_EXPERIMENTAL_API=1`).
2. A separate code path for V3 stores was implemented in `zarr._storage.v3`.
Major changes to the spec since the experimental implementation include:
- Entrypoint metadata document (`zarr.json`) is no longer required
- Metadata keys were renamed (e.g. `meta/foo/bar.group.json -> /foo/bar/zarr.json`)
- Group and metadata documents are no longer distinguished by their key name (everything is `zarr.json` and a `node_type` field is included in all documents)
- Various updates to metadata fields:
- `format_version` → `int`
- added `dimension_names`
- removed `chunk_memory_layout` (in favor of transpose codec)
- `codecs` now includes a list of codects that was previously split between the `filters` and `compressor` fields
- etc.
Open questions:
- fallback data types
### Actions
https://github.com/orgs/zarr-developers/projects/5/views/1
## Zarr refactor meeting
_Aug 16, 2023_
### Attendees
- Joe Hamman (Earthmover)
- Xarray and Zarr dev
- Sanket Verma (Zarr)
- Tom White (independent dev)
- SGKit and Cubed
- Max Jones (CarbonPlan)
- Data scientist
- Raphael Hagen (CarbonPlan)
- Data eng.
- Charles Stern (Columbia)
- Pangeo-forge
### Discussion
- Max: how do we view V3 extensions already in Zarr-python
- Charles: how does Zarr python register plugins
- Zarrita (https://github.com/scalableminds/zarrita/) - reference implementation
- no baggage / tech debt of Zarr-python
- not production ready
- also has sharding
- Tom: Interop tests between implementations
### Timeline
Goal: by the end of the year, have a fully-functional implementation of V3 in Zarr Python
- Starting now: survey users to get an understanding of how a breaking change to the V3 implementation will impact users
- Next two weeks: Break [#1290](https://github.com/zarr-developers/zarr-python/issues/1290) into smaller chunks and set up project board
- September: start refactor efforts
- Oct-Dec: Integration and interop testing
-
### TODOs:
- add regular call to community calendar
- break out V3 implementation tasks into issues / project board (try to identify issues that can be picked up by others)
- publish read out of this call