owned this note changed 12 days ago
Published Linked with GitHub

Zarr-Python Developer Meeting Notes

formerly Zarr-Python Refactor Meeting Notes

April 4, 2025

  • Davis Bennett / @d-v-b
  • Joe Hamman / @jhamman
  • Ian Hunt-Isaak / @ianhi
  • Sanket Verma / @sanketverma1704
  • Tom Augspurger /
  • Ryan Abernathey /
  • Josh Moore / @joshmoore
  • Tom Nicholas / @TomNicholas

Agenda

  • release?
  • (davis) learnings from tensorstore
  • (davis) A rough idea for a zarr-format-aware store API

Minutes

  • (ryan a.) update on the state of the spec w.r.t. extension names
  • Datetypes extension names
    • datetime64
    • timedelta
    • string
  • The dtype plan:
    • open issue in extensions repo for each new datatype
    • get feedback on names / configuration
    • split davis' pr into pieces
      • registry framework
      • new dtypes

March 21, 2025

  • Davis Bennett / @d-v-b
  • Joe Hamman / @jhamman
  • Ian Hunt-Isaak / @ianhi
  • Sanket Verma / @sanketverma1704
  • Kyle Barron /

Minutes

Core topic for the day: https://github.com/zarr-developers/zarr-python/pull/2874

>>> from ml_dtypes import bfloat16
>>> import numpy as np
>>> np.zeros(4, dtype=bfloat16)
array([0, 0, 0, 0], dtype=bfloat16)

March 7, 2025

  • Davis Bennett / @d-v-b
  • Josh Moore / @joshmoore
  • Joe Hamman / @jhamman
  • Tom Nicholas / @TomNicholas
  • Deepak Cherian / @dcherian
  • Tom Augspurger
  • Ian Hunt-Isaak / @ianhi

Minutes

February 28, 2025

  • Davis Bennett / @d-v-b
  • Josh Moore / @joshmoore
  • Joe Hamman / @jhamman
  • Ian Hunt-Isaak / @ianhi

Minutes

  • Josh and Davis on extension dtype naming
  • Davis is working on extension dtypes in zarr-python
    • need to add support for parametric dtypes and extension dtypes
  • Akshay and co have been hacking on zarr/gpus this week

February 21, 2025

  • Davis Bennett / @d-v-b
  • Josh Moore / @joshmoore
  • Sanket Verma / @sanketverma1704
  • Ian Hunt-Isaak / @ianhi

Minutes

February 14, 2025

  • Deepak Cherian / @dcherian
  • Josh Moore / @joshmoore
  • Norman Rzepka / @normanrz
  • Davis Bennett / @d-v-b
  • Ian Hunt-Isaak / @ianhi

Agenda

  • release?

February 7, 2025

  • Deepak Cherian / @dcherian
  • Josh Moore / @joshmoore
  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Max Jones / @maxrjones
  • Sanket Verma / @sanketverma1704

Agenda

January 31, 2025

  • Deepak Cherian / @dcherian
  • Norman Rzepka / @normanrz
  • Josh Moore / @joshmoore
  • Akshay Subramaniam / @akshaysubr
  • Joe Hamman / @jhamman

Agenda

January 24, 2025

Agenda

January 10, 2025

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Norman Rzepka / @normanrz
  • Josh Moore
  • Sanket Verma / @MSanKeys963

Agenda

  • Object data type - https://github.com/zarr-developers/zarr-python/issues/2617
  • Next steps after 3.0
    • variable chunking?
    • deprecating more api?
    • numcodecs thing that could use some thought / design work?
      • semi-circular dependency
        • what do we do with the v2 codecs
        • what do we do with the v3 things
      • future directions

January 3, 2025

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Norman Rzepka / @normanrz
  • Deepak Cherian
  • Sanket Verma / @MSanKeys963

Agenda

  1. 3.0 release schedule update (Joe)
    a. 3.0.0-rc.1 went out yesterday
    b. we will publish and socialize the migration guide today
    c. we will make the full 3.0.0 release on Thursday Jan 9 at 10a ET
    -> during this time, we will focus on documenation and bug fixes (no major feture additions)

  2. release announcement
    a. Joe has written a blog post. The full zarr-dev team has comment access here: https://www.notion.so/earthmover/Zarr-Python-3-Release-Blogpost-14b492ee309f80d28af3ebfdeedf96f7
    b. sanket will prepare a social media thread

  3. shape of array after the addition of filters/compressors to top-level api

    • davis is opening an issue on this
  4. Norman will write a docs section on sharding

create_array API design notes

We are struggling to find a user-facing API for creating new arrays.
We have decided to create a new top-level API function (create_array)
to handle this but questions remain about how to provide a simple / intuitive
API that covers both v2 and v3 arrays, and sharded/non-sharded arrays in one API.
This short design note lays out the goals and options we are considering.

goals

  1. provide a single function to create v2 and v3 arrays
  2. make it easy to create sharded arrays
  3. provide a way to configure codecs (ala compressors and filters from v2)

non goals

  1. extending sharding to v2 array
  2. ?

current proposal

async def create_array( store: str | StoreLike, *, name: str | None = None, shape: ShapeLike, dtype: npt.DTypeLike, chunk_shape: ChunkCoords | Literal["auto"] = "auto", shard_shape: ChunkCoords | None = None, filters: FiltersParam = "auto", compression: CompressionParam = "auto", fill_value: Any | None = 0, order: MemoryOrder | None = "C", zarr_format: ZarrFormat | None = 3, attributes: dict[str, JSON] | None = None, chunk_key_encoding: ChunkKeyEncoding | ChunkKeyEncodingParams | None = None, dimension_names: Iterable[str] | None = None, storage_options: dict[str, Any] | None = None, overwrite: bool = False, config: ArrayConfig | ArrayConfigParams | None = None, data: npt.ArrayLike | None = None, ) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]:

This function signature includes parameters that fall into the follwing categories:

Store parameters

  • store
  • storage_options

Runtime parameters

  • order
  • overwrite
  • config
  • data

V3-only parameters

  • dimension_names
  • shard_shape
  • chunk_key_encoding

Generic parameters

  • name
  • shape
  • dtype
  • chunk_shape
  • filters
  • compression
  • fill_value
  • attributes

**Note 1: the focus of this document is on the parameters that control how the core array metadata is configured.
**- shape

  • dtype
  • chunk_shape
  • shard_shape
  • compression
  • filters

Note 2: it may be worth grouping the parameters in create_array using a
similar framework to above. This will help users navigate this fairly large
parameter space.

Usage examples

  1. minimal example w/o sharding:
    this creates an array using default / inferred parameters for zarr_format, chunk_shape, etc., etc.

    ​​​​create_array(store=store, shape=(1000, 1000), dtype='f8')
    
  2. create sharded array
    _this creates a sharded array where chunks are compressed with Zstd

    ​​​​create_array(
    ​​​​    store=store,
    ​​​​    shape=(1000, 1000),
    ​​​​    shard_shape=(100, 100),
    ​​​​    chunk_shape=(10, 10),
    ​​​​    compressors=[ZstdCodec(level=3)]
    ​​​​    dtype='f8',
    ​​​​)
    

questions

  1. what is the value/justification for providing arguments for filters and compressors instead of a single codecs parameter? Will we enforce that all filters are array->array codecs and all compressors are bytes->bytes?

    @d-v-b > this seems like the question we need to answer first. How will we des

    (d-v-b): filters and compressors map on to the two types of variadic codecs allowed in the v3 codecs attribute. This makes those parameters simple to parse filters must resolve to tuple[ArrayArrayCodec, ...], and compressors must resolve to tuple[BytesBytesCodec, ...]. I think we could have just 1 codecs parameter, but it would need to take a form that allowed separably specifying the ArrayArray and BytesBytes codecs. Something like this:

    ​​​​class CodecParams(TypedDict):
    ​​​​    filters: NotRequired[tuple[ArrayArrayCodec, ...]]
    ​​​​    compressors: NotRequired[tuple[BytesBytesCodec, ...]]
    ​​​​    array_serializer: NotRequired[BytesBytesCodec]
    

    any missing keys would resolve to the defaults set in the config.

    But if codecs was tuple[Codec, ...] then users would be confused, and parsing it would be a headache.

    (NR): I like filters and compressors because imo they better convey what the codecs are used for instead of "array->array" or "bytes->bytes" codecs. We should enforce that only the right type is used for both kwargs.

  2. what is correct type for the filters / compressors argument? Options include:

    a. list of strings, e.g. ['gzip']
    b. list of dicts, e.g. [{"name": 'gzip', "configuration": {"level": 4}]
    c. list of objects, e.g. [GZipCodec(level=4)]

    (b) and (c) seem like a reasonable choice.

    (d-v-b) IMO the only option here is something that unambiguously represents a codec instance, which rules out a. If we can make constructing the dict representation of the codecs ergonomic (i.e., autocomplete), then I think b is a pretty nice option, because users don't need to import a bunch of classes to use the create function. but we should also accept the complete codec class instances as well, so c.

    (NR): I like c best, but also fine with b. Agree that a is too ambiguous. I also cleaned that up for the default codecs https://github.com/zarr-developers/zarr-python/commit/5cb6dd8f62ad6ed5391a08535dc05ef9ac50bbad

  3. How do we want to parametrize the partitioning of the array? Right now the PR in question takes two parameters, chunk_shape: tuple[int, ...] | Literal["auto"] and shard_shape: tuple[int, ...] | Literal["auto"] | None. In the interest of backwards compatibility and brevity I would support the names chunks and shards. An alternative API would be to have a single parameter, e.g., chunking, that takes:

    • tuple[int, ...], (no sharding, regular chunking),
    • A dict like {"chunks": tuple[int, ...] | Literal["auto"], "shards": Tuple[int, ...] | Literal["auto"]}
    • and maybe more complicated types? This basically pushes complexity into a single parameter, but it's convenient given that chunk shape and shard shape have to be defined together.

    (NR): I'd prefer chunks and shards. chunking: tuple[int, ...] | {"chunks": tuple[int, ...] | Literal["auto"], "shards": tuple[int, ...] | None | Literal["auto"]} would also be fine. Not really a fan of auto , though.

    (JH): would it help reduce scope to remove auto chunking / chunk/shared alignment from this first version?

    (DVB): I don't think the auto chunking / sharding adds a lot of complexity here, and I think it's a big win for usability to have some defaults that "just work" (whether the defaults in my pr actually "just work" is another question). As for auto, we need some way of expressing "pick chunks / shards automatically". Often we use None to mean "default", but if we are using shards=None to denote "no sharding", None can't mean "default" anymore, and we need to pick another value. I think auto is short and literate but I'd be up for alternatives.

December 20, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Norman Rzepka / @normanrz
  • Deepak Cherian
  • Josh Moore / @joshmoore
  • Sanket Verma / @MSanKeys963
  • Akshay Subramaniam / @akshaysubr

Release topics:

  • top level api
  • documentation
    • add to migration guide
      • use create_foo and open_foo functions
      • create() and open() will be deprecated soon
    • new page in user guide on runtime config

December 13, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Norman Rzepka / @normanrz
  • Deepak Cherian

Notes:

  • Davis worked on docstrings
  • Davis is working on concurrent array creation
  • Norman takes over write_empty_chunks
  • Joe will work on docs next week
  • Rename RemoteStore to FsspecStore
  • Array.__iter__ is slower compared to v2 because v2 loaded the entire array in memory upfront. not a release blocker
  • Next meeting next Wednesday 5pm CET, 8am PST

December 6, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Norman Rzepka / @normanrz
  • Sanket Verma / @MSanKeys963
  • Deepak Cherian /

Notes:

  • Davis will be working on some of the blocked PRs
  • Joe will share a V3 blog post next week for review
  • Deepak has been working on tests

Discussion points for today:

  1. ✅ beta release -> https://github.com/zarr-developers/zarr-python/releases/tag/v3.0.0-beta.3
  2. Default codec -> https://github.com/zarr-developers/zarr-python/issues/2267
  3. What's off spec?
  • string codecs and dtypes
  • consolidated metadata

November 29, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Norman Rzepka / @normanrz

Notes:

November 22, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Tom Augspurger / @TomAugspurger
  • Norman Rzepka / @normanrz
  • Sanket Verma / @MSanKeys963

Notes:

November 15, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Tom Augspurger / @TomAugspurger
  • Sanket Verma / @MSanKeys963
  • Theodore Visvikis /
  • Josh Moore / @joshmoore
  • Akshay Subramaniam / @akshaysubr
  • Norman Rzepka / @normanrz

Notes:

November 8, 2024

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Tom Augspurger / @TomAugspurger
  • Sanket Verma / @MSanKeys963

Notes:

Discussion points

November 1, 2024

  • Joe Hamman / @jhamman

October 25, 2024

  • Joe Hamman / @jhamman
  • Tom Augspurger / @TomAugspurger
  • Sanket Verma / @MSanKeys963

Notes:
- Updates from Tom — working on info, size and tree properties
- Joe - hopefully wrapping store mode refactor up today

October 18, 2024

  • Tom Augspurger / @TomAugspurger
  • Davis Bennett / @d-v-b
  • Ryan Abernathey / @rabernat
  • Joe Hamman / @jhamman
  • Norman Rzepka / @normanrz
  • Matt Iannucci / @mpiannucci
  • Akshay Subramaniam

Notes:

  • Updates
    • Joe: getting icechunk out, interested in speaking about release blockers
    • Davis: moving v3 tests, working on store api
    • Norman: working on the numcodecs, sharding bug, filter/codecs for v2 arrays
    • Ryan: worked on strings (out of spec), commited to dealing with spec problems (specifically on extensions)
    • Tom: Xarray compat (probably ready to merge)
    • Akshay: focusing on gpu compression codecs (nvcomp)
    • Matt: working on getting v3 working with kerchunk and virtualizarr

Topics:

  • store api
    • DB: two phases of IO: reading/writing chunks or initializing an array a group
      • mode was added to the store
      • pathalogical situations where clear() is happening on reopen
    • NR: Be able to use StorePath in zarr.open, e.g. zarr.open(LocalStore("...", mode="a") / "testdata.zarr")
    • JH: https://github.com/zarr-developers/zarr-python/issues/2359
  • v2 filters/codecs
    • https://github.com/zarr-developers/zarr-python/issues/2325
    • NR: hasn't done much here yet
    • MI: same, just looked at the issue
      • codec naming
    • NR: numcodecs codec namespace will be just for v3 arrays
      • we may also want to split the kerchunk filters into compressor/filter categories
    • RA: what would give us more developer velocity here?
  • release blockers
  • extensions

October 11, 2024

  • Tom Augspurger / @TomAugspurger
  • Davis Bennett / @d-v-b
  • Sanket Verma / @MSanKeys963
  • Josh Moore / @joshmoore
  • Ryan Abernathey / @rabernat
  • Joe Hamman / @jhamman
  • Norman Rzepka / @normanrz

Notes:

  • Summary
  • Big items
    • Strings (RA) - braindump
      • confusing how it (ever) used to work
      • np<2 had no notion of varlen str
        • fixlen (utf-8)
        • or object array
      • zarr allowed both as valid dtypes, e.g. u4 or object
      • now np has a varlen
      • question of dtype+codec
      • PRs merged
        • 236: new dtype "string" and "bytes". also zarrs implements them.
          • NR: v2 compatibility? pickling mode and another mode.
          • RA: two questions
          • RA: assumption is always vlen, and impl can use the appropriate in memory data structure (i.e., different in c)
            • in python, decoding to varlen if np>2 (breaking from zarr v2); decoding to object if np<2
          • NR: zarr array v2 a year ago will that work? believe so. tom working on the v2 side of it.
          • NR: for v3, bit of concern since there's no spec. a ZEP? (see stuck https://github.com/zarr-developers/zeps/pull/47)
            • RA: two components need to be addressed in the spec (dtype+codec; leaky abstraction?)
            • integer array interpreted as bytes?! unsure.
            • RA: far from being able to make changes to the spec.
            • JM: good to confer with John Kirkham
            • JH: pickled dtypes is inoperable with other languages
          • NR: how do we want to deal with experiments?
            • can't write specs without implementation and python is a great place to do that
            • but maintain a few other implementations. not useful if the main implementation blazes ahead and everyone must follow
            • suggest we be cautious about that.
            • various ways to handle that
              • previously environment variables
              • issue a warning on non-standard dtypes
            • some discussion then it'll be fine
            • RA: see https://github.com/zarr-developers/zarr-specs/pull/312#issuecomment-2407444223
            • SV: Isaac stepped down from https://github.com/zarr-developers/zeps/pull/47 (no one to steer)
            • RA: different opinion just everything through extensions.
              • trying to get to parity. can we do the same things.
              • still struggling to evolve it. we decided that it's unversioned, so it's unchangeable.
              • would need to move to v4.
              • immutability was to be balanced by extensions.
                • haven't managed to develop a robust ecosystem/process for extensions. ZEPs have failed. nothing adopted. we can't agree
              • extensions and then let's go make them
              • see TA's https://github.com/zarr-developers/zarr-specs/issues/316
              • let them be free
              • practical way forward
            • DB: tried to address that in https://github.com/zarr-developers/zarr-specs/pull/312
              • are we willing to change the parts of the spec that are blatantly contradictory
              • if it's immutable, so be it.
              • RA: clarifications are ok
            • NR: state of zarr-specs is terrible. ZEPs are a symptom. people are fatigued. process broken.
              • spec core team is a good path.
              • will have the same issue treating everything as an exception. names need to be coordinated. two string dtypes?!
              • who controls the namespace. need a process. even a repository, pypi style.
            • JM: feedback on zarr versioning from other implementations
            • RA: namespacing
              • extensions need to be namespaced. URI ok. absolute. resolves to the document.
              • need to figure how what are the extensions and what's their scope
              • 2 different extensions (different URIs) that define the same codec. that's ok.
              • make whatever changes are needed to have that process, socialize it, etc. shutdown ZEPs.
            • NR: that's not how they were meant to work in v3. extension points. let's you create, e.g., codecs. (nothing wraps it)
              • RA's is a new concept. might work. there might be issues with composability.
              • comfortable in zarr-python if you need to actively opt in.
            • RA: ask everyone if the original intention of zarr-spec work in practice.
              • haven't been able to make forward progress. incorporate learning
              • face reality of how things work in the real world and adapt.
              • look at others where it's working
            • JH: laser focused on getting zarr-python out
              • can set config (don't need environment variable). go off-road
              • consoldiated metadata, few codecs, etc. to add (not much more coming)
            • SV: lack of implementations was definitely an issue. Tried to work that into https://github.com/zarr-developers/zeps/pull/59
            • JH: on spec process, v3 is in accepted not final. missed that date by a year.
              • reasonable to say that changes are going to have to be made
              • change the status of the spec for a while?
            • SV: previous conversation about when to set it to final
      • Technical things
        • Beta release (JH): ok when smashing merge on string PR?
          • NR: v2 filters in beta or after? JH: weird kerchunk
            • RA: to make xarray work we had to special case everything ("working" isn't accurate)
            • convenience for our users
            • NR: backwards compatibility. (lack of) v2 spec are out in the wild that we have to define indefinitely. (see extensions above)
              • 2 camps: people that think it through for a long time and the others that want to wage ahead. a tension that we have to work out.
            • JH: filters just land in the array metadata. never seen that in the wild.
        • Endianness (DB)
          • https://github.com/zarr-developers/zarr-python/issues/2324
          • no longer part of the dtype. if someone create endian whatever, then the zarr array doesn't report it (have to check the codec)
          • creating a new array then it will get drop the endianess
          • don't care? in memory representation is decoupled from how it is stored.
          • NR: yes, that's what we did in zarrita. you can control how it lands out, but not how it is read into memory.
          • RA: prevent memmapping data? NR: that's what we have the metadata for. RA: can imagine an impl without codecs that wants memmap to access the data (though zarr-python doesn't work that way) NR: zarr-spec requires a bytes codec which defines the endianess.
          • JH: if you get a big endian array (e.g., zarr.save(np.array)) round-tripe so you get a big-endian back out the other side
          • NR: zarr.save() would need to handle.
          • JH: yes, interpret at the top-level and do something smart about the bytes codec
          • DB: if we keep it, what was the point of parameterizing it in the codec.
          • NR: compatibility, you need to store it somewhere
          • DB: what comes out is undefined.
          • RA: use platforms preferred endianness
          • DB: then users won't round-trip. won't come out in the chosen endianness
          • NR: struggled with this, but it is an implementation detail (incl. exposing it to the user). matters only for some performance issues.
          • DB: what is the dtype of a zarr array relative to what the user puts in.
          • NR: zarr only cares about how it looks on disk (not in-memory)
          • JM: zarr_array.dtype calculated from zarr data_type and checking the codec ("dynamic dtype")
          • NR: need to check the read path and what it is doing
          • RA: similar to the strings that it's coupling dtype and codec
          • DB: in v4 would like to see codec & dtype together
            • also want to put shape and chunk together (JM: plus shard)

October 4, 2024

  • Tom Augspurger / @TomAugspurger
  • Davis Bennett / @d-v-b
  • Sanket Verma / @MSanKeys963
  • Josh Moore / @joshmoore
  • Ryan Abernathey / @rabernat

Notes:

September 20, 2024

  • Joe Hamman / @jhamman
  • Tom Augspurger / @TomAugspurger
  • Davis Bennett / @d-v-b
  • Sanket Verma / @MSanKeys963
  • Akshay Subramaniam

Notes:

September 13, 2024

Attendees

  • Joe Hamman / @jhamman
  • Tom Augspurger / @TomAugspurger
  • Davis Bennett / @d-v-b
  • Sanket Verma / @MSanKeys963

Notes

  • Updates

    • Davis is happy about recent improvements to
  • on people's minds:

    • Davis is going to look at the synchronizer api
    • Sanket: Doc sprint in September? Dates? How many days? Async?
      • Let's try for Sept. 30-Oct 1
      • Yes, Async with a kickoff on Sept. 30
    • Tom: consolidated metadata is getting pretty close
      • reworking metadata layout
      • first iteration will support reading/writing v3 consolidated metadata and reading v2
      • should be possible to write new v2 metadata as well
      • will need to do more thinking on future proofing for metadata schemas
      • also thinking about the maximum depth of consolidation
    • storage transformers issue: https://github.com/zarr-developers/zarr-python/issues/2178
      • may need to update the spec lanaguage around optional metadata fields
import zarr

kwargs = zarr.codecs.make_sharding_pipeline(
    read_chunks={...},
    write_chunks={}, 
    compressor=Gzip(),
)

zarr.create_array(shape=(...), **kwargs)

August September 6, 2024

Attendees

  • Joe Hamman / @jhamman
  • Tom Augspurger / @TomAugspurger
  • Norman Rzepka / @normanrz
  • Josh Moore / @joshmoore
  • Davis Bennett / @d-v-b
  • Akshay Subramaniam

Notes

  • https://github.com/orgs/zarr-developers/projects/5/views/2 big lifts?

    • d-v-b: shape for sharding? does it have to change for 3.0.0?
      • shape is currently dependent on which codec
      • should encourage thinking about it as a new interpretation of chunking
      • JH: define some preset pipelines? NR: similarly. doesn't have to change the array. i.e., top-level API.
      • DB: people want easy access to the configuration for looping
      • JM: .writing_chunks to go with .reading_chunks. (would dask also adopt?)
      • DB: agreed, might be the right level of detail for users
        • would also help to guard against other implementations (transformers, etc.)
      • NR: also produce an ergonomic way of creating them
      • DB: you'd also want to pass as an argument
      • NR: ok, and doesn't have to be set forever.
      • JH: xarray/dask zarr-readers didn't need the attribute. (just from_array needs as an argument)
      • JM: default? which one wouldn't fail.
      • JH: default today is write chunk? NR: yes. but can be too big.
    • consolidated metadata
      • JH: reading/writing v2 metadata as a blocker
      • TA: writing, too? kinda yeah.
      • TA: status update
        • pretty straight-forward
        • issues with del item: do we synchronize out to the consolidated? (i.e. doing more IO)
        • relationship between group & store objects is just "call save metadata"?
      • JM: writing down the v2 schema in the v3 (since no v2 process)
      • JH: just do it in the v2 schema. people are using it.
    • docs (NR)
      • sprint? still happening.
      • issue raised about the formatting. not using the left pain. (sad & empty)
    • synchronizer API (JH)
      • issues ("it doesn't work")
      • hot potato: v2 has one but without distributed version
      • DB: how does it plug in in V2?
      • JH: mucked up the v2 code _set_item_nosync
      • DB: property of an array (i.e. high-up in the API)
      • JH: could go further down. store level?
      • DB: every store has a locking class
      • JH: zip store requires thread/asyncio locks (not-merged)
      • NR: not using synchronizers
      • JH: frequent bug reports in xarray
      • DB: does it have to live at arrays and groups because stores didn't know the key names
        • does it tie in to having the names knowledge in the store?
        • JH: possibly a high-level and a low-level store API
        • NR: higher-order store so that you can compose them
          • zip store always has it
    • mutable mapping
      • JH: use memory store to adapt anything (no async stuff though)
    • GPU (AS)
      • merged
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
      • testing with the codec interface. things look to be working.
      • JH: v3 branch works. other big lifts for 3.0.0?
      • batched store API. few minor issues. (rust under the hood?)
      • JH: definitely lots of small chunk calls. add to store API (bunch of keys)
        • DB: allow fetches to run out of order then it changes the API
        • gather runs sequentially
        • NR: async iterator? or as_completed
        • JH: streaming approach is the most powerful but almost most complicated
        • JM: add in delete and then it's approaching transactions
        • DB: no lazy execution model right now. leverage futures?
        • AS: gpu batch in kvikio does that, collects all the futures and then waits
        • DB: path is open for that. (if we're leave the mutable mapping API) not too painful.
        • AS: also async events needs separate codec pipeline. effects more.
        • DB: (dreaming) if txn as a context manager, then it could take a region
        • NR:
          Image Not Showing Possible Reasons
          • The image file may be corrupted
          • The server hosting the image is unavailable
          • The image path is incorrect
          • The image format is not supported
          Learn More →
    • DB: docs as the most important
      • NR: agreed
      • DB: pay attention to what sucks.
      • NR: migration guide.
    • CLI tools (convert v2 to v3) - DB
      • NR: not difficult for small arrays
      • TA: zarr v3 metadata refer to v2 data?
      • NR: most of the time. only if the codec is compatible. zarrita had a function for that. could do that.
  • time-permitting (Josh)

    • impl tests, netcdf-c, bluesky/tiled

August 30, 2024

Attendees

  • Joe Hamman / @jhamman
  • Sanket Verma / @MSanKeys963
  • Akshay Subramaniam /
  • Davis Bennett / @d-v-b
  • Josh Moore /
  • Tom Augspurger / @TomAugspurger

Agenda

August 23, 2024

Attendees

  • Joe Hamman / @jhamman
  • Sanket Verma / @MSanKeys963
  • Josh Moore / @joshmoore
  • Norman Rzepka
  • Akshay Subramaniam
  • Gustavo Hidalgo

Agenda

  • https://github.com/zarr-developers/zarr-python/pull/2102

    • NR: important to have a written document
      • OME is also interested in support for reading v2 data
      • may be good to remove the v2 module asap
    • JM: crux is supporting v2 and v3 data
      • does it make sense to create a zarr3 library
    • NR: not a fan of the zarrv3
      • discoverability and asthetics are not great
      • pitch weekly alpha releases
      • need to do the work and get the release out
    • JM: when do we go from alpha to beta to full release
    • SV: also address these questions: https://github.com/zarr-developers/zarr-python/discussions/2093#discussioncomment-10429985
  • alpha release frequency

    • proposal: weekly release on Monday
  • consolidated metadata

  • RemoteStore

    • PR1956
    • blocking: writing is completely broken because the exist method
    • doing naive synchronous user thing
      • open_array(s3://...)
      • using sync in user code
      • accessing fsspec directly
    • store = await MyStore.open('s3://foo')
    • store = sync(MyStore.open('s3://foo'))
    • store = MyStore.open_sync('s3://foo'), loop=...)
    • sync_store = SyncWrapper(MyStore, 's3://foo')
      • sync_store.set(filename, bytes)
  • GPU array progress

    • squashing bugs around merge conflicts
    • GPU CI is working now, need to sort out liminiting the size of the matrix and installing cupy

August 9, 2024

Attendees

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Sanket Verma / @MSanKeys963
  • Eschal Najmi

Agenda

https://github.com/zarr-developers/zarr-python/issues/2008

July 26, 2024

Attendees

  • Joe Hamman / @jhamman
  • Norman Rzepka / @normanrz
  • Davis Bennett / @d-v-b
  • Hannes Spitz / @brokkoli71
  • Gustavo Hidalgo / @ghidalgo3
  • Sanket Verma /

Agenda

Notes:

  • sharding is complicating the .chunks attribute on Arrays
    • ideas
    ​​​​Array.chunks -> tuple[int]  # raise if variable chunks or sharding
    ​​​​Array.chunk_grid.read_chunks()  # or inner_chunks or chunks
    ​​​​Array.chunk_grid.write_chunks() # or outter_chunks or shards
    
  • sharding configuration is pretty complicated today
    • template module
    • zarr.open_array remains the top level API where we can do user-friendly things
    • the Array.open method remains a low level entrypoint
  • async codec api
    • sharding is the only codec that needs to be async / do IO
    • NR: today we get scheduling in a threadpool
    • assumption that codecs release the GIL
    • need to do performance testing

Deprecate in 2.18.3

  • h5py compat methods

TODOs from this meeting

  • performance test suite
    • dask + threaded scheduler
  • GPU runner billing

July 12, 2024

Attendees

  • Ryan Abernathey / @rabernat
  • Norman Rzepka / @normanrz
  • Davis Bennett / @d-v-b
  • Akshay Subramaniam / @akshaysubr

Agenda

June 6, 2024

Attendees

  • Joe Hamman / @jhamman
  • Juan Nunez-Iglesias / @jni

May 30, 2024

Attendees

  • Joe Hamman / @jhamman
  • Norman Rzepka / @normanrz
  • Davis Bennett / @d-v-b
  • Akshay Subramaniam
  • Max Jones / @maxrjones

Agenda

  • Upcoming alpha release

Quick topics:

  • Norman, do we have an accessible api for extracting a shard index?
  • chunkstore API
    • joe: ask Martin
  • MemoryStore has Buffer objects in it :(
  • out kwarg

Notes

  • Joe
    • Store open mode is in, but incomplete
    • Top level API is functional but needs a bunch of work
    • Working on sharding codec, using fsspec branch + top level API branch
      • slow for now
  • Norman
    • working on indexing, tests are working
      • stuck on typing
      • ready early next week
  • Davis
    • Store tests for Martin to get fsspec
    • Hierarchy api
    • codec pipeline API
    • typed dicts for metadata objects
  • Akshay
    • On vacation, keeping track of Buffer/Indexing PRs
  • Max
    • No updates, can contribute to the v3.0.0 docs task (starting with dev docs)

May 23, 2024

Attendees

  • Joe Hamman / @jhamman
  • John Kirkham / @jakirkham
  • Juan Nunez-Iglesias / @jni
  • Sanket Verma / @MSanKeys963

Agenda

May 17, 2024

Attendees

  • Joe Hamman / @jhamman
  • Norman Rzepka / @normanrz
  • Davis Bennett / @d-v-b
  • Max Jones / @maxrjones

Agenda

Notes

May 8, 2024

Attendees

  • Joe Hamman / @jhamman
  • Davis Bennett @d-v-b
  • Norman Rzepka / @normanrz
  • Sanket Verma
  • Alden Keefe Sampson (AKS)
  • Akshay Subramaniam
  • John Kirkham

Notes:

  • Progress update (project board)
  • numcodecs codecs: numcodecs#524
  • zstd in numcodecs needs a review: numcodecs#519
  • HybridCodecPipeline (interleaved with configurable batch size) needs a review: #1670
  • Runtime configuration? #1772
  • Batched store discussion
  • Store metadata methods: zarr-python#1851
  • Initial NDBuffer implementation: zarr-python#1826
  • Proposed new meeting times
    • week 1: Friday 7a PT
    • week 2: Thursday 3p PT

Notes:
Major updates

  • JH:
    • implicit groups are gone :)
  • NR: codecs are getting into a good place
    • new rev on batched pipeline
  • DB:
    • out last week, getting back into it
    • group tests
    • need a decision about removing v2 code paths
      • should go now
  • AKS:
    • open PR generalizing array types

TODOs:

  • add tests for v2 and v3 arrays

April 24, 2024

Attendees

  • Joe Hamman / @jhamman
  • Davis Bennett @d-v-b
  • Jack Kelly / @jackkelly
  • Ryan Abernathey / @rabernat
  • Max Jones / @maxrjones
  • Sanket Verma
  • Akshay Subramaniam
  • Norman Rzepka / @normanrz

Notes:

April 22, 2024

Attendees

  • Joe Hamman / @jhamman
  • Norman Rzepka / @normanrz
  • Josh Moore / @joshmoore
  • Sanket Verma
  • John Kirkham
  • Martin Durant
  • Ryan Abernathey
  • Davis Bennett
  • Akshay Subramaniam

Excused: Juan Nuñez-Iglesias

Goals

  1. Make sure we're all on the same page with what has been going on with the project
  2. Organize around v3 efforts

Agenda

  • Recent efforts
    • Updates to core team (JH)
      • Moved some to emeritus status, etc.
      • We should work to get more core devs. (Lots to do)
      • RA: candidates? JH: let's get people making commits for a while.
    • meeting (JM)
      • propose to make it the regular meeting but find a time where everyone can join.
      • all aboard
        Image Not Showing Possible Reasons
        • The image file may be corrupted
        • The server hosting the image is unavailable
        • The image path is incorrect
        • The image format is not supported
        Learn More →
    • Zarr-Python 3 update (JH)
      • Design doc
      • Progress update (project board + notes below 👇)
      • Ambitious release schedule (#1777)
        • loose plan: May/alpha, June/release
        • roughly following the Pydantic 2 model (breaking API changes)
        • JM: all for getting pre-releases out
        • NR: need to get rid of the v3 folder (messes things up)
      • support/v2 branch may be useful (JM)
        • JM: we probably should be pushier
        • JH: sure, just our 100% confidence may lag
        • NR: define a window, e.g., by the end of the year everyone should move.
      • JH: pinned issue with the release plan? Yes.
      • v3 update
        • DB:
          • v3 metadata is done, i.e., can create spec compliant v3 arrays
          • working on groups that would work as expected, e.g., listing children (one of 2 big PRs). nearly done. required getting into the async implementation which is one of the biggest changes for the storage layer. Also means that we're not able to just paste in old code.
          • high-level convenience APIs are not there
          • only the nucleus of a testing strategy. using a different strategy from v2. bringing in what we can.
        • NR:
          • codecs are pretty advanced. async
            • MD: that means thread pools? NR: Yes. They can choose how they do that.
            • JH: core part of that is in the v3 release that will spend time on async/threading/scheduling. Lots of new behavior that we're going to learn about it. But now we have an API that can be tuned.
          • arrays are missing indexing
          • documentation is largely open. (pushed to post alpha for the launch)
        • JH:
          • on our way towards having 100% type hints
          • abstract base classes for the Store and Codecs that allow people to implement their own (outside of zarr-python) with an entrypoint system. perhaps something for chunk grids as well
          • store is no longer a mutable mapping but a custom class. list methods are async generators, etc. all synchronization happens upstream. Synchronize wrapper of Arrays and Groups, but wait until you're at the top-level API for sync.
          • build is cleaner. using hatch.
          • need to discuss numcodecs. currently isn't taking part in the protocol system. what does it mean to Zarr going forward?
        • Discussion
          • JK: documented path for upgrading? JH: no, there's an open ticket. Need docs on upgrading code but also migrating data, e.g., metadata only changes. (Alistair did this for v1->v2).
          • MD: need to discuss what kerchunk is going to do. it will take some pretty deep working. the style of where the metadata is has changed (along with the codeces). JH: yeah, no filters, all one pipeline. i.e., just the metadata. MD: more involved (i.e. goes deeper) into Zarr then other things. RA: zarr data model that is independent of the spec could be super useful. DB: don't think there is an overlap of v2 and v3 arrays. i.e., it would be a UNION. you need to map between the names and the types. don't get that for free just with the hierarchies. RA: don't do it once off, but build something re-usable and then serialize those. JH: clearly separated metadata from the classes. can turn one dataclass into another dataclass. (work to be done)
          • DB: spec allows v2/v3 things to be mixed, so a coroutine of some form may need to be opinionated about what it prioritizes. JH: good point, since you might have to look for 4 things, or prioritize one or the other. we should just be clear and then let people suffer the consequences. RA: have some shim functions similar to the current open() which keeps things working. JH: zarr.open has a version flag. None could mean do both.
          • implicit groups (DB) basically anything is a group even if it has garbage in it. NR: haven't seen anyone who is against removing it.
          • DB: if so, also make mixing versions disallowed? JM: can we allow a complete mixing? JH: don't want to be polymorphic about children. DB: can't forbid having .zattrs in a v3 group/array. Agreed.
          • JM: if need be, can try to organize a ZIC meeting with SV.
        • Numcodecs (NR)
          • Opened PR today if someone wants to review that, but more generally where are we with numcodecs
          • have specialized codec classes in v3 branch. arrays-to-bytes, etc. etc. Different classes from in numcodecs. Do use it under the hood.
          • for v2 support in the v3 branch we use the code unchanged. we ask numcodecs to do it for us. we could pull that into the v3 arrays which would give us support for batching, async, etc. we will likely need some glue code. (that's with minimal effort). Do we move numcodecs in a direction such that it uses v3 abstract classes.
          • DB: I like the idea of their being a repo on github for people to go to. numcodecs should exist where we have these compression routines at a low level. It should be there to support zarr.
          • NR: closed list. How do we handle that?
            • DB: spec says that it just needs a URL.
            • NR: what if two implement blosc2 differently.
            • DB: people are going to do what they want.
            • NR: make it more difficult? or use the github URL to prefix?
            • JH: raise a warning that users can turn off if they aren't using an approved list of codecs. for experimentation, we definitely want to make it possible (and easy). That's what zarr-python is known for.
            • DB: what's the advantage of enumerating a list of codecs?
            • NR: when creating an implementation, you can just follow the list.
            • JM: allows us to say, "this implementation is not complete". plus also
              Image Not Showing Possible Reasons
              • The image file may be corrupted
              • The server hosting the image is unavailable
              • The image path is incorrect
              • The image format is not supported
              Learn More →
              for a schema where possible.
            • NR: ok to have optional codecs
          • JH: open issue with tifffile of a missing default flag (size parameter?)
          • RA: a way out of this is to outsource as much as we can with blosc, has an ambition of being a meta compressor
            • AS: blosc as the main library is that it also has sharding etc. under the hood. better IMO to just expose the compression stream formats. blosc is less flexible than numcodecs currently. (more difficult to add new compressors or options)
          • AS: gzip links to RFC not an implementation, i.e., a specific stream format. this is also an issue with numcodecs lz4. would be good to have these written down and link to a spec.
          • Continue conversation on https://github.com/zarr-developers/numcodecs/issues/502
    • NASA Funding Opportunity (JH)
      • Planning to submit a LOI next week
      • targetting Zarr-Python, v3 feature development, and support at least 3 years
  • TODOs
    • find a new time for the bi-weekly meeting. becomes zarr-python dev meeting but open invitation to anyone who would want to join.

Notes

April 10, 2024

  • SV: Zarr-Python B&P meeting discontinued - rename refactor meeting to 'Zarr-Python meeting'?
  • JH: updates since last meeting

Active work

Apr 5, 2024

  • Davis Bennett / @d-v-b (DB)
  • Norman Rzepka / @normanrz (NR)
  • Joe Hamman / @jhamman
  • Deepak Cherian / @dcherian

Todo list

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Note: the topics listed below have been converted to issues and placed on the v3 project board: https://github.com/orgs/zarr-developers/projects/5

p0 - must happen now
p1 - must happen before alpha release (target first week of May)
p2 - must happen before 3.0 release (target early June)
p3 - nice to have, can happen after 3.0 release

March 27, 2024

  • Davis Bennett (DB)
  • Alden Keefe Sampson (AKS)
  • Norman Rzepka / @normanrz (NR)
  • Sanket Verma / @MSanKeys963 (SV)
  • Akshay Subramaniam / @akshaysubr (AS)
  • Max Jones / @maxrjones (MJ)
  • Raphael Hagen / @norlandrhagen (RH)

Meeting notes:

  • Sanket: Bi-weekly meeting ends on May 1st, 2024 - shall we continue after that?

    • Yes! Schedule it until June end! - DONE
  • DB: Fleshing out the group API in v3 https://github.com/zarr-developers/zarr-python/pull/1726

  • NR: We need to find a common understanding of what we still need to work on for beta release. NR will create a tracking issue.

  • Akshay: Generalized array support

    • Where to create issue to track this? zarr-python or zarr-specs? Any direction for structuring the issue and proposal?
    • Create a native zarr NDArray class for typing and to interface with existing protocols. This includes:
      • Buffer protocol
      • __array_interface__
      • __cuda_array_interface__
      • DLPack
      • Raw pointers
    ​​​​namespace zarr
    ​​​​{
    
    ​​​​namespace py = pybind11;
    ​​​​using namespace py::literals;
    
    ​​​​class Array
    ​​​​{
    ​​​​public:
    ​​​​Array(zarrArrayInfo_t* array_info, int device_id);
    ​​​​Array(py::object o, intptr_t cuda_stream = 0);
    
    ​​​​py::dict array_interface() const;
    ​​​​py::dict cuda_interface() const;
    
    ​​​​py::tuple shape() const;
    ​​​​py::tuple strides() const; // Strides of axes in bytes
    ​​​​py::object dtype() const;
    
    ​​​​zarrArrayBufferKind_t getBufferKind() const; // Device or Host buffer
    ​​​​py::capsule dlpack(py::object stream) const; // Export to DLPack
    
    ​​​​py::object cpu(); // Move array to CPU
    ​​​​py::object cuda(bool synchronize, int device_id) const; // Move array to GPU
    
    ​​​​const zarrArrayInfo_t& getArrayInfo() const
    ​​​​{
    ​​​​return array_info_;
    ​​​​};
    ​​​​static void exportToPython(py::module& m);
    ​​​​};
    
    ​​​​} // namespace zarr
    
    • Interoparability with Numpy
    ​​​​ascending = np.arange(0, 4096, dtype=np.int32)
    ​​​​zarray_h = zarr.ndarray.as_array(ascending)
    
    ​​​​print(ascending.__array_interface__)
    ​​​​print(zarray_h.__array_interface__)
    ​​​​print(zarray_h.__cuda_array_interface__)
    ​​​​print(zarray_h.buffer_size)
    ​​​​print(zarray_h.buffer_kind)
    ​​​​print(zarray_h.ndim)
    ​​​​print(zarray_h.dtype)
    
    • Interoparability with Cupy
    ​​​​data_gpu = cp.array(ascending)
    ​​​​zarray_d = zarr.ndarray.as_array(data_gpu)
    ​​​​print(data_gpu.__cuda_array_interface__)
    ​​​​print(zarray_d.__cuda_array_interface__)
    ​​​​print(zarray_d.buffer_kind)
    ​​​​print(zarray_d.ndim)
    ​​​​print(zarray_d.dtype)
    
    • Convert CPU to GPU
    ​​​​zarray_d_cnv = zarray_h.cuda()
    ​​​​print(zarray_d_cnv.__cuda_array_interface__)
    
    • Convert GPU to CPU
    ​​​​zarray_h_cnv = zarray_d.cpu()
    ​​​​print(zarray_h_cnv.__array_interface__)
    
    • Anything that supports the buffer protocol
    ​​​​with open('file.txt', "rb") as f:
    ​​​​    text = f.read()
    
    ​​​​zarray_txt_h = zarr.ndarray.as_array(text)
    ​​​​print (zarray_txt_h.__array_interface__)
    
    ​​​​zarray_txt_d = zarray_txt_h.cuda()
    ​​​​print(zarray_txt_d.__cuda_array_interface__)
    

March 13, 2024

  • Joe Hamman / @jhamman (JH)
  • Alden Keefe Sampson (AKS)
  • Norman Rzepka / @normanrz (NR)
  • Sanket Verma / @MSanKeys963 (SV)
  • Akshay Subramaniam / @akshaysubr (AS)
  • Max Jones / @maxrjones (MJ)
  • Agriya Khetarpal / @agriyakhetarpal (AK)

Meeting notes:

  • Alden / top level API
    • https://github.com/zarr-developers/zarr-python/issues/1598#issuecomment-1994729420
    • In the v3 library, are the top level methods
      • the primary way users interact with the library? or
      • the smoothest on ramp for v2 library users into v3? (and Array.xxx and Group.xxx become primary)
      • something else?
      • Notes:
        • Joe's thought: We want to provide a pretty similar interface, help massage or raise errors when args not compatible. We can start deprecating and changing behavior
        • Norman: like Array. and Group entrypoints, but we need to have these top level entry points. promote the Array and Group classmethods in the docs
        • Joe: don't love polymophism of .open, but it exists
    • The kwargs to any method that can create an array are currently v2 specific (filters vs codecs, etc), plus there are a number of performance/behavior modifying args (cache_[metadata|attrs], partial_decompress, write_empty_chunks, dim separator). Do we
      • try not to change the api at all and try to translate into spec V3 land
      • make the kwargs actually match those of the Array.xxx v3 library methods, but also take in **v2_kwargs and translate where possible, checking for conflicts with v3 kwargs if provided
      • make the top level methods kwargs align with those of Array.create, etc
        • Norman: for open: make it compatible, you get back your method, for create could make
          • Runtime parameters: many of these didn't know exist, can debate case by case
          • Joe: if run time param provided that v3 don't use, raise error
    • Currently zarr.open and similar will return an array if it exists, even if the existing array's dtype, codecs, etc don't match those provided.
      • Keep this?
      • Joe: think we should raise an error, wide agreement on this
  • Norman / Batched and interleaved codec pipelines
    • https://github.com/zarr-developers/zarr-python/pull/1670
    • Hybrid interleaved batched codec pipeline
    • How to set runtime configuration?
    • Batched store API
    • BatchedCodecPipeline as abstract class that can be overridden by user code
    • Move thread dispatch from codec to pipeline to allow for coalescing and locality?
  • Akshay / Generalized array support
    • Open issue and tag with v3
  • Sanket / Summary for the core-devs → potential blog post in the future
    • JH: I can take this on target April?
    • SV: Sounds good!
  • Agriya / Zarr Pyodide support, out-of-tree
    • Requires patching numcodecs, zarr here and there
      • Zarr is pure Python, so lesser patches there. Numcodecs needs more patches because it is Cython-based.
      • Already done by Pyodide devs per Pyodide/Emscripten release
        • The Emscripten and Pyodide versions are not decoupled yet
        • Leads to missing versions
    • Establish CI job that runs on PRs and nightly — or just nightly
    • Issue with this is maintainability and how to keep support?
    • Interactive documentation (end goal).
    • Action item: I will be opening an issue for this on the Zarr repository and link both previous discussions (the ones that I have found). Discussion may proceed there further

February 28, 2024

  • Joe Hamman / @jhamman (JH)
  • Tom Nicholas / @TomNicholas
  • Norman Rzepka / @normanrz (NR)
  • Davis Bennett / @d-v-b (DB)
  • Sanket Verma / @MSanKeys963 (SV)
  • Akshay Subramaniam / @akshaysubr (AS)
  • Charles Stern
  • Alden Keefe Sampson (AKS)

Meeting notes:

February 14, 2024

  • Norman Rzepka / @normanrz (NR)
  • Davis Bennett / @d-v-b (DB)
  • Sanket Verma / @MSanKeys963 (SV)
  • Akshay Subramaniam / @akshaysubr (AS)

Meeting notes:

January 31, 2024

Attendees

  • Sanket Verma / @MSanKeys963 (SV)
  • Norman Rzepka / @normanrz (NR)
  • Davis Bennett / @d-v-b (DB)
  • Max Jones / @maxrjones (MJ)
  • Alden Keefe Sampson / @aldenks (AS)
  • Raphael Hagen / @norlandrhagen (RH)
  • Charles Stern / @cisaacstern (CS)
  • Jeremy Maitin-Shepard / @jbms (JMS)

Meeting notes:

  • NR: Codec pipeline
    • Open question: merging partial and full versions?
    • Next - reading/writing partial chunks for uncompressed data
  • DB: Saransh helped with hatch and source layout updates
  • MJ: No updates, participating in Joe's virtual sprint on Zarr refactor
    • can test out setting up test env with Hatch, provide feedback
  • AS: Setup on dev environment, still intending to work on high-level methods.
  • RH: No updates
  • CS: Interested in participating in Zarr sprint remotely
  • JMS: No updates, analogous decisions in tensorstore

January 17, 2024

Attendees

  • Sanket Verma / @MSanKeys963 (SV)
  • Joe Hamman / @jhamman (JH)
  • Norman Rzepka / @normanrz (NR)
  • Davis Bennett / @d-v-b (DB)
  • Max Jones / @maxrjones (MJ)
  • Alden Keefe Sampson / @aldenks (AS)
  • Raphael Hagen / @norlandrhagen (RH)

Meeting notes:

  • SV: plans for numcodecs going forward
    • TODO: connect with JK about this
  • JH: made some good progress on the Store API
    • Thinking about what to do when keys are missing, raise KeyError or return None
    • Needs work: getsize, move, tree, rmdir, open, close
      • NR: move should only exist on a store if its cheep
    • Open questions:
      • Store.list_* could change to return async generators
  • NR: working on codec api, removing array metadata
  • DB: working on a messy / unmergable PR for the Array API
    • end goal: unify array/group apis for v2 and v3
    • added a new directory with v2 and v3 metadata
    • stuck on dataclass inheritance
    • JH: where will the normalization of metadata keys
  • MJ: not much, going to pick up the hatch PR
  • AKS: playing with zarr in rust
    • very fast!
    • going to pick up the top level api this week
  • RH: No updates at this time

Discussion

What goes in the beta release

  1. core array, group, and store api
  2. thesis: almost feature complete but the api should be set
    • we want people to start using it so we can get some feedback

January 3, 2024

Attendees

  • Sanket Verma / @MSanKeys963 (SV)
  • Alden Keefe Sampson / @aldenks (AS)
  • Joe Hamman / @jhamman (JH)
  • Norman Rzepka / @normanrz (NR)
  • Davis Bennett / @d-v-b (DB)
  • Max Jones / @maxrjones (MJ)

Meeting notes:

  • JH: Still working on Group and Store APIs
  • NR: Left off with codec api, sharding api, and sharding layouts
  • DB: Still working on array api
    • considering a major change to indexing/slicing api (slicing a Zarr array gives NumPy array, which is weird and should give out a Zarr array)
    • thinking about serialization of nested objects
    • partial writes
  • MJ: Looking for a place to jump in
  • AKS - may want jump in on the top level zarr.foo* api

December 20, 2023

Attendees

  • Charles Stern / @cisaacstern (CS)
  • Jack Kelly / @JackKelly (JK)
  • Sanket Verma / @MSanKeys963 (SV)
  • Alden Keefe Sampson / @aldenks (AS)
  • Joe Hamman / @jhamman (JH)

Meeting notes:

  • CS: nothing directly on zarr, keeping an eye on the zarr issues with help wanted tags
  • SV: looking at hatch pr from davis, zep 0 revisions, and zarr paper
  • JK: working on a light-speed-io project (rust), playing around with ideas for fast data loading
  • AS: seems too early to jump in, don't want to get in the way
  • JH: Many things in progress: Store, Codecs, Arrays, Groups

December 6, 2023

Attendees

  • Ryan Aberanthey / @rabernat
  • Joe Hamman / @jhamman (JH)
  • Charles Stern / @cisaacstern (CS)
  • Jack Kelly / @JackKelly (JK)
  • Sanket Verma / @MSanKeys963 (SV)
  • Davis Bennett (DB) / @d-v-b
  • Raphael Hagen / @norlandrhagen
  • Alden Keefe Sampson / @aldenks
  • Norman Rzepka / @normanrz

Meeting notes:

  • Design doc for v3.0 has moved to GitHub. If you want to comment on the design then comment on the GitHub pull request.
  • Zarrita: Alistair started it as a place to experiment with the Zarr v3 spec (back in July 2020). Norman picked Zarrita up in Summer.
  • We've taken Zarrita, copied it, renamed & refactored things. There's a new Store interface (we're leaving behind the mutable mapping interface of Zarr stores.) Aiming for 100% coverage of static type checking.
  • Norman, Davis & Joe have been together for the last 3 days (in Berlin). JH has been working on the Group API (compatibility with Zarr-Python's group API). See this PR. For v2 there are two metadata docs which describe a group. Reads now happen async: which cuts the loading time for groups in half. Now working on listing contents of a group.
  • DB: In Zarrita we have representations of arrays for v2 and v3. DB has been working on a uniform interface to v2 and v3. Breaking lots of things :). Looking at how codecs decompress & compress. Overall strategy is to use the v3 way of doing things. See DB's PR.
  • RA: It's great that there are both async and sync APIs. But downstream datascience libraries will always want the sync API.
  • NR: The Zarrita code is based on fsspec, with some small changes.
  • JH: We're just using fsspec (for now). Very convinced that having an async API is the right call for Zarr-Python. Less convinced that the fsspec way of doing things will be the long-term solution.
  • NR: Adding sharding strategies. Customise how chunks are laid out in the file. e.g. if you want to do partial writes (where you can write to specific places). Instead of having to write entire shards at a time. Writing tests right now. First PR has been merged into the v3 branch (codecs).
  • JH: The codecs are now an entry point into the Zarr-Python code. Zarr-Python v2 basically supports any codec in numcodecs. Do we need to register all of those compressors and filters as codecs? Or should we limit them?
  • NR: Let's make a generic codec. bytes-to-bytes, and array-to-bytes.
  • JH: For now, we've decided not to work on variable chunk sizes. We could release a version of Zarr-Python without variable chunk sizes. Questions?
  • RA: Everyone's very supportive! This is what we need to get over the rut that zarr-python has been stuck in. A lot of folks would like to help, but don't know where. Are there concrete tasks that we can give to people? (The answer may be "no"! Some software projects are just no parallelizable like that.)
  • JH: Some of these first blocks of work have required us to already resolve conflicts. I'll jot down a couple of tasks which could be useful for folks to work on.
    • the top-level API has not been ported over yet (e.g. zarr.open()). Most people use that top-level API.
    • documentation! A lot of copy-and-pasting from Zarr-Python v2.0. But some function signatues will change.
    • type checking needs work! MyPy isn't happy right now.
  • DB: v3 introduces the concept of a codec pipeline. We build an object which is a sequence of transformations of chunk data, which leads to it being storage, or - in reverse - leads to it being opened. The documentation for this doesn't exist yet. If anyone has an idea for a transformation, then work through the process of doing this with the v3 machinery, and write up some docs for how to do this. v3 is a lot more explicit about how chunks are encoded.
  • NR: +1 to adding codecs, or wrapping numcodecs. Also:
    • try wiring the new Zarr-Python to downstream libraries. That would tell us what's missing in terms of the public API.
    • Also, it'd be useful to having a champion for variable chunking. The codec pipeline should know about variable chunking.
  • RA: The problem with the ZEP process is that it's hard for ZEP authors to just implement the ZEP.
  • JH: We should be able to find byte-sized chunks which folks can work on.
  • NR: It'd be great to keep up the momentum after this week. And it'd be great to have a beta by Jan 2024!
  • RA: Where is the discussion of the ZEP3 proposal (here's the PoC implementation PR)? And the discussion is here.

Looking for a champion on:

  • variable chunking
  • synchronizers
  • h5py compat

Agenda

  • Update from Joe + Davis on refactor progress

November 22, 2023

Attendees

  • Joe Hamman / @jhamman (JH)
  • Charles Stern / @cisaacstern (CS)
  • Jack Kelly / @JackKelly (JK)
  • Sanket Verma / @MSanKeys963 (SV)
  • Davis Bennett (DB)

Agenda

Meeting notes:

  • JH: We can start working from store interface - kind of leaky abstraction
  • JK: Looking to read million chunks - sharding helps with that - discussions around batching requests in Zarr-Python - requesting million chunks in a single request - if Zarr V3 is a good place to pull in these performance bumps? (don't want to delay the existing work)
  • DB: To make get more efficient, you need to wrap it in something - mostly users are getting multiple chunks at a time
  • JH: In Dask/Xarray world you map a single chunk of Zarr at a time - At Earthmover there is 1-to-1 reads - to handle big size chunks we have rechunker - sharding codec sits above the store interface
  • JH: https://github.com/scalableminds/zarrita/blob/async/zarrita/sharding.py#L309 - indexing for sharding - the sharding codec will need access to store API whereas the other codecs doesn't need it
  • DB: Like the idea - add a new abstraction - we have leaky abstraction and we can use it
  • JH: Norman is willing to help but only if Zarr-Python is first class citizen
  • JH: https://docs.xarray.dev/en/stable/roadmap.html - publish the roadmap on Zarr-Python docs for the community
  • JH: Jack can help us in fast concurrent loading problem
  • JH: Meeting with Davis and Norman in 1.5 weeks to work on Zarr-Python 3.0

November 8, 2023

Attendees

  • Joe Hamman / @jhamman
  • Charles Stern /
  • Raphael Hagen / @norlandrhagen
  • Sanket Verma / @MSanKeys963
  • Martin Durant /

Agenda

November 1, 2023

Attendees

  • Joe Hamman / @jhamman
  • Davis Bennett / @d-v-b
  • Sanket Verma / @msankeys963
  • Raphael Hagen / @norlandrhagen
  • Charles Stern
  • Brian Davis
  • Thomas Nicholas

Agenda

  • Request - can someone try to take some notes today?
  • Request - can we move this meeting time to 8:30a PT (currently at 9a PT).
  • V3 API migration
    • Now that we are starting to work on implementing v3, we're faced with the question of what to do with the existing API

    • Observation: the current v2/v3 polymorphism is unsustainable (and incorrectly prioritizes v2 internally)

    • Proposal - we create a v3 namespace within zarr-python where we can develop in an isolated space toward a complete v3-spec implementation

      • Included in this namespace:
        • classes: zarr.v3.{Array,Group,Store}
          • These classes implement an internal api that closely aligns with the v3 spec
        • high level functions: `zarr.v3.{create, open, }``
          • As much as possible, these function should look and feel like the v2 equivalents but should not be tied to the exact implementation
            • e.g. zarr.create(shape=..., dtype=..., compressor=...) -> zarr.create(shape=..., data_type=..., codecs=..., attributes=...)
          • We may also want to deprecate and/or rename some of the existing top level functions
        • backward compatability:
          • high-level functions in the v3 namespace should be able to create or open a v2 dataset
          • The Group or Array does not need to be backward compatible though.
      • All development toward v3 happens on the main branch in zarr-python
    • Alternative proposal

      • We avoid the v3 namespace and instead take over the primary namespace in a development branch (e.g. v3)
      • When we feel that the v3 branch is complete, we merge to main and make a 3.0 release
      • Folks have less time to test out the v3 implementation but we have a cleaner development process
    • Ideas

      • Idea of zarr array is to look like a numpy array
        • could move all the zarr array details to a polymorphic metadata object
        • trim things down to just the minimal array api interface
      • declarative heirarchy specification
      • type hints

Sanket Notes

  • DB: Definition of Zarr and Dask chunks are different and that's not good
  • JH: Benefits of generative chunk indexing
    • Impacts with sharding, variable chunking and other shiny feature
    • Large array with billions of chunks
  • JH: Maintaining both V2 and V3 at the same time is not ideal
    • DB: V2 has of lot stuff that people don't use - stores
  • TN: The current public facing APIs (V2 and V3) are conformant to the existing spec - but what we're thinking to work on a new public facing API which is wrapper of V2 and V3, and not conformant to V3 exactly - isn't that a bad thing?
    • DB: The public-facing Zarr array object API is not covered by the spec anyway
    • Also can't be, because syntax might be language-dependent
    • Therefore we have full freedom in the public python API of the python zarr array type
    • TN: Okay, in that case makes sense to follow python array API standard as much as possible
  • TN: Array API has granular functionality which is super useful (e.g. you can say "we don't support the statistical functions")
  • TN: Note that chunking is not part of the array API standard

October 18, 2023

Attendees

  • Joe Hamman / @jhamman
  • Max Jones / @maxrjones
  • Davis Bennett / @d-v-b
  • Tom Nicholas
  • Charles Stern
  • Sanket Verma
  • Ryan Abernathey

Agenda

September 20, 2023

Attendees

  • Joe Hamman / @jhamman
  • Charles Stern / @cisaacstern
  • Sanket Verma / @MSanKeys963
  • Raphael Hagen / @norlandrhagen

Agenda

September 6, 2023

Attendees

  • Joe Hamman / @jhamman
  • Ryan Abernathey / @rabernat
  • Sanket Verma / @msankeys963
  • Raphael Hagen / @norlandrhagen
  • Ryan Williams
  • Charles Stern / @cisaacstern
  • Davis Bennett / @d-v-b

Agenda

  • review scoping section (below)
  • performance
  • zarr + pydantic (https://github.com/janelia-cellmap/pydantic-zarr)
    • observation: Zarr-python is missing specific data models for Groups / Arrays
    • price of depending on pydantic is probably not worth it

Scoping V3 update (by @jhamman)

Written by @jhamman on September 5, 2023

In the Winter and Spring of 2022, while the V3 spec was still under development, an experimental V3 implementation was added to the Zarr-Python codebase (#898). This implementation followed the spec, as it was written at the time. However, in the months following these developments, major changes to the spec were made. This has left Zarr-Python out of sync with the V3 specification.

Summary of current status

  1. V3 support is behind an experimental API (accessed by setting zarr_version=3 and ZARR_V3_EXPERIMENTAL_API=1).
  2. A separate code path for V3 stores was implemented in zarr._storage.v3.

Major changes to the spec since the experimental implementation include:

  • Entrypoint metadata document (zarr.json) is no longer required
  • Metadata keys were renamed (e.g. meta/foo/bar.group.json -> /foo/bar/zarr.json)
  • Group and metadata documents are no longer distinguished by their key name (everything is zarr.json and a node_type field is included in all documents)
  • Various updates to metadata fields:
    • format_versionint
    • added dimension_names
    • removed chunk_memory_layout (in favor of transpose codec)
    • codecs now includes a list of codects that was previously split between the filters and compressor fields
    • etc.

Open questions:

  • fallback data types

Actions

https://github.com/orgs/zarr-developers/projects/5/views/1

Zarr refactor meeting

Aug 16, 2023

Attendees

  • Joe Hamman (Earthmover)
    • Xarray and Zarr dev
  • Sanket Verma (Zarr)
  • Tom White (independent dev)
    • SGKit and Cubed
  • Max Jones (CarbonPlan)
    • Data scientist
  • Raphael Hagen (CarbonPlan)
    • Data eng.
  • Charles Stern (Columbia)
    • Pangeo-forge

Discussion

  • Max: how do we view V3 extensions already in Zarr-python
  • Charles: how does Zarr python register plugins
  • Zarrita (https://github.com/scalableminds/zarrita/) - reference implementation
    • no baggage / tech debt of Zarr-python
    • not production ready
    • also has sharding
  • Tom: Interop tests between implementations

Timeline

Goal: by the end of the year, have a fully-functional implementation of V3 in Zarr Python

  • Starting now: survey users to get an understanding of how a breaking change to the V3 implementation will impact users
  • Next two weeks: Break #1290 into smaller chunks and set up project board
  • September: start refactor efforts
  • Oct-Dec: Integration and interop testing

TODOs:

  • add regular call to community calendar
  • break out V3 implementation tasks into issues / project board (try to identify issues that can be picked up by others)
  • publish read out of this call
Select a repo