owned this note
owned this note
Published
Linked with GitHub
---
tags: zarr, ZEPs, Meeting
---
# ZEPs Bi-weekly Meetings
### **Check out the website: https://zarr.dev/zeps/meetings**
Joining instructions: [https://openmicroscopy-org.zoom.us/j/82447735305?pwd=U3VXTnZBSk84T1BRNjZxaXFnZVQvZz09](https://openmicroscopy-org.zoom.us/j/82447735305?pwd=U3VXTnZBSk84T1BRNjZxaXFnZVQvZz09)
Meeting ID: 82447735305
Password: 016623
GitHub repo: https://github.com/zarr-developers/zeps
## 2023-03-16
**Note:** This ZEP meeting was an impromptu meeting. Please see the corresponding message on Gitter [here](https://matrix.to/#/!nZLdXRRzIbkoDjkEvS:gitter.im/$oxM2UpzOTs--6P1Itl6gPWvLRCBEv_npvxJYi5m95l8?via=gitter.im&via=matrix.org&via=cadair.com).
**Attending:** Sanket Verma (SV), Norman Rzepka (NR), Jonathan Striebel (JS), Jeremy Maitin-Shepard (JMS)
**Meeting Minutes:**
- NR: Confused for sharding as a codec! - Making sharding as a codec will change the interface -
- JMS: doesn’t agree with Martin’s point
- JS: doesn’t change the codec - will nest a new one - V3 doesn’t need to change anything
- NR: Need to change ZEP2 then
- NR: Zarr-Python API is mess right now
~Jeremy joins in~
- NR: Should we send ZEP1 for voting?
- JMS: Martin wants to push everything as a storage transformer - keep storage transformers in the spec - we could also defer that decision to ZEP2
- NR: I wonder if we have too much implementation detail in V3? - Whether we need partial read or not?
- JMS: Partial read are not required for sharding
- JS: https://github.com/zarr-developers/zarr-specs/issues/161
- JMS: Mostly concerned with JSON metadata - haven’t starting doing the implementation
- NR: Some behaviour needs to be defined - everything goes beyond - doesn’t need to strip the spec
- JS: hierarchy discovery - what happens if you delete the chunks and what happens then?
- JS: We’re fine as it is now!
- JMS: For someone is reviewing the ZEP0001 - JSON metadata is important - but it is burried in the middle - an editorial change would be helpful to put on the top
- JS: Glossary defined at the top is not optimal
- JMS: Would look into restructuring the metadata to the top
- JMS: Would start working on V3 implementation of Neuroglancer and tensor store
- JS: https://github.com/zarr-developers/zarr-specs/pull/222 -
- NR: This is something of an implementation detail
- JS: Maybe we can mark it them as implementation detail
- JMS: Josh and Ryan brought up the idea of codec vs transformers in the last ZEP meeting - so I wrote this PR
- NR: Move sharding to a codec?
- JMS: Josh was skeptical - Ryan was in favour - we should go ahead with the proposal for sharding as a codec
- NR: Will make the changes to the ZEP2 to the next week
- JS: Having sharding as similar to blosc2 and nvcomp is a strongest opinion
- NR: Contributors for V3 so far?
- SV: Will make a contributor list after the voting email goes out!
## 2023-03-09
**Attending:** Hailiang Zhang (HZ), Dieu My Nguyen (DMN), Josh Moore (JM), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JMS)
**Meeting Minutes:**
- JMS: sharding as transform vs. codec
- https://github.com/zarr-developers/zarr-specs/issues/220
- RA: been playing with sharding, trying to get the most out of it
- Key question (impl or spec?): if sharding is a codec how does the outer layer which range it wants? (general problem in zarr-python for blosc)
- requires passing context to context
- transform is explicit; codec is less clear
- JMS: in zarrita he has the codec take an indexing expression (optional?). defer some of that for ZEP2? arrays vs. bytes vs. additional concept of arrays.
- RA: similar to Martin's request
- JMS: first codec is fine, but the next one is less clear. need to be more explicit about the interface.
- RA: need to solve this. what information needs to be passed in between (implementors and at spec level)
- RA: e.g. could be a codec that takes an HDF5 file (blosc2, etc.) missed a chance to build the right abstractions there.
- JMS: `codecs := array|bytestream in; array|bytestream out`
- JM: recursive zarrs all the way down?
- JMS: concatenation of other arrays
- RA: Norman's justification. JMS' proposal. re: how to integrate other things like referncing between arrays, shards defining own chunking, etc. (doesn't change anything in ZEP1)
- JMS: transforms as bytes, and codecs can access arrays
- JMS: NB: MD wants low level store to be aware of array indexing
- JM: always thought of codecs as the lowest thing that is unaware of arrasy
- JMS: combined compression with filters (which can operate on arrays, transpose)
- RA: sharding fundamentally breaks core abstraction between store / codec. at the impl. level, want an efficient/fast code to fetch chunks of shard, make smart decisions, close to the metal. but the naive thing isn't fast. do the core abstractions break down. no longer using key/value store API. using offsets into storage.
- JMS: don't see byte range as breaking. addition to the interface.
- RA: not just a file format, but a protocol for addressing chunks.
- JMS: dimension names metadata
- https://github.com/zarr-developers/zarr-specs/issues/219
- RA: would be for keeping or making it easy to not use
- HZ/DMN: ZEP5 presentation (recorded)
- https://zarr.dev/zeps/draft/ZEP0005.html
- 
- HZ: Tabling because of Zoom issues.
- RA: re: expectations -- very limited due to the numbers of people working on the spec. (it's taken *years*) so ... 6 months?
- HZ: this is an extension, doesn't blocking anything.
## 2023-02-23
**Attending:** Sanket Verma (SV), Ward Fisher (WF), Ryan Abernathey (RA), Jonathan Striebel (JS)
**TL;DR:**
**Updates:**
- New extension by Hailiang Zhang, see here: https://github.com/zarr-developers/zeps/pull/31
**Meeting Minutes:**
- JS: No items in TODOs and Needs pr in ZEP1 [project board](https://github.com/orgs/zarr-developers/projects/2/views/2) - spec is coming to final stage 🎉 - last few days at scalable minds - needs to finish the remaining tasks soon!
- JS: Should we ship V3 with OGC?
- RA: We already have Zarr V2 Spec as OGC standard - we can ask for v3 - but its more of take it or leave thing
- JS: [chunk key encoding](https://github.com/zarr-developers/zarr-specs/issues/172) could be an extension - separate key in the metadata - may prepends 0 - two things should be configurable separately
- RA: could it be an extension?
- JS: possibly!
- RA: we should have it - this could enable to see only metadata when opening a directory without seeing the whole array
- JMS: separator `/` would allow multiple possibilities
- JS: should be backwards compatible, not a breaking feature
- RA: working on a extension ZEP for non - listable stores -
- wants to run it by the community first - read only stores - no writes to these stores
- copied from STAC - link: https://github.com/radiantearth/stac-spec/blob/master/catalog-spec/catalog-spec.md#relation-types
- provide explicit link between parent and children document - write a store and create links for the store -
- JMS: new group property, no reason to have parent - because you always know the parent
- RA: what if someone gives you a middle hierarchy array address? - it’s helpful then
- RA: Maybe we should not advertise V3 as we don't have a reference implementation
- SV: hack week is a good idea to get ride of the existing technical debt
- JMS: Looking to implement V3 in tensorstore and can help with Zarr-Python too!
## 2023-02-16
**Attending:** Sanket Verma (SV), Dieu My Nguyen (DMN), John A. Kirkham (JK), Hailiang Zhang (HZ), Johana Chazaro (JC), Jeremy Maitin-Shepard (JMS), Akshay Subramaniam (AS)
**TL;DR:**
**Updates:**
**Points of discussions:**
ZEP 1:
- Anything missing for https://github.com/zarr-developers/zarr-specs/pull/204?
- Change global storage transformer PR to group storage transformers:
https://github.com/zarr-developers/zarr-specs/pull/182
- Should we update or remove the "Storage – Operations" section?
https://github.com/zarr-developers/zarr-specs/issues/206
- ZEP 1 needs updates: https://github.com/zarr-developers/zeps/issues/32
- URL to groups and arrays:
https://github.com/zarr-developers/zarr-specs/issues/132#issuecomment-1433105652
- Prepare mail for the councils for the vote
**Meeting Minutes:**
- SV: Your thoughts on checksum for shards? Check the discussion [here](https://github.com/zarr-developers/zarr-specs/pull/152#issuecomment-1412688953)
- AS: Not really thought about the ZEP extension but it could be!
- AS: Want to support applications that don’t support Zarr - would be nice to support shard - send shard over the network and decompress it over at the other end
- AS: KwickIO doesn’t do compression - would be nice to support this
- JMS: There’s some tension with the Zarr model - shard has data and metadata - maybe duplicate the metadata and add info over to it
- AS: checksum issue is not critical - some more metadata to shard - applications in genomics and geospatial data - having number of chunks would help - applications has the context for unpacking
- AS: Zarr can have wrapper which can put data in the right place
- JMS: having a container of the string is the abstract of what we want
- JK: Is it depended on the data? - The compressor and the chunks and shard
- AS: what compressor works with what data is a subjective choice
- JK: Compressor could have branching logic?
- AS: Could be logical to create a new compressor
- JK: Branching logic would change for different datasets?
- AS: It could.
- SV: Dataset is public? Or can be made public?
- AS: We can make it public - I’ll look into it
- HZ: Brief summary of [ZEP0005](https://github.com/zarr-developers/zarr-specs/pull/205) GES DISC is looking for averaging the chunks - cost is high - introduce the algorithm - make overhead to be small - regional data is 1 TB the extra overhead after accumulation would be ~5% - it is working really well - big improvement in performance - very accurate - Ryan suggested to add it as extension during last year ESIP meeting
- SV: Maybe add benchmarks?
- HZ: We can do that!
- JMS: Question on the specs PR
- HZ: User can specify can any random range - corner would aggregate the chunk value - loading single chunk is easy - this single chunk would contain the aggregation value and you load it - it would be transparent to the user
- JMS: makes sense to have stride - for image you want to store pyramid - you can represent downsampling pyramid - you can do this by your proposal
- HZ: if you can zoom in and zoom out - regular stride - how do we setup the stride? - in theory is it possible? - is it possible to have a version 2 for this extension ZEP? - the separation would be good idea
- JMS: Multiply stride by chunk size
- HZ: Saving multiple chunks is not a problem - currently doesn’t support any random stride
- JMS: It would let implementation have more work but it would cover more generic use cases
- HZ: I will think hard and include the modification in the PR
## 2023-02-09
**Attending:** Sanket Verma (SV), Ward Fisher (WF), Isaac Virshup (IV), Virginia Scarlett (VS), Jonathan Striebel (JS), Jeremy Maitin-Shepard (JMS)
**TL;DR:**
**Updates:**
**Open Agenda(add here 👇🏻):**
- IV: Strings + variable length binary – New ZEP?
## 2023-02-02
**Attending:** Sanket Verma (SV), Josh Moore (JM), Ryan Abernathey (RA), Jonathan Striebel (JS), Jeremy Maitin-Shepard (JMS)
**TL;DR:**
**Updates:**
**Open Agenda(add here 👇🏻):**
- SV: Is there R implementation of Zarr?
- JM: Only [Rarr](https://github.com/grimbough/Rarr) (with active development)
- RA: Rust Implementation is a good place to put our efforts; would be good binary implementation that would be useful for the communities for other languages
- JMS:
- RA: Took sharding for the test-drive
- RA: Storage transformers doesn't have `get_items` and `set_items`
- Is a good thing
- JS: Does have a partial values and it could cover keys
- Martin thinks API is not clean now
-
## 2023-01-26
**Attending:** Jonathan Striebel (JS), Ward Fisher (WF), Josh Moore (JM), Sanket Verma (SV), Jeremy Maitin-Shepard (JMS)
**TL;DR:**
**Updates:**
- Weekly ZEP meetings until March, 2023
**Open Agenda(add here 👇🏻):**
- JS: Timeline
- https://github.com/orgs/zarr-developers/projects/2 (in discussion and TODO)
- Not voting by end of January
- More realistic? End of February
- Josh: agreed with handover e.g. end of February. (can be more activate in March)
- JS: How long for the review?
- SV: 1 month?
- nodding...
- [Prefix](https://github.com/zarr-developers/zarr-specs/issues/177)
- Underscores and escaping. (needs to happen in group)
- [unicode](https://github.com/zarr-developers/zarr-specs/issues/56)
- allowed: +1
- recommended set of characters (lower case, digits, hyphens)
- normalization?
- filesystem does normalization on matching
- online there's no normalization
- default: we do nothing
## 2023-01-12
**Attending:** Jonathan Striebel (JS), Josh Moore (JM), Jeremy Maitin-Shepard (JMS), Ryan Abernathey (RA), Sanket Verma (SV), Ward Fisher (WF), Dennis Heimbigner (DH)
**TL;DR:**
**Updates:**
- https://github.com/zarr-developers/zeps/pull/27
- Any objections?
**Open Agenda(add here 👇🏻):**
- ZEP 1 issues that need attention:
- **prefix**: https://github.com/zarr-developers/zarr-specs/issues/177
- open question is the prefix for the chunk directory
- potentially to-be-used for json files, etc.
- also useful for extensions that add new folders
- Dennis: `zarr.chunks`? Good to identify them. (since there are arbitrary delimiters)
- with HDF5 people have experimented with accessing chunks directly, so need easy identification
- Options:
- `_`
- `__`
- `_z_`
- `z.`
- `zarr.`
- `_zarr_`
- Ryan: what's the goal of the prefix?
- JS: preventing node-name collision
- JM: preventing extension collision
- DH: DAP4 rule attempts to have self-assigned namespace, piggybacking on DNS (.ucar.edu)
- RA: work through use cases, e.g. nczarr files
- RA: won't the ability to offload metadata into a separate document
- DH: also apply it to keys within the attributes
- RA: see [ZEP4](https://github.com/zarr-developers/zeps/pull/28). special attributes is another discusion.
- WF: A convention that _zarr are reserved is longer, but feels less prone to collision than _z
- JMS: In general I think we want to reserve a prefix such as "_" for zarr itself and extensions, and then perhaps a subset of that should be reserved for just zarr itself (not extensions).
- NB/DH: would suggest a top-level group
- DH: Do you have sufficient metadata?
- What does it mean to access a raw variable?
- JM: there's still a directory. metadata+chunks (but a place to put extension files as well)
- JM: bidirectionality would need some work to make sure that a group doesn't magically appear
- RA: cf. how GDAL and Geo-tiff (etc) handle this
- DH: two purposes of group -- namespace and a place to store attributes that aren't part of the variable
- DH: example is group-level superblock marker
- chat
- RA: I worry that if we make `_` a disallowed prefix, lots of datasets may not work
- RA: I feel like there is plenty of data out there in the wild that has a `_` as the first character in a variable
- RA: Using ‘_’ as a convention in netCDF is a soft limit, not a hard; it’s part of the convention that it’s reserved, but if users disregard that, they can use ‘_’ with their own attributes. Whatever convention we decide upon can be phrased as guidance, without necessarily breaking extant datasets.
- deciding
- JM: https://app.rankedvote.co/decisions/29018/Zarr-v3-prefix/30274/vote ?
- JS: compatiblity extension for v2 is valid or not
- JM: could see `zarr.json, zarr.chunks, zarr.extensions`
- **no root group**: https://github.com/zarr-developers/zarr-specs/issues/192
- explicit groups can have transformers
- do we always need to have a zarr.json for an array
- if an array has none, how do we open it?
- search up the hiearchy? too inefficient?
- JMS: group level transformer ten you can't make any assumptions about what's underneath it (e.g. a redirect)
- JMS: without searching, the group with the transform will become a "root"
- RA: need URL syntax for group#path if we have transformers
- JM: still need search up for desktop clients. no URL syntax for searching down
- RA: formalize "**entrypoint**" what can and cannot be opened
- shouldn't be able to open a chunk (and figure out that it's part of an array)
- JS: alternatively, you MUST have a zarr.json
- RA: like that one. for an entry point, there should be a zarr.json
- That should be the **one** entrypoint definition.
- DH: defines a leaf, everything below doesn't exist externally.
- RA: would help to look at hierarchies. (too abstract)
- DH: a bit like posix mountpoints? driver is responsible for interpretation
- URL syntax: https://github.com/zarr-developers/zarr-specs/issues/132
- little bit via the previous conversation
## 2022-12-15
**Attending:** Sanket Verma (SV), Josh Moore (JM), Ward Fisher (WF), Ryan Abernathey (RA), Jonathan Striebel (JS), Jeremy Maitin-Shepard (JMS), John Kirkham (JK)
**TL;DR:** The meeting started with a discussion on some pending issues regarding V3. Then, we opened the [ZEP1 project board](https://github.com/orgs/zarr-developers/projects/2) and went through the issues individually to decide their conclusion. As a result, consensus on some issues was achieved, while others are yet to be discussed in successive ZEP meetings.
The ZEP0001 has gone into feature freeze, as mentioned in the blog post [here](https://zarr.dev/blog/zep1-update/), and from now on, the community, ZSC and ZIC will be working on integrating and resolving existing features and issues, respectively.
**Meeting minutes:**
- Discussed with Jonathan on 12/9:
- Adding a `diff` w.r.t. to earlier version of V3
- Include filesystem in ZEP0001
- Sync V3 implementation in `zarr-python` with the recent changes in spec; see - https://github.com/zarr-developers/zarr-python/issues/1290
- https://github.com/orgs/zarr-developers/projects/2
- No issues added after the 19th
- All need to be solved by the vote
- Migrated NaN issue to zarr-specs
- dropping /meta prefix
- RA: make clear (at some spec locations) that iterative listing is necessary
- also make more use of async calls
## 2022-12-01
**Attending:** Sanket Verma (SV), Josh Moore (JM), Jonathan Striebel (JS), Jeremy Maitin-Shepard (JMS), Ward Fisher (WF), Dennis Heimbigner (DH), Ryan Abernathey (RA)
**TL;DR:** RA started a discussion on drop `/meta` prefix. See the issue [here](https://github.com/zarr-developers/zarr-specs/issues/177), which basically led to chain reaction of several conversations around topics related to each other. These discussions are mostly around some lingering issues around the finalisation of Zarr V3 spec.
RA, JMS and JS took some action items which can be seen at the bottom
**Updates:**
- Conversations (issues and feedback) on ZEP1 [PR](https://github.com/zarr-developers/zarr-specs/pull/149) are now resolved. Check [this](https://github.com/zarr-developers/zarr-specs/pull/149#issuecomment-1327605570). Thanks to Jonathan Striebel! 🙌🏻
- The conversations which needs additional input have been moved to separate issues
- Jeremy Maitin-Shepard promoted as one of the authors for [ZEP0001](https://zarr.dev/zeps/draft/ZEP0001.html)
- Current status of ZEP1 can be viewed [here](https://github.com/orgs/zarr-developers/projects/2)
**Meeting minutes:**
- RA: suggest focusing on the meta/ prefix discussion
- https://github.com/zarr-developers/zarr-specs/issues/177
- JMS: not sure it's solving a problem (optimally). nice feature of v2 is copying out an array
- JS: was for performance, use exclusion mechanism
- RA: never need to list chunks (even if implementations do...)
- NB: don't like trying to open files to know things (404-based)
- JM: so we all agree? Yes. But what's the default?
- RA: suggest: drop meta, use .json on the array
- Can then drop the root metadata?
- DH: there is dataset-level metadata (superblock)
- JS: discuss those separately?
- Agreed
- JS: so to that suggestion, how do you list all metadata?
- RA: don't think we should plan for discovering all metadata (millions of arrays)
- JS:
- RA: listing recursively isn't ok?
- JS: not with implicit groups
- RA: use storage transformer to get the previous behavior
- RA: data is out there so need to provide a mechanism
- DH: don't think that's fair
- RA: nice feature of this proposal if we could keep it.
- JM: how is conslidated metadata related?
- RA: that's another problem.
- RA: had thought about explicitly list the children (stac catalog)
- DH: nczarr does that as well.
- RA: downside is the concurrency issue
- JS: good extension for groups (listing children)
- JS: but could also have consolidated per group
- JM: different commnunities here. some are definitely asking for listability
- DH: not lots of formats that are listable without tools. They are asking for something powerful.
- RA: so that's the root feature of the separate hierarchies
- RA: we should look at some data. offer to write a script
- JS one other alternative: chunks in an extra subfolder
- JMS: k/v versus directory based will have different performance behavior
- RA: does this require us to give up on implicit groups?
- summary: foo/array.json and foo/chunks/0/0
- JMS: re: dropping of root metadata
- solution is perhaps storage transformer would need to write something in _array_ metadata
- JS: but then everything in the array metadata and you'd need be able to read it
- JMS: don't _always_ need to access the metadata directly. more like a safety measure.
- JMS: could copy "global" metadata into each array
- JM: that works to the design goal of being able to freely copy an array
- DH: that assumes no global extensions.
- RA: need to think through this separately:
- portability ("invariance" of v2 that any group is standalone)
- RA: danger is that you can build zarrs through the command-line (need no library)
- DH: sure you want to do that?
- RA: it was at least part of the design of V2
- DH: as a principle, if that's what you want it's really critical to the v3 process
- DH: cf. intellij -- fiddling in text file then going back (bypassing the tool)
- JM: perhaps CLI manipulations are inherently "extension-less" and therefore this is "safe"
- DH: one tool being a verification tool?
- JS: consolidated metadata as an example
- RA: primarily used as a way to allow easy listing
- but you know that you can't touch the store.
JS: you need a root to do fancy things
- **action items:**
- RA to do some performance benchmarking
- JMS to propose a new storage layout for v3
- next time: root metadata discussion issue (JS)
## 2022-11-17
**Attending:** Sanket Verma (SV)...Jonathan Striebel (JS), Ryan Abernathey (RA), Ward Fisher (WF)
**TL;DR:** Apparently there was a snafu where JS, RA and WF joined a Zoom meeting whereas SV joined another one! 🥲
In the meeting there was the discussion on V3 spec and some of it's missing parts. Also, RA opened a PR on global transformers, which can be seen [here](https://github.com/zarr-developers/zarr-specs/pull/182).
**Updates:**
- ZEP1 Update, see [here](https://gitter.im/zarr-developers/community?at=6374fae6f9491f62c9b7ea61)
- Check out the ZEP1 GH Project board [here](https://github.com/orgs/zarr-developers/projects/2/views/2); maintained by Jonathan Striebel
**Meeting minutes:**
Same as TL;DR. 👆🏻
## 2022-11-03
**Attending:** Josh Moore (JM), Jonathan Striebel (JS), Jeremy Maitin-Shepherd (JMS), Sanket Verma (SV)
**TL;DR:** Discussions were held on how to move forward with ZEP1 quickly. The summary can be viewed [here](https://github.com/zarr-developers/zarr-specs/pull/149#issuecomment-1302440391). Then the attendees discussed extensions in V3, and JMS is considering trying with non-zero origin. SV joined the meeting after 30 mins. After that, JS mentioned some high-level issues looming around V3 spec.
**Meeting Minutes:**
* JMS: number of PRs that could be merged into the working draft
- JS: don't want to just close it
- JM: can we cross link e.g. JMS' PR? Yes.
- ==> once all cross-linked close PR.
* JS: when to merge?
- JM: when it matches the consensus?
- JS: ok, but don't have merge rights.
- ==> Let's merge proactively.
* see: https://github.com/zarr-developers/zarr-specs/pull/149#issuecomment-1302440391
* extensions
- JMS: thinking of trying with non-0-origin
- JM: think that's a general principle we should try for all issues/PR is "could it be an extension"
- JMS: thinking of extensions as plugins? Not exactly.
- JS: how to influence if an implenentation adopts an extension? if there's a concrete implementation / clear interface
- JMS: agreed and some obvious ones (codecs) but not clear there will be a broader abstraction
- JS: "index transformer" _perhaps_
- or as transformer _if_ multiple of chunking
- JMS: unfortunate limitation
- JMS: re: transformers - it doesn't make sense to compose a different storage transform _before_ sharding
- JS: depends. cache of chunks or shards? also checksum
- JM: codec is similar
- JMS: caching enabled in code, but not in zarr metadata
- JS: that's in spec, yes. "runtime-only" but still before or after
- JMS: when implementing sharding, would check if it's first
- want to be able to tell the user "this is the graularity to write"
- JS: good that it's flexible. like c/f order.
- JS: mention in implementation "sharding must be first"
- JMS: composing makes for useful extension point
- JS: most important point: are we sure enough about the extension points?
- _Sanket joins_ 🧑🏻💻
- Jonathan: high-level issues looming
- paths/URL discussion (needs an issue)
- global transformers
- variable chunk length (possibly origin offset)
- indexing more abstract
- upgrade path! (`{“extension”: [“@v2-layout”]}`)
## 2022-10-20
**Attending:** Ward Fisher (WF), John Kirkham (JK), Jeremy Maitin-Shepard (JMS)
**TL;DR:** WF is working on the maintenance NetCDF release candidate (`v4.9.1-rc1`), and JMS added CMake support to TensorStore. After this, JMS initiated a discussion on Path structure and was stretched for the remaining meeting.
**Updates:**
- (WF) Working on maintenance netcdf release candidate (`v4.9.1-rc1`). No new features, just bug fixes and improvements.
- (JMS) Added CMake support to TensorStore
- Discussion about CMake, dependency management
- https://cmake.org/cmake/help/book/mastering-cmake/chapter/CDash.html
- https://github.com/cpm-cmake/CPM.cmake
**Meeting Minutes:**
* (JMS) Path structure
* Require or encourage root directory to end in .zarr
* How to name all the metadata files?
* Root metadata could contain extension information
* (JK) Mentioned `.zmeta` metadata file with paths to metadata file
* (JMS) About listing
* (WF) Possible issues with writing
* (WF) Spec vs. library tension
* (JK) Have file expire?
* (JMS) Handle as read-only
* (JK) Could also delete as part of writing?
* (JMS) HDF5 has hierachary and Zarr replicates this
* Have some array and non-array data next to each other
* (JK) Examples?
* (JMS) Segmentations & mesh representations
* (JMS) Collection of volumes with annotations related to them
* (WF) Have Zarr hierarchy with non-Zarr?
* (JMS) Only have single individual arrays
* (WF) Wouldn't have considered this structure
* (WF) Does there need to be something in the spec about interleaving data?
* (WF) Maybe interleaving poses some challenges
* (JMS) Doesn't NetCDF have extra files as well?
* (WF) Yes. Extra metadata used to map Zarr model to NetCDF model.
* (JMS) Reason to use this structure as opposed to Zarr metadata files?
* (WF) NetCDF supports different formats HDF5, Zarr, etc.
* (JMS) Have user defined attributes. Types are stored in metadata file? Could those be in zattrs?
* (WF) Yes. Not sure
* (JMS) Hierarchy becomes more apparent with V3 as opposed to V2
* (WF) Groups were a new feature that users were slow to pick up on
* (JK) Does adding more top-level metadata cause issues?
* (JMS) Could it contain the metadata?
* (WF) Maybe include subset of metadata
* (WF/JMS) Perhaps special case single array use case
* (JK) How does data relate in non-hierachical form
* (JMS) Related, but not all Zarr data
* (JK) Would other kinds of chunk formats (standardizing on kerchunk) be useful
* (JMS) Meshes probably don't make sense in this way
* (JMS) Neuroglancer meshes are a good example
* (JMS) Sparse arrays seem similar in that they might be better handled by being their own file format
* (WF) NetCDF users mention performance issues in moving to new version. Usually suggest using old NetCDF. Maybe same with V2/V3?
* (JMS) Want to use V3 (sharding being of value).
* (JK) Including unstructured binary blobs in Zarr?
* (JMS) Has a group of files for mesh
* (JK) Maybe ignore specific paths?
* (WF) Having mixed media is valuable though can be logisticially tricky
* (WF) What defines Zarr as a data model? At least need to say some behavior is undefined (mixed media). Ideally ignores mixed media files.
*
## 2022-10-06
**Attending:** Ward Fisher (WF), Josh Moore (JM), Jeremy Maitin-Shepard (JMS), Greg Lee (GL), Jonathan Striebel (JS)
**TL;DR:** JM shared that there were some good conversations around OME-Zarr yesterday. The summary is available [here](https://forum.image.sc/t/ome-ngff-community-call-transforms-and-tables/71792/10). WF shared that Kitware is looking for partners and a link to the sign-up form. GL shared that during the CZI Open-Science Summit 2022, he worked on writing tests for Xarray. After this, there was an extensive discussion on URL syntax initiated by JMS.
**Updates:**
* miscellaneous reading before the meeting (JM)
- https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/
- https://github.com/kaitai-io/kaitai_struct/issues/125
* NGFF (JM)
- https://forum.image.sc/t/ome-ngff-community-call-transforms-and-tables/71792/10
- Good conversations around OME-Zarr yesterday
* Enthusiasm for Kitware (WF)
- Looking for partners. [Have form on webpage](https://www.kitware.com/contact/project/).
- Unidata an option. They've mentioned Zarr a couple of times (Kitware Blog).
* xarray test (GL)
- during czi conference.
- release of 2.13 hopefully fixed it all :tada:
**Meeting Minutes:**
* URL syntax? (JMS)
- helps to figure out the metadata location.
- Josh: great idea. have several ongoing discussions at the NGFF level
- current proposal would be to support URIs internally (relative, absolute, remote)
- however, in V2
- JMS: in v3 the root exists
- though not entirely clear that the new metadata organization is necessary
- designed for S3 where there's no directory, but other problems exist
- Josh: _summarized previous discussions for Greg_
- GL, thoughts on the V3 situation?
- GL: at the moment, you need helper methods to do that.
- JM: one proposal was to have the metadata be the main directory which lets you then bootstrap the chunk loading
- JMS: support multiple?
- JM: conceivably. as extension or configuration.
- JM: downside for consolidated metadata is that nothing exists in the metadata hierarchy
- workaround of having a thin-hierarchy only with references to where the metadata exists
- JS: losing the ability to be able to next any hierarchy. (everything is a root)
- JM: are we proposing rolling it back completely
- JMS: problem is the URI+rootpath metadata
- JS: walking up the hierarchy would be an option (URL doesn't actually point)
- JMS: would be nicer if you don't have to perform a search
- Use case
- URL case
- Desktop double click on something
- Similar issue: **Zips** :warning:
- JMS: have an additional level
- JM: except ZipStore v2 assumes the whole zip is a zgroup
- JS: propose zip is a special case which is _easier_
- JMS: unless you are mixing volumetric with a zarr then it wouldn't be at the top-level
- Btree (JMS): need to be able to compose multiple layers (similar to fsspec and double colons)
- Remote chunk store (or point to V2 chunks)
- Renaming folders (keep data with arrays)
- Options
- Keep "/meta", clients must know
- Drop "/meta", direct URLs
- `?param` syntax
- `#param` syntax
- Separator syntax (e.g. "`//`")
- root dir ends in .zarr
- fsspec `::` separator
- multiple protocols (git+ssh, zip+zarr)
- further discussion
- JS: without /meta and .zarr requirement, you still don't know where the root is
- JS: if you drop "/meta" then you can't name anything "/data"
- JMS: could use something more obfuscated
- JS: why split?
- JMS: if you are not using the filesystem (s3 or gcs) and you want to list all the metadata, it's not (as) efficient
- JM: "data" could be registered in the metadata so it's a known (and configurable) thing
- WF: NC anything with leading underscore is assumed reserved for the library
- permitted to create them, but the spec says "please don't"
- JM: `.z` prefix
- WF: utilities and tools can scrape everything with that
- WF: also don't have to put too much thought into new features
- JMS: would prefer not a `.` prefix because of archiving tools, etc.
- then `_z`?
- JMS: root metadata file doesn't really do anything
- JS: creates ambiguity
- JM: think it was largely for bootstrapping global plugins (e.g. transformers)
- JS: perhaps V2 compatibility
- JMS: not clear you would nest sharding with other transformers. it would be the thing applied to the chunks.
- JS: the metadata needs to be somewhere, and for that can be at the array level
- brief summary
- zarrs are essentially a metadata hierarchy
- that configure (possibly remote) chunk stores
- and the root is identified with .zarr
## 2022-09-22
**Attending:** Ward Fisher (WF), Josh Moore (JM), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JMS), Dennis Heimbigner (DH)
**TL;DR:** Consolidate metadata needs an extension for V3, which might result in a new ZEP. Next, JMS shared a document titled ‘Optionally-cooperative distributed b-tree for Tensorstore’. The participants discussed the document after that. After that, JM initiated the discussion on codecs-registry, which was built by one of the GSoC students this summer. The meeting ended with a discussion on the path to the metadata files.
**Meeting Minutes:**
- Java/NetCDF side:
- JM: Sanket met people
- WF: Unidata should be 3x the staff.
- JM: perhaps starting with a kerchunk implementation?
- WF: looking for more community involvement (like netcdf-c had)
- JM: Greg mentioned consolidated metadata needs an extension for V3
- RA: Iceberg issue, also see JMS' proposal
- https://github.com/zarr-developers/zarr-specs/issues/154
- JMS: touches on not needing a file per chunk (like discussed last night)
- https://docs.google.com/document/d/1PLfyjtCnfJRr-zcWSxKy-gxgHJHSZvJ2y4C3JEHRwkQ/edit?resourcekey=0-o0JdDnC44cJ0FfT8K6U2pw#heading=h.8g8ih69qb0v
- db format that stores a btree.
- uniquely: designed to allow distributed writes (s3, etc.) *but* doesn't need a peristent database
- can also read it in a non-distributed fashion
- downside: adds quite a bit of added complexity (greatly for binary format)
- also good where sharding isn't appropriate (e.g. pre-defined shard size which is required for write)
- e.g. large number of small arrays (where sharding won't help)
- RA: nice document. comments:
- focused on big distributed writes, but with iceberg had a different main motivation: more flexibility in mapping keys to chunks. kerchunk-like. virtual concatenate . can you reference random chunks? yes.
- JMS: btree nodes have references to files (like kerchunk). but datafiles are identified with 128-bit path (not an fsspec URL)
- RA: different use case, so can have them be optional transformers/extensions
- RA: really similar to tiledb! why not use it?
- JMS: tiledb is organized by time not space.
- JM: need a compaction
- JMS: and even after that you still have a million files.
- DH: HDF5? internally it's btrees. (which is responsible for most of its complexity). Are you sure this is the path?
- JMS: not sure there's an alternative to btrees. used in databases, filesystems, etc.
- DH: if you don't want some ordered searches, then linear hashes are an alternative
- JMS: ordered is useful for a lot of use cases. but there wasn't an obvious solution for distributed writes
- DH: [extendable hashing](https://en.wikipedia.org/wiki/Extendible_hashing) is an easier data structure (old paper) works well with disk storage.
- JMS: think this is more a key-value store (like zip)
- RA: agreed. Nice that it's possible to experiment like this.
- RA: can the V3 spec support this experimentation? (right extension points?)
- RA: trying to do that with Iceberg. Martin suggested "IceChunk".
- See also: hooty and others. Lots of smart ideas that we can copy.
- Goal is to provide some level of branching & transactions for/on a Zarr store
- Allow you to work on your staged area which all get written at once.
- Branch non-destructively (or rollback)
- The key is having a "manifest" (they all have some concept of that, even kerchunk)
- Don't depend on the object stores listing as the source of truth
- Need storage transformers at the top level, not array. But for JMS' idea array-level might suffice.
- JMS: wasn't planning on an extension. root metadata would be in the same data store.
- JM: basically writing DB/filesystem :+1: ZarrFS ;)
- JMS: planning on mongo? Yeah, or Dynamo. (They store JSON)
- JSON in S3 isn't ideal.
- metadata in document store and chunks on disk. Beyond just filesystem. It's a data lake.
- "meta-store"
- JMS: regarding versioning, how are you representing the delta?
- The chunk is the minimal writable unit. (out-of-scope)
- Every chunk write is a uniquely ID'd (e.g. content addressable). That gets a key. Write that to DB.
- JMS: expecting the database to provide the versioning?
- RA: no, just a place for documents. versioning (in iceberg) has a branch or a tag that points to a specific chunk manifest. you can create a new one and point your HEAD at that. only rely on database to atomically change the references. iceberg tracks a number for the transaction.
- JMS: use kerchunk model? limitation on the number of chunks?
- RA: chunks are likely in a separate manifest. discussed that another extension with Martin.
- RA: but can just query a chunk from the database.
- JMS: 1M chunks in v1. then update to v2. What's the diff? A copy.
- RA: yeah need to play with it.
- JMS: when you get to wanting to update just a portion of it, then you get to b-trees :smile:
- RA: no db guys, trying to keep it hackable.
- RA: but megabyte kerchunk is already getting :heart: since it's so easy. looking for incremental improvement on _that_. (NASA will be pumping out GRIB forever...)
- JMS: looking forward to hearing more and exchanging info re: b-trees
- JMS: see also https://github.com/janelia-flyem/dvid (backed by KV database)
- JM: sharing layers with them?
- JMS: complicated by other priorities of the EM team. invite Bill to the Zarr meetings?
- RA: see https://lakefs.io/
- JM: API versus format
- RA: thinking about it more like an API
- JM: briefly codecs-registry
- https://zarr.dev/codecs-registry/
- https://github.com/zarr-developers/codecs-registry
- JMS: still want a schema per codec. JM: agreed!
- JMS: talks about codecs having URLs.
- would by an annoyance to have difference V2 and V3 identifiers.
- e.g. just numeric constants in the JSON that are from the C API
- e.g. shuffle parameter which would be nicer as a string.
- support integer or string for a while (in order to deprecate)
- JM: have plans to have code in each languages that checks for an id from the central registry
- DH: approx. that with nczarr. ncdump lists the actual codecs in the file
- would be good to have something more sophisticated
- have the disadvantage of C code and interpreted files
- 3 repositories on the C side. unidata + irvine + hdf5
- hdf5 only has names, hdf5-ids and a pointer (which is often out of date)
- something universal would be nice
- WF: roping in the HDF5 group would be a heavy lift
- JMS: **URL interface** :rocket:
- DH: :+1: for the REST API
- WF: NSF/CSSI solicitation has opened
- https://beta.nsf.gov/funding/opportunities/cyberinfrastructure-sustained-scientific-innovation-cssi
- perhaps something here
- WF: planning on getting to https://www.egu23.eu/
- tweet something from zarr_dev to see if there is interest :question:
- could collaborate something re: nc/zarr
- JMS: don't have clear resolution on the paths to the metadata files
- JM: re-capped the previous discussion and think it's still good.
- JMS: some details around the root array (the named files, etc.)
- JMS: consolidated metadata? duplicated?
- JM: would make it possible to have everything in the top-level
- JMS: pointers in the subdirectories? bit annoying.
- JMS: with iceberg & co. you likely don't need a consolidated metadata
- JM: so you'd push it to the store level?
- JMS: possibly, but not that simple
- JMS: there are cases where you need path separation anyway (Zips)
- JMS: so could see using a path separation strategy entirely
- JMS: Davis did have a use case ...
- (...details zip, consolidated brainstorming...)
- JM: need both solutions...
## 2022-09-08
**Attending:** Sanket Verma (SV), Josh Moore (JM), Ward Fisher (WF), Jonathan Striebel (JS), Norman Rzepka (NR), Ryan Abernathey (RA), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS)
**TL;DR:** This was the first ZEP meeting ever. Representatives from the [Zarr Implementations Council](https://github.com/zarr-developers/governance/blob/main/GOVERNANCE.md#zarr-implementation-council-zic) joined the meeting. Most discussions revolved around [ZEP1](https://zarr.dev/zeps/draft/ZEP0001.html). One of the critical decisions about ZEP1 was to accept it ‘[Provisionally](https://zarr.dev/zeps/active/ZEP0000.html#review-and-resolution)’ and move forward with the implementation in various programming languages.
**Updates:**
- SV: open ("draft") ZEPs:
- https://zarr.dev/zeps/draft/ZEP0001.html
- https://zarr.dev/zeps/draft/ZEP0002.html
- SV: Author discussion on all the comments on ZEP0001
- Have proposed resolutions for a number of those
- Meeting again tomorrow to try to finish the list
- https://hackmd.io/sOos8rxrRvKCJPbbUKWtwA?view
- SV: Critically propose marking ZEP0001 as "provisionally accepted" after the above are handled and passed by the ZIC
- Implementations are free (and encouraged!) to start implementing.
- Any blocking changes could still be handled.
- Otherwise, "feature freeze".
- Feedback?
- RA: process to get to provisionally accepted
- SV: draft == under review.
- on vote, can move to provisionally or accepted state.
- once implemented, moves to final.
- could move to "deferred" state if the ZIC vetoes
- WF: "ready to implement" jumped out (and caused anxiety but only since there's too much to do)
- https://zarr.dev/zeps/active/ZEP0000.html#review-and-resolution
- JMS: no substantial changes since early draft
- JM: editors are preparing a rebuttal (Alistair's paper model)
- JMS: not sure a paper model is best
- RA: not in the sense that there's only one round and someone will decide. iterative
- good to have authors who are organizing.
- now in revision and we can continue until everyone is happy
- gone slowly for various reasons (availability, summer, and it's our first time & massive)
- would be useful to go through the outstanding issues
- JS: in this cycle and not limited iterations is just the limited time.
- but for now, trying to make batched changes
**Agenda:**
- JS: **review of memory order decision number-16 from list**
- zarr's goal is interoperability. therefore propose to keep C & F (benefit for community)
- could support read only, even with a transpose (if too slow, add a warning?)
- JMS: agree. but would like an arbitrary permutation.
- DH: good use case?
- JMS: dimension that represents time. order you display to the user is logical for them but need not be logical for compression/access patterns.
- JM/JS: core or extension?
- RA: that's a key question
- NR: re: backwards compatibility C/F is in V2 therefore that would need to be in core. but arbitrary could be an extension.
- RA: but v3 is a chance to break backwards compatibility (explicitly not a goal)
- NR: upgrade path? so be able to upgrade without re-writing the chunks.
- RA: v2 will still be supported.
- WF: that would be the hope, but worry about netcdf & archival -- assuming software will support it without it being expressed somewhere. aspirational sure but makes us nervous.
- e.g. will future software implement the v2 standard?
- RA: transform based solution? (but only if we support F) **if** we say the chunks should be backwards compatibility.
- WF/DH: no one has ever asked for arbitrary. Someone at NOA asks for things that would help their lab. Technical debt. (Won't even request a pull request) See the trap that the HDF group fell into (single-writer-multiple-reader, several orders of magnitude that they are trying to recover from.)
- JMS: arbitrary seems most natural. pass to `numpy.transpose()`
- WF: shocked at the assertion that there _wouldn't_ be a migration path
- JM: clarification -- were only differentiating if _binary_ transformation is needed
- **can add _requirement_ to v3 that implementations read v2**
- WF: requirement of netcdf. can decide if that's a requirement.
- DH: depends if it's alot. operational definition - "too painful to copy v2 to v3"
- (for RA): petabytes of data
- JS: RA proposed transformer strategy - essentially rewriting metadata **formalize it?**
- DH: how comfortable are you not supporting older version?
- JM: for OME, got agreement but that's a layer higher
- DH: will there be new implementations without V3 support?
- NR: think there will be
- JM: but it's so easy to implement
- WF: people won't do that...what do we do if a popular implementation doesn't support v2?
- other packages?
- RA: recommend storage layer / translation?
- JM: agreed but that's SHOULD (versus MUST)
- JMS: only way to force it is a standardization
- JM: agreed, but we can only do what the spec document allows us (i.e. labeling something as "compliant")
- JS: it's a new major version and people know what we mean. (as a user, I wouldn't expect support for v2 if an implementation says "v3")
- WF: convinced myself I'm worrying too much instead:
- WF: in 18 months how do you know which Zarr is used to open it.
- JM: metadata file is different (essentially the magic number). The proposal for `.zr3` was currently turned down.
- SV: [data type naming](https://github.com/zarr-developers/zarr-specs/pull/149#discussion_r929140806)
- JM: dropping the python-ness
- JMS: helps provide a more nature scheme for some datatypes (and endianness as a codec)
- no argument against (just "convenient in Python")
- JM: will need names
- JMS: in https://github.com/zarr-developers/zarr-specs/pull/155
- DH: netcdf ncchar type equivalent to 8-bit ascii, no equivalent in Zarr. Needed? NC uses it all the time. Why not in numpy?
- JMS: thought numpy has char.
- RA: revisit char question? JMS: different than varstring
- RA: where does the encoding go? DH: in an attribute. "ascii" (or "utf-8")
- RA: used for? DH: you see a lot of flags stored that way.
- also historical: NC-3 didn't have strings of any type. (arrays of chars workaround)
- JM: extension mechanism?
- DH: where the wheel hits the road
- JMS: just metadata?
- RA: disagree, influences ...
- JMS: agreed, but doesn't change how hard it is to implement
- JM: but need to feel confident that they are low cost so we can change *when* we discuss these things
- JMS: will changes start appearing?
- JM: