tags: zarr, Meeting
# Zarr Bi-weekly Community Call
### **Check out the website: https://zarr.dev/community-calls/**
Joining instructions: [https://zoom.us/j/300670033 (password: 558943)](https://zoom.us/j/300670033?pwd=OFhjV0FHQmhHK2FYbGFRVnBPMVNJdz09#success)
GitHub repo: https://github.com/zarr-developers/community-calls
Previous notes: https://j.mp/zarr-community-1
**Attending:** Josh Moore (JM), Jeremy Maitin-Shephard (JMS), Ahmet Can Solak (AS), Sanket & the Do-a-thon, Martin Durant (MD), Davis Bennett (DB)
**TL;DR:** Having a new attendee from the CZI Open Science Summit, we took a deep dive into the best way to capture data directly from microscopes, comparing the pros and cons of Zarr/HDF5/Zip and more. Additionally, we worked through remotely visualizing a Zarr when it's been created on the cluster in a Jupyter notebook.
- Sanket at CZI/NumFOCUS Summits
- Coming to San Fran next week, lunch!
**Open Agenda (add here 👇🏻):**
- Ahmet: BioHub
- Collaborators interested in Java implementation
- Need a good implementation
- ImageJ / BDV (folks at Janelia)
- V3: collaborators to help read it
- JMS: explicit opt-in for V3 (need to know _a priori_)
- Though auto-detection could be added
- neuroglancer likely has a stronger case for auto-detection
- AS: happy tensorstore users. Thanks alot! :star:
- resize manually? more internal with a skinnier API
- JMS: assume things within old bounds are old?
- AS: perhaps request chunks (from last savepoint) more compute heavy
- keyword argument?
- MD: "don't bother writing where there's no new data"
- JM: see related https://github.com/zarr-developers/zarr-python/issues/1017
- JMS/MD: use selection to fill in the new bits
- AS: `append()` is only for one axis. This might be for arbitrary axes.
- perhaps `append_chunks()`
- use case
- instruments generating lots of data quickly.
- don't want to resize if not necessary. with fewer methods if possible.
- most efficient way?
- of course, better to know exact size.
- MD: just have size must larger and have missing chunks?
- AS: only if know when biologists will stop
- Clarification: doesn't write the empty chunks
- MD: do edge chunks need special handling?
- JMS: no. always write the full chunk.
- (not in N5, and didn't implement in tensorstore)
- DB: wouldn't suggest having everything in one array
- 1 array per timepoint (doesn't work for NGFF)
- growable arrays
- or use HDF5 for the acquisition
- AS: why? faster than zarr-python. but tensorstore? Don't know.
- JM: let's do that benchmark
- DB: Windows doesn't like lots of small files
- MD: could write Zarr into Zip with no compression (basically what HDF5 does)
- DB: save data in the way that's most effective for the acquisition
- Zarr as a great format after that
- AS: that's what we were doing previously. but additional time adds up. people want the results faster. was asked to add ZarrWriter in aquisition package. Can then easily transfer to data storage.
- DB: easier to transfer than HDF5? No, than the raw files. Compression is a benefit.
- AS: set chunk size bigger rather than using HDF5
- JM: per camera. but can't compress chunks.
- HDF5 compress in parallel but not write in parallel
- JMS: eventually all use cases of HDF5 but not there yet
- granularity at which you can read and write
- AS: re-chunking is faster than converting camera offline
- AS: with two camera we don't try to write to same array with both, but multiple places
- JM: zip support in tensorstore? JMS: not yet
- JMS: also thought about LMDB. single file. pretty efficient.
- zip e.g. doesn't support deleting.
- also only has one directory structure
- MD: HDF also has that problem.
- DB: re-writing isn't a problem for acquisition.
- JMS: do need to checkpoint the zip directory periodically.
- AS: saving single-array per timepoint, then zip might work quite well.
- converting to zip zarr saw some worse performance. not sure where.
- MD: make sure the zip isn't compressed.
- JM: need Zip spec
- DB: would love to hear where this goes
- MD: **inverse problem**
- massive HDF5 files in tar file on S3 for the purpose of multi-file dataset
- desire to distribute them as individual files
- 20G tar containing HDF5
- Kerchunk's job was to point to these files within the tar
- or "find all the chunks in all of the files"
- works nicely!
- fetches are short but there are many of them.
- had to download it (for scanning) but don't want users to have to do that.
- i.e. if you push for a single file, perhaps you can get the best of both worlds.
- DB: lambda function? probably. (but this was custom S3)
- JM: need Java implementation of Kerchunk (for BDV)
- DB: generate from json-schema
- AS: with kerchunk can you point to your data centers...
- MD: each chunk is a key but is a URL
- JM: `"chunk-name"URL, offset, length)`
- JMS: can get the correct endpoint for a chunk
- add s3 syntax
- IPFS, mutable hashes, ...
- DB: interesting workflow. any help?
- couldn't get napari on cluster over VDI
- transforming images and saving them as zarr.
- starting static server and pointing neuroglancer at it.
- would prefer to do things programmatically in neuroglancer and it spits out a URL
- also convenient to have static file server as background process from main python (notebook)
- JMS: definitely convenient and it's "just a web server"
- DB: don't save that to disk? dask arrays in memory?
- JMS: neuroglancer-py does have a way to share numpy array or tensorstore object
- Socker based? Internally starting a web server.
- DB: and if it gets updated? does it block? No, background thread
- There is a method to invalidate the cache.
- Python API for making URLs? Yes.
- Could be attractive to people (Janelia) for when computing on the cluster
- JM: See also Wei's imjoy-rpc for the usability
- JMS: works as iframe in jupyter now (DB: desirable)
- JMS: possibly using jupyter protocols would work around firewall
**Attending:** Sanket Verma (SV), Josh Moore (JM), Davis Bennett (DB), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Hailey Johnson (HJ), Brianna Pagán (BP), Isaac Virshup (IV)
**TL;DR:** *Will be completed after the meeting.*
- ZEP meetings will take place bi-weekly on Thursdays @ `21:30 IST/18:00 CEST/17:00 BST/12:00 EDT/9:00 PDT`
- Instructions: https://zarr.dev/community-calls/
- More focused on the spec than these meetings
- Check out our new illustrations here: https://github.com/zarr-developers/zarr-illustrations-falk-2022
- More ideas welcome!
- `copy` button for code snippets in Zarr documentation, check here: https://zarr--1124.org.readthedocs.build/en/1124/ @ Altay Sansal [#PR1124](https://github.com/zarr-developers/zarr-python/pull/1124)
- Approaching end of GSOC (12th of Sep)
- Looking to participate in Outreachy (https://outreachy.org/)
- New potential users & developers
**Open Agenda (add here 👇🏻):**
- JMS: plan for resolving v3 spec?
- SV: more on this tomorrow but some progress looking at open issues
- Upcoming work
- JM: proposal to have ZEP0001 moved to a "provisional ZEP" state (only blockers allowed)
- JMS: idea is no spec discussion at this meeting? SV: no, but we'll communicate back and forth
- SV: updates on Brianna/Hailiang's ZEP? BP: Not yet. SV: also welcome to join tomorrow
- IV: zarrR?
- JM: idea of having a hierarchy that builds a virtual n-dim array
- DB: adds brittleness. would say no.
- IV: kind of like kerchunk but with more indirection
- JMS: sometimes have use cases.
- stacks of images that you want to view as an array, or multiple images acquired separately.
- Do have stack driver in tensorstore (with specificed origins. No stored representation)
- DB: similar problem when acquired in HDF. wrote own layer.
- JMS: might should be a layer higher than zarr.
- DB: for bioimaging, if your app depends on this then you can only open HDF and Zarr and not other stuff.
- doesn't need to be compiled code. API problem.
- NaN/inf/other special values in user-defined attributes: https://github.com/zarr-developers/zarr-specs/issues/141
- zarr-python supports by encoding in non-JSON-compliant way
- DB: nothing that can be stored as data shouldn't be impossible to store as an attributes
- DH: was dealing with this recently. found "NaN" (quoted string) in existing datasets, expecting it to be treated as such. added support to nczarr (as well as unquoted versions)
- JM: will likely need a deprecation/warning/error cycle (royal pain)
- IV: keep JSON and use them as special values? nice that it is all just JSON.
- DH: nczarr (netcdf API) got ahead of this because typing is stored for attributes ("double"). possibility for v3?
- JMS: good point. perhaps decide the model for attributes in v3 (i.e. proposed change to v3 spec)
- JM: will need an upgrade path
- DB: haven't seen untyped attributes, but just that JSON is missing values
- DH: so extend constants that are definable
- JM: BSON?
- IV: there are also things that can't be encoded in zarr
- DH: one problem with extending JSON is that in C code that there are JSON parsers that would choke
- JMS: zarr-python has essentially already done that
- **Enumerating options:**
1. extend JSON parser (generally :-1:)
2. support existing JSON-variants (BSON) (generally :-1:)
3. encode objects in JSON
4. add type information somewhere else (like .nczarr)
- JM: (2) might be a metadata-driver like separate chunk-stores
- DH: if it's binary, then you need a good spec. and need to show equivalence between binary version and JSON.
- JMS: you might be writing the non-JSON attribute late in the process, which would cause problems
- DH: binary could help with speed since string level parsing is expensive
- DB: always thought of the metadata as the stuff you want to read with editor and you don't want peformance issues
- DH: have had number of examples of NC-4 files that are enormous (10s of MB of metadata)
- also abusing grouping for "namespaces" (even if not a good idea)
- IV: is this Zarr's responsibility? cf. Pydantic which can turn your values into something else. (i.e. external schema)
- DB: but Zarr is responsible for storing "fill_value"
- IV: that's .zarray rather than .zattrs
- DB: would assume that the `.attrs` property takes care of encoding/decoding
- JMS: would see saying ".zattrs supports JSON + these encodings"
- IV: do all the languages support this?
- JMS: `Array[UInt8Array]`
- SV: an extension?
- JMS: could fail on invalid JSON now and then add encoding/decoding later (since there's already the issue with V2)
- IV: Arrow requires everything to be an arrow type (everything else is string with encoding)
- DH: did that in netcdf-4
- DB: sqlite is the same way
- DH: include numpy with json type (from string)
- IV: almost done with PR on awkward arrays (using this). depends on the JSONs
- JMS: would make sense to standardize that (decide: pure JSON or extended JSON)
- IV: see https://github.com/scverse/anndata/pull/569
- SV: heading to California next week for NumFOCUS & CZI summit (also NJ & NYC)
**Attending**: Sanket Verma (SV), Jeremy-Maitin-Shephard (JMS), Davis Benett (DB), Eric Perlman (EP), Ward Fisher (WF), Martin Durrant (MD), Hailiang Zhang (HZ), Ryan Abernathy (RA), John A. Kirkham (JK)
**TL;DR:** The Zarr community discussed two open PRs in the zarr-specs repo. [Support a list of `codecs` in place of `compressor`](https://github.com/zarr-developers/zarr-specs/pull/153) and [Change data type names and endianness](https://github.com/zarr-developers/zarr-specs/pull/155). The discussion was extensive, covering many good points, and the overall community favours both of these PRs for getting merged. After this, Hailiang Zhang from NASA Goddard asked a few questions about the ZEP extension he and his colleagues are working on. They are making progress on the ZEP and will submit a draft in the upcoming weeks.
At last, there was a discussion on working on and finishing the pending ZEP1. John A. Kirkham proposed an idea that everyone was in favour of. Also, the community would like to step up and help in the completion of ZEPs whenever and wherever needed.
- Zarr is attending CZI and [NumFOCUS Summit 2022](https://numfocus.org/2022-project-summit-openmbee), if you're there feel free to say Hi! 👋🏻
- [Jonathan Striebel](https://github.com/jstriebel) is presenting poster on Zarr @ [EuroScipy 2022](https://www.euroscipy.org/2022/) next week. If you're attending the conference, please say, Hi 👋🏻!
- Final decision on [ZEP1](https://github.com/zarr-developers/zarr-specs/pull/149) most probably next week. Please leave your feedback now!
- Suggestions for themes/tech stack for website revamping
- JMS: https://github.com/zarr-developers/zarr-specs/pull/153
- Does anyone has any feedback on this?
- RA: Hard to change across all the specs - changing filters is possible as it deals with NP arrays - and compressors don’t do that
- DB: we can look at unified API for these type of changes
- MD: Codec should take context - and all the info like position, size of array and this could potentially solve the problem - by the time you compress the array - the codec could do and know what you’ve told
- JMS: each codec should have byte as output
- DB: for numcodecs this mean to promote/change function signature - no reason this could not be done
- MD: it says buffer and it can be amended
- MD: it can tell you where you are in the array - chop things where it is specified - For e.g. I want this key because it is this key in this chunk - also biased because worked on storage and helps me out
- JMS: first you read the chunk and then data is read - codec wants to make partial read - then codec could decide what to do from there
- MD: blosc takes care of this - codec will itself won’t interact with the storage layer - use case - kerchunk example - netcdf file - needs to know what file and size we are - doesn’t need to think about storage layer
- JMS: https://github.com/zarr-developers/zarr-specs/pull/155
- MD: Seems good to me 👍🏻
- JMS: rename the data types and keep the endianness - minimal change - using different names makes senses - you can add other names - and it makes sense to make more conventional names
- DB: in favour 👍🏻
- JMS: if big endian array - this change will return array with big endian
- RA: lil’ experience with this - ocean model gives big endian data - sometimes they don’t work - never wanted to have those types - just accepted because it was there - trade off: computational cost and cannot convert it to fly
- MD: if you can do it in places, temporary duplication of the memory can be done - all astropy data is endian
- RA: row major vs. column major - something which Zarr should take care of 👀
- HZ: Ryan and HZ’s colleagues(Briana and Mahabal from NASA Goddard) had discussion in meeting - a proposal to build chunk level statistics for performance - idea is like: each chunk will have some statistics to decide their characteristics and how are they performing
- *Planning to submit a extension ZEP soon!* 🙌🏻
- Not a single value - will be along certain dimensions - allows to be vector instead of scalar - along which dimensions it should be done!?
- It’s been a month - what’s the timeline of release of next major version spec? - No time, we're still working on it!
- Statistics need to have some knowledge - https://github.com/zarr-developers/zarr-specs/issues/73 - this is helpful for us - couldn’t find this in V3 - did I miss something about this?
- Dimensions could be lat. long. time - the statistics - the dimensions could be switched
- MD: adding a codec could solve this!
- Thank you!
- JK: Get Alistair to finish ZEP1!
- Use comments and make an individual PR for those comments - Nice idea!
- MD: Different uses and perspectives - move towards a same goal - other groups have structured issues
- MD: make a list of things we can include and solve them
- JK: break larger problem into simpler ones and then solve then!
- Discussion on community to take charge for ZEPs
- Everybody seems to be in favour and ready to step up
- Hoping to close ZEP1 soon and move forward!
Open Agenda (add below 👇🏻):
- Zarr v3 spec open issues:
- by JMS
- fill value required: https://github.com/zarr-developers/zarr-specs/pull/145
- C order vs F order vs arbitrary order
- Rename of array metadata files
- Dimension labels
- NaN/inf/other special values in user-defined attributes: https://github.com/zarr-developers/zarr-specs/issues/141
- Storage transformations: sharding, consolidated metadata
- Can storage transformers operate on the entire store or just arrays? https://github.com/zarr-developers/zarr-specs/pull/149#pullrequestreview-1078722828
**Attending**: Sanket Verma (SV), Josh Moore (JM), Jeremy Maitin-Shepard (JMS), Alex Merose (AM), Brianna Pagán (BP), Hailey Johnson (HJ), Hailiang Zhang (HZ), Jonathan Striebel (JS), Martin Durant (MD), Norman Rzepka (NR), Shivank Chaudhary (SC), Ward Fisher (WF), Mahabal Hegde (MH), John Kirkham (JK), Isaac Virshup (IV)
- Favorite sport!
- Feel free to add links to your work here
- [ZEP1](https://github.com/zarr-developers/zarr-specs/pull/149) & [ZEP2](https://github.com/zarr-developers/zarr-specs/pull/152) are open for feedback!
- Review under https://zarr.dev/zeps/draft_zeps/
- Comments on https://github.com/zarr-developers/zarr-specs/pulls
- Browse https://zarr.dev/community-calls/ for previous meetings notes
- [Jonathan Striebel](https://github.com/jstriebel) is presenting poster on Zarr @ [EuroScipy 2022](https://www.euroscipy.org/2022/). If you're attending the conference, please say, Hi 👋🏻!
- 2.13.0a1 releasing soon with updates from [Davis](https://github.com/zarr-developers/zarr-python/pull/1094), [Mads R.B.](https://github.com/zarr-developers/zarr-python/pull/934) and [Jonathan](https://github.com/zarr-developers/zarr-python/pull/1096)
- Josh: 2.13 next alphas
- Phase 1 of GSoC 2022 completed! 🎉 Check progress [here](https://alt-shivam.github.io/Codecs-Registry/)
- JMS: would be good to have specification (json-schema?) for each
- Goal: have clients interact with the registry to give users info/feedback
- async zarr https://github.com/martindurant/async-zarr
- anacoda hackweek per quarter (2-day-hack)
- for discussion https://gitter.im/zarr-developers/community?at=62f3ed24d020d223d36587d5
- http only and other simplifications
- JMS: targeting runtimes outside of the browser?
- AM: cloudflare worker? WebAssembly support. (MD: already in pyodide)
- lightweight VMs (e.g., for security)
- IV: story for downstream library developer to use? rewrite to use await
- MD: definitely must use await (can't go in and out of the event loop)
- use case: first chunk of several arrays
- MD: e.g. what would xarray for example use it
- MD: somethings already work: bokeh, etc.
- ...add here ...
- AM: Question about the Zarr Spec v3 (ZEP1)
- Like that it's very bare bones
- Though experiment: Could a video codec be implemented?
- Compress across time; key frames
- JMS: 3d xyt chunks would work (individually)
- MD: variable length chunks. Critical to video compression (per key frame)
- MD: if each chunk has all the time points
- MD: but also in favor of variable length chunks
- JMS: what's the connection?
- MD: video compressions support large range of chunk size based on how quickly the video is changing
- JMS: just make time chunking big enough and internal is a detail
- JS: remove chunking across the time dimension
- currently inefficient, but with partial reads it could work
- let video codec request chunks of data from the store
- MD: internal to codec or explicit at the zarr level
- JM: difference in the fundamental model? (atomicity)
- MD: 1-dimensional delta codec, make it across an arbitrary dimension?
- AM: if Zarr intends to be the metadata format, this is a stress test.
- JMS: with fixed key rate,
- JM: see also https://mpeg-g.org/
- HZ: extension proposal
- implementation for multidimensional data analysis
- introducing auxillary datasetse in reduced datasets (non-scalar, accumulation value)
- helps to speed up computation. Ryan A. suggested a spec extension
- MH: averaging over time or spatial extensions
- JM: cf. https://github.com/zarr-developers/zarr-specs/issues/50
- IV: perhaps like transforms https://github.com/ome/ngff/issues/101
- JS: difference of whether or not it leads to an additional array
- IV: Non-uniform chunks – timing?
- conversation with JK at SciPy
- broad desire to have them exist. any objections?
- have several masters students to put on this
- JM: ZIC?
- IV: can discuss if in spec or as an extension
- JS: would still have a formal spec even if an extension (eases adoptions, clear interface)
- IV: ZEP0001 timeline?
- SV: on me. working with Alistair to apply the modifications. ASAP.
- JMS: meta-issue for scheduling time to work on the V3 spec
- way to speed up progress? additional meetings?
- SV: ZIC meeting?
- JM: editorial meeting? Add JMS?
- JS: happy to be in discussions with AM but also open issues that need discussion
Attending: Josh Moore (JM), Brianna Pagán (BP), Ryan Abernathey (RA), Norman Rzepka (NR), Jeremy Maitin-Shepard (JMS), Greg Lee (GL), Trevor Manz (TM), Ward Fisher (WF), Davis Bennett (DB), Matt McCormick (MM), John Kirkham (JK), Parth Tripathi (PT)
- RA: ongoing review process :tada:
- JMS: long-list. perhaps we should just go through them
- NR: higher-level -- status of the extensions? going in? ZEP0002, 3, 4...
- RA: sets ground work for extensions. like the idea of keeping them narrow in scope
- NR: worried about a ZEP a month and the happiness of the ZIC. Perhaps batching them?
- JMS: review sharding as part of ZEP0001 since it was the motivation for many to have gotten involved. main benefit of V3
- MM: on sharding, would like to look towards the future (i.e. not necessarily finalization) to get it adopted across the implementations
- JM: not a lot of movement (speaking for others). definitely implementation needs work.
- NR: Jonathan is in stand-by waiting for decision. Could be ZEP0002. (He has a conflict at this time)
- MM: Great comments https://github.com/thewtex/shardedstore/issues/17
- Using sharding in a general way for simplicity, incl. with different stores.
- Looking to go through this in practice for large scale data.
- See working prototype. Pretty efficient. Works with v2 as well out of the box
- DB: understanding sharding as introducing an abstraction between the array and the store. Will that generalize to all non-sharded stores. (No-op shard?)
- NR: yes. Store shouldn't need to know about the storage transformers. Partial reads are helpful but not required.
- NR: at specification level (i.e. not just zarr-python) need to know how it will look like on disk.
- MM: could see trying to get ZEP0001 out. (**Proposal?**)
- but also: yes you can shard arrays, but what about groups (as additional need for the spec)
- useful for content-addressable/verifiable storage
- unrelated to all of the hierarchical formats
- separate shards per scale along with the related metadata (same for xarray)
- JMS: in the interest of getting ZEP0001, perhaps we hold off on sharding. as a delta to the
- **tl;dr** --
- ZEP0001: focus on getting current work done but include storage transformer (:+1:)
- ZEP0002: Jonathan to start ~next week (making necessary adjustments to ZEP0001)
- MM to comment on PR or open alternative proposal
- then in that same batch or ZEP0003 definition of extensions
- RA: process -- inventing it as we go. Bootstrapping so there will likely be a lot of feedback on how things work, but try to use that structure for the moment.
- JMS: on the outstanding issues
- fill values: consensus that it must be specified?
- JM: replaces DB's smart fill value logic?
- DB: clients can have a mapping or a callable, but it wasn't easy to make it work with the semantics (in the zarr-python)
- DB: easier if we make it required. gets past fundamental ambiguity
- JM: the upgrade scripts will need to be aware of this too (EOSS4)
- DB: 0 as sane default for uint8? etc. etc.
- consensus: require fill value be specified
- case sensitivity
- v3 says "case sensitive", reasonable except on e.g. OSX.
- Add a note?
- Alternative of escaping? (add-on)
- WF: file-system rather than OS bound (despite tight correlation per platform). NC doesn't assume the technical doubt of working around file system limitations. Ergo? User consideration, not something that can be fixed technically.
- path structure
- see previous meetings
- (A) removing "meta/" as nicer paths to metadata files
- Con: doesn't work for the consolidated metadata path
- JM: Workaround with "symlinks"
- (B) require suffix on the root (".zarr")
- (C) syntax for combining path and key: `path//key`, `path#key`, etc.
- JM: recently ran into a need/use-case for something like (A). import into OMERO, easier to work with the metadata as the main hierarchiy.
- RA: good to think about having kerchunk style references encodable in Zarr
- discussions at scipy
- vibe, "why a new format?"
- needed the ability to have references to blocks in other files
- MM: "composite store" (like a sharded store, could also add in kerchunking possibly)
- adds in layer of indirection, doing indexing. tells you what's present.
- would need to be more well-supported than consolidated metadata.
- JM: how does it differe? MM: more flexibility in how it is broken up
- MM: very large dataset and doing analysis on one part of the dataset then that can be updated independently.
- JM: similar to Dennis' consolidate per hierarchy level Yeah, like an octtree.
- NR: but doesn't solve path
- JMS: if consolidated is a concern, then (A) won't work
- JM: plus symlink should work.
- JMS: proposing to drop "meta".
- RA: currently fsspec handles it. could be a formal URL scheme.
- JMS: some issues with URLs if you're opening RW
- RA: could see only RO to begin with.
- root directory
- reasoning is having a non-empty name
- JMS: just have ".zarr" as the name? best to skip that. (hidden files were an issue for V2)
- RA: talked about having the root document be a json file.
- JMS: special name?
- difference between .zgroup and .zarray
- leads to potential race condition
- DB: used to attributes.json from n5 and have never had an issue with it
- NR: easier if it's one name to not need to do two look ups
- JMS: currently include types `<i2`, etc. more logical to say data types are logical (16-bit-signed-integers)
- JMS: make it a codec issue to deal with endianness since it only matters for raw encoding
- NR: need to specify it somewhere, even if just in the codec. blosc would need to know (anything byte based)
- JMS: filter rather than a data type
- NR: downside of having it in the datatype? codec could ignore it. (JMS: happens at the moment)
- JMS: numpy's endianness is a bit unusual. often you want to just use it in the native endianness.
- (Lots of nodding from Trevor)
- JMS: main benefit is to always give the user native endianness
- people were happy to have them (yes?)
- MM: boolean as 1 bit or byte? JK: one byte. no bitpiacking. (That could be a codec)
- vector bool as an example of over-optimized
- rawcodec (DB)
- never need to say "None"
- "raw"? intuitive?
- "identifiy", "noop", "dummy", "pass-through"
- JMS: similar to endianness. combining 2 things in codec. codec gets an array and not a stream of bytes. could arguably be **split**
- DB: separate configuration for each?
- JK: similar to filter vs. codec, not well spelled out in the spec. See Categorize for an example.
- JM: would make the choice to avoid compression explosion (e.g. for images)
- JK: there's already a meterological compressor...
- JMS: linear chain of filter with a codec has issues
- current way to do it would be to encode a byte stream and use a compressor
- perhaps want separate compressors for different parts
- could the filter itself have additional filters/compressors for the labelled data vs. the indices
- JK: use cases? JMS: variable length strings, multisets, downsampling segmentations (similar to large number of categories)
- JMS: should be easy to fit it in now. have a tree and the filter becomes the codec.
- DB: filter vs codec? why a tree rather than an array?
- JK: original use case of filter is categorize `(RGB) -> (012)`
- filter as a transformation (on ndarray)
- DB: different type signatures?
- JMS: effectively not different in V2
- JK: mostly a terminology thing
- DB: "pipeline"
- TM: is "raw" just an empty list?
- JK: look at how parquet does it? (Ask Martin perhaps)
- TM: one pipeline with inputs/outputs for each codec then you could encode numpy/bytes as desired an confirm that it's valid
- JMS: one codec location to an array? (nodding)
- JK: do we have chained codec use cases
- DB: someone at Janelia was working on that for segmentation of volumes
- similar to categorical
- see related paper. "gzip on top"
- JMS: similar to something in neuroglancer
- TM: bitshuffle/gzip for kerchunking? (to read HDF5 file)
- DB: semantics come from HDF5
* Cancelled for SciPy
Attending: Davis Bennett (DV), Sanket Verma (SV), Josh Moore (JM), Jeremy Matiin-Shepard (JMS), Parth Tripathi (PT), Ward Fisher (WF), Hailey Johnson (HJ), Shivank Chaudhary (SC), Ryan Abernathey (RA) +30 min
- [Zarr-Python 2.12.0](https://github.com/zarr-developers/zarr-python/pull/1038) has been released! 🎉
- ZEP1: https://github.com/zarr-developers/zeps/pull/1
- Authored by Alistair and Jonathan
- includes details on sharding & transformers
- addresses pain points & lack of clarity in v2
- Alistair to open spec changes against zarr-specs repo
- see https://zarr.dev/zeps for these changes
- comment on PR as desired
- otherwise, merging very soon
- further discussion to take place on the zarr-specs PR
- Briefly (Josh), NFDI recommended for funding :tada:
- JMS spec discussions
- NB: right forum? JM: just need to communicate thoughts back on the PR since there is no requirement to be at the community calls
- Dimension labels
- there seemed to be interest in writing it up as a spec
- requirement that they are unique strings OR the empty string to say that they are unlabeled
- DB: motivation for unlabeled? Currently all are unlabeled. DB: disagree they are all labeled with integers.
- JMS: then strings are optional/additional alternatives.
- DB: see it leading to issues. potentially: "if you add labels then you must add all"
- JMS: case of automating inputs to outputs could lead to inventing fake labels but perhaps that's preferable to empty
- DB: drawback from type theory is that you want the unlabel to be a different type. JMS: Use Null? disallow `""` anyway?
- WF: dimensions are label and id parent? or conflating NC/H5
- JMS: was just thinking within a given array. goal would be to not need to know it's "dimension"
- DB: could see having arrays logically identical with different dimension ordering. want to enable use of, e.g., `"z"`
- WF: "dim_$N" gets assigned automatically.
- JM: need for buy-in from xarray and nczarr
- JM: in .zattrs? .zarray? JMS: don't really mind.
- DB: err on the side of having zarrs more like numpy arrays
- JM: names in numpy are part of the dtype
- DB: backwards-compatible way to specify the defaults if they don't exist
- JMS: and added to the zarr-python library? Yes.
- Single string to identify zarr root path + zarr array/dataset within root
- SV: Greg left a [comment](https://github.com/zarr-developers/zarr-python/issues/1039#issuecomment-1170034733) today. See also shoyer [issue](https://github.com/zarr-developers/zarr-python/issues/1039)
- DB: an issue. problematic ergonomics
- JMS: was hoping to find a resolution
- JM: couple proposed
- sensible defaults
- DB: reason for separate hierarchy
- JM: possible extensions (like consolidated)
- JMS: range-requests to see full listings
- RA: strongly believe that V3 doesn't introduce such a breaking change
- RA: NC uses path/to/file.h5/path/to/group
- JM: would require an increased number of lookups for the root JSON
- WF: correction -- NC uses two strings
- JMS: neuroglancer has a data source URL. can make up a convention but it would be nice to preserve the single-string semantics
- RA: xarray only opens groups. more complicated for arrays.
- RA: good to formalize the URI/URL semantics (good to specify your data with a string)
- JMS: applies to groups, too.
- RA: xarray supports extra path to a sub-group. also gaining datatree functionality.
- DB: going into mainline? RA: Yes. DB: super cool.
- DB: couldn't you just pass the absolute.
- JM: you don't pass "data" or "meta". only the logical group.
- DB: that means that could completion won't work. could irritate people.
- DB: would pass the array. job of library is to find the array.
- RA: use hash tag or standardized file ending (.zarr) to parse URL
- DB: .zarr seems 100% reasonable (since slash is taken)
- DB: recommendation for people who want to live their truth
- JMS: would like to make this a MUST
- DB: jpeg vs jpg vs ...
- RA: mimetype
- JM: make the .json files the default?
- RA: getting Zarr into STAC was problematic because it's to a URL rather than a file. i.e. it fundamentally becomes a JSON file. Becomes a catalog.
- DB: like it. Directories are not real, files are real.
- JMS: could define a different ending?
- RA: .json is good
- JM: it's .zarr.json which isn't bad
- DB: natural when moving from local file system to a KVS
- RA: opens up absolute paths to chunks potentially
- JMS: with more changes to the spec, yeah.
- JM: consolidated metadata will be problematic.
- DB: PR
- mypy issues
- annotations breaks linter
- JM: generally :+1: for type annotations, also ok to start looking at dropping 3.7 now
- Support for inf/nan/binary data in attributes
- Zarr's website
- What do you feel about our current website?
- What would you like to see in the new website?
- Any ideas for good Jekyll/any static website generator themes?
Attending: Sanket Verma (SV), Ryan Abernathey (RA), Jeremy Maitin-Shepard (JM), John Kirkham (JK), Jackson Maxfield Brown (JMB), Dennis Heimbigner (DH), John A. Kirkham (JK), Martin Durrant (MD)
- ZEP acceptance criteria: https://zarr.dev/zeps/active/ZEP0000.html#how-does-a-zep-become-accepted
- GSoC 2022 coding period has officially started! Check the progress for [Registry Codecs](https://hackmd.io/@uTe8Vo8gSYeCbwHsQI2Z2Q/SypXtPRD9) and [Benchmarks](https://hackmd.io/@I9Hj1bLETn6QIva97pA3Hw/By7rlRXd5).
- MD: Weekly tracking of GSoC 2022 Kerchunk contributors here: https://github.com/fsspec/GSoC-kechunk-2022
- Non-zero origins: https://github.com/zarr-developers/zarr-specs/pull/144
- RA: JM's [proposal](https://github.com/zarr-developers/zarr-specs/pull/144) like a comments/suggestions to the ZEP1?
- JM: Yes, it’s like a comment but not a full suggestion
- JM: Having a non-zero origin as extension will be fine. Zarr doesn’t have a well defined coordiante space - if you add non-zero origin you need to add stuff when dealing with other types of arrays like reading or writing it to other file systems or arrays
- RA: uses Zarr also as a lower stack array - comfortable with the idea raw array space and coordinate space - and good with zarr doesn’t know about coordinate space - Xarray can build coordinate space - works with the metadata concept - Zarr can’t make Julia use index base - coordinate space is not suggestion, here we are changing the array index
- JM: certainly see Ryan’s argument - can use other libraries like Xarray for doing the index manipulation - also value of having array where you can talk about position
- DH: NetCDF coordinate system talks about latitude and longitude - introduce notion of coordinate variables - agree with Ryan’s - index level needs to be pure and standardised - whole variety of coordinate system that can be imposed later on - there are arbitrarily number of coordinate system that people use and bad to pick-up one here
- MD: agree with Ryan - in Xarray we can define coordinate system using other variables
- RA: JM also commented on the issue that the risk of not having in the core would be that client opening the Zarr arrays and would not able to access the array
- JM: unfortunate how Julia changes index - if you don’t talk about base index it doesn’t hurt anyone
- RA: HDF group is used to this - zero base indexing - language determines the exposition the array data - `Xarray` can do it because it has a data model in Xarray - diffuse this out in zarr - we have a primitive array storage system and on top of the we have various conventions of metadata and that’s the beauty - no explicit support is required for that - many tools can open that
- RA: We can put a convention to address the issue - a page of conventions on the website, something like https://zarr.dev/conventions can document that - processing softwares can use those - Zarr ontology to other array ontology - if we put it up in a Zarr core why are we catering to microscopy group only and why not the Geo community!?
- MD: The word `convention` is super useful - if you have tools which can leverage the indexing
- RA to JM: if we don’t support in the core array - it’s also about the implementation - have you thought about implementation?
- JM: very simple if `not` in core spec - pretty clear boundary on how transformation can be done in dense integer space of Zarr array - index by coordinate array and other method - different data types where indexes are latitude longitude - having a extra level of translational array
- MD: Zarr array core design would need to behave like every language
- JM: if array is small - it’s in the memory and you can do a lot of stuff like read it store it and play around with it! - Zarr array and memory works in other ways!
- MD: naively do it in any language - use the language rules - you’d the do the selection as the array is in the memory
- JM: shifting the coordinate space - what about negative indexing? - How does Xarray handles it?
- MD: not possible - each variable has unique set of coordinate - the NetCDF conventions would not allow it - NetCDF conventions are far more rigid that anything - Xarray could certainly implement wide range of mapping - Xarray is born out of `NetCDF model`
- Negative Indexing
- JK: negative indexing - logical indexing and coordinate indexing problem - data exists somewhere and how we map that to meaningful coordinate
- MD: negative indexing is problem in Python and means a different thing over there
- JK: big change - specifying changes by coordinate having to list them in metadata and to update the metadata for all the previous arrays
- MD: the reference file systems could do the renaming - but it is complicated
- Discussion on: https://github.com/zarr-developers/zarr-specs/issues/141
- DH: is the issue representing the floating point numbers?
- JM: the attribute model is `.json` - some way to intended as the number
- DH: `.json` has that distinction
- JM: Python implements the extension - generally extension doesn’t support that - in JS you need to write your own parser to take care of this - no way to represent this and we need to discuss this
- DH: binary and nan will be represent as bit pattern
- JK: would love to see how coordinate space stack would look like - interesting to have it in extension - if Xarray would be interested in that - recasting the coordinate? - coordinate space extension? - changes the metadata? - graduate the metadata and see how to it behaves when the coordinate system is changed?
- JM: a few things that needs to be discussed:
- Zarr attributes and .json and infinity values binary things - wonder if there’s a solution to that in zarr v3 (https://github.com/zarr-developers/zarr-specs/issues/141)
- Zarr V2 array creation has a easy way to create arrays - whereas you need to mention path in V3; Zarr v3 array creation is pain because of path - could be handled by the having `.v3` extension - `//` or any other special character to handle it
Attending: Davis Bennett (DV), Sanket Verma (SV), John A. Kirkham (JK), Trevor Manz (TM), Brianna Pagán (BP), Parth Tripathi (PT), Gregory Lee (GL), Ward Fisher (WF)
- Release 2.12 - [Blog](https://zarr.dev/blog/pre-release-2-12/) and [Tweet](https://twitter.com/zarr_dev/status/1529430764563013632)
- ZEP website: https://zarr.dev/zeps
- ZEP 1 - Alistair and Jonathon working on it, check [here](https://github.com/alimanfoo/zeps/blob/zep-1-2022-05-03/zep-1.md). PR [here](https://github.com/zarr-developers/zeps/pull/1)
- David(*xtensor*), Ward(*NetCDF-C & Java*), Trevor(*zarr.js*) and Gregory Lee(*zarr-python*) added to the [ZIC](https://github.com/zarr-developers/governance/blob/main/GOVERNANCE.md#zarr-implementation-council-zic)
- GSoC 2022 contributors updates - [Shivank](https://hackmd.io/@uTe8Vo8gSYeCbwHsQI2Z2Q/SypXtPRD9) and [Parth](https://hackmd.io/@I9Hj1bLETn6QIva97pA3Hw/By7rlRXd5)
- TM: https://excalidraw.com/ for making graphic representations, TM used them for the Zarr [slides](https://docs.google.com/presentation/d/1bKE3BYp9FEPcL7ZUyWkyyguRE5ptSiJYHDqcIn_nmkU/edit?usp=sharing) 🖼
- `fill_value` for empty chunks:
- DV: How could it implemented to avoid breaking changes?
- DV: Whether there is a clear description about `fill_value` in Zarr Spec V2?
- JK: V3 spec aims to mitigate the issue of `fill_value` and have clear text about it
- DV: Haven't looked at V3 but if it specifices `fill_value` by default then it's good
- TM: Not writing `fill_value` is sematically correct, on the implementations side: having to implement something like this is more flexible and easy in JS as compared to Python
- DV: Problems representing numbers as `.json` metadata. There should not be constraints to `fill_data`/`fill_value` in `.json` as it's metadata
- TM: Serialising the scalar arrays as containers
- JK: Slightly hard to read metadata with `fill_value` as there are amiguity issues
- JK: How does NetCDF handles `fill_value`?
- WF: Couple of different approaches to handle it.
- We do it for standard and compound data types
- In absence of user-input we have a default `fill_value`
- Mostly users need to specify the data type as it's difficult to come up with it on it's own
- There's also a flag tp supress `fill_value`
- As NetCDF is mostly used for data archival and for that we need to maintain data integrity, it's difficult to know what's done if there are no defined data types or no `fill_values`
- We can also explicitly write `fill_value`
- Store in metadata
- JK: Is there a concept of not storing it?
- WF: `fill_value` can be changed, the default behaviour is to replace old `fill_value` to new `fill_value`, not the best case but yeah, it is what it is!
- TM: `FSSPEC` reference ability to record `base64` to `.json`. How data `URLS` work: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs
- SV: Conferences 📣
- BP: Speaking at [ESIP](https://2022esipjulymeeting.sched.com/info)(in-person) on July 19th-22nd on cloud processing and how they process internal Zarr & COGs data at NASA. Christine will be talking about Zarr and almost 50% of the content will be focused on Zarr. Link: (https://2022esipjulymeeting.sched.com/event/12etJ) and (https://2022esipjulymeeting.sched.com/event/12etV)
- WF: Will be talking about NCZarr in Rocky Mountain(hybrid), current state of NetCDF. Link to follow in the Gitter chat.
- JK: GTC(virtual) happened in spring, NVIDIA is presenting a few things at SciPy
- SV: CFP for EuroSciPy is open until 6/6 AoE, submit [here](https://www.euroscipy.org/2022/program.html)
Attending: Josh Moore (JM), Sanket Verma (SV), Dennis Heimbigner (DH), Eric Perlman (EP), Ward Fisher (WF), Martin Durant (MD), Vinasco Juan (VJ), John Kirkham (JK), Isaac Virshup (IV),
- New repo: https://github.com/zarr-developers/zeps (available from https://zarr.dev/zeps later this week)
- Work on ZEP1 started!
- Lays the ground work for V3
- More updates coming from Alistair in the next week.
- ZEPs have a lead author but multiple co-authors
- So feel free to add to the ZEP PR that will be opened soon!
- GSoC contributors coming on 20th May
- Recently merged PRs (Thanks! 💐)
- NumPy twitter poll: 1.20 or newer used by 60%+
- JK: looking to drop some versions (currently 1.17+)
- xarray and dask or on 1.18 or even later
- conda-forge is at 1.19
- hard to get older versions
- NEP provides a schedule for deprecating. In June 1.20 becomes the minimum
- also: want to use newer features to support more array types (numpy + dask + ...)
- MD: it's been stable for so long, so someone can use an older version.
- Also: Python 3.6 dropped
- https://github.com/zarr-developers/zarr-specs/pull/16 (only open PR)
- updated by Jonathan in https://github.com/zarr-developers/zarr-specs/pull/142
- good for everyone to take a look at that.
- JM: https://github.com/zarr-developers/zarr-python/pull/1020 and how conservative to be
## May the 4th
Attending: Sanket Verma (SV), Ryan Abernathey (RA), Josh Moore (JM), Ishan Bansal (IB), John Kirkham (JK), Hailey Johnson (HJ), Shivank (SH), Brianna Rita Pagán (BRP), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Parth Tripathi (PT), Eric Perlman (EP), Jonathan Striebel (JS), Martin Durant (MD), Greg Lee (GL), Matt McCormick (MM), Davis Bennett (DB)
Introductions (new things & favorite places)
- Brianna: tech. lead at NASA for migrating data & services to the cloud
- ZEP is on the verge of acceptance & merging. Check [here](https://github.com/zarr-developers/governance/pull/16)
- ZIC invites sent. Check issues [here](https://github.com/zarr-developers/governance/labels/ZIC)
- Merged PRs:
- Adding environment variable for Zarr V3 by Greg. Check [here](https://github.com/zarr-developers/zarr-python/pull/1007)
- Performance improvement while appending data to Zarr array in S3 by Hailiang Zhang. Check [here](https://github.com/zarr-developers/zarr-python/pull/1014)
- Pre-commit check by Shivank. Check [here](https://github.com/zarr-developers/zarr-python/pull/1015) and [here](https://github.com/zarr-developers/zarr-python/pull/1016)
- Suggestions for the Zarr Community. Add [here](https://hackmd.io/7jV1cE3pQeWXWI4e8siXng)
- Please have a look at the recent poll [here](https://twitter.com/zarr_dev/status/1521755830051389441)
- Thoughts on recording the community calls?
- EP: more no (except presentations)
- JK: more no
- IV: R implementation?
- JM: a libzarr out of netcdf-c
- DH: possibly. (need to look at turning NC3 off)
- WF: if we had someone to maintain it, then it's a no-brainer
- DH: what API would we use. NC API is a pretty good match.
- IV: good tooling for wrapping python in R. works almost seamlessly.
- JM: libzarr would also get us MATLAB
- WF: It would be a lot of work, and we have the code in the netCDF-C repo is available for poaching. Collaborating to create a pure C Zarr library would be in our (Unidata/netCDF-C)'s interest and an easier lift than splitting it out/maintaining it ourself
- WF: license, etc. should not be an issue.
- IV: there were some C++ folks on the bioconductor side
- JK: invite bioconductor to next meeting?
- ZEP process:
- MD: approve :tada:
- etc. etc.
- _ergo_ ZEP 0 merged :tada:
- BRP: geospatial standards (matching to https://cfconventions.org/)
- RA: read the [OGC document](https://portal.ogc.org/files/?artifact_id=100727&version=1)?
- BRP: No, gone through geozarr
- RA: started mapping via xarray cf to zarr
- RA: OGC is voting on accepting zarr, wrapper with a preface on conventions (named dimensions, netcdf data model, coordinate reference systems)
- RA: NASA really cares about OGC and zarr is on track to be accepted
- RA: [geozarr](https://github.com/christophenoel/geozarr-spec/) is newer and more proscriptive
- BRP: battling with CRS since it's not in cf
- RA: unidata would say there's a way.
- BRP: but it's not required
- RA: suggest getting behind geozarr (1 person at ESA)
- ...add stuff here...
- RA: What's ZEP 1 going to be? start wih JMS' comments on breakingness?
- see: https://github.com/zarr-developers/zarr-specs/issues/140
- MD: list of chunks!
- non-breaking is passing a range of a chunk to the backend storage ("simple sharding")
- RA: like being able to push selections to the store
- MD: in v2 getitems (that's fsstore only)
- JS: in sharding proposal, there are other methods for getting ranges for keys & multiple keys at once and as a combination (pre-requisite for efficient sharding)
- MD: want to uncompress things that you don't know anything about
- JS: there are hooks for blosc, e.g.. Adding this interface would help, since it's currently quite hacked.
- MD: simple enough and nice that will enable sharding? good for prototying ZEP
- JMS: there are breaking changes that don't change the data model in a significant way; **"feature-flags"** as important breaking addition
- JS: transformer infrastructure also
- RA: does anyone expect current v3 implementations to break?
- JMS: v3 isn't a huge change; mostly isomorphic
- RA: _explicit_ extensibility of the protocol? (on top of the re-org)
- JK: `None` fill value etc. that just needed cleaning up
- also moving towards sparse arrays. please for people to explore.
- RA: tl;dr
- ZEP1 to get motivation (co-editors welcome)
- ZEP2 e.g. sharding
- ZEP3 e.g. variables chunks
- SV: doesn't need to be sequential editing but sequential merging!
- RA: worried about fragmentation
- will say in ZEP1, but want a strong core that all should implement
- avoid driving people away
- JS: happy to help with the spec, but not great for the ZEP
- JMS: also happy to help with the spec
- RA: will reach out to Alistair (SV: was waiting on ZEP0 to be merged)
- RA: THANKS SANKET!
- SV: Davis from April meeting, propose to add "auto" setting
- DB: "inaction item". Perhaps by the end of the week.
- new teams any comments? :thumbsup:
- build docs for PRs? good idea.
- SV: pyscript!
- MD: super interesting for Zarr. friendly for the browser. no sockets. no threads. suggestion that it might lead to a lot of hype. involved for the IO conversations. (couple of years until its really usable for heavy data workloads)
- DB: performance? MD: good, except for populating the browser (it's a VM)
- DB: had seen 2.5x in favor of native
- MD: more browser as the interface so that you don't need an ipython kernel running somewhere
- MD: long-term talking about how to run numba in the browser (acceleration tricks that make regular python fast). could be that numpy in the browser is <50% slower
- JK: fortran is a problem (e.g. scipy is hard to build)
- tile servers are good enough for pure visualization
- and/or people using zarr are doing data processing (parallelism)
- DB: an
Attending: Ryan Abernathey, Josh Moore, Eric Perlman, Sanket Verma, Jeremy Maitn-Shepard, Jonathan Striebel, Gregory Lee, Jim Pivarski, Ishan Bansal, Isaac Virshup, Parth Tripathi, Martin Durant, Ward Fisher, Dennis Heimbigner, Matthew McCormick
- GSoC deadline ended, we've 3 proposals this year! April 19-May 12 we can decide how many slots we can take.
- Cloud Native Outreach Event went great. Videos will be live shortly!
- If you have any videos to share, let us know!
- intros: https://www.youtube.com/playlist?list=PLvkeNUPrCU04Xvcph4ErxsRkZq28Oucr7
- applications: https://www.youtube.com/playlist?list=PLvkeNUPrCU05qHkZso_T74yoayqLFHzkI
- Using https://github.com/orgs/zarr-developers/discussions
- higher level of repo discussions (specifically show up on the "community" repository.)
- ZEP final update!
- JM: implementation council to be invited
- JS: great to have the implementors on board to not fragment the landscape
- MD: some may not implement though, right? JM: true. multiple states of votes:
- will-implement, may-implement, wont-implement, breaks-us-veto
- MD: no clear status of what's up to date
- RA: veto power since would be bad to lead to forks. worth discussing that provision.
- MD: since we aim for consensus anyway (and veto is used rarely ) should work fine
- JS: don't want to end in a place where the spec says something that will never be implemented.
- JS: separation on veto for core or extension. JM: agreed, focus all ZEPs on core for the moment
- SV: extensions are V3, which isn't done, so it's all core.
- JMS: only V3? JM: what about C/F order? RA: don't have to limit it (but we want to focus on V3)
- MD: agreed, the place to expose breakages
- RA: core vs. extension
- is core something that everyone must (eventually) implement
- MD: some things are already optional like filter
- MD: extensions were originally synonymous with conventions but dataset is openable without
- RA: convention is distinct from optional extension (cf. variable length chunks)
- JMS: another way of seeing extensions is the evolution of the spec. signalling to implementations that they are seeing new data. "must understand"
- JP: agree about the disctinction. can't-read-data vs. might-need-a-library. have wanted to frame this as an extension that *labels* a convention, like an annotation.
- IV: would be useful to specify convention. if you don't have a way to store the metadata, then goes into the .zattrs
- good to have a field in the structural metadata to specify conventions
- JS: separate from convention. orthogonal questions.
- RA: hierarchy (or ontology)
- JP: A different example: I've seen HDF5 data files, from gravitational waves, that are valid HDF5 but can only be "understood" by the LIGO collaboration's code. It would have been good to have a label on that HDF5 file warning haphazard users.
- DH: is wont-implement a veto? or if they say wont-implement and breaks, then is veto? That's a lot of power.
- JMS: non-zero origin & data-orders other than C are both examples that cause issues (with e.g. Julia). potential vetos.
- DH: solvable, but they are saying the cost is high.
- RA: take sharding. major enhancement but pita to implement. is it core? need to show in ZEP? higher bar for core proposal?
- DH: have been looking at how to implement. it will be a challenge. decided with Ward that it's worth doing.
- RA: meta goal is to have that discussion before the ship has sailed.
- WF: feels a lot like an internal NetCDF conversation. What is an NC file? vs. what's in the doc
- NC file has to be more than a file written by netcdf-c library (the first party implementation)
- goal of tech. spec. is to take it and write software in any lang. that can write/read a NC
- needs to specify permissible deviations
- MD: how to go from v3 to v3.1? (sharding or variable length chunks)
- WF: have made many mistakes.... (e.g always refer to specific versions, NetCDF 3...)
- note: unidata doesn't yet have the iron clad backwards compatibility for nczarr
- v3 to v3.1 could potentially *not* be backwards compatible
- behavior versus definition (this message may self-destruct...)
- MD: parquet example. people mention v2 but that doesn't really exist
- still features that aren't implemented!
- JMS: similar in HTML , https://caniuse.com/ -- we need the same
- JS: agreed, important to know. needed to read the data or not.
- Even more core, must-understand flag & warnings about not being supported. All MUST have this for V3.
- sharding storage-transformer proposal: sharding could be an extension that uses transformers.
- then impl. council decides
- WF: NetCDF isn’t a great analog here; we have no forward-compatibility promise, and the solution when an old version cannot read a newer file is to suggest they upgrade to the latest version. But this is because there are not a lot of independent implementations in the wild.
- NetCDF is also fortunate to have a number of independently-developed utilities and tools (NetCDF Operators (NCO), and pnetcdf spring to mind). Perhaps a zdump (similar to ncdump or h5dump), provided by the core project, that could provide summary information for a file? This information could then be used to determine if a specific dataset could be read by an implementation in question.
- RA: useful concept here -- netcdf is focused on interoperability & preservation. Parquet is for performance. Zarr is mainly high-performance copy of data. But for sharing, might make different choices. Use different extensions then. i.e. have it both ways. Still need the minimal, most-operable version. And need to be clear & upfront about that.
- MD: perhaps a "maximal-flag" setting? IV: perhaps flags. JMS: agreed.
- MD: perhaps "conversative" to cover several of these
- JM: https://xmpp.org/extensions/xep-0115.html
- JP: version/flag-objects within spec, then could have storage-typed and performance-typed objects
- IV: do that in AnnData. Sparse array v1 or v2. (i.e. at the object level)
- RA: are we at the point that the core is the necessary stuff in v3 and we can go with that?
- JMS: expect to add non-optional features in the future
- JM: various data types are probably missing now
- JS: storage transformer falls under this too
- JMS: (...Josh missed a comment from JMS here...)
- JS: not clear how optional the extensions are.
- JM: strip extensions and add it back later? Agreed.
Agenda: Josh Moore (JM), Sanket Verma (SV), Norman Rzepka (NR), Parth Tripathi (PT), Davis Bennet (DB), Shivank, Gregory Lee (GL), Ishan Bansal (IB), Hailey Johnson (HJ), Martin Durant (MD), Isaac Vishup (IV), Ward Fisher (WF), John Kirkham (JK)
- Introductions (incl. favorite food!)
- [Cloud Native Outreach Day](https://www.ogc.org/ogcevents/cloud-native-geospatial-outreach-event) is happening in 2 weeks i.e. April 19th. Register [here](https://na.eventscloud.com/website/36829/) if you still haven't.
- GSoC contributors proposal are open now until April 19th ([ideas-lists](https://github.com/zarr-developers/gsoc/blob/main/2022/ideas-list.md))
- Guest blog posts are welcome for http://zarr.dev/blog
- A small surprise for y'all! 🎉 ()
- `write_empty_chunks` debacle (DB)
- Wrong choice? If we can't handle the edge cases, probably.
- DB: make fill_value required? (break change). don't see zero as the obvious fill value even for numeric types.
- MD: would want a default since there is data that doesn't _need_ a `fill_value`
- "auto" which sets write_empty_chunks to False if a fill_value is passed
- DB: and/or a global mapping from data_type to fill_value
- WF: that's how NetCDF handles it (on by default, can be turned off by API). Anecdotally, few complaints.
- DB: proposing "auto" if no one objects (post-2.11.3)
- WF: is there a way to know what was requested? NC captures provenance metadata (now).
- MD: other provenance metadata that could be included: "written in python with zarr-python 2.11" (JK: great spec issue!)
- multi-resolution images (MD)
- JM: description of the xarray/datatree work
- MD: ok to concatenate multiple non-T volumes into a time-series? JM: think so
- DB: some clients may not be able to
- TBD in gitter: fsspec (MD)
- can fetch many chunks in an array concurrently
- not an API for fetching from many zarrs in a group
- DB: sounds like it is undercovering a larger issue (`Futures`)
- MD: like dask
- moving the sharding spec forward (NR)
- JM: getting the abstraction correct / we seem to be risk-adverse
- JM: editor group? but regardless major decision is does sharding get defined before or after v3.
- NR: (voluntary) implementors group? Then a question of voting vs. consensus vs. ...
- JK: looks like there are different asks at the moment, so maybe the roadmap is the most important thing
- Davis Bennet to submit a follow-up PR to [this](https://github.com/zarr-developers/zarr-python/pull/1005) to propose an "auto" setting for `write_emtpy_chunks`
**Attending:** Sanket V., Josh M., Hailey J., Greg L., Dennis H., Jim Pivarski, John K, Jeremy Maitin-Shepard
- GSoC and GSoD Updates
- Several people are showing up looking for good first issues.
- Feel free to help them out, but SV & JM'll be chatting wit them this weekend.
- Anyone who is interested in spearheading GSOD (deadline tomorrow)
- [ZEP](https://github.com/zarr-developers/governance/pull/16) feedback: Governance, template,
- JMS: propose/offer using an existing issue to test the ZEP. (:+1:)
- e.g. Data type rename issue: https://github.com/zarr-developers/zarr-specs/issues/131
- Description of some dtypes currently supported in Zarr-Python's v2 implementation, but are not part of the core v3 spec: https://github.com/zarr-developers/zarr-specs/pull/135
- Now or later? SV: will ping after another round of changes
- V3 / awkward arrays: extension mechanism
- JP: requires 5 1D arrays, an integer with the size, the array names, and some JSON as to the types
- edits are not supported. memory is shared in python implementation
- one option might be to have arrow as a subspec of zarr
- DH: potentially having a different return type.
- JM: that might could be used for xarray as well
- JP: looking to warn people about the need for awkward arrays (or even making the storage opaque)
- JM: e.g. having the extension mechanism allow/enable:
- mandatory (MUST): throws exception if not installed
- suggested (SHOULD): which raises a warning
- optional (MAY): which silently ignores
- JMS: would see `open_with` as a better pattern. e.g. choosing a backend without disallowing access to the details
- DH: this seems to follow the pattern of being "part of something bigger" while still made of parsable units (e.g. dimension arrays, multiscale, etc.)
- JMS: would be interesting to make use of chunking in the layout. JP: agreed, e.g. to enable parallel writes
- xarray/nczarr (Josh) - anyone interested from the xarray side? (tabled)
**Attending**: Jeremy Matiin-Shepard, Josh Moore, Isaac Virshup, Ward
Fisher, Gregory Lee, Sanket Verma, Dennis Heimbigner
- Xarray/NetCDF (Josh)
- Trying to get everyone at a (virtual) table
- Ward happy to be involved
- Dennis nczarr already supports xarray-convention
- Josh: development of https://github.com/xarray-contrib/datatree
> may suggest more conversations around this
- Multiscale representation? (Jeremy)
- any relationship between the dimensions? No. Not yet.
- Typically done at a higher level (like cfconventions)
- Updates from Sanket
- Release 2.11.1
- outstanding 2.11 issue:
- Thoughts on changing the default branch?
- Jeremy: code doesn't look dangerous. Josh: agreed, more a
- Jeremy: another option would be branching by configuration.
- **Isaac: status page for v3 on what’s being developed would
- **Greg: consensus on file-ending? .zr3?**
- Jeremy: could see some benefit for inspecting paths
- GSoC 2022
- Awkward Arrays
- Isaac: all of AA or ragged arrays / vlen? TBD
- Dennis: another thing that’s not in the specification
- Josh: all agreed. Having a v3 to cover the ragged arrays
would be a great outcome
- Jeremy: only on a chunk? (since a codec)
- comparison to HDF5
- Cloud Native Outreach Day: CFP Deadline 15th March
- CZI EOSS Cycle 5
**Attending**: Dennis, Josh, Sanket, Jeremy, Ward, Hailey Johnson, John K., Eric Perlman, Greg Lee
- Updates from Sanket
- [*gsoc*](https://github.com/zarr-developers/gsoc) (cf.
- Open call for mentors
- Cloud Native Outreach Day -
> (Apr 19th/20th)
- talks & workshops
- [*lightning talk
- Release of numcodecs incl.
- Entrypoints (if Martin shows up)
> flatten: *need an attribute?*
- *John: Unsure.* zfpy remembers internally.
- HJ: NetCDF meeting on codecs soon. Good to know prioritization
- blosc of course
- JM: Twitter poll? “What compressor do you frequently use in
- EP: discussed with d-v-b that lossy compressors would be nice.
- jpeg chunk storage
- JK: tried zfpy? No. from seung lab. (Meteorological data)
- JMS: imagecodecs from Gohlke.
> more explicit std. in v3 (need json for each parameter)
- Jeremy (in order of importance)
- 1\. [*consistency in referring to coordinates / dimension
- a\. Unambiguously referring to dimensions / coordinates
- Ward: in NC order of dimensions is under the hood,
> indexing into set of arrays independently of the
> underlying order. file written by netcdf-fortran
> should be indistinguishable from by netcdf-c (due to
> work in the library to be
- JMS: tension cf. Julia’s desire of order
- WF: similar issue with endianness? JMS: big endian
> likely dead WF: sadly no. see netcdf repo for redhat
- b\. Support for different storage orders:
- c\. Support for non-zero origin
- 2\. appetite for zarr multiscale spec?
- Move metadata from .zattrs to .zarray?
- JMS: primarily getting it outside of OME, in discrete state
- 3\. Data type syntax
- 4\. URL syntax
- notes in [*https://hackmd.io*](https://hackmd.io)?
- Yes: Josh, Greg, Ward, …
- John: not all in one document
- “Motion carries”
- Dennis: could use help on how to work struct into extension
- JMS: one of the big changes in V3 is no support for structs
> (numpy array)
- GL: several datatypes are no in v3. currently missing a document
> describing those types. e.g.
- JMS: will numpy be supported in V3?
- GL: was easy to implement but they aren’t written up as
- Untested are: unicode and bytestring (only indirectly in
- JMS: numpy doesn’t support variable length strings
- JMS: numpy struct datatypes lead to interleaved in memory. not
> great for compression. perhaps better to transpose them. would
> be nice to not be tied to the numpy model.
- JK: need something in the spec, which is why it was left out of
> the spec so far.
- GL: just a few types currently in v3. complex would be easy to
> support. can choose what goes in or not. (or warn or error …)
- DH: trying to figure out how to support as much of NC/HDF5 core
> data model as possible. **Big missing piece are: structs,
> enumerations, vlenstrings, vlenobjects (sequences)**
- JMS: as multiple arrays? DH: question of how to specify in the
> extension mechanism. (lots of possible implementations)
- Feedback process (!)
- Sanket: spoke to Alistair
- Suggestions welcome.
- Fill_value issues
- [*change in indexing
**Attending**: Jonathan Striebel, Davis Bennett, Eric Perlman, Josh
Moore (JAM), Sanket Verma, Jeremy Maitin-Shepard, Hailey Johnson, Erik
Welch (Anaconda → NVIDIA), John Kirkham, Dennis Heimbigner, Dave
Mellert, Greggory Lee, Matt McCormick
- Davis: gdoc → hackmd? Sure!
- Sanket: suggestions
- updated webpage, blog update
- Intros & various links here
- Erik: [*Mr.
- C(++) implementations: quick report to Dennis from
- v3 support for non-zero origin
- Dennis: would attribute specifying the origin be enough?
have benefited from cfconventions which standardize the
meaning of attributes (in atmospheric/related domains)
- JMS: benefit of allowing indexing to be affected.
- DH: second system syndrome problem
- DB: see the utility (working with cutouts) but the workflow
puts us outside what’s in the core spec. zarr array
shouldn’t have metadata that references another array.
perhaps a nice formalization of transforms, since it defines
a coordinate space.
- JMS: julia has offset arrays (numpy is always 0-origin’ed)
- DB: meaning of 0-origin is that there’s a coordinate space
- EP: like the functionality, but it can be at a different
- DH: specify the array it came from as well as the origin?
- zaJMS: have translate to origin
- DB: in xarray use piece of data to coordinate-aware
indexing, two methods of getting into an array.
- JAM: prioritizing the many recent spec proposals
- DB: suggest talking to other consumers and see what they
- Addition of sharding in spec v3
- Status: 2 prototype PRs (plus a [*minimal
> in zarrita)
- Moving forward with v3 would be fine.
- Try a PR as a on the v3 spec for a translation layer (i.e.
- sharding, checksumming, IPFS, etc.
- JMS: don’t see the relationship between sharding & checksumming
- JAM: due to content-addressable storage
- DH: just as a compressor/filter that attaches the checksum
- JS: partial read would need to be handled somewhere that’s
not in the compression
- JAM: kerchunk API of \`key → (uri, offset, length)\`
- JMS: for the write path it is more complicated
- DH: makes me nervous when we worry about limitations of the
> underlying store w.r.t to the specification. spec should be
- DH: v2 is agnostic of relative location of chunk/metadata
are laid out on disc. Don’t need to be “next” to one
another. This is introducing pieces **and** that they are
- Be sure you want to get rid of the independence property
- JK: more simple way of describing it. (An ordering).
- DH: the proposal needs to specify the relationships between
chunks that are supported.
- DB: agreed complicated but 100% worth it for some domains.
- re-writing methods?
- only if uncompressed?
- JS: not yet, but doesn’t currently exist for simplicity
- rewrite index
- v3 (Greg)
- In terms of the dtypes supported, I have not worked on those
> protocol extensions related to that. Is that something I
> should spend time on? The other thing I could do is make a WIP
> PR to Dask and Xarray with minimal changes for how they could
> support v3 as currently implemented in that branch.
remote implementations: http/s3/etc
- good point
- EP to create an issue
> / [*datatree*](https://github.com/TomNicholas/datatree)
> ([*issue*](https://github.com/spatial-image/spatial-image-multiscale/issues/8)) -
> (if Matt McCormick shows up)
**Attending**: Josh Moore, Ryan Abernathey, Eric Perlman, Hanka Medová,
Sanket Verma, John Kirkham, Ward Fisher, Dennis Heimbigner, Greg Lee,
Matt McCormick, Fabian Gans, Jackson Maxfield Brown
- C/F ordering
- C-ordering is the default (natural in Python)
- in Julia, typically looks reverted (in metadata) but without
- Seems to what most people who are doing it are used to
- But there is confusion if you save in Python (10 rows, 5
- when in Julia it’s different.
- Ryan: is it important to be able to *write* F-ordered data?
- What’s important is what bits are close together not the
> language convention. May not have understood the
- Didn’t transpose but only change the metadata. Then can
> read as is.
- Dennis: fortran impl. reverses the dimensions before it
calls the C library which gives the correct order for the
data. netcdf-c stores everything (by default) as row-major
order. For nczarr, must look at the ordering and then decide
whether or not to reverse.
- Ryan: the fact that it’s in the spec but not supported is a
- Don’t see the use case for it right now.
- ….lots of discussion…
- Fabian: remember the reason for F-ordering is because
compression in some orders is more efficient
- John: not sure that’s (still) the case, since carefully try
to **never** transpose.
- Dennis: should be talking about column- and row-major
ordering rather than C/F.
- John: need to also discuss with N5
- Xtensor-zarr implementation status (Matt)
- Being applied here:
- API / testing issues in xtensor-zarr
- Matt: running into some bugs. Working with team in France.
- Josh: did you try tensorstore or z5? c++ netcdf is also an
- C++ implementations:
- e.g. also
- Josh: would love to find a way to share code (Constantin
could see moving z5py onto xtensor-zarr)
- NetCDF WebAssembly support (Matt)
“Support hdf5debug.c compilation with Emscripten“
- Todo: follow-up on netcdf-c repository GitHub
- C#, etc.
- Dennis: interested in using WebAssembly (was thinking about
using it for compressors for Java). Perhaps move to
- Xarray/Zarr (Josh & Jackson, minimally)
- BOpen consultants working on:
- NetCDF / xarray / Zarr compatibility
- Hierarchical support
- Data trees
- Multiscale conventions in zarr repo:
- Jackson: aicsimageio only supports xarray for different
locations (scenes, position or OME:images. some other
- then multiscales, i.e. pyramids
- datatree stuff should help. we have different groups and
> they *MAY* have multiscales.
- seems like
> would work
- don’t know if it’s “enough” for aicsimageio. (Jackon’s)
> dream would be to have multiple datatrees where each
> is a position.
> Dataset (diff position / scene / image)
> Resolution 0
> Resolution 1_2D …
> Resolution 1_3D …
> Dataset (diff position…)
- Fabian: would be nice to have **multiple chunkings** of the same
> data (Also from Brockmann Group in their viewer)
- Jackson: currently do this (fake it) with TIFF, CZI, LIF by allowing
> the user to say *how* they want to chunk (zyx, tzyx, …)
- (old reminder)
- Time permitting
- Meeting slot: poll to be opened.
*Happy New Year!* 🎉
**Attending**: Davis Bennett, Josh Moore, John Kirkham, Greg Lee, Eric
Perlman, Tobias Kölling, Hailey Johnson, Ward Fisher
- Davis: Move forward with set_write_empty_chunks?
- Eric: any metadata to say that it was written this way?
- John: not fatal just will be empty value.
- DVB: *different* problem with shards.
- Have a script to test performance
- Depends on latency, compressor, etc.
- Testing of emptiness isn’t the most performant
- **No objections. Moving forward with 2.11.**
- Anything for spec?
is good to go.
- Greg ok with 2.11? Yes. Nothing needs pulling or adding.
- Need a release note. (TBD)
- Spent some time on V3 PRs.
- Sixth passes CI.
- Consolidated metadata now works (but we want that to be an
- Unimplemented stores
- ABSStore & N5Store
- Davis to review N5Store
- Do we need a Hierarchy object? (e.g. when you create a group
you have to give it a path) – main difference (requires
issues in tests due to create_store needing a path)
- JK: Davis’ PR for trimming chunks?
- DVB: Languishing. Synchronization issue in appending with
- JK: design decision to make appending easy. But we need to
> design whether or not we will handle it in sharding.
- JM: Moving forward with community manager position (2 years)
- checksuming (structure looks like IPLD
- JSON document with hash for chunks
- how could we write out [*content addressable
- MutableMapping that’s write only (remembers content
- DVB: use case?
- nice if you want to calculate checksums, even for part
> of an array. (helps if it’s in a merkle tree)
- makes it easier to share the data in multiple pieces.
> you can have copies everywhere.
- interesting optimizations for big datasets: to get
> difference between two variables, you want them
> defined on the same grid (even if you don’t care what
> the grid is)
- JK: cloud store that’s write-once, since you only need
> to write each hash once. (bonus of time exploration
> feature with history of the chunks)
- TK: discovered over the holidays that the
> content-addressable & the IPFS solutions could be the
- [*IPLD*](https://ipld.io/) have special type that’s
a link via a content-identifier
- JAM: “fsspec concern”? TK: reading yes, but for writing
> it gets more difficult. No key-value concept. May also
> be useful to express the content-identifiers to a
> higher-level for optimizations.
- JK: explore implementing on top of MutableMapping
> interface and that’s what Zarr uses. Naive idea, Zarr
> special key that has metadata but address of each
> chunk. Gets difficult since the top-special-key needs
> to be writable. Josh: perhaps that key *is the* Zarr.
> JK: grab all of them? request small JSON things from
> the cloud. Takes place of consolidated metadata? (“one
> big request”) TK: you would also need to write it out
> as well (only visible afterwards). JK: start running
> in to ACID. Tertiary concern, but lot of writing might
> create chunks that you no longer care about that need
> cleaning. ZHierarchy could be like UTC time (time
> since last debug) → git for the cloud.
- On non-zarr stuff. (Big funded push is done now)
- Zarr API is properly working there.
- housekeeping, etc.
- Josh interesting issue from Julia community on order swapping
- Nvidia GPU-based Zarr to avoid host-memory transfer
- Requires 2 pieces
- CuPy part (JK needs to review that
- Compression (someone working on that now)
- DVB: which codecs? most basic ones. blosc is unclear.
> DVB: super interested! preferably a simple one incl.
> Java clients, but primarily transfer to GPU is a
> bottleneck. Happy to test.
> (snappy, etc.)
- DVB: unsigned bits. no compression automization. use
> case is getting output of ML model, making
> predications on 3D arrays. (writing use case includes
> multi-resolution pyramids)
- TK: climate models - too much data that needs
> compression (GRIB backed …) Trying to convince them to
> use Zarr. But **floats**.
- JK: happy to get a list of compressors. (on an issue or
> via email) TK: currently investigating.
- JAM: data-apis?
- JK: not sure we want to be a provider, but as a consumer. being
> a provider gets us into providing consumption.
- Ward: That is a long-standing issue with netCDF as well. re:
> pressure to start performing computations w/in the netCDF
- DVB: what happens with array inception? (dask-zarr-dask)
- JK: Martin adding entrypoint support into numcodecs, perhaps
> something similar to say a zarr.array entrypoint
Attending: Josh Moore, Eric Perlman, Tobias Kölling, Ryan Williams, John
- no outreachy
- dozen or so applications for comm. mgr
- scalable minds working on sharding
- b-open for xarray, multiscales, and more extension-like stuff
- State of the world, Eric? “why doesn’t Zarr…”
- (overall goal of replacing TIFF stacks with chunked zarr)
- [*axes*](https://github.com/ome/ngff/pull/57) (OME-Zarr)
- EP: read-only? still open
- EP: simple read-only would be a great start. (\~20 files)
- some overlap with HDF5. (Josh: some benchmarking)
- JM: kerchunk in front 20 HDF5s
- EP: JSON too large
- TK: hierarchy of indices so you don’t have to load all
- TK: and recursive shards/zarr?
- JM: Jeremy said it would be difficult for the client to make
use of more than 2
- EP: dask lazy-arrays of dask lazy-arrays of ….
- JM: worried that something is in the python implementation
as opposed to the “protocol” (shape + chunks +
dimension_separator := all keys for chunks)
- EP: “logical encoding” versus actual “path”
- JK: indexing a point in an octree (paths on each dimension)
- separate issues. (optimal) access patterns for
> particular use
- two different things going on:
- how deeply to split
- how it gets implemented
- almost like a compression
- EP: logical key v. physical key
- <img src="Pictures/10000201000005800000032F8DF8FDEF66827340.png"
style="width:2.0618in;height:1.1925in" /><img src="Pictures/10000201000002F2000003084CDFA69144E2C6C6.png"
- …. lots of talking (unfortunately not recorded)
- JM: love the hierarchical “smart-mutable mapping” but Zarr
protocol *avoids* knowing how to split byte streams into
- This gets us to the issue of (offset, length)
- JK: spent lots of time with dask serialization protocol
- “have a header on the byte string” “how many do I have?”
- shift all the metadata somewhere else? **.zidx** (for
> the whole array?)
- mostly thinking of archival data (frequent read)
- JM: Allowing Array.\_chunk_key to return (key, offset,
- Tabled: netcdf-java (AGU permitting) - Josh
> Central*](https://search.maven.org/search?q=g:edu.ucar) v.