owned this note
owned this note
Published
Linked with GitHub
---
tags: zarr, Meeting
---
# Zarr Bi-weekly Community Call
### **Check out the website for previous meeting notes and other information: https://zarr.dev/community-calls/**
Joining instructions: [https://zoom.us/j/300670033 (password: 558943)](https://zoom.us/j/300670033?pwd=OFhjV0FHQmhHK2FYbGFRVnBPMVNJdz09#success)
GitHub repo: https://github.com/zarr-developers/community-calls
Previous notes: https://j.mp/zarr-community-1
## 2023-03-22
**Attending:** Hailiang Zhang (HZ), Sanket Verma (SV), Dieu My Nguyen (DMN), Dennis Heimbigner (DH), John Kirkham (JK), Norman Rzepka (NR), Jeremy Maitin-Shepard (JMS), Johana Chazaro-Haraksin (JCH), Davis Bennett (DB)
**TL;DR:**
SV started the community meeting with a few updates. One of the most important updates was that we have a new page on the website for the Zarr adopters. Check here: https://zarr.dev/adopters/.
After this, HZ started presenting his ZEP, post which there was a Q&A session. After the Q&A session, we concluded the meeting with a few action items for HZ, which he’ll take care of in the upcoming weeks.
**Updates:**
- New blog post: https://zarr.dev/blog/ome-2022/ 🎉
- Zarr Adopters live at: https://zarr.dev/adopters/
- Zarr office hours
- Josh fixed the conda-forge error, check here: https://github.com/zarr-developers/zarr-python/pull/1364
**Meeting minutes:**
- Hailiang's [ZEP0005](https://zarr.dev/zeps/draft/ZEP0005.html) presentation - check the recording [here](https://drive.google.com/file/d/13xkl-i8pCSnv42KeqX6KLtIRFln5sf6k/view?usp=share_link) 🎥
- Q&As post the session are below 👇🏻:
- JMS: Mathematical equation would be helpful
- HZ: Working on a paper for more details - internal for now - will be publishing it soon!
- DB: Is it similar to summed area table? - https://en.wikipedia.org/wiki/Summed-area_table
- HZ: Not exactly - Trying to achieve something more centric to Zarr - making the accumulation flexible and dimension agnostic
- JMS: I think your is trying to solve the same problem as a [summed-area table](https://en.wikipedia.org/wiki/Summed-area_table) solves. But I think you’re trying to do it without requiring to storing something the same size as the original array - but perhaps imposing some additional restrictions on the type of queries you can do
- HZ: This is more Zarr related - Jeremy has more comments on the PR - stride boundary aligned with the Zarr boundary
- JMS: Don’t understand it fully - a mathematical equation would be helpful
- SV: Is there a reference implementation?
- HZ: I’m working on the code - will open-source it by the end of the summer
- HZ: What is not clear, Jeremy?
- JMS: Does the chunk needs to be aligned?
- HZ: For this implementation the chunks need to be aligned
- JMS: Trying to understand the proposal in a general sense
- HZ: We can achieve the accumulation/averaging service faster - [Giovanni](https://giovanni.gsfc.nasa.gov/giovanni/) will be using this accumulation
- DB: Would you share some cloud backed example?
- HZ: Sure, can do!
- SV: Having a .zarr data before and after the accumulation chunks with the attribute would be good for everyone!
- HZ: Sure, can do and can also add the script for the data generation!
- DB: Having a reference of summed area table would be good thing!
- HZ: Sure, can do that!
## 2022-03-09
**Attending:** Josh Moore (JM), Dennis Heimbigner (DH), Alan Liddell (AL), Dieu My Nguyen (DMN), Davis Bennett (DB), Brianna Pagán (BP), Isaac Virshup (IV), Sanket Verma (SV), Martin Durant (MD), Jeremy Maitin-Shepard (JMS)
**TL;DR:**
SV started the meeting by announcing that the Outreachy phase was a success! 🚀 After this, we briefly discussed why sharding should be codec, which JMS initiated. Then, DMN and BP had some questions about how you can continuously update your Zarr stores; everyone from the community has good insights on that. Next, DB showcased what he has been working on for some time, and IV brought up the discussion on strings as codecs or types. Lastly, MD informed everyone about his recent PR to the numcodecs repo and showcased his work on Rust Python FileSystem.
**Updates:**
- Outreachy ended on 3/9
- Check the work here: https://github.com/caviere/testing_zipstore
- [ZEP0001](https://zarr.dev/zeps/draft/ZEP0001.html) entering the review period soon!
- Sharding to a codec (JMS)
- MD: push it down to the storage layer
- IV: Documentation for transformers
- SV: open issue with purl URLS
**Meeting notes:**
- Slow but some progress on geozarr spec (BP)
- Anyone have contacts we (NASA) can chat with on how continuously updated zarr stores are handled? i.e. someone in the middle of making a calculation on a zarr store, but zarr store is updated... best practices? https://github.com/CCI-Tools/zarr-cache (BP + DMN)
- DMN: trying to keep a consistent view for the viewer.
- MD: user should see old or new but nothing in-between
- IV: updating attributes which are out-of-date with the arrays
- see Ryan's Iceberg-based solution. Single kerchunk like file.
- MD: see recent blog post. IV: see Ryan's issue.
- https://martindurant.github.io/blog/berg/
- https://martindurant.github.io/blog/mutable-kerchunk/
- DMN: updating old data as easy? MD: writing a chunk is fine
- JMS: versioning would be more of an issue. MD: again see iceberg-y.
- https://earthmover.io/
- DB: example of pydantic classes for OME-NGFF
- helps with validation
- https://github.com/JaneliaSciComp/pydantic-ome-ngff
- IV: strings as codecs or dtypes
- JM: v2 issue is because lack of extension points
- JMS: implementation issue?
- IV: potentially meta-array? object arrays as they are now
- IV: dtype as the final codec (the buffer in memory)
- JMS: different for sparse data types (when/if)
- IV: sparse arrays aren't compatible with v3 data type system (and that's fine)
- JMS: saw having each chunk a sparse array
- IV: may over-complicate what zarr is. probably going to do sparse array == zarr group
- how to represent a sparse array is hard to agree on
- JMS: was attempting to move v3 towards dtype being more of an abstract representation
- IV: what gets written per dtype with no compression
- JMS: default for integer is little-endian
- IV: on big-endian machine would want to skip the final codec
- JMS: if you have an array after codecs then you apply default codec for dtype
- MD: Added a PR in numcodecs: https://github.com/zarr-developers/numcodecs/pull/422
- And also talked about: https://github.com/martindurant/rfsspec
## 2023-02-22
**Attending:** Davis Bennett (DB), Sanket Verma (SV), Ward Fisher (WF), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Norman Rzepka (NR), Eric Perlman (EP), Dieu My Nguyen (DMN), Virginia Scarlett (VS)
**TL;DR:**
We started the meeting by discussing some possible performance improvements for sharding. [Ryan Abernathey](https://github.com/rabernat) did some [benchmarks](https://github.com/zarr-developers/zarr-python/discussions/1338) a few weeks ago, which turned out well. Then, the community offered a few solutions to make it even better. Jonathan is working on a PR for the same. After that, DB wondered, “How can he only store the indexes in the cache using Zarr?”
SV asked what does everyone think about the [all-contributors](https://github.com/all-contributors/all-contributors) bot. He also questioned whether the community would like to participate if Zarr were to organise a hack week. Both ideas were received 👍🏻 from everyone.
Lastly, DB showed us what he’s working on, and JMS discussed the `zarr-specs` issue he submitted recently.
**Updates:**
- Zarr-Python 2.14.1 release with sharding and new docs theme (PyData-Sphinx)
- SciPy 2023 deadline next week (03/01), if you want to collaborate on a tutorial with us, now's the time
**Meeting Notes:**
- DB: Sharding is slow - according to https://github.com/zarr-developers/zarr-python/issues/1343
- NR: Jonathan is working on a PR
- DB - Using mutable mapping for interface store for `getitems` and `setitems` is the probable cause - Zarr array API has some bash logic - which may also lead to slowing down
- DH: What type of caching model is using?
- NR: No caching is there! - individual chunks are loaded through byte range
- DH: having cache would be fine
- NR: maybe! depends on the use case
- JMS: Zarr array API does have support for batch reading and writing - if you’re giving multiple shard from single chunk the overhead would be big - you need to tell the user to cache the index - maybe twice the read requests because of the index and the shard
- NR: Having index in cache makes sense
- DB: How to store the indexes in the cache? Maybe JSON? What would be good storage format? How about SQLite? (for my use case)
- JMS: Unrelated to cache - you can list the chunk
- DB: recursive and expensive
- JMS: S3 can give you a flat list, may not solve the problem but the abstraction would help
- JMS: are you doing re-encoding?
- DB: yes
- JMS: SQLite could be pretty reasonable solution
- DB: Is there a clever way to do it using Zarr itself?
- JMS: can be ordered using lexical graphs
- DB: If your data is too big then your metadata becomes another type of data
- SV: What do you think about using [all-contributors](https://github.com/all-contributors/all-contributors) and hosting a Zarr hack week to sync `zarr-python` implementation to V3?
- All: Sounds good for `all-contributors` :+1:
- EP: Having something online would be good and I'd participate
- DB: Sounds good and I'd participate
- WF and JMS: :+1:
- DB: Defining abstract structure of zarr array like keys, properties, metadata - OME-Zarr has a try catch block - Have put some together with [pydantic](https://docs.pydantic.dev/) - it’s simple to generate zarr group - taking abstract tree representation and running it backwards to create Zarr groups
- Repo: https://github.com/JaneliaSciComp/pydantic-ome-ngff
- SV: Could be something like `pip install ztree` or something similar
- DB: Trying to define protocol for HDF5 as well - could dump the hierarchy into Zarr or HDF5 container
- DB: Structural sub typing stuff
- SV: Would be good to show a demo?
- DB: Yes!
- JMS: https://github.com/zarr-developers/zarr-specs/issues/212
- DB: Maybe similar to what vanilla N5 does!
- JMS: Reasonable to add a metadata option - planning to add a extension ZEP for this
## 2023-02-08
**Attending:** Davis Bennett (DB), Josh Moore (JM), Sanket Verma (SV), Virginia Scarlett (VS), John Kirkham (JK), Dieu My Nguyen (DMN), Hailey Johnson (HJ), Isaac Virshup (IV), Martin Durant (MD)
**TL;DR:**
SV started the meeting by letting everyone know he is working on adding a webpage on our website to showcase Zarr adopters. So if you’re using Zarr in any way and want to showcase your logo, please follow the instructions in the issue here: https://github.com/zarr-developers/community/issues/60.
He also asked around if submitting a paper in [JOSS](https://joss.theoj.org/) for Zarr-Python would be a good idea, and overall there was a positive response. After this, we discussed submitting tutorials for SciPy 2023 and adding sharding in the next release.
The meeting ended with MD sharing what he was building during the HackWeek at Anaconda and DB asking about the performance difference with/without sharding.
**Updates:**
- Zarr Adopters - get your logos in!
- https://github.com/zarr-developers/community/issues/60
**Meeting Minutes:**
- [JOSS](https://joss.theoj.org/) (SV)
- DB: takes time but can't hurt.
- IV: JOSS review took over a year for anndata
- SciPy 2023 Tutorials (SV)
- Deadline: Feb 22 (Conference in July)
- Talk focuses on the evolution of the spec since 2019
- Duration: 4 hours
- Basics but also different domains (geospatial, bioimaging, ...)
- HJ: planning on being there
- DMN: hybrid? Unsure. Hope to attend. Earth science data.
- SV: New Zarr-Python release (2.14) including sharding and new theme?
- small issues before release
- MD: https://github.com/martindurant/rfsspec
- hack week at anaconda. async fetching of list of urls, start/stop ranges
- can load zarr data using rust native concurrency (not asyncio; python overhead)
- fsspec currently works around asyncio re-entrance issues by using a dedicated io thread
- i.e. dask use case is undergoing through GIL contention
- some parquet use cases has substantial overhead
- also would enable pyodide or could be compiled to wasm
- a step towards browser-ability
- see https://github.com/fsspec/filesystem_spec/pull/1180
- DB: perf. difference? worst case the same. with lots of threads, could see speed up but unclear how much (maybe 2x)
- JK: with sharding? sure.
## 2023-01-25
**Attending:** Davis Bennett (DB), Sanket Verma (SV), Josh Moore (JM), Ethan Davis (ED) - Unidata, Ward Fisher (WF), Alan Liddell (AL), Brianna Pagán (BP), Erik Welch (EW), Hailey Johnson (HJ), Jeremy Maitin-Shepard (JMS), Isaac Virshup (IV)
**TL;DR:**
**Release update**: [Zarr Python](https://github.com/zarr-developers/zarr-python) 2.13.6 along with 2.13.4 and 2.13.5 is out! Check the release notes [here](https://zarr.readthedocs.io/en/stable/release.html#release-notes). [SciPy 2023](https://www.scipy2023.scipy.org/present) CFP is open until 22nd February. If you’re planning to submit a proposal involving Zarr, feel free to reach out to our [community manager](mailto:svsanketverma5@gmail.com), and he’ll help you out. The meeting started with DB discussing his recently opened [PR #1323](https://github.com/zarr-developers/zarr-python/pull/1323). Next, JM gave us updates on [AI4Life](https://ai4life.eurobioimaging.eu/) and initiated a discussion on GeoJSON. After this, we discussed [ZEP0004](https://github.com/zarr-developers/zeps/pull/28) and [Unicode names](https://github.com/zarr-developers/zarr-specs/issues/56). Next, BP gave an update on GeoZarr bi-weekly meetings, and lastly, SV asked a general question on visualising N-dimensional arrays.
**Updates:**
- Zarr-Python [2.13.4](https://zarr.readthedocs.io/en/stable/release.html#release-2-13-4) (outreachy updates), [2.13.5](https://zarr.readthedocs.io/en/stable/release.html#release-2-13-5) and [2.13.6](https://zarr.readthedocs.io/en/stable/release.html#release-2-13-6) is out!
- Josh: sorry, I stuttered.
- Migration to PyData Sphinx theme almost complete; check here: https://github.com/zarr-developers/zarr-python/pull/1242
- SciPy 2023 CFP is open until 22nd February
- https://www.scipy2023.scipy.org/
**Open Agenda(add here 👇🏻):**
- [automated formatting](https://github.com/zarr-developers/zarr-python/pull/1323) - Davis
- what's the line in the sand to get it in?
- ping the core devs, getting sharding in
- Declarative hierarchy (Davis)
- Create object then serialize it to disk
- JMS: maybe a layer on top of zarr (multi-formats)
- https://pypi.org/project/h5py-like/0.5.1/
- [AI4Life](https://ai4life.eurobioimaging.eu/), GeoJSON, etc. (Josh)
- JM: "GeoJSON is JSON. Zarr has JSON. Can you put GeoJSON in Zarr?"
- IV: something more like GeoArrow
- ED: cfconventions (trajectories, etc.) for bio?
- probably not
- IV: https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation
- IV: convention for ragged array (start in Zarr space) - least-common denominator
- IV: shapely support for GeoArrow.
- [get_items](https://github.com/zarr-developers/zarr-python/pull/1131) (Sanket) - Tabled (comments welcome)
- ZEP4 (EW)
- looking to make an embeddable spec for sparse arrays and sparse tensors
- natural fit for a ZEP.
- JM: previously we had discussed an extension, right?
- EW: convention so it can be stored along with other things.
- IV: not super strongly about zarr being aware. but consistent way to structure things.
- ZEP4 is more collecting what people might be doing
- IV: break ZEP4 into a completed set of conventions and then we can identify later.
- https://github.com/zarr-developers/geozarr-spec and https://github.com/zarr-developers/geozarr-spec/issues/2 (Brianna)
- BP: initial meeting last week on a governance group.
- Repo: https://github.com/zarr-developers/geozarr-spec
- Meeting Notes: https://hackmd.io/@MSBYE-SmSS-O706S4WXH0Q/geozarr-spec-swg-20230119
- meeting biweekly. invite open. Wednesday mornings 11am, Eastern.
- submitting to OGC approval by the summer.
- Unicode for names (Ethan): https://github.com/zarr-developers/zarr-specs/issues/56
- JMS: very helpful. normalization important.
- tension of zarr not being its own container format
- normalization by default breaks that in a subtle way
- ED: normalize on serialization, on query, etc.
- can see optional, but also advice on whether to do it and why.
- world going to UTF-8
- Visualization of Zarr arrays
- DB: neuroglancer (client-side website)
- HTTP/S3 accessible (even static fileserver)
- python code that spins up a browser
- http://vitessce.io
- https://imagej.net/plugins/bdv/
- https://www.unidata.ucar.edu/software/idv/
- https://github.com/google/neuroglancer
- https://github.com/bigdataviewer
## 2023-01-11
**Attending:** Jeremy Maitin-Shepard (JMS), Josh Moore (JM), John Kirkham (JK), Sanket Verma (SV), Brianna Pagán (BP), Martin Durant (MD)
**TL;DR:** Happy New Year, folks! 🎉
Welcome to the first Zarr community meeting of 2023! The discussion started with John’s briefing about his recent visit to Allen Institute. After this, Brianna initiated a discussion on Geo-Zarr SPEC, which led to discussing various .zarr datasets NASA is using and how they are storing them. Brianna has been working on .zarr datasets, and her team will publish them shortly. Finally, we concluded the meeting by discussing async-zarr, new Zarr’s R Implementation and PR #1131 in Zarr-Python.
**Updates:**
- Happy New Year! 🥂
- next zarr release
- release notes: https://hackmd.io/_P8Q0_cFT-6ymtYJSm9wnA?view#Final-Release-Notes-for-21342140
- for John: https://github.com/zarr-developers/zarr-python/pull/1285
- Mads added 2 commits
- followed by getting 1096 and 1111 out
**Open Agenda(add here 👇🏻):**
- Allen Institute Cell/Brain meeting (JK)
- day of meetings (7)
- deep learning, image processing (Zarr as input)
- storage management for multiple groups (TIFFs to Zarrs)
- invited people to this meeting
- "what should (we) do with our workflow"
- used poster from neuroscipy and some notebooks
- also explanations of HDF5 (hierarchical storage)
- interest in benchmarks
- daily mouse brains! ergo throughput!
- HHMI -> Allen -> CZI: write c++/rust (typed compiled) support
- Nathan Clack: https://github.com/nclack
- no one slide deck would have helped
- maybe "basic getting started" from the poster
- questions:
- pyramidal / ome-zarr (extensions)
- SV: showing Henning's drawings at NASA
- geo-zarr spec (BP)
- want to push for a v1 in the next 6 months
- organizing with Ryan, Chris Holmes (Planet) and folks from OGC
- BP: had group email chain
- branching / forking his repo
- specs for making zarr stores compatible with e.g. xarray
- MD: coordinate tranforms. own interpretations in e.g. gdal
- BP: searchability of the zarr stores. keeping them inline with other collections
- MD: including bounding box? yes. like stac. QC units.
- BP: worried about the spec not being done and needing to re-publish
- MD: have a ZEP/discussion place about "should this be handled by Zarr or not?"
- SWG: steering working group for OGC to write the proposal
- MD: _could_ start on top of affine transform (but exists in cfconventions)
- CRS as a short fall of cfconventions
- SV: https://search.earthdata.nasa.gov/search ?
- BP: yes. where it's in AWS.
- first time having duplicate datasets (by format)
- what needs to be updated in the Common Metadata Repository
- Jennifer Wei discussing with SV, don't see Zarrs
- BP: don't have the ability to make them available
- giovanni-in-the-cloud: on-prem caches converted to zarr this year
- earthdata is the GUI of CMR (looking at AWS)
- MD: doing auth signing? LONG CONVERSATION. (HTTP Tokens via proxy)
- BP: https://github.com/nsidc/earthaccess/issues/188#issuecomment-1371626230
- BP: https://hackmd.io/T73AtFTnS4C_Ez9JfGNldA?view
- MD: anaconda interesting in being conda frontend for data via intake
- do search, pull credentials, redirect, etc.
- BP: people are to moving to zarr stores with or without them. archives are alive. "static zarr store is not true"
- 14000 collections. tie into CMR API.
- https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html
- MD: intake no listing
- BP: eartaccess is voluntary maintained (APIs across NASA: https://github.com/nsidc/earthaccess)
- "store" vs "collection"
- store is the .zarr directory
- collection is a product
- files under that are granules
- NC and TIFF can be archived via granule
- (pangeo-forge is calling this a store)
- want to associate the store with the collection
- STAC calls those ... Need a glossary.
- JK: https://numpy.org/doc/stable/user/numpy-for-matlab-users.html
- async-zarr: https://github.com/martindurant/async-zarr
- Blog by Martin: http://martindurant.github.io/blog/async-zarr/
- Steps needed to release it as a package
- Transferring the repo under /zarr-developers?
- Writing tests?
- Adding Github actions?
- Testing the browser is tricky, but not something for MD (i.e. requires effort from someone else) but useful independently for the two use cases
- FYI: https://github.com/grimbough/Rarr
- getitems: https://github.com/zarr-developers/zarr-python/pull/1131
- JK: might be useful for other types of arrays
- JMS: worried that every line of code needs to change. do it as core?
- JK plugin pieces - store & compressors
## 2022-12-14
**Attending:** Davis Bennett (DB), Josh Moore (JM), Norman Rzepka (NR), Dieu My Nguyen (DMN), Dennis Heimbigner (DH), Sanket Verma (SV), Hailey Johnson (HJ), John Kirkham (JK)
**TL;DR:** [Numcodecs 0.11](https://github.com/zarr-developers/numcodecs/) has been released with the support of Python 3.11. In addition, we’re planning to migrate the Zarr documentation to PyData Sphinx Theme; please look at the [PR #1242](https://github.com/zarr-developers/zarr-python/pull/1242). After sharing these updates, NR started a discussion on ensure_bytes by referring to [PR #1285](https://github.com/zarr-developers/zarr-python/pull/1285). Then DB had a question if anyone tried GPU direct storage, to which JK referred to a blog post which can be seen [here](https://xarray.dev/blog/xarray-kvikio). And we closed the year 2022 by asking everyone what they have been working on recently and their plans for the New Year!
**Updates:**
- Meetings cancelled for next round and will start again in the new year
- New game: something exciting/helpful from the last two weeks
- violin, escape rooms, snow, trains, car parties, travels!
- numcodecs 0.11 release with Python 3.11 support; see [#377](https://github.com/zarr-developers/numcodecs/issues/377)
- PyData sphinx theme migration [#1242](https://github.com/zarr-developers/zarr-python/pull/1242)
- see https://zarr--1242.org.readthedocs.build/en/1242/
**Open Agenda(add here 👇🏻):**
- (Tabled) async-zarr: https://github.com/martindurant/async-zarr
- Blog by Martin: http://martindurant.github.io/blog/async-zarr/
- Steps needed to release it as a package
- Transferring the repo under /zarr-developers?
- Writing tests?
- Adding Github actions?
- NR: `ensure_bytes` PR [#1285](https://github.com/zarr-developers/zarr-python/pull/1285)
- JK: trying to avoid copies
- F/C order wackiness ensues
- tldr: add try/catch block with fallback that does copy
- DB: anyone try GPU direct storage?
- JK: https://xarray.dev/blog/xarray-kvikio
- DB: and blosc, etc.?
- JK: blog didn't compress. but the library supports some standard ones
- JM: some interest arising in ZFS. benchmarking needed.
- DB: anything that indexes _existing_ chunks?
- NR: have a utility function over the storage keys (by listing the filesystem)
- in webknossos library
- SV: https://github.com/zarr-developers/zarr-python/issues/538 follow up?
- DB: will open a PR
- DB: discord?
- SV: using it for outreachy
- publicly opening a discussion on how/what/why/when/etc.
- DMN: will get in touch re: ZEP soon
- HJ: working on filters, scales & offsets
- :christmas_tree:
## 2022-11-30
**Attending:** Josh Moore (JM; aikido/books/coffee), John Kirkham (JK; hiking, reading), Sanket Verma (SV; video games & pixel art), Dennis Heimbigner (DH; drawing/guitar), Eric Perlman (EP; travel/bikes/trains), Martin Durant (MD; who has the time), Norman Rzepka (NR; kids & cooking), Ward Fisher (WF; video games/3d printers/wood & leather working), Isaac Virshup (IV; cooking < eating, bouldering), Davis Bennett (DB; birds & computer games), Brianna Pagán (BP; ultra running & my doggo)
**and welcoming**: Dieu My Nguyen (DMN; NASA, replacing Brianna; CS in computational bio; travel, house plants, hiking)
**TL;DR:** We have a new community member who joined today: Dieu My Nguyen from NASA. Thanks for attending the community meeting Dieu! 🙌🏻
Two Outreachy interns have been selected to work with Zarr for the next 3 months. BP asked if anyone is working on partial Zarr stores. After that, DB raised some opinions on the zarr-python slicing API, and lastly DH had some questions related to the V2 spec.
**Updates:**
- Outreachy Interns to work on Tutorials (AWA Brandon) and Testing Zip Stores (Weddy)! 🎉
- ZEP1 [Project board](https://github.com/orgs/zarr-developers/projects/2/views/2) and PRs: https://github.com/orgs/zarr-developers/projects/2/views/2
- PyData Global 2022:
- Talk: https://global2022.pydata.org/cfp/talk/DQSXAX/
- Sprint: https://pydata.org/global2022/sprints/#zarr
**Open Agenda(add here 👇🏻):**
- BP: Question around if anyone is doing work with partial zarr stores or live archives in zarrs.
- Growing in at least one dimension
- Good chunking strategies but not
- Pushing updates
- MD: Zarr has been _sort of_ archivey. Good to talk about "append, append, append". Updating data is less common. See Ryan's strategy.
- BP: hackathon during AGU. Hub in LA if anyone wants to join.
- DB: if you have a well-defined shape then it should be ok. Chunks are never partial on disk. (At least that's not variable)
- Changing origin would be expensive. (Rename everything)
- JK: if you are only using part of a chunk, it will fill with fill_value. Resizing could lead to zeroes. (MD: yes, recent shock)
- IV: [ZEP3](https://zarr.dev/zeps/draft/ZEP0003.html)? BP: Yes. Should be contributing. Have done some work internally.
- DB: zarr-python slicing API: aged well? alternatives?
- critique points:
- process of adding static types to PRs led to looking at the slicing API
- https://github.com/zarr-developers/zarr-python/blob/main/zarr/indexing.py
- numpy has expressive slicing (integer, tuple, slice object, tuple of slice object, tuple of arrays)
- in Zarr, ever type in the polymorphism is a class. no interface in common
- spit out 2 things: number of chunks and with same arity set of operations to do on them
- maybe three levels of indexing, but perhaps order of operations depends on the compressor
- MD: looked at numpy's implementations or is it in C.
- DB: assumed they were encumbered by organic growth
- MD: likely, but lots of testing. Also: normalizing to in-memory indexing (set of) then might be just one implementation
- DB: then imagined a package that solved the problem which could be shared with napari, etc.
- DH: looked at HDF5? (difficult) DB: maybe h5py since they handle the polymorphism.
- JM: possibly https://pypi.org/project/h5py-like/ too
- JK, dask array slicing? np slicing does a bunch of stuff. led to vindex and oindex.
- IV: views on arrays in Julia is really nice. JK: multidispatch helps (IV: and n-dim in the lang.)
- JK: there was a performance issue on slicing, now fixed, but likely just a subset.
- DH: questions about v2.
- feedback for JS about the current nczarr extensions
- shape of an array isn't fixed? stated anywhere (JM: explained recent issues)
- JK: https://github.com/zarr-developers/zarr-specs/issues/188
- scalars (arrays of rank 0) are ok?
- char type (in numpy)? (like a scalar string)
- IV: strings? DH: fixed length is in, but varlength proposed and close to varlen arrays.
## 2022-11-16
**Attending:** Sanket Verma (SV), Josh Moore (JM), Davis Bennett (DB), Ryan Abernathey (RA), Dennis Heimbigner (DH), Hailey Johnson HJ), Ward Fisher (WF), Jonathan Striebel (JS)
**TL;DR:** SV is going to speak at [PyData Global 2022](https://pydata.org/global2022/) next week and also going to run a sprint along with JM. Details for the talk are [here](https://global2022.pydata.org/cfp/talk/DQSXAX/) and sprint [here](https://pydata.org/global2022/sprints/#zarr). During the meeting a proposal to host a Discord server for chunked formats was raised by DB which later converted to creating a Discord server for the Zarr community. JM gave updates revolving around zarr-java, and lastly SV initiated the discussion on how we can separate Zarr (file storage format) from zarr-python (Python implementation of the Zarr spec).
**Updates:**
- ZEP1 Update, see [here](https://gitter.im/zarr-developers/community?at=6374fae6f9491f62c9b7ea61)
- Check out the ZEP1 GH Project board [here](https://github.com/orgs/zarr-developers/projects/2/views/2); maintained by Jonathan Striebel
- PyData Global 2022 Sprint next-to-next week(1st-3rd Dec.), anyone interested in helping out?
- Need to know by this week
- Also: Sanket giving a talk, "The Beauty of Zarr"
**Open Agenda(add here 👇🏻):**
- DB: Discord for Chunked Formats?
- Had problem and TypeScript Community was _very_ helpful.
- New thread gets created per issue
- Downside: not indexed by google
- WF: like it, use it socially, want this to be a solution
- WF: have people pushing people to _github_
- RA: pangeo uses discourse. bringing more dialogue together?
- critical is the granularity so we get take home
- DB: activation for forum post is 10x to discord message
- WF: discord seems more synchronous
- wouldn't work for NC since there wouldn't be enough critical mass
- gitter straddles the line, since there's a lot to catch up on
- but one for sync and async would be good
- RA: don't see the other chunked formats being psyched to be in a channel with us. workflow might be the more useful framing
- zarr-java: discussions tomorrow about bringing back two forks of jzarr
- JM (re-surfacing) care with Python discord? Yes.
- HJ: always start by clarifying library vs file format
- WF: good feedback on Zarr at unidata user committee meeting
- being asked for in THREDSS
- DH: recent discussion around ragged arrays
- most people encounter zarr through python
- that leads to an imprinting
- when they run into nczarr, they are perplexed (things missing)
- still a big problem
- JS: incompatibilities between libraries could bring us down
- v3 is spec first with feedback from different implementors
- hopefully to drop python-specifics and possible to know what the incompatibilities are
- claim: "Zarr is X", not just "a community project"
- posters, repos, webpages, etc...
- currently hard to grasp
- SV: how to separate from Python
- DH: go through tutorial and move things that are not in the v2 spec
- DH: probably many go through the tutorial
- DH: e.g. fortran community
- JM: good points, but the same will be true for nczarr. v3 will give us the chance to label things more clearly.
- WF: have to be clear about "NetCDF". Was specific talking to the user committee about the "Zarr data model", or "Zarr data storage"
- specificity in language. data model and the format should be cross-language.
- DH: ultimately goal is the nczarr extensions to be v3 extensions
- JM: wonderful :tada: when would be a good time to plan for that?
- DH: currently working on DAP4, but will shoehorn some time for bullet point list of extensions (and why)
## 2022-11-02
**Attending:** Sanket Verma (SV), Josh Moore (JM), Dennis Heimbigner (DH), John Kirkham (JK), Norman Rzepka (NR), Jeremy Maitin-Shephard (JMS), Davis Bennett (DB), Ward Fisher (WF), Martin Durant (MD), Isaac Virshup (IV)
**TL;DR:** The Outreachy contribution phase is ending after four weeks, and we had a fantastic time working with them. We'll select 1/2 interns to work with us for three months (December-February). SV is speaking @ PyData Global 2022 next month and planning to host Zarr Sprint. Special kudos to JK for working actively on numcodecs! After that, IV started a discussion on memory mapping.
**Updates:**
- Outreachy contribution phase ending on 11/4 (Thanks to all who helped us! 🙌🏻)
- 2 more days then time to choose an intern to work with us for 3 months
- JK: choose more than 1? Possible. Feedback welcome.
- #beautifulzarr born out of Outreachy: https://github.com/zarr-developers/beautiful-zarr
- Crowd-sourced collection of pretty stuff
- Feel free to add stuff under https://github.com/zarr-developers/beautiful-zarr/tree/main/_data
- Speaking at [PyData Global 2022](https://pydata.org/global2022/) 📣
- "The Beauty of Zarr"
- Planning a Zarr Sprint! 🏃🏻♂️ Anyone like to volunteer?
- Collect issues and attendees (some 2000+) get involved
- 3-4 maintainers/contributors should be present
- 1-2 hour committment
- 1-3rd of December
- ZEP0003 by [Martin](https://github.com/martindurant) and [Isaac](https://github.com/ivirshup) is in draft; read [here](https://zarr.dev/zeps/draft/ZEP0003.html) :tada:
**Open Agenda(add here 👇🏻):**
- MD: kudos to the push on numcodecs
- JK: fixing build things
- People _very_ excited about building for 3.11
- Want to get all of the build fixes into the upcoming release
- https://github.com/zarr-developers/numcodecs/issues/377
- pyproject.toml allows us to unvendor header
- _issue moved to private issue_
- IV: Question about where memory mapping is at (https://github.com/zarr-developers/zarr-python/pull/377, https://github.com/zarr-developers/zarr-python/pull/1131)
- MD: related to passing contexts down to the reading
- JK: added in DirectoryStore `_from_file()` to define how the reading is done
- JK: https://github.com/zarr-developers/zarr-python/pull/377#issuecomment-915030522
- JK: https://github.com/zarr-developers/zarr-python/pull/377#issuecomment-1301159210
- MD: with codec, produces a regular array
- JK: previously updated to use pybuffer protocol (codec with decompression will do a copy)
- DB: use case? for large amounts of single cell data. resampling for neural networks. DB: chunk size doesn't help. IV: not because of the sparseness. but even if dense, the random access
- MD: even better would be on slice selection of zarr object, pass the byte range into the loader (done with blosc blocks in v2. cheap sharding) memory mapping just exposes an extra layer that you might not need.
- IV: "fast as possible reading" (where disk size isn't an issue)
- DB: using zarr array API? doesn't seem like that would work
- MD: similar to kerchunk. want to build a utility, pretend N chunks for random-access.
- IV: how much work until we can pass the byte-range down?
- MD: discussed in several places. 1131 is likely to win.
- JM: does FSStore support it? We _think_ so.
- IV: and ZipStore? MD: requires some work.
- IV: .. and gets passed to pytorch and is multi-threaded ...
- JK (from chat): Maybe this Dask PyTorch loader is useful?
- JK (from chat): https://github.com/rapidsai/cucim/pull/120
- DH: effect on caches? (general "good question" nodding)
- JMS: page cache? No. e.g. nczarr's chunk cache (DB problem)
- MD: if you're caching, then whole chunks and read partially from them rather than reading partial chunks. fsstore has something for parts of files, but it's messay
- DH: potential for optimizations.
- MD: good question when we get to subselections of a chunk
## 2022-10-19
**Attending:** Sanket Verma (SV), Eric Perlman (EP), Erik Welch (EW), Hailey Johnson (HJ), Ward Fisher (WF), Jonathon Striebel (JS), Martin Durant (MD)
**TL;DR:** The Outreachy contribution phase has started, and the contribution PRs are coming in hard! JS eagerly seeks feedback on storage transformers PR; check [here](https://github.com/zarr-developers/zarr-python/pull/1111). There was a visible concern from the Zarr Community around the progress of ZEP0001. SV assured everyone that he’d communicate with the author and resolve any pending issues to move forward as quickly as possible. After that, there was a discussion on ZEP0003 (authored by Martin and Isaac).
**Updates:**
- Outreachy contributions coming in hard! 💪🏻
- Any more ideas how to engage the applicants?
- You can help us by reviewing the PRs and interacting on Gitter chat
- New ZEP by Martin Durant, check [here](https://github.com/zarr-developers/zeps/pull/18)
- TensorStore CMake integration now available: https://google.github.io/tensorstore/installation.html#cmake-integration
**Agenda:**
- JS: Made 2 PR for storage transfomers
- Looking for feedback
- https://github.com/zarr-developers/zarr-python/pull/1111
- https://github.com/alimanfoo/zarr-specs/pull/1
- Martin gave some feedback
- Looking to driving it forward
- SV
- Working with Alistair to get the feedback
- Moving things forward ASAP taking in account the feedback and comments on V3 PR
- MD: https://github.com/fsspec/kerchunk/pull/237
- JMS: What functionality is not in V2 that's in V3?
- MD: Zarr chunks can be made by concatenating - storage layer can be used to do this - similar to sharding storage layer - sharding is done at some levels
- Kerchunk would really like to do this
- JS: Sharding has no problem (in principle)
- JMS: Maybe create a fork `zarr-specs` repo
- Better way to work forward and make progress
- JS: Opening PR against the repo is fine
- MD: lot of changes would be hard to merge back in the main PR
- JS: Don’t personally mind of merge things - ZIC has to decide
- JMS: Valuable to make incremental changes
- MD: JS can make changes to zarr-python
- JK: Make PR on Alistair’s PR and merge them
- MD: Sounds good!
- JMS: Make a place/fork and merge things over there and have working draft
- SV: Try to resolve things with Alistair and set a plan to move forward
- JS: Triage the PR which are on Alistair’s fork
- JK: Have a single place to have all the discussion to and not diverge things
- MD: wants to have variable length chunking
- It’s a draft for now
- Edit it before merging if it doesn’t makes sense
- JMS: no objections to Martin’s proposal
- V2 is obvious and V3 has multiple chunk grid types - not example of having multiple chunk types
- MD: complete not separate
- JMS: do we want to have those special types?
- MD: can’t imagine how easy to put up these things - can have overlapping thing - values can be duplicated over multiple chunks - different things can over relate like a surface of sphere - Dask may have a reference implementation for Martin’s ZEP -
- MD: Do you have partial chunk?
- JK: How would the append work on this ZEP?
- JMS: Chunk should not size after it’s created - if you append you should append the same size chunk
- MD: Merging and joining chunks would be easy to do - don’t really need to think about it for now - can rewrite a whole new array - ability to manipulate the chunks - once it’s in the spec we can work on the API
- JMS: Hadn’t had use for variable size chunk
- MD: streaming data - maybe in those cases you would want to have variable size chunks
- JS: Address is same? MD: Right!
- JS: may not have use for variable size chunks - current complexity is enough - other cases: move data, resize your chunk and write that chunk
- MD: or delete an index or 0
- JMS: Don’t want to implement non-zero index
- MD: now have a way to do it! Voila!
- JMS: We’re using a special structure at Google for non-zero origin - boxloses or something
- JS: If you use Zarr API
- EP: Using Jeremy’s implementation and then using Zarr’s metadata - can use same indexing for arrays
- JS: Fair for many things - don’t know a lot bout offset - you start at 0, 0, 0 - can ignore
- EP: involved with OME data and can fit into other use cases as well
## 2022-10-05
**Attending:** Josh Moore (JM), Norman Rzepka (NR), Davis Bennett (DB), Ward Fisher (WF), Hailey Johnson (HJ), Martin Durant (MD), Jeremy Maitin-Shepard (JMS), Isaac Virshup (IV), John Kirkham (JK)
**TL;DR:** Scalableminds has forked `jzarr` and has started working on it. After that, there were discussions around how the Java implementation (`jzarr`) would move forward and implement the V3 spec. After that, IV discussed the R implementation of Zarr and planned to discuss ragged arrays at the ZEP meeting tomorrow.
**Updates:**
- vEM sidenote
- NR: did Mathworks reach out to Unidata?
- WF: about NaN?
- NR: reached out and asked about Zarr for Matlab. They wanted a c/c++ library they could build on top of
- WF: not about that. Ellen Johnson is the main contact. Will add that to the list.
- :tada:
**Agenda:**
- Java and async
- scalableminds: Java backend. Forked jzarr and implementing in scala (read only)
- layers: fsspec, pure zarr, parallel (and async at all of those)
- (Tischi) imglib? NR: interop would be cool
- NR: Shared effort would be appreciated
- HJ: filters are pluggable and shareable
- HJ: definitely want to implement async for S3
- HJ: breaking THREDSS into microservices. also an option. (weren't breaking netcdf-java)
- MD: python async. zarr is sync. barrier in the storage layer.
- NR: important to have the lower levels async. then join/wait
- NR: but don't need cutouts. just interface to get specific chunks and (optionally) decode them
- abstractions: filesystems, zarr/n5/precomputed variants.
- MD: https://github.com/martindurant/async-zarr was a Python POC (targeted at Pyscript)
- https://github.com/bluesky/tiled (server)
- DB: would love async zarr in python
- JMS: tensorstore has python async. And s3 writing? No.
- JM: wrapping in tensorstore with Java?
- JMS: people tend to like native Java. Unsure how the memory measurement might work. Managing persistent buffers. Happy to help someone to look into it. (Little experience)
- MD: someone doing it for spark? pyspark talking to tensorstore
- JM: netcdf
- HJ: netcdf-java uses netcdf-c for writing NC4.
- WF: JNI/JNA tricky so didn't do more, but for the writing there was no alternative.
- MD: HDFS drivers in Java land saw lots of barriers. Seemed to be the case that C native interface bridge that it raised uncertainties.
- R (IV)
- graphblas would also like to help get the libzarr c implementation outside of
- goal is a binary sparse format
- MD: Jim works on team. Thought community wasn't behind Zarr.
- IV: from last Tuesday. Tried Zarr. Had issues. Tried HDF5 had more issues.
- MD: see also awkward arrays (had prototype of AA in zarr. bit clunky. work on convention for v3. would also work for sparse)
- WF: There is a lot of modularization in the C library source, in terms of a proof of concept, we might be able to create a separate library. But I also recognize I'm looking for something to work on besides spinning up a grant proposal.
- WF: internal considerations. capacity. etc.
- but from code organization standup, code is in separate library. compiled as object that is linked against.
- likely wouldn't take a ton of work. need to examin ie.
- HJ: the netcdf-java implementation also has the zarr package serparated from CDM (i.e. ditto)
- WF: precendence with the udunits package. similar to how nczarr is now. (now maintained by outside unidata team)
- IV: good to hear. maybe also what Jim wanted to hear. segfaults.
- WF: putting new release out soon. bug fixes & messaging. (IV: need conda package)
- MD: sparse data needs a binary storage. so good time to write down those requirements
- MD: gets us into variable chunk sizes which are needed in zarr for these uses.
- IV: don't care just yet. (not accessing by chunk right now)
- DB: why not just numpy sparse storage? MD: you want configurability
- DB: sparse as a codec? you don't want to build that in memory. i.e. you want a sparse representation
- also: what does it even mean
- JMS: think we can shoe horn sparse into Zarr but don't think just encoding them as dense arrays is the best way. need to add variable sized chunks to do something reasonable. maybe based on store abstraction but not on array abstraction.
- MD: if there are other reasons for variable sized chunks. JMS: even then it's show-horning.
- IV: encoding as zgroup? JMS: depends on operations (e.g. chunks in spatial region)
- MD: you can definitely design a binary format (but there isn't one). zarr is already around though rather making a new binary format.
- IV: all formats are a collection of dense 1-D or maybe 2-D arrays
- JMS: in-memory arrays have different constraints. parallel writes?
- JMS: you don't really care about the linear mapping.
- IV: number of sparse formats is due to the number of use cases
- MD: cf. array's feather format. dump of each buffer that it takes to reproduce it. and it's fine. (no chunk wise access. no parallel reads from remote)
- i.e. Arrow is not better or worse
- Zarr with a convention is Arrow
- IV: chunking in arrow
- IV: another use case is variable length strings (in V3)? MD: not as numpy arrays
- JMS: v2 does but you can't index in the string
- MD: numpy object arrays
- JMS: most obvious zarr encoding is having each chunk sparse. DB: why isn't that the "zarrthonic" way?
- JMS: not on top of zarr-python. M: don't want `foo[:]`. IV: most libraries don't comply to numpy. dtypes of each 1D array.
- JM: https://data-apis.org/
- MD: see cupy work
- DB: in dask you should be able to use. there's a meta keyword argument that gets passed around that contains a description of the soul/truth of the array
- JM: generally good :+1: let's work on it
- IV: have already done this for some sparse array types in anndata
- JM: should capture as convention/extension
- IV: variable size chunks would help
- JM: Martin, someone already working on that spec?
- MD: spec change is easy, but implementation work is harder. (dask array may have some of the logic)
- DB: any objections?
- MD: there were some, but can't summarize
- JMS: boundaries in separate place or in metadata
- MD: Ryan Abernathey thinks only the storage layer should know about it (not in metadata)
- any read would need to pass not just a key but also where in the array it is
- IV: https://github.com/zarr-developers/zarr-specs/issues/62
- JMS: doesn't seem very natural in the storage layer. i.e. not MutableMapping API.
- MD: don't expect chunks in one dimension to get so large that the metadata is a problem
- IV: have students who could work on this
- JMS: what happens on re-sizing
- MD: API is to be determined but that's an implementation question
- JMS: unless you need a "default chunk size" metadata or similar.
- IV: ZEP meeting (tomorrow) the place to bring up ragged arrays
- JMS: harder than sparse
- IV: ok not being able to index into the variable length arrays
- vlencodecs ... need spec ...
- running out of steam ...
- IV: was looking at dtype extensions and thought perhaps ragged arrays could go there. but perhaps another one. which oen is right?
- JMS: dtype, generic extensions, ... would need custom codec and generic.
- IV: might make sense to _reduce_ what's allowed for the dtype extensions
## 2022-09-21
**Attending:** Josh Moore (JM), Jeremy Maitin-Shephard (JMS), Ahmet Can Solak (AS), Sanket & the Do-a-thon, Martin Durant (MD), Davis Bennett (DB)
**TL;DR:** Having a new attendee from the CZI Open Science Summit, we took a deep dive into the best way to capture data directly from microscopes, comparing the pros and cons of Zarr/HDF5/Zip and more. Additionally, we worked through remotely visualizing a Zarr when it's been created on the cluster in a Jupyter notebook.
**Updates:**
- Sanket at CZI/NumFOCUS Summits
- Coming to San Fran next week, lunch!
**Open Agenda (add here 👇🏻):**
- Ahmet: BioHub
- Collaborators interested in Java implementation
- Need a good implementation
- ImageJ / BDV (folks at Janelia)
- V3: collaborators to help read it
- JMS: explicit opt-in for V3 (need to know _a priori_)
- Though auto-detection could be added
- neuroglancer likely has a stronger case for auto-detection
- AS: happy tensorstore users. Thanks alot! :star:
- https://github.com/zarr-developers/zarr-python/issues/1140
- resize manually? more internal with a skinnier API
- JMS: assume things within old bounds are old?
- AS: perhaps request chunks (from last savepoint) more compute heavy
- keyword argument?
- MD: "don't bother writing where there's no new data"
- JM: see related https://github.com/zarr-developers/zarr-python/issues/1017
- JMS/MD: use selection to fill in the new bits
- AS: `append()` is only for one axis. This might be for arbitrary axes.
- perhaps `append_chunks()`
- use case
- instruments generating lots of data quickly.
- don't want to resize if not necessary. with fewer methods if possible.
- most efficient way?
- of course, better to know exact size.
- MD: just have size must larger and have missing chunks?
- AS: only if know when biologists will stop
- Clarification: doesn't write the empty chunks
- MD: do edge chunks need special handling?
- JMS: no. always write the full chunk.
- (not in N5, and didn't implement in tensorstore)
- DB: wouldn't suggest having everything in one array
- 1 array per timepoint (doesn't work for NGFF)
- growable arrays
- or use HDF5 for the acquisition
- AS: why? faster than zarr-python. but tensorstore? Don't know.
- JM: let's do that benchmark
- DB: Windows doesn't like lots of small files
- MD: could write Zarr into Zip with no compression (basically what HDF5 does)
- DB: save data in the way that's most effective for the acquisition
- Zarr as a great format after that
- AS: that's what we were doing previously. but additional time adds up. people want the results faster. was asked to add ZarrWriter in aquisition package. Can then easily transfer to data storage.
- DB: easier to transfer than HDF5? No, than the raw files. Compression is a benefit.
- AS: set chunk size bigger rather than using HDF5
- JM: per camera. but can't compress chunks.
- HDF5 compress in parallel but not write in parallel
- JMS: eventually all use cases of HDF5 but not there yet
- granularity at which you can read and write
- AS: re-chunking is faster than converting camera offline
- AS: with two camera we don't try to write to same array with both, but multiple places
- JM: zip support in tensorstore? JMS: not yet
- JMS: also thought about LMDB. single file. pretty efficient.
- zip e.g. doesn't support deleting.
- also only has one directory structure
- MD: HDF also has that problem.
- DB: re-writing isn't a problem for acquisition.
- JMS: do need to checkpoint the zip directory periodically.
- AS: saving single-array per timepoint, then zip might work quite well.
- converting to zip zarr saw some worse performance. not sure where.
- MD: make sure the zip isn't compressed.
- JM: need Zip spec
- DB: would love to hear where this goes
- MD: **inverse problem**
- massive HDF5 files in tar file on S3 for the purpose of multi-file dataset
- desire to distribute them as individual files
- 20G tar containing HDF5
- Kerchunk's job was to point to these files within the tar
- or "find all the chunks in all of the files"
- works nicely!
- fetches are short but there are many of them.
- had to download it (for scanning) but don't want users to have to do that.
- i.e. if you push for a single file, perhaps you can get the best of both worlds.
- DB: lambda function? probably. (but this was custom S3)
- JM: need Java implementation of Kerchunk (for BDV)
- DB: generate from json-schema
- AS: with kerchunk can you point to your data centers...
- MD: each chunk is a key but is a URL
- JM: `"chunk-name"URL, offset, length)`
- JMS: can get the correct endpoint for a chunk
- add s3 syntax
- IPFS, mutable hashes, ...
- DB: interesting workflow. any help?
- couldn't get napari on cluster over VDI
- transforming images and saving them as zarr.
- starting static server and pointing neuroglancer at it.
- would prefer to do things programmatically in neuroglancer and it spits out a URL
- also convenient to have static file server as background process from main python (notebook)
- JMS: definitely convenient and it's "just a web server"
- DB: don't save that to disk? dask arrays in memory?
- JMS: neuroglancer-py does have a way to share numpy array or tensorstore object
- Socker based? Internally starting a web server.
- DB: and if it gets updated? does it block? No, background thread
- There is a method to invalidate the cache.
- Python API for making URLs? Yes.
- Could be attractive to people (Janelia) for when computing on the cluster
- JM: See also Wei's imjoy-rpc for the usability
- JMS: works as iframe in jupyter now (DB: desirable)
- JMS: possibly using jupyter protocols would work around firewall
## 2022-09-07
**Attending:** Sanket Verma (SV), Josh Moore (JM), Davis Bennett (DB), Dennis Heimbigner (DH), Jeremy Maitin-Shepard (JMS), Hailey Johnson (HJ), Brianna Pagán (BP), Isaac Virshup (IV)
**TL;DR:** SV announced new ZEP bi-weekly meetings apart from regular bi-weekly community meetings. Now, the Zarr Community will be meetings two times every two weeks. New illustrations by Henning Falk, check them [here](https://github.com/zarr-developers/zarr-illustrations-falk-2022). Google Summer of Code 2022 is finally coming to an end. After the updates, JM discussed having a hierarchy that builds a virtual n-dimensional array. Then, JMS started discussing one of the open issues in the `zarr-specs` repo; check the issue [here](https://github.com/zarr-developers/zarr-specs/issues/141).
**Updates:**
- ZEP meetings will take place bi-weekly on Thursdays @ `21:30 IST/18:00 CEST/17:00 BST/12:00 EDT/9:00 PDT`
- Instructions: https://zarr.dev/community-calls/
- More focused on the spec than these meetings
- Check out our new illustrations here: https://github.com/zarr-developers/zarr-illustrations-falk-2022
- More ideas welcome!
- Sharding...
- `copy` button for code snippets in Zarr documentation, check here: https://zarr--1124.org.readthedocs.build/en/1124/ @ Altay Sansal [#PR1124](https://github.com/zarr-developers/zarr-python/pull/1124)
- Approaching end of GSOC (12th of Sep)
- https://alt-shivam.github.io/Codecs-Registry
- Looking to participate in Outreachy (https://outreachy.org/)
- New potential users & developers
**Open Agenda (add here 👇🏻):**
- JMS: plan for resolving v3 spec?
- SV: more on this tomorrow but some progress looking at open issues
- Upcoming work
- JM: proposal to have ZEP0001 moved to a "provisional ZEP" state (only blockers allowed)
- JMS: idea is no spec discussion at this meeting? SV: no, but we'll communicate back and forth
- SV: updates on Brianna/Hailiang's ZEP? BP: Not yet. SV: also welcome to join tomorrow
- IV: zarrR?
- https://github.com/fsspec/kerchunk/issues/212
- Josh:
- JM: idea of having a hierarchy that builds a virtual n-dim array
- DB: adds brittleness. would say no.
- IV: kind of like kerchunk but with more indirection
- JMS: sometimes have use cases.
- stacks of images that you want to view as an array, or multiple images acquired separately.
- Do have stack driver in tensorstore (with specificed origins. No stored representation)
- DB: similar problem when acquired in HDF. wrote own layer.
- JMS: might should be a layer higher than zarr.
- DB: for bioimaging, if your app depends on this then you can only open HDF and Zarr and not other stuff.
- doesn't need to be compiled code. API problem.
- NaN/inf/other special values in user-defined attributes: https://github.com/zarr-developers/zarr-specs/issues/141
- zarr-python supports by encoding in non-JSON-compliant way
- DB: nothing that can be stored as data shouldn't be impossible to store as an attributes
- DH: was dealing with this recently. found "NaN" (quoted string) in existing datasets, expecting it to be treated as such. added support to nczarr (as well as unquoted versions)
- JM: will likely need a deprecation/warning/error cycle (royal pain)
- IV: keep JSON and use them as special values? nice that it is all just JSON.
- DH: nczarr (netcdf API) got ahead of this because typing is stored for attributes ("double"). possibility for v3?
- JMS: good point. perhaps decide the model for attributes in v3 (i.e. proposed change to v3 spec)
- JM: will need an upgrade path
- DB: haven't seen untyped attributes, but just that JSON is missing values
- DH: so extend constants that are definable
- JM: BSON?
- IV: there are also things that can't be encoded in zarr
- DH: one problem with extending JSON is that in C code that there are JSON parsers that would choke
- JMS: zarr-python has essentially already done that
- **Enumerating options:**
1. extend JSON parser (generally :-1:)
2. support existing JSON-variants (BSON) (generally :-1:)
3. encode objects in JSON
a. `{"attribute":...`
a. `{"@type":...`
4. add type information somewhere else (like .nczarr)
- JM: (2) might be a metadata-driver like separate chunk-stores
- DH: if it's binary, then you need a good spec. and need to show equivalence between binary version and JSON.
- JMS: you might be writing the non-JSON attribute late in the process, which would cause problems
- DH: binary could help with speed since string level parsing is expensive
- DB: always thought of the metadata as the stuff you want to read with editor and you don't want peformance issues
- DH: have had number of examples of NC-4 files that are enormous (10s of MB of metadata)
- also abusing grouping for "namespaces" (even if not a good idea)
- IV: is this Zarr's responsibility? cf. Pydantic which can turn your values into something else. (i.e. external schema)
- DB: but Zarr is responsible for storing "fill_value"
- IV: that's .zarray rather than .zattrs
- DB: would assume that the `.attrs` property takes care of encoding/decoding
- JMS: would see saying ".zattrs supports JSON + these encodings"
- IV: do all the languages support this?
- Javascript?
- JMS: `Array[UInt8Array]`
- SV: an extension?
- JMS: could fail on invalid JSON now and then add encoding/decoding later (since there's already the issue with V2)
- IV: Arrow requires everything to be an arrow type (everything else is string with encoding)
- DH: did that in netcdf-4
- DB: sqlite is the same way
- DH: include numpy with json type (from string)
- IV: almost done with PR on awkward arrays (using this). depends on the JSONs
- JMS: would make sense to standardize that (decide: pure JSON or extended JSON)
- IV: see https://github.com/scverse/anndata/pull/569
- SV: heading to California next week for NumFOCUS & CZI summit (also NJ & NYC)
## 2022-08-24
**Attending**: Sanket Verma (SV), Jeremy-Maitin-Shephard (JMS), Davis Benett (DB), Eric Perlman (EP), Ward Fisher (WF), Martin Durrant (MD), Hailiang Zhang (HZ), Ryan Abernathy (RA), John A. Kirkham (JK)
**TL;DR:** The Zarr community discussed two open PRs in the zarr-specs repo. [Support a list of `codecs` in place of `compressor`](https://github.com/zarr-developers/zarr-specs/pull/153) and [Change data type names and endianness](https://github.com/zarr-developers/zarr-specs/pull/155). The discussion was extensive, covering many good points, and the overall community favours both of these PRs for getting merged. After this, Hailiang Zhang from NASA Goddard asked a few questions about the ZEP extension he and his colleagues are working on. They are making progress on the ZEP and will submit a draft in the upcoming weeks.
At last, there was a discussion on working on and finishing the pending ZEP1. John A. Kirkham proposed an idea that everyone was in favour of. Also, the community would like to step up and help in the completion of ZEPs whenever and wherever needed.
**Updates:**
- Zarr is attending CZI and [NumFOCUS Summit 2022](https://numfocus.org/2022-project-summit-openmbee), if you're there feel free to say Hi! 👋🏻
- [Jonathan Striebel](https://github.com/jstriebel) is presenting poster on Zarr @ [EuroScipy 2022](https://www.euroscipy.org/2022/) next week. If you're attending the conference, please say, Hi 👋🏻!
- Final decision on [ZEP1](https://github.com/zarr-developers/zarr-specs/pull/149) most probably next week. Please leave your feedback now!
- Suggestions for themes/tech stack for website revamping
**Meeting minutes:**
- JMS: https://github.com/zarr-developers/zarr-specs/pull/153
- Does anyone has any feedback on this?
- RA: Hard to change across all the specs - changing filters is possible as it deals with NP arrays - and compressors don’t do that
- DB: we can look at unified API for these type of changes
- MD: Codec should take context - and all the info like position, size of array and this could potentially solve the problem - by the time you compress the array - the codec could do and know what you’ve told
- JMS: each codec should have byte as output
- DB: for numcodecs this mean to promote/change function signature - no reason this could not be done
- MD: it says buffer and it can be amended
- MD: it can tell you where you are in the array - chop things where it is specified - For e.g. I want this key because it is this key in this chunk - also biased because worked on storage and helps me out
- JMS: first you read the chunk and then data is read - codec wants to make partial read - then codec could decide what to do from there
- MD: blosc takes care of this - codec will itself won’t interact with the storage layer - use case - kerchunk example - netcdf file - needs to know what file and size we are - doesn’t need to think about storage layer
- JMS: https://github.com/zarr-developers/zarr-specs/pull/155
- MD: Seems good to me 👍🏻
- JMS: rename the data types and keep the endianness - minimal change - using different names makes senses - you can add other names - and it makes sense to make more conventional names
- DB: in favour 👍🏻
- JMS: if big endian array - this change will return array with big endian
- RA: lil’ experience with this - ocean model gives big endian data - sometimes they don’t work - never wanted to have those types - just accepted because it was there - trade off: computational cost and cannot convert it to fly
- MD: if you can do it in places, temporary duplication of the memory can be done - all astropy data is endian
- RA: row major vs. column major - something which Zarr should take care of 👀
- HZ: Ryan and HZ’s colleagues(Briana and Mahabal from NASA Goddard) had discussion in meeting - a proposal to build chunk level statistics for performance - idea is like: each chunk will have some statistics to decide their characteristics and how are they performing
- *Planning to submit a extension ZEP soon!* 🙌🏻
- Not a single value - will be along certain dimensions - allows to be vector instead of scalar - along which dimensions it should be done!?
- It’s been a month - what’s the timeline of release of next major version spec? - No time, we're still working on it!
- Statistics need to have some knowledge - https://github.com/zarr-developers/zarr-specs/issues/73 - this is helpful for us - couldn’t find this in V3 - did I miss something about this?
- Dimensions could be lat. long. time - the statistics - the dimensions could be switched
- MD: adding a codec could solve this!
- Thank you!
- JK: Get Alistair to finish ZEP1!
- Use comments and make an individual PR for those comments - Nice idea!
- MD: Different uses and perspectives - move towards a same goal - other groups have structured issues
- MD: make a list of things we can include and solve them
- JK: break larger problem into simpler ones and then solve then!
- Discussion on community to take charge for ZEPs
- Everybody seems to be in favour and ready to step up
- Hoping to close ZEP1 soon and move forward!
Open Agenda (add below 👇🏻):
- Zarr v3 spec open issues:
- by JMS
- Tabled:
- fill value required: https://github.com/zarr-developers/zarr-specs/pull/145
- C order vs F order vs arbitrary order
- Rename of array metadata files
- Dimension labels
- NaN/inf/other special values in user-defined attributes: https://github.com/zarr-developers/zarr-specs/issues/141
- Storage transformations: sharding, consolidated metadata
- Can storage transformers operate on the entire store or just arrays? https://github.com/zarr-developers/zarr-specs/pull/149#pullrequestreview-1078722828
## 2022-08-10
**Attending**: Sanket Verma (SV), Josh Moore (JM), Jeremy Maitin-Shepard (JMS), Alex Merose (AM), Brianna Pagán (BP), Hailey Johnson (HJ), Hailiang Zhang (HZ), Jonathan Striebel (JS), Martin Durant (MD), Norman Rzepka (NR), Shivank Chaudhary (SC), Ward Fisher (WF), Mahabal Hegde (MH), John Kirkham (JK), Isaac Virshup (IV)
Introductions:
- Favorite sport!
- Feel free to add links to your work here
Updates:
- [ZEP1](https://github.com/zarr-developers/zarr-specs/pull/149) & [ZEP2](https://github.com/zarr-developers/zarr-specs/pull/152) are open for feedback!
- Review under https://zarr.dev/zeps/draft_zeps/
- Comments on https://github.com/zarr-developers/zarr-specs/pulls
- Browse https://zarr.dev/community-calls/ for previous meetings notes
- [Jonathan Striebel](https://github.com/jstriebel) is presenting poster on Zarr @ [EuroScipy 2022](https://www.euroscipy.org/2022/). If you're attending the conference, please say, Hi 👋🏻!
- 2.13.0a1 releasing soon with updates from [Davis](https://github.com/zarr-developers/zarr-python/pull/1094), [Mads R.B.](https://github.com/zarr-developers/zarr-python/pull/934) and [Jonathan](https://github.com/zarr-developers/zarr-python/pull/1096)
- Josh: 2.13 next alphas
- https://zarr.readthedocs.io/en/latest/release.html#release-2-13-0
- Phase 1 of GSoC 2022 completed! 🎉 Check progress [here](https://alt-shivam.github.io/Codecs-Registry/)
- JMS: would be good to have specification (json-schema?) for each
- Goal: have clients interact with the registry to give users info/feedback
- async zarr https://github.com/martindurant/async-zarr
- anacoda hackweek per quarter (2-day-hack)
- for discussion https://gitter.im/zarr-developers/community?at=62f3ed24d020d223d36587d5
- http only and other simplifications
- JMS: targeting runtimes outside of the browser?
- AM: cloudflare worker? WebAssembly support. (MD: already in pyodide)
- lightweight VMs (e.g., for security)
- https://github.com/zarr-developers/community/issues/14
- IV: story for downstream library developer to use? rewrite to use await
- MD: definitely must use await (can't go in and out of the event loop)
- use case: first chunk of several arrays
- MD: e.g. what would xarray for example use it
- MD: somethings already work: bokeh, etc.

Open agenda:
- ...add here ...
- AM: Question about the Zarr Spec v3 (ZEP1)
- Like that it's very bare bones
- Though experiment: Could a video codec be implemented?
- Compress across time; key frames
- JMS: 3d xyt chunks would work (individually)
- MD: variable length chunks. Critical to video compression (per key frame)
- MD: if each chunk has all the time points
- MD: but also in favor of variable length chunks
- JMS: what's the connection?
- MD: video compressions support large range of chunk size based on how quickly the video is changing
- JMS: just make time chunking big enough and internal is a detail
- JS: remove chunking across the time dimension
- currently inefficient, but with partial reads it could work
- let video codec request chunks of data from the store
- MD: internal to codec or explicit at the zarr level
- JM: difference in the fundamental model? (atomicity)
- MD: 1-dimensional delta codec, make it across an arbitrary dimension?
- AM: if Zarr intends to be the metadata format, this is a stress test.
- JMS: with fixed key rate,
- JM: see also https://mpeg-g.org/
- 
- HZ: extension proposal
- implementation for multidimensional data analysis
- introducing auxillary datasetse in reduced datasets (non-scalar, accumulation value)
- helps to speed up computation. Ryan A. suggested a spec extension
- MH: averaging over time or spatial extensions
- JM: cf. https://github.com/zarr-developers/zarr-specs/issues/50
- IV: perhaps like transforms https://github.com/ome/ngff/issues/101
- JS: difference of whether or not it leads to an additional array
- IV: Non-uniform chunks – timing?
- conversation with JK at SciPy
- broad desire to have them exist. any objections?
- have several masters students to put on this
- JM: ZIC?
- IV: can discuss if in spec or as an extension
- JS: would still have a formal spec even if an extension (eases adoptions, clear interface)
- IV: ZEP0001 timeline?
- SV: on me. working with Alistair to apply the modifications. ASAP.
- JMS: meta-issue for scheduling time to work on the V3 spec
- way to speed up progress? additional meetings?
- SV: ZIC meeting?
- JM: editorial meeting? Add JMS?
- JS: happy to be in discussions with AM but also open issues that need discussion
## 2022-07-27
Attending: Josh Moore (JM), Brianna Pagán (BP), Ryan Abernathey (RA), Norman Rzepka (NR), Jeremy Maitin-Shepard (JMS), Greg Lee (GL), Trevor Manz (TM), Ward Fisher (WF), Davis Bennett (DB), Matt McCormick (MM), John Kirkham (JK), Parth Tripathi (PT)
Agenda:
- ZEP0001
- RA: ongoing review process :tada:
- JMS: long-list. perhaps we should just go through them
- NR: higher-level -- status of the extensions? going in? ZEP0002, 3, 4...
- RA: sets ground work for extensions. like the idea of keeping them narrow in scope
- NR: worried about a ZEP a month and the happiness of the ZIC. Perhaps batching them?
- JMS: review sharding as part of ZEP0001 since it was the motivation for many to have gotten involved. main benefit of V3
- MM: on sharding, would like to look towards the future (i.e. not necessarily finalization) to get it adopted across the implementations
- JM: not a lot of movement (speaking for others). definitely implementation needs work.
- NR: Jonathan is in stand-by waiting for decision. Could be ZEP0002. (He has a conflict at this time)
- MM: Great comments https://github.com/thewtex/shardedstore/issues/17
- Using sharding in a general way for simplicity, incl. with different stores.
- Looking to go through this in practice for large scale data.
- See working prototype. Pretty efficient. Works with v2 as well out of the box
- DB: understanding sharding as introducing an abstraction between the array and the store. Will that generalize to all non-sharded stores. (No-op shard?)
- NR: yes. Store shouldn't need to know about the storage transformers. Partial reads are helpful but not required.
- NR: at specification level (i.e. not just zarr-python) need to know how it will look like on disk.
- MM: could see trying to get ZEP0001 out. (**Proposal?**)
- but also: yes you can shard arrays, but what about groups (as additional need for the spec)
- useful for content-addressable/verifiable storage
- unrelated to all of the hierarchical formats
- separate shards per scale along with the related metadata (same for xarray)
- JMS: in the interest of getting ZEP0001, perhaps we hold off on sharding. as a delta to the
- **tl;dr** --
- ZEP0001: focus on getting current work done but include storage transformer (:+1:)
- ZEP0002: Jonathan to start ~next week (making necessary adjustments to ZEP0001)
- MM to comment on PR or open alternative proposal
- then in that same batch or ZEP0003 definition of extensions
- RA: process -- inventing it as we go. Bootstrapping so there will likely be a lot of feedback on how things work, but try to use that structure for the moment.
- JMS: on the outstanding issues
- fill values: consensus that it must be specified?
- JM: replaces DB's smart fill value logic?
- DB: clients can have a mapping or a callable, but it wasn't easy to make it work with the semantics (in the zarr-python)
- DB: easier if we make it required. gets past fundamental ambiguity
- JM: the upgrade scripts will need to be aware of this too (EOSS4)
- DB: 0 as sane default for uint8? etc. etc.
- consensus: require fill value be specified
- case sensitivity
- v3 says "case sensitive", reasonable except on e.g. OSX.
- Add a note?
- Alternative of escaping? (add-on)
- WF: file-system rather than OS bound (despite tight correlation per platform). NC doesn't assume the technical doubt of working around file system limitations. Ergo? User consideration, not something that can be fixed technically.
- path structure
- see previous meetings
- Options
- (A) removing "meta/" as nicer paths to metadata files
- Con: doesn't work for the consolidated metadata path
- JM: Workaround with "symlinks"
- (B) require suffix on the root (".zarr")
- (C) syntax for combining path and key: `path//key`, `path#key`, etc.
- JM: recently ran into a need/use-case for something like (A). import into OMERO, easier to work with the metadata as the main hierarchiy.
- RA: good to think about having kerchunk style references encodable in Zarr
- discussions at scipy
- vibe, "why a new format?"
- needed the ability to have references to blocks in other files
- MM: "composite store" (like a sharded store, could also add in kerchunking possibly)
- adds in layer of indirection, doing indexing. tells you what's present.
- would need to be more well-supported than consolidated metadata.
- JM: how does it differe? MM: more flexibility in how it is broken up
- MM: very large dataset and doing analysis on one part of the dataset then that can be updated independently.
- JM: similar to Dennis' consolidate per hierarchy level Yeah, like an octtree.
- NR: but doesn't solve path
- JMS: if consolidated is a concern, then (A) won't work
- JM: plus symlink should work.
- JMS: proposing to drop "meta".
- RA: currently fsspec handles it. could be a formal URL scheme.
- JMS: some issues with URLs if you're opening RW
- RA: could see only RO to begin with.
- root directory
- reasoning is having a non-empty name
- JMS: just have ".zarr" as the name? best to skip that. (hidden files were an issue for V2)
- RA: talked about having the root document be a json file.
- JMS: special name?
- difference between .zgroup and .zarray
- leads to potential race condition
- DB: used to attributes.json from n5 and have never had an issue with it
- NR: easier if it's one name to not need to do two look ups
- endianness
- JMS: currently include types `<i2`, etc. more logical to say data types are logical (16-bit-signed-integers)
- JMS: make it a codec issue to deal with endianness since it only matters for raw encoding
- NR: need to specify it somewhere, even if just in the codec. blosc would need to know (anything byte based)
- JMS: filter rather than a data type
- NR: downside of having it in the datatype? codec could ignore it. (JMS: happens at the moment)
- JMS: numpy's endianness is a bit unusual. often you want to just use it in the native endianness.
- (Lots of nodding from Trevor)
- JMS: main benefit is to always give the user native endianness
- boolean/complex
- people were happy to have them (yes?)
- MM: boolean as 1 bit or byte? JK: one byte. no bitpiacking. (That could be a codec)
- vector bool as an example of over-optimized
- rawcodec (DB)
- never need to say "None"
- "raw"? intuitive?
- "identifiy", "noop", "dummy", "pass-through"
- JMS: similar to endianness. combining 2 things in codec. codec gets an array and not a stream of bytes. could arguably be **split**
- DB: separate configuration for each?
- JK: similar to filter vs. codec, not well spelled out in the spec. See Categorize for an example.
- JM: would make the choice to avoid compression explosion (e.g. for images)
- JK: there's already a meterological compressor...
- JMS: linear chain of filter with a codec has issues
- current way to do it would be to encode a byte stream and use a compressor
- perhaps want separate compressors for different parts
- could the filter itself have additional filters/compressors for the labelled data vs. the indices
- JK: use cases? JMS: variable length strings, multisets, downsampling segmentations (similar to large number of categories)
- JMS: should be easy to fit it in now. have a tree and the filter becomes the codec.
- DB: filter vs codec? why a tree rather than an array?
- JK: original use case of filter is categorize `(RGB) -> (012)`
- filter as a transformation (on ndarray)
- DB: different type signatures?
- JMS: effectively not different in V2
- JK: mostly a terminology thing
- DB: "pipeline"
- TM: is "raw" just an empty list?
- JK: look at how parquet does it? (Ask Martin perhaps)
- TM: one pipeline with inputs/outputs for each codec then you could encode numpy/bytes as desired an confirm that it's valid
- JMS: one codec location to an array? (nodding)
- JK: do we have chained codec use cases
- DB: someone at Janelia was working on that for segmentation of volumes
- similar to categorical
- see related paper. "gzip on top"
- JMS: similar to something in neuroglancer
- TM: bitshuffle/gzip for kerchunking? (to read HDF5 file)
- DB: semantics come from HDF5
## 2022-07-13
* Cancelled for SciPy