owned this note
owned this note
Published
Linked with GitHub
---
tags: zarr, Meeting
---
# Zarr Bi-weekly Community Calls
### **Check out the website for previous meeting notes and other information: https://zarr.dev/community-calls/**
Joining instructions: [https://zoom.us/j/300670033 (password: 558943)](https://zoom.us/j/300670033?pwd=OFhjV0FHQmhHK2FYbGFRVnBPMVNJdz09#success)
GitHub repo: https://github.com/zarr-developers/community-calls
Previous notes: https://j.mp/zarr-community-1
## 2024-07-24
**Attending:** Davis Bennett (DB), Josh Moore (JM), Sanket Verma (SV), Fernando Cervantes (FC), Eric Perlman (EP), Ward Fisher (WF), Thomas Nicholas (TN)
**TL;DR:**
**Updates:**
- SciPy 2024 was great! 🎉
- DB: Zarr-Python updates
- Sharding codec is pickleable
- Decision need to made about array API
- How sharding codec should look like to the user?
- DB: Easy to find if your array is sharded
- JM: Partial reading this in Zarr V2
- TIFFfile set a bunch of flags - wonder if those features are friendly for Zarr
- DB: All the arrays should have sharding configuration
- JM: Working with Tensorstore, the order of codecs didn't matter --> read_chunks / write_chunks
- DB: some weirdness when it comes to different backends when uncompressed
- New release - Numcodecs 0.13.0 - https://numcodecs.readthedocs.io/en/stable/release.html#release-0-13-0 - Thanks, Ryan!
- New codec added - Pcodec
- JM: Conda is unhappy
**Open agenda (add here 👇🏻):**
- Intros
- SV: Yosemite National Park
- JM: National Seashore in Florida - Gulf of Mexico
- FC: Jackson Lab working in ML - Saccida National Park
- EP: Zayn National Park
- WF: Yellowstone National Park
- DB: Yellowstone National Park
- TN: Want to open issues on bunch of ideas
- 1. Zarr reader to read chunk manifest and bytes offset - currently Xarray handles this
- Can use Zarr to open NetCDF directly
- 2. VirtualiZarr has lazy concatenation of arrays - Xarray has lazy indexing operations for arrays
- Long standing issue in Xarray to separate the lazy indexing machinery from Xarray - https://github.com/pydata/xarray/issues/5081
- DB: Could be handled and should be a priority now
- TN:
- JM: Agree with Davis with indexing - not sure if the abstraction layer for concatenation is correct!
- JM: Talked to 2 Napari maintainers - on a problem of chunking
- TN: A lot of people want to solve the indexing problem but neither Zarr or Xarray exposes that
- JM: Finding more people with similar interests would help us provide more engineering power
- DB: Create a PR with copy pasting code from Xarray!? - This could unlock a lot of usecase
- TN: VirtualiZarr does actually do that - but at the level of chunks rather than indices
- DB: Slicing and concatenation are duals - if you have both its complete
- DB:
- JM: Query optimisation can be tweaked as we move forward
- TN: When you do concat and slice you have identified a directed graph - you can optimise that plan - you can also hand off that plan to some reader
- JM: What does user do with the plan? Do they do something with it?
- TN: Array API folks has deliberately made arrays lazy
- GPU CI for Zarr-Python - https://github.com/zarr-developers/zarr-python/issues/2041
- GitHub and Cirun sounds good and easy to setup
- Who pays? - Earthmover is ready to pay the cost for initial months and then switch to NF
- NF has money reserved for projects in the infrastructure committee for similar costs
- JM: Good to have it!
- SV: Need to get it sooner that later
- Zarr paper - https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/Appetite.20for.20a.20Zarr.20paper.3F
- JM: My poster was cited multiple times in the last few weeks
- JM: JOSS is a potential venue - IETF is more work
- TN: Submitting to a computing journal - W3C, IEEE, etc.
- TN: Xarray: https://openresearchsoftware.metajnl.com/articles/10.5334/jors.148
- JM: NetCDF: https://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg00087.html
- **TABLED**
- Using MyST for Zarr webpages - https://ossci.zulipchat.com/#narrow/stream/423692-Zarr/topic/Moving.20from.20Jekyll.20.E2.80.94.20Zarr.20webpages
## 2024-07-10
**Attending:** Josh Moore (JM), Davis Bennett (DB), Fernano Cervantes (FC)
**Updates:**
- SciPy! :tada:
- Josh: testing zarr v3
- issue for each problem? Davis: sure
- Davis: to be fixed:
- no validation of fill value
- multiple bugs with sharding: 1d
- Josh: missing "attributes"
- Josh: but neuroglancer working?
- Davis: not for all static file servers. need PR.
- Davis: various forks. Josh: plugins? Davis: tough
- or: neuroglancer as a component that can be embedded
- Janelia NG is a React component.
- "Visualization is tough."
- Motion for food :knife_fork_plate: Seconded.
## 2024-06-26
**Attending:** Brianna Pagān (BP), Thomas Nicholas (TN), Dennis Heimbigner (DH), Eric Perlman (EP), Sanket Verma (SV), Davis Bennett (DB)
**TL;DR:**
**Updates:**
- Zarr-Python 3.0.0a0 out
- https://pypi.org/project/zarr/3.0.0a0/
- Good momentum and lots of things happening with ZP-V3 - aiming for mid July release
- SV represented Zarr at CZI Open Science 2024 meeting - various groups looking forward to V3 - https://x.com/MSanKeys963/status/1801073720288522466
- R users at bio-conductor looking to develop bindings for ZP-V3
- New blog post: https://zarr.dev/blog/nasa-power-and-zarr/
- ARCO-ERA5 got updated this week - ~6PB of Zarr data available - check: https://x.com/shoyer/status/1805732055394959819
- https://dynamical.org/ - making weather data easy and accessbile to work with
- Check: https://dynamical.org/about/
- Video tutorial: https://youtu.be/uR6-UVO_3k8?si=cp0jOxrtKL_I6LfV
**Open agenda (add here 👇🏻):**
- BP: Will be talking about how Zarr is utilised at NASA!
- _starts screen sharing and presenting_
- BP: I work at Goddard GES DISC - deputy manager at one of the centres - manages team of developers and engineers - **not representing all the data centres**
- BP: Lot of people are coming into Zarr from the SMD (Science mission directorates)
- BP: Earth Science Division - EOSDIS and Distributed Active Archive Centres (DAACs) - DAACs focuses on data distribution and management
- BP: All the centres coming up with the suggestion on best practices and best format - we discuss with them the possibility of what they can, and should use
- BP: Moving to cloud optimized format - DAACs have ton of archival data in various formats
- BP: Projected growth for entire Zarr store across all EOSDIS by 2030 60PB -> 600PB!
- BP: GES DISC holds 7 PBs of data - we have 3000 different collections of datasets - really diverse!
- BP: Giovanni - interactive web-based program have 20+ services associated with it - taking the existing data and grooming the metadata so it's accessible and useful across broader range
- BP: Over at NASA, we do many Zarr stuff...
- Zarr V2 spec is approved data format convention for use in NASA Earth Science Data Systems (ESDS)
- Giovanni in cloud - duplicates Zarr (variable based)
- Open issue: continuously updating Zarr stores - Exploring lakeFS for managing dynamic data
- ZEP0005
- Brianna is leading the GeoZarr work
- VEDA - no. of things Zarr/STAC related going on in VEDA
- TN: Does Giovanni read Zarr directly? If so which reader does it use? (Can Goivanni use VirtualiZarr?)
- BP: Goivanni promotes variable first search - most of Goivanni has OpenDAP attached to it - builts with overhead with GES DISC pipeline - in hindsight- Yes!
- TN: From the slides - Xarray can take care of some of the stuff that Giovanni does
- TN: Very curious about the exact difference between the LakeFS idea and EarthMover’s ArrayLake
- BP: LakeFS is OS ArrayLake - no vendor lock-in
- SV: What does Giovanni actually do when you say, ‘it grooms metadata’?
- BP: Standardizes the grid - flip the grid - naming mechanism - smoothing the metadata so that it works across various services
- BP: other grooming metadata is for example we have alot of time dimension issues. that's because of scattered best practices for how to store time metadata
- TN: Can we do the flipping with Zarr/VirtualiZarr?
- DB: If you flip at the store level - you'd need to find out the how deep you'd need to go
- BP: Will try to make time standard across the datasets
- BP: https://github.com/briannapagan/quirky-data-checker
- BP: _from the Zoom chat_
- Zarr Storage Specification V2 is an approved data format convention for use in NASA Earth Science Data Systems (ESDS). https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices
- Giovanni in the Cloud, duplicate archive, zarr, variable-based: https://cmr.earthdata.nasa.gov/search/variables.umm_json?instance-format=zarr&provider=GES_DISC&pretty=True
- Open issue: continuously updating zarr stores. Exploring lakeFS for managing dynamic data
- ZEP 0005: Zarr accumulation extension for optimizing data analysis
- Looking into a GIS service for zarr stores
- POWER https://power.larc.nasa.gov/data-access-viewer/
- https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html
- https://discourse.pangeo.io/t/metadata-duplication-on-stac-zarr-collections/3193/7
- EP: Converting OME datasets in V3 in upcoming months - quirky tool can be useful
- DB: V3 chunking encoding matches with V3 encoding - you just need to re-write the JSON document
- DB: Playing with sharding - tensorstore is fast - need to figure out the nomenclature
- EP: The bio and geo world have parallel tracks and working in silos
- EP: https://forum.image.sc/t/ome2024-ngff-challenge/97363
- DB: The challenge doesn't seems interesting to me! - convering `JSON`s documents - instead we should be focusing on converting existing data to sharded stoes - much interesting problem
- EP: Bunch of data is non-Zarr and would be working on to push them to cloud and convert it to Zarr