owned this note
owned this note
Published
Linked with GitHub
# GeoZarr Spec Steering Working Group
Note that a [meetings summary](#Meetings-Summary) is provided at the end of the document.
Repository: https://github.com/zarr-developers/geozarr-spec
## Meetings Info
> :warning: The official meeting is scheduled for the **first Wednesday of each month** at 4:00 PM UTC. An optional second meeting may be held on the third Wednesday of the month, if needed.
**Next**: Wednesday, **November 6th**, 2024
Time: EDT: 11:00AM - PDT: 8:00AM - UTC: 4:00 PM - CET: 5:00PM - [[more](https://www.worldtimebuddy.com/?qm=1&lid=100,12,5128581,5368361&h=100&date=2024-4-17&sln=15-16&hf=1)]
Video call link: https://meet.google.com/jth-rstn-fwb (or dial: [phone numbers](https://tel.meet/jth-rstn-fwb?pin=9845739928663))
## November 6th, 2024
### Attendees
* Tyler Erickson
* Raphael Hagen
* Piotr Zaborowski
* Raphael Hagen
* Chris Little
* Christophe Noël
* Max Jones
* Colby Fisher
* Ryan Avery
### Agenda
**Agenda**
1. **Actions**
- Agreed to create a dedicated wiki page on GitHub to track actions raised during the meeting. This page will ensure that all action items are recorded and accessible to team members.
2. **Workplan / Roadmap**
- [Workplan Wiki](https://github.com/zarr-developers/geozarr-spec/wiki/GeoZarr-Roadmap)
- **Topics**: Requested a link to any related discussions or specs for the various topics under the workplan.
- **Participation**: Asked all interested members to update the topic list, indicate their interest, and specify the topics they will work on.
- **CF Mapping**: Suggested that CF (Climate and Forecast) metadata mapping should be identified, including any minimum required properties. However, the use of multiple terminologies is encouraged and is a common practice in recent specifications.
3. **GeoTransform Conclusions**
- Awaiting input from Brianna. Noted that conclusions should consider OGC (Open Geospatial Consortium) guidelines, and discussions are on hold.
4. **OGC Tools (GoToMeetings)**
- Recording meetings and transcriptions are common practice, and the materials should be made available afterward.
- Some members prefer open tools like Google Meet, although it lacks recording functionality. GoToMeeting may be used but only if it does not restrict attendance.
5. **Resampling method**
- Max Jones presented warp resampling methods, comparing performance and memory usage across libraries. Key findings include that GDAL-based tools like Open Data Cube and Rioxarray are efficient for local datasets. Virtualizing NetCDF files as Zarr enhances performance, and web-optimized formats, especially those with overviews, dramatically reduce processing time. Future improvements should focus on supporting Zarr overviews, pre-generating weights, and optimizing Xarray imports, as these enhancements could significantly boost resampling efficiency.
6. **AOB**
- Christophe asked if the issue with the number of objects generated by Zarr, which poses a cost barrier to its adoption by the Copernicus Data Space Ecosystem, is addressed by Zarr v3.
- Note (after the meeting): the sharding codec can mitigate this problem. [Read more here](https://zarr-specs.readthedocs.io/en/latest/v3/codecs/sharding-indexed/v1.0.html).
### Geo Transform
Objective: To link pixel coordinates to geographic coordinates in a condensed way
* Relationship and complementarity with Map/Coverage standards (e.g. old and new APIs)? See [this summary](https://github.com/zarr-developers/geozarr-spec/issues/17#issuecomment-2328167623) (OGC core axis-aligned ?)
## August 28th, 2024
- Brianna Pagan
- Kevin Sampson
- Ethan Davis
- Felix Cremer
### Saving of geo transform
Brianna showed a saved geozarr with geotransform
### Chunk position information
What do you want to show to the user
User should not think about chunking
## July 24th, 2024
- Kevin Sampson
- Colby Fisher
- Brianna Pagán
- Martin Durant
### Agenda
- Follow up on: https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/33
- Martin: Need to store it, need to transform, and need to be implemented before you can demo. Its the tools responsibility to describe the world. If you allow for different ways to describe the coordinates you don't need to feel locked in.
- Brianna: to-do on me to create a zarr store with the suggested metadata described above in the pangeo forum.
- https://github.com/pydata/xarray/issues/6448
## July 10th, 2024
Cancelled, no attendees.
## June 26th, 2024
### Attendees
- Kevin Sampson
- Steve Olding
- Alexey Shiklomanov
- Brianna Pagán
- Ethan Davis
- Felix Cremer
### Agenda
- Moved all notes from 2023 to: https://hackmd.io/@briannapagan/geozarr-swg-2023 as we exceeded hackmd note length
- Go over Brianna's response to: https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/32
- TLDR is to add:
- GeoTransform
- number of lat/lon
- lat_bounds/lon_bounds <> offset? AREA_OR_POINT for geotiff
- [Alexey]: Here's the [GeoTiff standard](https://docs.ogc.org/is/19-008r4/19-008r4.html). The relevant discussion of whether pixels refer to corners or centers is, maybe, "Raster Space" heading: `PixelIsArea` raster space assumes the data are _between_ indices, while `PixelIsPoint` raster space means the data correspond exactly to the data.
- [Alexey]: `gdalinfo` on an example (cloud optimized) GeoTiff shows `AREA_OR_POINT=Area` (under `Metadata:`) for this.
- For chunks, chunk reference index
- This would allow to go from implicit to explicit, each chunk refers to original GeoTransform and uses chunk reference index to when needed explicitly calculate
- https://github.com/zarr-developers/geozarr-spec/pull/19 related.
- Live demo of going roundtrip geotiff -> zarr -> geotiff: https://github.com/briannapagan/geozarr-validator/blob/main/tiff-roundtrip.ipynb
- Discussion:
- Alexey: Yes, to me the issue was always getting xarray/CF to understand affine transforms. Need to iron out preferred ways to define the position in a pixel, refer to GDAL. Might have some critics with how to store CRS/WKT
- [Alexey] Recommendation:
1. Explicit dictionary referring to EPSG; e.g., `{"crs": {"epsg": 4326}}`
2. Explicit dictionary including WKT (as a string?; e.g., `{"crs": {"wkt": "<long multi-line WKT string...>"}}`)
- Brianna: Martin's case below where a small GRIB file opening in xarray becomes huge trying to explicitly list out coordinates.
- Ethan: Going GRIB to netcdf, converting to explicit lat/lon.
- Alexey: This solution unlocks the ability to leverage the speed of raster subsetting for netcdf like data.
- GDC SWG meetings will be started again in combination with the Testbed 20 GDC group. They'll start June 27th, 16:00 CEST and take place every week. Meeting links are on the OGC portal now.
- Brianna will try to join these meetings to discuss overlap had some preliminary discussions with Matthias Mohr
- With regards to "file/data formats", the GDC work has been pretty unspecific (or let's say format agnostic), many people did actually export GeoTiffs in TB19, haven't seen a lot of netCDF
- How to handle multiple resolutions / overviews?
- [Alexey]: GeoTiff metadata includes `Overviews: 1830x1830, 915x915, 458x458, 229x229` and possibly `OVR_RESAMPLING_ALG=NEARSET`
- Felix: suggesting etting up TMS server on top of zarr
## June 12th, 2024
### Attendees
- Brianna Pagán
- Kevin Sampson
- Colby Fisher
- Chris LIttle
- Felix Cremer
- Emma Marshall
CNL: not available
Felix Cremer: Try to join, but will be 15 minutes late
### Agenda
- CNL: If there are no objections, I'd suggest merging it so we can start creating pull requests on specific topics and sections. This should help optimize improvements and collaboration.
- https://github.com/zarr-developers/geozarr-spec/pull/47
- Working with GDC SWG
- Brianna reached out to Matthias Mohr, testbed 20 will start June 21st, next GDC SWG June, 20th 9:00am - 10:30am EDT see https://portal.ogc.org/meet/
- Deadline passed for testbed proposals, Brianna will see if/how GeoZarr can fit into any existing one
- Question of how/if GDC supporting netcdf-like data
- OGC Code sprint for the OGC APIs sitting in on the webinar
- June 13, we will run the pre-event welcome webinar, for the Open Standards Code Sprint. The webinar will run from 14:00 to 15:00 UTC+1, and it will set the context for the upcoming code sprint, by presenting the overviews and sprint goals. If you are planning to attend the sprint next week, we highly recommend attending the webinar, as this information will not be repeated in the kick-off session.
The webinar will take place online, at the #Main Stage of the OGC Events Discord server. For logis
- Storing ```affine_transform``` OR ```coords```
- Need more than just geotransform, need to know whether x/y is trying to represent center or elsewhere on the grid, so need to store offset
- For us, we can store GeoTransform and the offset between large box and tiles
- CL: GIS assumes it's an image, CF conventions say whether it's in the center or edge.
- Geotransform is corner of cells
- https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/31
- Brianna: propose to store GeoTranform and offset
- Round tripping between data formats, Felix opened an issue https://github.com/zarr-developers/geozarr-spec/issues/50
### Action Item
- [ ] Organize notes and action items, keep Meeting Summary updated, clean up open issues so that people can async help better. (Brianna)
## May 29th, 2024
### Attendees
- Christophe Noel
- Christine Smit
- Kevin Sampson
- Emma Marshall (University of Utah)
- Brianna Pagán
- Max Jones
- Colby Fisher
- Ryan Abernathey
- Chris Little
- Martin Durant
### Agenda
- Christophe: created PR to discuss the table of contents and some core principles of the OGC spec: [PR-47](https://github.com/zarr-developers/geozarr-spec/pull/47). Each topic discussed during meetings should have a PR within a section (+ zarr example)
- Check all topics are covered
- Check that it supports all relevant source formats
- Check that it reconcile GDAL/CF ecosystems for the encoding of rasters (coordinates and projection encoding, n-d dimensions, etc.)
- CL: if you look at WMO in its standard has same status as ISO, alot about coordinates in a non-GIS way, in a very meterological way, want to raise as an issue, translate WMO to GIS terminology.
- RA: To be successful a standard thats useful for EO satellite and climate/weather
- CL: GIS community has asssumption anything with raster is numeric and 2D.
- RA: If take gdal, quite flexible in how it handles coordinates. xarray can't do GIS style referencing.
- BP: Discussion about saying terms like 'regular' grid, 2D
- CN: grid_mapping can include geotransform, raster 2D, in multi-dimensional array, is it still the right way to describe the projection, rasterio creates grid mapping within coverage variable which might be more than 2D
- Christophe: A coverage (2D or more) generally has a corresponding affine transformation (projection) for the latitude/longitude dimensions. The projection can be described in the grid_mapping variable. The coordinates for each pixel might be encoded/described as labelled arrays (NetCDF), origin offset (GDAL), vectors.
* I propose using the conformance class (property `conformsTo` in the coordiante variable of the GeoZarr model to indicate which type of encoding is used.
- Brianna: For spec want to propose the ability to check whether data has ```affine_transform``` OR ```coords```. Original thought was some type of flag to distinguish between the two (Max Jones and I met last week and had the same independent idea), but if range index ability is supported in xarray, this can be the same field with different metadata associated with it
- Until then, modified validator to fit within pydantic-zarr
- Max looking into how abstracting would work for pyramiding
- MD: affine is not the only one, but good start. Astro coord system is capable of doing all the things we talked about, what it doesn't have is the set of complex coordinate systems, need an extensible way
- RA: GDAL has [affine geotransform](https://gdal.org/user/raster_data_model.html#affine-geotransform), aim to support this, don't need to go beyond at this point.
- MD: aim for extension mechanism for other people to bring solutions. Affine is straightforward
- RA: Need new custom index, the problem is when you go to write this, the prescence of explicity coords with trump, once you write data with explicit, it will use explicit rather than transform
- MJ: Broader than just how to write, if you read it and write with rounding errors coords are less precise, people are also manipulating data, that affine transform is not accurate, the minute you do isel, propagating as fixed attribute is not going to work.
- RA: how does xarray deal with time,enconded and decoded state.
- MJ: https://github.com/carbonplan/xrefcoord/blob/f7c46c845cb34175ab56a49a26941257a457c87c/xrefcoord/coords.py#L22-L46
- RA: can we create an index, the other part is serialization. Right now xarray is special w.r.t to time. can put custom decoding in the back ends
- RA: can we enumerate on different options
## May 15th, 2024
### Attendees
- Ryan Abernathey
- Christophe Noel
- Christine Smit
- Ethan Davis
- Brianna Pagan
- Wietze Suijker
- Kevin Sampson
- Felix Cremer
- Colby Fisher
### Agenda
- Changing how to call meetings
- Calendar invites? Some folks are receiving them, some are not. I'm inclined to cancel and have folks use this page to know how/when to join. -Brianna
- I have two entries in my calendar, looks strange - Christophe
![image](https://hackmd.io/_uploads/r1_7jfMmR.png)
- First iterations of validator (Brianna): https://github.com/briannapagan/geozarr-validator/
- grid_mapping?
- time bounds
- Inspo: https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr
- SHould validation be in terms of JSON-schema? https://github.com/zarr-developers/zarr-specs/pull/262#issuecomment-1729053211
- Or Zarr Object Model: https://github.com/zarr-developers/zeps/pull/46
- Coordinate System Models: https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140/27
- How is geozarr different from CF?
- _ARRAY_DIMENSIONS is not zarr but comes from xarray to make zarr look netcdfish
- Zarr lets you put whatever you want into these attributes
- zarr v3 has dimensions built in
- Decide on which level the validator sets on
- do not open the raw json but use the zarr package
- We could operate on a netcdf data model
- We don't have to validate that it is a netcdf but look at the attributes and dimensions
- Whatever we agree on in the validator should be also included in the spec
-
- Current is in the details and it needs to be either the zarr data model or the netcdf or CF data model and not go into the details of the encoding
- What is the goal of GeoZarr and how does it fit in with the existing efforts?
- CS: CF is a very complex standard and we wouldn't want to write a new CF validator
- ED: 90 % of CF data uses a very basic subset of the CF standard
- if you build a validator you have to look at the weird cases
- CF validator will fail for the planet data
- Mix between TIFF and NetCDF world
- We have two opposite objectives
- Wide range of data sizes
- facilitate a wide range of clients
- Do we want to start with basic examples and then open this to a wide range of conventions?
- Or do we want to be compliant with certain conventions like CF?
- Dataset contains coordinates, variables
- Start with dataset then you can open it with xarray
- Start with a minimal subset of CF conventions
Make up the standard that we need to make this happen
GDAL already has a spec to make this possible
Can we open that on the python side and do something with it
That happens in rioxarray
This conversion is not reversible and we would need to see which information is needed to make it reversible
Explicit versus affine transformation
Make the software do what it needs to do and get the spec out of that
- Do we want to be the meeting ground to find this out.
- Should this be more an implementation meeting?
- There is other conventions that build on CF that - specify which parts of CF are required
We need to decide what we want to do
Whether we want to include geotiff like data in the first iteration
Delivering something is important
specs are not updated very often
Maybe we should rather take our time to confront the root issue
If a library says it supports geozarr does it need to support both different data models?
Enumerate the different tools
There is python and gdal which cover 90% of usage
Does GDAL understand both?
If GDAL is doing both so the only thing we would get is on the python side
You can't really write that spec starting with data from xarray
Interoperabiltiy means round triping the data without losing information
Is it obvious how GDAL does it and how to translate that to zarr?
Is there any reason to not take this and implement it in python?
Attrs are not removed in V3 only how it is saved on disk is changed
- Progress on conforming to OGC template (Christophe/Brianna?)
- Multi-scale PR updates?
- Tried implementing on julia issues with
- Difference in webmapping world and geotiff overview model of aggregating data, not sure we want to mix this, if you want a tile matrix set, there is another layer that needs to be specific on top
- Serving up a TMS on top of a zarr might be a different tooling then just overviews of data
## April 17th, 2024
> `CNL`: not available today (still plan to bootstrap the OGC template with core definitions - during April)
### Attendees
Brianna Pagan (NASA GES DISC)
Felix Cremer (MPI BGC Julia programmer)
Christine Smit (NASA GES DISC)
Martin Durant (Anaconda)
Kevin Sampson ()
Doug Newman (NASA ESDIS)
Ryan Abernathey
Colby Fisher
### Agenda
- Ethan and Brianna tag-up on compression algorithm support
- Had some discussion before that "default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr" but we found documentation showing it should: https://www.unidata.ucar.edu/blogs/developer/entry/nczarr-support-for-zarr-filters Ethan is following up
- Martin: Two major versions of blosc with different codecs, so could not be full support
- Brianna and Christophe tag-up on branch for refactoring existing write-up to OGC template
- Any updates on discussion from Ryan's demo/blog from last meeting: https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140
- Martin: GRIB <1000 byte, but always adds coordinates, load with xarry, it will be over 100MB in memory.
- Felix: is there anything julia ecosystem can assist/learn from this
- Ryan: We need a concrete proposal, everytime something simple is suggested, many responses as to why it's more complicated, feel stuck. https://gdal.org/user/raster_data_model.html
- Martin: implement something like affine transform, and this is the implementation and a way from going from standard tags in geotiff to your explicit implementation would go a long way.
- Ryan: what would success look like, write code that we want to work, then ask where should it be implemented. Ultimately new index type in xarray, but need to define success within the group. Keep analytical not float info. Preserve analytic coordinates
- load data from geotiff save to zarr and load it again and save to geotiff and the coordinates should be the same
- Felix: idea is to save it as GDAL saves it
- Martin: how gdal defines attributes is fine
- Ryan: forget geo, just 1-D, defined analytically, A->B, shouldn't need to save every coordinate. Don't have to treat it as data.
- Felix: in Julia you can usee w/e array as dimension, not sure how these are read/saved
- Martin: not language issue, a library problem, astronomers don't have this problem although they have analytic. Need POC
- Christine: dimensions matter when you're trying to query. when would a tool use this information, you need a function to decide when you need indexes
- Martin: Yes, Logical to analytical indexes function is needed
- Christine: 1) i just want to open the file, i want xarray to do the right thing. 2) actual implementation people who want to get more in the weeds
- Brianna: it's easy enough A->B, but when we add the geospatial, that's where the convo get's blocked
- Ryan: PROJ does this, the difficult part is with serialization, how can we tell that the coordinate is present, how do we identify and is that interoperable for non-python, non-xarray softwares. Xarray developers need to show that I can create an xarray dataset that has this type of analytic coordinate system and query it. After that we can tackle with encoding
- Felix: we have this in julia, if we save to zarr just an array as integer, just a vector, why we need to talk about serialization
- Ryan: can we create a 1D xarray, save to zarr, open in julia, get a properly encoded, save in zarr, pass it back and forth.
- Felix: how would you save it.
- Ryan: for range it's start, stop, # of points, metadata variance you want to save, is it for the center of pixel. Start, stop, or offset/scale, you need to know how many points, already known if it's describing another array. Encode these floating point numbers in a lossless way, in zarr you can put them in metadata or another array, in another array probably not needed, but would encode in an optimal way, putting it into metadata as json, you want to put full bytes rather than txt based rep as number.
- Christine: advantage of seperate array, can take CF approach for describing in the metadata with additional attributes
- Ryan: push as much encoding as possible in zarr, virtualizarr. If xarray sees a variable already opened by zarr, that has units of days since some day, it triggers that's time and let's decode.
- Christine: time is more of a pain.
- Ryan: motivated here to figure out decode index, first step is in xarray dev supporters
- Pangeo/NASA funding discussion scheduled for later this afternoon: https://discourse.pangeo.io/t/nasa-funding-and-the-pangeo-ecosystem/4136
- https://github.com/zarr-developers/geozarr-spec/pull/44
- Felix: trying to build and save to disc, not exactly sure how tile matrix set is going to work, if we have some dataset, would we be able to add
- Set up dedicate agenda item
## April 3rd, 2024
### Attendees
- Brianna Pagán
- Ryan Abernathey
- Ethan Davis
- Tadd Bindas
- Max Jones
- Anthony Cak
### Agenda
- New branch for conforming issue #34 with OGC template (Brianna)
- Ethan: for CF just need groups, arrays and attributes. An extension of Zarr changes the encoding, whereas a convention is just an extra metadata that is visible to any zarr. CF is completely visible to anything that understands netcdf, whereas an extension you need not just zarr, but you need the zarr extension
- Ryan: still in the process of figuring it out for zarr, we can define some conventions, extensions are things that require changes or augmentations to the core data model. If extension is not understood, you cannot decode the data, an example that needs to be an extension, variable size chunks, if you try and go in to read zarr data, and your implementation doesn't know how to decode. Conventions should still be operable under vanilla zarr, multi-scale is an example, people use this. xarray came up with its own convention for putting convention names. This group should try to do everything through conventions, cross post with OGC. Dont get into the data model layer, like how jsons are structured.
- https://zarr.dev/zeps/draft/ZEP0004.html
- This needs to be linked to the PR for the zarr spec that implements the ZEP https://github.com/zarr-developers/zarr-specs/pull/262
- https://github.com/zarr-developers/zarr-specs/pull/262/files#diff-cacd72e8200bb6b7fb7e9ee8709abb11ecd292bb6c462f0fe402fdc46bb77927
- Describes the xarray-zarr convention
- Ryan: there are aspects where some domains what to adopt units without adopting all CF
- Still would like an example zarr file for https://github.com/zarr-developers/geozarr-spec/pull/44
- Demo from Ryan and blog: https://discourse.pangeo.io/t/example-which-highlights-the-limitations-of-netcdf-style-coordinates-for-large-geospatial-rasters/4140
- In memory information and serialized information, needs to go somewhere in metadata
- Could create custom xarray index that understands
- Ryan will make a post on pangeo discourse, discuss possible solutions, implementing an xarray custom index that supports this type of coordinate system
- Tadd: lazy coordinate system?
- Ryan: maybe implicit, lazy implies there is data just not loading yet. Lazy concept is useful but slightly different
- Tony: This is my exact use case, i have errors trying to create a zarr store using xarray from a large
- Developing tool for checking compliance
- Max: interested in a checker beyond just the convention but also looking at the data, https://discourse.pangeo.io/t/nasa-funding-and-the-pangeo-ecosystem/4136/3
## March 20th, 2024
### Attendees
- Brianna Pagán
- Christine Smit
- Colby Fisher
- Kevin Sampson
- Max Jones
- Christophe Noel
- Steve Olding
### Agenda
- Still haven't heard back on sub-group creation from Scott
- Have folks requesting 'observer' status, this will only come into play when we're voting on items
- Presentation at NOAA Enterprise Data Management Workshop in May
- Can the summary from Christophe: https://github.com/zarr-developers/geozarr-spec/issues/34 be submitted as PR?
- Adapt definitions, in a more agnostic way and map to zarr model, will be good start to adapt
- Writing this with OGC template
- Check that OGC templates are auto converting to pdfs successfully
- Ethan and Kevin as reviewers
- CN: Maxmimize interoperability of format, tradeoff of datasets encoded in zarr and maximizing the tools, recommendation versus requirement classes, do we allow a zarr to have multiple datasets, dataset with children datasets, this complicates how tools read the files. Have a requirement class which says this geozarr is complex, or the opposite, this geozarr is a simple dataset, maps to native-format
- CS: CF didn't accept group structure, recently added
- Compressor lit review (open action item, still need to tag up)
- From Christine last meeting "default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr," however I am seeing [NCZarr Filter Support](https://docs.unidata.ucar.edu/netcdf-c/current/filters.html#filters_nczarr) a reference to blosc. Can Ethan confirm?
- Will merge: https://github.com/zarr-developers/geozarr-spec/pull/44
- Developing tool for checking compliance
- any lessons learned from CF checkers or COG checkers?
- https://cogeotiff.github.io/rio-cogeo/CLI/
### Action Items
- [ ] Brianna to try to adapt issue-34 as PR using OGC template, Ethan and Kevin (kmsampson) to review
- [ ] Brianna scheduling alternative bi-weekly coworking session to develop a tool that checks if a zarr store is compliant with existing specification
## March 6th, 2024
### Attendees
Max Jones
Michelle Roby
Ethan Davis
Brianna Pagán
Christophe Noël
Tadd Bindas
Ryan Aberanthey
Felix Cremer
Lars Barring
Sean Harkins
### Agenda
- Presentation to OGC netCDF SWG last week (Ethan)
- Specifying the Organizational Structure of GeoZarr (title edited) [#34](https://github.com/zarr-developers/geozarr-spec/issues/34) (Christophe)
- Brianna: this is the same point i bring up below in agenda for how to handle zarr_format
- Christophe: never added the mapping between geozarr and zarr, always has been implied, but we can follow NCZarr approach of how this is mapped.
- Ethan: I found current spec confusing, spelling out dataset, data array etc there's some parts that are CF, and other parts that seem to replace CF. If you look at CF-data model, it has alot of details on how CF works with these kinds of things. Would be good to have netCDF OGC SWG to have more people from CF world to look at this and how to clarify how much CF is used. In term of NCZarr and xarray, wondering if GeoZarr shouldn't be too specific, if it can handle xarray dimensions, allow for NCZarr construct that will represent same. Big differences between NCZarr zarr-v2 implementation and zarr-v3
- Christophe: its opinionated if we say that GeoZarr should use all same approaches as CF, until now just using CF for not reinventing the wheel, but minimize the size of the specification
- Ethan: CF doesn't have alot of requirements for metadata, advantage for allowing whatever CF is in the file and building on top of that, and having the pieces that are making it geozarr compliant. Lots of existing profiles of CF, WMO- for sounding data some example,
- Christophe: if we define something and its aligned with CF, that's a good approach
- Ryan: between v2 and v3 data model is not hugely changing, we shouldn't get too hung up on that. GeoZarr should specify the zarr model, how that is encoded, that's zarr job to manage. Doesn't need to go under the hood.
- Ethan: Not that CF has to be the end all, GeoZarr can reference CF data model and build off of that. Where does CF line up where does GeoZarr need to diverge.
- Tile Matrix https://github.com/zarr-developers/geozarr-spec/pull/44
- Compressor lit review
- From Christine last meeting "default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr," however I am seeing [NCZarr Filter Support](https://docs.unidata.ucar.edu/netcdf-c/current/filters.html#filters_nczarr) a reference to blosc. Can Ethan confirm?
- https://ui.adsabs.harvard.edu/abs/2021AGUFMIN35D0418H/abstract
- Do we have to explicitly add zarr version specs to GeoZarr specs?
- i.e. [zarr-v2 arrays](https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html#arrays)
- This came to my mind when thinking of how to handle for example consildated metadata, so we need to make some statement on compatability of GeoZarr with zarr v2 and v3
- CN: I think so, OGC extension must align specific version (in particular on breaking changes). Zarr v3 is still under development, we should target v2 before v3 is released.
- Some test zarr stores so far have been using v2 consildated metadata, need some examples of zarr v3 generated zarr stores
- CN: .zmetadata only concatenates metadata of all children, so even in v2, we can specify all out of it (and possibly already add indexes to fasten coordinate and variables discovery).
- Tuesday March 12 Coworking Hour! EST Time: 11:00 AM (UTC-5)
- AOB: consider shifting for a more wordwide time slot in April (EDT: 10AM, CEST: 4PM, UTC: 2PM) ?
- BRP: That is fine, I am working US West hours,
![image](https://hackmd.io/_uploads/rk1VkGIap.png)
- Interoperabability issues with opening th example zarr stores in Julia
-
- What would we want to see from zarr sparse array support https://github.com/zarr-developers/zarr-specs/issues/245
- Tadd: You have a sparse array, and a command to write to zarr, and some codec would save it and read out of, workaround right now is a wrapper that would break up your sparse array in different parts. that's individually stored in zarr. Question: if anyone in this meet uses sparse arrays, what an interface of what they are looking for would be
- Ryan: zarr specifies on disk format, so we can imagine how to store sparse arrays, but to sparse effectively you need an in memory representation that allows you to query, most programming languages has a sparse array type, and once you have that sparse array you can use for useful things. Hash regridding between two different grids. Would like to compute this once and save it, and open it quickly. The stumbling block is figuring out what in memory would look like. Can agree on serialization but what are implementions going to do when seeing a sparse
- Sean: What is your primary analysis env? Deepak's prior art here https://ncar.github.io/esds/posts/2022/sparse-PFT-gridding/
- Tadd: A combo of zarr, xarray, dask, depending on how big problem. Smaller problems xarray, the hypersparse matrixes we use, similar to Ryan, some mapping matrix used for calculations, or using sparse.COO
- Sean: with EO data, issues with sparse data cube problem. You have a storage problem as well
- Ryan: Proposal to czi to implement sparse encoding in zarr
- https://github.com/ivirshup/binsparse-python/ ( the binsparse spec proposed in python)
### Action Items
- [ ] Felix opening new issue for Julia compatability
- [ ] Ethan and Brianna tag-up on compression algorithm support
- [ ] Ethan and Brianna tag up on more explicit open questions for CF community
## Feb 21st, 2024
### Attendees
- Brianna Pagán (NASA)
- Ryan Abernathey (EarthMover)
- Ethan Davis (UCAR/NCAR Unidata)
- Amit Kapadia (Planet)
- Michelle Roby (Radiant Earth)
- Tadd Bindas (Penn State PhD Candidate)
- Christophe Noel (Spacebel)
- Kevin Sampson (NCAR, WRF-Hydro, WRF)
- Christine Smit (NASA)
- Colby Fisher
### Agenda
- Co-chairs for OGC sub group: Christophe and Brianna
- Repo updated with OGC formatting
- Zarr Sprint summaries (https://github.com/zarr-developers/geozarr-spec/issues/33)
)
- HTTP extension, traverzarr mock-up. File browsing. Kevin Booth, same time tomorrow if folks want to join that conversation via Radiant Earth/Source
- Rust object store to be able to query data, replacing fsspec.
- Chunk manifest/virtual concat
- Ryan: chunk-manfiest, referencing/pointing to existing chunks from the zarr metadata. virtual-concat of zarr arrays, stacked zarr arrays exposed, similar to ncml, combining into one larger virtual object.
- Ethan: would love to share best practices with ncml.
- Ryan: folks are already kerchunking PBs of data and opening with zarr, but no spec.
- https://github.com/zarr-developers/zarr-specs/issues/288
- Amit: do people want improvements for kerchunking tiffs?
- Ryan: the issue with normal tiffs not COGs, is too many files, sharding can assist with this.
https://github.com/fsspec/kerchunk/issues/325
- Amit: high error rate in storage at a specific scale. More worried about cloud service provider to keep up with rate request.
- Ethan: errors coming from servers, opendap & co. have dealt alot with this.
- GeoZarr Interoperablitly https://github.com/zarr-developers/geozarr-spec/blob/main/geozarr-interop-table.md
- Christine: a few open issues, compression, default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr, that impacts NCO, netcdf-python and panoply. For NCO, cannot access things from S3. Panoply has a dev branch that can read zarr stores.
- Ryan: no inherent or default compression for zarr, there is one for python-zarr. This is just a downside of how pluggable zarrs are, there are no standards profile. If in geozarr, we state the min set of compression options that aimed to support. Make narrower recs.
- Christine: would be nice to target the default one in the zarr-python library.
- Ryan: Make a recommendation of min set of compression options that make it compliant.
- Brianna: it would be easier to get a list from netcdf/NCO/NCZarr etc of compressions that work and have that for recommendations, rather than waiting for those tools for blosc. But looks like there needs to be some lit review over current.
- Ryan: would look at conda forge netcdf
- GeoTiff -> GeoZarr PR! https://github.com/zarr-developers/geozarr-spec/pull/42
- Overviews need some coordination with Max/the tile matrix PR below?
- Tile Matrix PR: https://github.com/zarr-developers/geozarr-spec/pull/44
- Move away from consolidated metadata in the spec itself, so no zattrs
- Ryan: just want whatever is chosen to be in the spec, can keep it if folks want, wouldn't get rid of it. Before making a rec, understanding how it could impact existing tools. Push it at the spec level.
- Ethan: linked hierarcheries can be fragile
- Ryan: both can be fragile, maybe not one better or worse
- Working session? Monday March 4, 10am-noon EST
### Action Items
- [ ] Ethan and Brianna to tag-up about compression issues.
- [X] Brianna schedule a follow up chat with Wietze and Max [Brianna made comments to open PR]
- [ ] Colby/Amit add input into PR-42
## Jan 24th, 2024
### Attendees
- Brianna Pagán
- Matt Hanson
- Michelle Roby
- Amit Kapadia
- Forrest Williams
- Kevin Booth
- Sean Harkins
- Ryan Abernathey
- Christophe Noel
- Ethan Davis
- Kevin Sampson
- Patricia Fricke
- Wietze Suijker
### Notes
- Charter approved November 2023, now an official SWG
- Currently waiting for the OGC to create a subgroup work environment which would allow us to nominate and elect chairs.
- Scott recommended using this meeting to get the list of nominees, then will be voted on
- Tentative; Christophe and Brianna as co-chairs
- Upcoming Zarr sprint with GeoZarr focus: https://lu.ma/Zarr-NYC
- Logistics
- Should we support viritual?
- Virtual as second class citizens?
- Matt H: don't waste time trying to combine in-person and virtual
- Brianna setting up a hub with example code/zarr stores
- What are the real blockers for some open issues?
- Understanding concerns with CF encoding of CRS https://github.com/zarr-developers/geozarr-spec/issues/20
- How to encode typical origin / offset coordinate variables in ZARR? https://github.com/zarr-developers/geozarr-spec/issues/17
- Ryan: Concrete outcome: be able to write raster day to python read it back in gdal and write it from gdal read back in python. Round trip for CRS. Not possible today because gdal and xarray/python world have chosen a different way to represent CRS info.
- Matt: what about case in zarr where you don't have a valid crs, can still read it in gdal, we have gcp for every pixel, interesting exercise how to read zarr data that doesn't have crs assigned and reproject data.
- Ethan: Huge software ecosystem based around netcdf/cf making sure it works with netcdf implementation of zarr support would be a big win. Ethan is chair of OGC netcdf group organizing a meeting end of Feb, one agenda item is looking at geozarr. Dave Blodgett and Brianna will join for this discussion, can be a follow-up to zarr sprint.
- Christophe: GDAL 3.9 per OSGeo/gdal#9108 will be able to infer CRS in a Zarr dataset using a CF-1 grid_mapping variable (basically raw conversion of netCDF CF-1 to Zarr)
- Wietze: Played around with zarr in QGIS which works with latest gdal, quite slow because data is large, doesn't load the data efficienctly, no pyramid concept in zarr like geotiffs
- Max: Happy to get together virtually for half day to work on pyramiding component, core decision is everything in the geozarr stac or seperate ZEP. Meeting with Sanket next week
- Single entry point, can have POC by end of sprint
- Ryan: We don't want to be spec-first.Focus on making demos - something that didn't work day one that now works on day two. There is a convention, biologists also use it heavily.
- Christophe: Yes focus on POC, but to balance not reinvent the wheel with COG, usual webmap viewer, open layers typically provide BMPs for those formats, if people experienced in this can share their knowledge here
Some good start for standard "conventiosn" for pyramid well supported by Map viewer:
- COG (GeoTiff Overviews: https://docs.ogc.org/is/21-026/21-026.html#_conformance_class_geotiff_overviews)
- Zoom LEvels https://wiki.openstreetmap.org/wiki/Zoom_levels
- OGC WMTS : align with tiling service (typically what is implemented with COG) https://www.ogc.org/standard/wmts/
- Ryan: need to be realistic, we won't accomplish as much as we want only a few hours to actually code and we should be very targeted. How shareable are some of these projects. Wanting to make pyramiding work in QGIS? Can more than one person work on that at a time? Get a candidate list of projects, what level of difficulty, what skills, rank and select 3-4.
- Ryan: do we need to pay someone with gdal expertise? Contact Evan? My inclination is to leave gdal out just based on who has RSVPed so far.
- Sean: writing CF compliant metadata, later verify it works with gdal
- Suggest divide into focus groups at the sprint to address
- Going back to the template:
```
As a [type of User], I need to [do something] with Zarr using [tool X]
```
- https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A?both#Let%E2%80%99s-brainstorm
- Sprint focus groups:
- Michelle/Brianna go through the use cases
- Max (virtual): pyramiding
- Ryan/Kevin: http browsable zarr
- Joe (virtual): v3 for zarr-python
- Bidirectional gdal?
- probably after sprint
- Integration of Zarr with STAC Catalogs https://github.com/zarr-developers/geozarr-spec/iss4ues/32
- Ryan: would be awesome to have translator between ZARR and STAC, would have to populate some required attributes in zarr metadata but easily
- Christophe: The data store POC created they typically hold level-1 and 2 products which includes hierarchy like STAC, but without the product you may have different assets
- Ryan: in zarr there is no hidden metadata, all in json
- Sean: Curious from use case perspective, that use existing STAC cube metadata to configure zarr stores, I am against a full STAC based hierarchy.
- Ryan: Something we could do at the sprint, the idea of zarr-http browsable extension, can't list the directories, solved this with consolidated metadata, but not scalable, probably won't propogate to zarr v3, instead we want links between nodes.
- Sean: what would transition look like for this? what would happen to older stores?
- Ryan: consolidated metadata is v2 feature, once we start writing v3 in production, new extension. Consolidated metadata was originally a work around that solves why its slow reading cloud, but now major improvements have been made, now the only solution is for unlistable stores. But now there are PBs of data with this hack, forunately migrating zarr data doesn't involve rewriting chunks, option to migrate data or same data having it exposed via v2 and v3 metadata. Might be a pain, but not fundamentally expensive to rewrite jsons.
- zarr-python is in flux, not put into zarr-v3. Might be released by the sprint or working off v3 feature branch where new things are living.
### Action Items
- [ ] Open new issue on github with propopsed tasks, get community feedback by Wednesday Jan 31, reach out to list of RSVPs for zarr sprint for which task people want to join
- [ ] Create a template for what the task plan would be, assign leader, write AC
- [ ] Create a template for the structure of the spec, those templates to collect after the sprint, will happen in these discussions