Towards a Unified EO Data Model (Christophe): This presentation will review core EO data system concepts —abstract data models, file formats, and encodings. It will explain how HDF, CF, and GDAL work, aiming to support a proposed meta-model as a bridge to Zarr.
Ryan: xarray is reconstructing netcdf data model. xarray needed a explicit definition for coordinate. Some assumptions if variable matches a dimension name, assumes this is a coordinate. Beyond that grid mapping.
Matthias: OGC GeoDataCube trending towards not a spec, just referring to other specs, like zarr. Reprocessing sentinel in zarrs for ESA, zip zarr, we don't know yet. Concerns that is the wrong direction.
Ryan: Getting Evan involved with GDAL. We've talked about doing this round trip exercise also for awhile. Same pathway as multi-dimensional or raster pathway in GDAL. Bypass xarray, open geotiff manually gather metadata, use zarr API directly and just post-it. Anyone that has a tool - specifically asking someone from GDAL.
Brianna: Propose test bed with Max's demos and coordinate with GeoDataCube
Let's push this via some deadline of CNG conference, work with DevSeed folks, look into OGC funding to pay someone like Evan looking at the roundtrip example.
Include the group to help with the scope of work.
Do we think geozarr as a single array will ever be a flavor? Limit of how big a COG could be, 100GB COGs.
Testbed call from OGC looking for funding for testbeds:
CNL: End of march, the fundings will be available and the Call for Participation available (around 50% of personal contribution is requested)
Emmanuel: Some updates on ESA Zarr EOPF (if time allows)
EOPF abstraction for Copernicus, it's a data model fitting the Sentinel datasets. On top of this different enconding: zarr, SAFE. Selected zarr for all missions because it can cover abstractions. Will last until mid-2026. Still don't know how it will be encoded, room for research here. Will NOT be zipp zarr, just a temporary workaround, to deal with the super high # of files. A 1GB dataset producing 4k files, impossible to manage on object storage. In April, there will be a service that will emerge to play with sample data, available for one year, feedback to user, a jupyterlab available with dask.
To achieve concrete objectives quickly, it is essential to proceed step by step and focus on the essentials to produce a first version of the specifications. A suggested approach is as follows for a first milestone:
Define a minimal convention for converting a base file ("1 band") from NetCDF/GDAL formats.
Define rules for handling files containing multiple bands
Specify conventions for encoding pyramiding (overviews)
Define conventions for supporting coordinates and GeoTransform (optional for the first version)
In further milestones: 3D+ arrays, GeoTransform optimisations, etc.
AOB
For information, this comparison table is available on the Copernicus page: EOPF Copernicus
Could be a way to get examples because RadiantEarth is submitting a proposal
Trying to make sure there's not redundant Zarr proposals
Where to put examples
Simple examples in GeoZarr repository
Small Python script
Setting of grid_mapping
One multiscale example (Web Mercator)
One custom TMS with simple downsampling
Discuss next steps next month (riocogeo for GeoZarr?)
Next time - decide on a different platform for chats? talk with CNG folks
December 4th, 2024
Attendees
Tyler Erickson
Piotr Zaborowski
Chris Little
Colby Fisher
Sai Cheemalapati
Agenda
?
Round of intros for attendees and sharing of slack channel, hackmd notes, github links and Roadmap/Contribution doc. Chris notes we should add workflows/pipelines to the examples/case studies section.
Adjourned
November 6th, 2024
Attendees
Tyler Erickson
Raphael Hagen
Piotr Zaborowski
Raphael Hagen
Chris Little
Christophe Noël
Max Jones
Colby Fisher
Ryan Avery
Agenda
Agenda
Actions
Agreed to create a dedicated wiki page on GitHub to track actions raised during the meeting. This page will ensure that all action items are recorded and accessible to team members.
Topics: Requested a link to any related discussions or specs for the various topics under the workplan.
Participation: Asked all interested members to update the topic list, indicate their interest, and specify the topics they will work on.
CF Mapping: Suggested that CF (Climate and Forecast) metadata mapping should be identified, including any minimum required properties. However, the use of multiple terminologies is encouraged and is a common practice in recent specifications.
GeoTransform Conclusions
Awaiting input from Brianna. Noted that conclusions should consider OGC (Open Geospatial Consortium) guidelines, and discussions are on hold.
OGC Tools (GoToMeetings)
Recording meetings and transcriptions are common practice, and the materials should be made available afterward.
Some members prefer open tools like Google Meet, although it lacks recording functionality. GoToMeeting may be used but only if it does not restrict attendance.
Resampling method
Max Jones presented warp resampling methods, comparing performance and memory usage across libraries. Key findings include that GDAL-based tools like Open Data Cube and Rioxarray are efficient for local datasets. Virtualizing NetCDF files as Zarr enhances performance, and web-optimized formats, especially those with overviews, dramatically reduce processing time. Future improvements should focus on supporting Zarr overviews, pre-generating weights, and optimizing Xarray imports, as these enhancements could significantly boost resampling efficiency.
AOB
Christophe asked if the issue with the number of objects generated by Zarr, which poses a cost barrier to its adoption by the Copernicus Data Space Ecosystem, is addressed by Zarr v3.
Note (after the meeting): the sharding codec can mitigate this problem. Read more here.
Geo Transform
Objective: To link pixel coordinates to geographic coordinates in a condensed way
Relationship and complementarity with Map/Coverage standards (e.g. old and new APIs)? See this summary (OGC core axis-aligned ?)
August 28th, 2024
Brianna Pagan
Kevin Sampson
Ethan Davis
Felix Cremer
Saving of geo transform
Brianna showed a saved geozarr with geotransform
Chunk position information
What do you want to show to the user User should not think about chunking
Martin: Need to store it, need to transform, and need to be implemented before you can demo. Its the tools responsibility to describe the world. If you allow for different ways to describe the coordinates you don't need to feel locked in.
Brianna: to-do on me to create a zarr store with the suggested metadata described above in the pangeo forum.
lat_bounds/lon_bounds <> offset? AREA_OR_POINT for geotiff
[Alexey]: Here's the GeoTiff standard. The relevant discussion of whether pixels refer to corners or centers is, maybe, "Raster Space" heading: PixelIsArea raster space assumes the data are between indices, while PixelIsPoint raster space means the data correspond exactly to the data.
[Alexey]: gdalinfo on an example (cloud optimized) GeoTiff shows AREA_OR_POINT=Area (under Metadata:) for this.
For chunks, chunk reference index
This would allow to go from implicit to explicit, each chunk refers to original GeoTransform and uses chunk reference index to when needed explicitly calculate
Alexey: Yes, to me the issue was always getting xarray/CF to understand affine transforms. Need to iron out preferred ways to define the position in a pixel, refer to GDAL. Might have some critics with how to store CRS/WKT
[Alexey] Recommendation:
Explicit dictionary referring to EPSG; e.g., {"crs": {"epsg": 4326}}
Explicit dictionary including WKT (as a string?; e.g., {"crs": {"wkt": "<long multi-line WKT string...>"}})
Brianna: Martin's case below where a small GRIB file opening in xarray becomes huge trying to explicitly list out coordinates.
Ethan: Going GRIB to netcdf, converting to explicit lat/lon.
Alexey: This solution unlocks the ability to leverage the speed of raster subsetting for netcdf like data.
GDC SWG meetings will be started again in combination with the Testbed 20 GDC group. They'll start June 27th, 16:00 CEST and take place every week. Meeting links are on the OGC portal now.
Brianna will try to join these meetings to discuss overlap had some preliminary discussions with Matthias Mohr
With regards to "file/data formats", the GDC work has been pretty unspecific (or let's say format agnostic), many people did actually export GeoTiffs in TB19, haven't seen a lot of netCDF
How to handle multiple resolutions / overviews? - [Alexey]: GeoTiff metadata includes Overviews: 1830x1830, 915x915, 458x458, 229x229 and possibly OVR_RESAMPLING_ALG=NEARSET - Felix: suggesting etting up TMS server on top of zarr
June 12th, 2024
Attendees
Brianna Pagán
Kevin Sampson
Colby Fisher
Chris LIttle
Felix Cremer
Emma Marshall
CNL: not available Felix Cremer: Try to join, but will be 15 minutes late
Agenda
CNL: If there are no objections, I'd suggest merging it so we can start creating pull requests on specific topics and sections. This should help optimize improvements and collaboration.
Brianna reached out to Matthias Mohr, testbed 20 will start June 21st, next GDC SWG June, 20th 9:00am - 10:30am EDT see https://portal.ogc.org/meet/
Deadline passed for testbed proposals, Brianna will see if/how GeoZarr can fit into any existing one
Question of how/if GDC supporting netcdf-like data
OGC Code sprint for the OGC APIs sitting in on the webinar
June 13, we will run the pre-event welcome webinar, for the Open Standards Code Sprint. The webinar will run from 14:00 to 15:00 UTC+1, and it will set the context for the upcoming code sprint, by presenting the overviews and sprint goals. If you are planning to attend the sprint next week, we highly recommend attending the webinar, as this information will not be repeated in the kick-off session.
The webinar will take place online, at the #Main Stage of the OGC Events Discord server. For logis
Storing affine_transform OR coords
Need more than just geotransform, need to know whether x/y is trying to represent center or elsewhere on the grid, so need to store offset
For us, we can store GeoTransform and the offset between large box and tiles
CL: GIS assumes it's an image, CF conventions say whether it's in the center or edge.
Organize notes and action items, keep Meeting Summary updated, clean up open issues so that people can async help better. (Brianna)
May 29th, 2024
Attendees
Christophe Noel
Christine Smit
Kevin Sampson
Emma Marshall (University of Utah)
Brianna Pagán
Max Jones
Colby Fisher
Ryan Abernathey
Chris Little
Martin Durant
Agenda
Christophe: created PR to discuss the table of contents and some core principles of the OGC spec: PR-47. Each topic discussed during meetings should have a PR within a section (+ zarr example)
Check all topics are covered
Check that it supports all relevant source formats
Check that it reconcile GDAL/CF ecosystems for the encoding of rasters (coordinates and projection encoding, n-d dimensions, etc.)
CL: if you look at WMO in its standard has same status as ISO, alot about coordinates in a non-GIS way, in a very meterological way, want to raise as an issue, translate WMO to GIS terminology.
RA: To be successful a standard thats useful for EO satellite and climate/weather
CL: GIS community has asssumption anything with raster is numeric and 2D.
RA: If take gdal, quite flexible in how it handles coordinates. xarray can't do GIS style referencing.
BP: Discussion about saying terms like 'regular' grid, 2D
CN: grid_mapping can include geotransform, raster 2D, in multi-dimensional array, is it still the right way to describe the projection, rasterio creates grid mapping within coverage variable which might be more than 2D
Christophe: A coverage (2D or more) generally has a corresponding affine transformation (projection) for the latitude/longitude dimensions. The projection can be described in the grid_mapping variable. The coordinates for each pixel might be encoded/described as labelled arrays (NetCDF), origin offset (GDAL), vectors.
I propose using the conformance class (property conformsTo in the coordiante variable of the GeoZarr model to indicate which type of encoding is used.
Brianna: For spec want to propose the ability to check whether data has affine_transform OR coords. Original thought was some type of flag to distinguish between the two (Max Jones and I met last week and had the same independent idea), but if range index ability is supported in xarray, this can be the same field with different metadata associated with it
Until then, modified validator to fit within pydantic-zarr
Max looking into how abstracting would work for pyramiding
MD: affine is not the only one, but good start. Astro coord system is capable of doing all the things we talked about, what it doesn't have is the set of complex coordinate systems, need an extensible way
RA: GDAL has affine geotransform, aim to support this, don't need to go beyond at this point.
MD: aim for extension mechanism for other people to bring solutions. Affine is straightforward
RA: Need new custom index, the problem is when you go to write this, the prescence of explicity coords with trump, once you write data with explicit, it will use explicit rather than transform
MJ: Broader than just how to write, if you read it and write with rounding errors coords are less precise, people are also manipulating data, that affine transform is not accurate, the minute you do isel, propagating as fixed attribute is not going to work.
RA: how does xarray deal with time,enconded and decoded state.
_ARRAY_DIMENSIONS is not zarr but comes from xarray to make zarr look netcdfish
Zarr lets you put whatever you want into these attributes
zarr v3 has dimensions built in
Decide on which level the validator sets on
do not open the raw json but use the zarr package
We could operate on a netcdf data model
We don't have to validate that it is a netcdf but look at the attributes and dimensions
Whatever we agree on in the validator should be also included in the spec
Current is in the details and it needs to be either the zarr data model or the netcdf or CF data model and not go into the details of the encoding
What is the goal of GeoZarr and how does it fit in with the existing efforts?
CS: CF is a very complex standard and we wouldn't want to write a new CF validator
ED: 90 % of CF data uses a very basic subset of the CF standard
if you build a validator you have to look at the weird cases
CF validator will fail for the planet data
Mix between TIFF and NetCDF world
We have two opposite objectives
Wide range of data sizes
facilitate a wide range of clients
Do we want to start with basic examples and then open this to a wide range of conventions?
Or do we want to be compliant with certain conventions like CF?
Dataset contains coordinates, variables
Start with dataset then you can open it with xarray
Start with a minimal subset of CF conventions Make up the standard that we need to make this happen GDAL already has a spec to make this possible Can we open that on the python side and do something with it That happens in rioxarray This conversion is not reversible and we would need to see which information is needed to make it reversible Explicit versus affine transformation Make the software do what it needs to do and get the spec out of that
Do we want to be the meeting ground to find this out.
Should this be more an implementation meeting?
There is other conventions that build on CF that - specify which parts of CF are required We need to decide what we want to do Whether we want to include geotiff like data in the first iteration Delivering something is important specs are not updated very often Maybe we should rather take our time to confront the root issue If a library says it supports geozarr does it need to support both different data models? Enumerate the different tools There is python and gdal which cover 90% of usage Does GDAL understand both? If GDAL is doing both so the only thing we would get is on the python side You can't really write that spec starting with data from xarray Interoperabiltiy means round triping the data without losing information Is it obvious how GDAL does it and how to translate that to zarr? Is there any reason to not take this and implement it in python?
Attrs are not removed in V3 only how it is saved on disk is changed
Progress on conforming to OGC template (Christophe/Brianna?)
Multi-scale PR updates?
Tried implementing on julia issues with
Difference in webmapping world and geotiff overview model of aggregating data, not sure we want to mix this, if you want a tile matrix set, there is another layer that needs to be specific on top
Serving up a TMS on top of a zarr might be a different tooling then just overviews of data
April 17th, 2024
CNL: not available today (still plan to bootstrap the OGC template with core definitions - during April)
Attendees
Brianna Pagan (NASA GES DISC) Felix Cremer (MPI BGC Julia programmer) Christine Smit (NASA GES DISC) Martin Durant (Anaconda) Kevin Sampson () Doug Newman (NASA ESDIS) Ryan Abernathey Colby Fisher
Agenda
Ethan and Brianna tag-up on compression algorithm support
Martin: GRIB <1000 byte, but always adds coordinates, load with xarry, it will be over 100MB in memory.
Felix: is there anything julia ecosystem can assist/learn from this
Ryan: We need a concrete proposal, everytime something simple is suggested, many responses as to why it's more complicated, feel stuck. https://gdal.org/user/raster_data_model.html
Martin: implement something like affine transform, and this is the implementation and a way from going from standard tags in geotiff to your explicit implementation would go a long way.
Ryan: what would success look like, write code that we want to work, then ask where should it be implemented. Ultimately new index type in xarray, but need to define success within the group. Keep analytical not float info. Preserve analytic coordinates
load data from geotiff save to zarr and load it again and save to geotiff and the coordinates should be the same
Felix: idea is to save it as GDAL saves it
Martin: how gdal defines attributes is fine
Ryan: forget geo, just 1-D, defined analytically, A->B, shouldn't need to save every coordinate. Don't have to treat it as data.
Felix: in Julia you can usee w/e array as dimension, not sure how these are read/saved
Martin: not language issue, a library problem, astronomers don't have this problem although they have analytic. Need POC
Christine: dimensions matter when you're trying to query. when would a tool use this information, you need a function to decide when you need indexes
Martin: Yes, Logical to analytical indexes function is needed
Christine: 1) i just want to open the file, i want xarray to do the right thing. 2) actual implementation people who want to get more in the weeds
Brianna: it's easy enough A->B, but when we add the geospatial, that's where the convo get's blocked
Ryan: PROJ does this, the difficult part is with serialization, how can we tell that the coordinate is present, how do we identify and is that interoperable for non-python, non-xarray softwares. Xarray developers need to show that I can create an xarray dataset that has this type of analytic coordinate system and query it. After that we can tackle with encoding
Felix: we have this in julia, if we save to zarr just an array as integer, just a vector, why we need to talk about serialization
Ryan: can we create a 1D xarray, save to zarr, open in julia, get a properly encoded, save in zarr, pass it back and forth.
Felix: how would you save it.
Ryan: for range it's start, stop, # of points, metadata variance you want to save, is it for the center of pixel. Start, stop, or offset/scale, you need to know how many points, already known if it's describing another array. Encode these floating point numbers in a lossless way, in zarr you can put them in metadata or another array, in another array probably not needed, but would encode in an optimal way, putting it into metadata as json, you want to put full bytes rather than txt based rep as number.
Christine: advantage of seperate array, can take CF approach for describing in the metadata with additional attributes
Ryan: push as much encoding as possible in zarr, virtualizarr. If xarray sees a variable already opened by zarr, that has units of days since some day, it triggers that's time and let's decode.
Christine: time is more of a pain.
Ryan: motivated here to figure out decode index, first step is in xarray dev supporters
Felix: trying to build and save to disc, not exactly sure how tile matrix set is going to work, if we have some dataset, would we be able to add
Set up dedicate agenda item
April 3rd, 2024
Attendees
Brianna Pagán
Ryan Abernathey
Ethan Davis
Tadd Bindas
Max Jones
Anthony Cak
Agenda
New branch for conforming issue #34 with OGC template (Brianna)
Ethan: for CF just need groups, arrays and attributes. An extension of Zarr changes the encoding, whereas a convention is just an extra metadata that is visible to any zarr. CF is completely visible to anything that understands netcdf, whereas an extension you need not just zarr, but you need the zarr extension
Ryan: still in the process of figuring it out for zarr, we can define some conventions, extensions are things that require changes or augmentations to the core data model. If extension is not understood, you cannot decode the data, an example that needs to be an extension, variable size chunks, if you try and go in to read zarr data, and your implementation doesn't know how to decode. Conventions should still be operable under vanilla zarr, multi-scale is an example, people use this. xarray came up with its own convention for putting convention names. This group should try to do everything through conventions, cross post with OGC. Dont get into the data model layer, like how jsons are structured.
In memory information and serialized information, needs to go somewhere in metadata
Could create custom xarray index that understands
Ryan will make a post on pangeo discourse, discuss possible solutions, implementing an xarray custom index that supports this type of coordinate system
Tadd: lazy coordinate system?
Ryan: maybe implicit, lazy implies there is data just not loading yet. Lazy concept is useful but slightly different
Tony: This is my exact use case, i have errors trying to create a zarr store using xarray from a large
Adapt definitions, in a more agnostic way and map to zarr model, will be good start to adapt
Writing this with OGC template
Check that OGC templates are auto converting to pdfs successfully
Ethan and Kevin as reviewers
CN: Maxmimize interoperability of format, tradeoff of datasets encoded in zarr and maximizing the tools, recommendation versus requirement classes, do we allow a zarr to have multiple datasets, dataset with children datasets, this complicates how tools read the files. Have a requirement class which says this geozarr is complex, or the opposite, this geozarr is a simple dataset, maps to native-format
CS: CF didn't accept group structure, recently added
Compressor lit review (open action item, still need to tag up)
From Christine last meeting "default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr," however I am seeing NCZarr Filter Support a reference to blosc. Can Ethan confirm?
Brianna to try to adapt issue-34 as PR using OGC template, Ethan and Kevin (kmsampson) to review
Brianna scheduling alternative bi-weekly coworking session to develop a tool that checks if a zarr store is compliant with existing specification
March 6th, 2024
Attendees
Max Jones Michelle Roby Ethan Davis Brianna Pagán Christophe Noël Tadd Bindas Ryan Aberanthey Felix Cremer Lars Barring Sean Harkins
Agenda
Presentation to OGC netCDF SWG last week (Ethan)
Specifying the Organizational Structure of GeoZarr (title edited) #34 (Christophe)
Brianna: this is the same point i bring up below in agenda for how to handle zarr_format
Christophe: never added the mapping between geozarr and zarr, always has been implied, but we can follow NCZarr approach of how this is mapped.
Ethan: I found current spec confusing, spelling out dataset, data array etc there's some parts that are CF, and other parts that seem to replace CF. If you look at CF-data model, it has alot of details on how CF works with these kinds of things. Would be good to have netCDF OGC SWG to have more people from CF world to look at this and how to clarify how much CF is used. In term of NCZarr and xarray, wondering if GeoZarr shouldn't be too specific, if it can handle xarray dimensions, allow for NCZarr construct that will represent same. Big differences between NCZarr zarr-v2 implementation and zarr-v3
Christophe: its opinionated if we say that GeoZarr should use all same approaches as CF, until now just using CF for not reinventing the wheel, but minimize the size of the specification
Ethan: CF doesn't have alot of requirements for metadata, advantage for allowing whatever CF is in the file and building on top of that, and having the pieces that are making it geozarr compliant. Lots of existing profiles of CF, WMO- for sounding data some example,
Christophe: if we define something and its aligned with CF, that's a good approach
Ryan: between v2 and v3 data model is not hugely changing, we shouldn't get too hung up on that. GeoZarr should specify the zarr model, how that is encoded, that's zarr job to manage. Doesn't need to go under the hood.
Ethan: Not that CF has to be the end all, GeoZarr can reference CF data model and build off of that. Where does CF line up where does GeoZarr need to diverge.
From Christine last meeting "default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr," however I am seeing NCZarr Filter Support a reference to blosc. Can Ethan confirm?
This came to my mind when thinking of how to handle for example consildated metadata, so we need to make some statement on compatability of GeoZarr with zarr v2 and v3
CN: I think so, OGC extension must align specific version (in particular on breaking changes). Zarr v3 is still under development, we should target v2 before v3 is released.
Some test zarr stores so far have been using v2 consildated metadata, need some examples of zarr v3 generated zarr stores
CN: .zmetadata only concatenates metadata of all children, so even in v2, we can specify all out of it (and possibly already add indexes to fasten coordinate and variables discovery).
Tuesday March 12 Coworking Hour! EST Time: 11:00 AM (UTC-5)
AOB: consider shifting for a more wordwide time slot in April (EDT: 10AM, CEST: 4PM, UTC: 2PM) ?
BRP: That is fine, I am working US West hours,
Interoperabability issues with opening th example zarr stores in Julia
Tadd: You have a sparse array, and a command to write to zarr, and some codec would save it and read out of, workaround right now is a wrapper that would break up your sparse array in different parts. that's individually stored in zarr. Question: if anyone in this meet uses sparse arrays, what an interface of what they are looking for would be
Ryan: zarr specifies on disk format, so we can imagine how to store sparse arrays, but to sparse effectively you need an in memory representation that allows you to query, most programming languages has a sparse array type, and once you have that sparse array you can use for useful things. Hash regridding between two different grids. Would like to compute this once and save it, and open it quickly. The stumbling block is figuring out what in memory would look like. Can agree on serialization but what are implementions going to do when seeing a sparse
Tadd: A combo of zarr, xarray, dask, depending on how big problem. Smaller problems xarray, the hypersparse matrixes we use, similar to Ryan, some mapping matrix used for calculations, or using sparse.COO
Sean: with EO data, issues with sparse data cube problem. You have a storage problem as well
Ryan: Proposal to czi to implement sparse encoding in zarr
HTTP extension, traverzarr mock-up. File browsing. Kevin Booth, same time tomorrow if folks want to join that conversation via Radiant Earth/Source
Rust object store to be able to query data, replacing fsspec.
Chunk manifest/virtual concat
Ryan: chunk-manfiest, referencing/pointing to existing chunks from the zarr metadata. virtual-concat of zarr arrays, stacked zarr arrays exposed, similar to ncml, combining into one larger virtual object.
Ethan: would love to share best practices with ncml.
Ryan: folks are already kerchunking PBs of data and opening with zarr, but no spec.
Christine: a few open issues, compression, default zarr compression (blosc) is not a standard compression that netcdf has used so it's not available with NCZarr, that impacts NCO, netcdf-python and panoply. For NCO, cannot access things from S3. Panoply has a dev branch that can read zarr stores.
Ryan: no inherent or default compression for zarr, there is one for python-zarr. This is just a downside of how pluggable zarrs are, there are no standards profile. If in geozarr, we state the min set of compression options that aimed to support. Make narrower recs.
Christine: would be nice to target the default one in the zarr-python library.
Ryan: Make a recommendation of min set of compression options that make it compliant.
Brianna: it would be easier to get a list from netcdf/NCO/NCZarr etc of compressions that work and have that for recommendations, rather than waiting for those tools for blosc. But looks like there needs to be some lit review over current.
Move away from consolidated metadata in the spec itself, so no zattrs
Ryan: just want whatever is chosen to be in the spec, can keep it if folks want, wouldn't get rid of it. Before making a rec, understanding how it could impact existing tools. Push it at the spec level.
Ethan: linked hierarcheries can be fragile
Ryan: both can be fragile, maybe not one better or worse
Working session? Monday March 4, 10am-noon EST
Action Items
Ethan and Brianna to tag-up about compression issues.
Brianna schedule a follow up chat with Wietze and Max [Brianna made comments to open PR]
Colby/Amit add input into PR-42
Jan 24th, 2024
Attendees
Brianna Pagán
Matt Hanson
Michelle Roby
Amit Kapadia
Forrest Williams
Kevin Booth
Sean Harkins
Ryan Abernathey
Christophe Noel
Ethan Davis
Kevin Sampson
Patricia Fricke
Wietze Suijker
Notes
Charter approved November 2023, now an official SWG
Currently waiting for the OGC to create a subgroup work environment which would allow us to nominate and elect chairs.
Scott recommended using this meeting to get the list of nominees, then will be voted on
Ryan: Concrete outcome: be able to write raster day to python read it back in gdal and write it from gdal read back in python. Round trip for CRS. Not possible today because gdal and xarray/python world have chosen a different way to represent CRS info.
Matt: what about case in zarr where you don't have a valid crs, can still read it in gdal, we have gcp for every pixel, interesting exercise how to read zarr data that doesn't have crs assigned and reproject data.
Ethan: Huge software ecosystem based around netcdf/cf making sure it works with netcdf implementation of zarr support would be a big win. Ethan is chair of OGC netcdf group organizing a meeting end of Feb, one agenda item is looking at geozarr. Dave Blodgett and Brianna will join for this discussion, can be a follow-up to zarr sprint.
Christophe: GDAL 3.9 per OSGeo/gdal#9108 will be able to infer CRS in a Zarr dataset using a CF-1 grid_mapping variable (basically raw conversion of netCDF CF-1 to Zarr)
Wietze: Played around with zarr in QGIS which works with latest gdal, quite slow because data is large, doesn't load the data efficienctly, no pyramid concept in zarr like geotiffs
Max: Happy to get together virtually for half day to work on pyramiding component, core decision is everything in the geozarr stac or seperate ZEP. Meeting with Sanket next week
Single entry point, can have POC by end of sprint
Ryan: We don't want to be spec-first.Focus on making demos - something that didn't work day one that now works on day two. There is a convention, biologists also use it heavily.
Christophe: Yes focus on POC, but to balance not reinvent the wheel with COG, usual webmap viewer, open layers typically provide BMPs for those formats, if people experienced in this can share their knowledge here Some good start for standard "conventiosn" for pyramid well supported by Map viewer:
Ryan: need to be realistic, we won't accomplish as much as we want only a few hours to actually code and we should be very targeted. How shareable are some of these projects. Wanting to make pyramiding work in QGIS? Can more than one person work on that at a time? Get a candidate list of projects, what level of difficulty, what skills, rank and select 3-4.
Ryan: do we need to pay someone with gdal expertise? Contact Evan? My inclination is to leave gdal out just based on who has RSVPed so far.
Sean: writing CF compliant metadata, later verify it works with gdal
Suggest divide into focus groups at the sprint to address
Going back to the template:
As a [type of User], I need to [do something] with Zarr using [tool X]
Ryan: would be awesome to have translator between ZARR and STAC, would have to populate some required attributes in zarr metadata but easily
Christophe: The data store POC created they typically hold level-1 and 2 products which includes hierarchy like STAC, but without the product you may have different assets
Ryan: in zarr there is no hidden metadata, all in json
Sean: Curious from use case perspective, that use existing STAC cube metadata to configure zarr stores, I am against a full STAC based hierarchy.
Ryan: Something we could do at the sprint, the idea of zarr-http browsable extension, can't list the directories, solved this with consolidated metadata, but not scalable, probably won't propogate to zarr v3, instead we want links between nodes.
Sean: what would transition look like for this? what would happen to older stores?
Ryan: consolidated metadata is v2 feature, once we start writing v3 in production, new extension. Consolidated metadata was originally a work around that solves why its slow reading cloud, but now major improvements have been made, now the only solution is for unlistable stores. But now there are PBs of data with this hack, forunately migrating zarr data doesn't involve rewriting chunks, option to migrate data or same data having it exposed via v2 and v3 metadata. Might be a pain, but not fundamentally expensive to rewrite jsons.
zarr-python is in flux, not put into zarr-v3. Might be released by the sprint or working off v3 feature branch where new things are living.
Action Items
Open new issue on github with propopsed tasks, get community feedback by Wednesday Jan 31, reach out to list of RSVPs for zarr sprint for which task people want to join
Create a template for what the task plan would be, assign leader, write AC
Create a template for the structure of the spec, those templates to collect after the sprint, will happen in these discussions