Forward Stream Processing + Zarr Stores

# Forward Stream Processing + Zarr Stores Call-in information: Monday, March 20 · 2:00 – 3:00pm Google Meet joining info Video call link: https://meet.google.com/shq-zenj-xqm Or dial: ‪(US) +1 314-384-5292‬ PIN: ‪691 877 052‬# More phone numbers: https://tel.meet/shq-zenj-xqm?pin=2854704778130 (edited) ## Attendees - Brianna Pagán (NASA GES DISC) - Joe Hamman (Earthmover) - Orestis Herodotou (Earthmover) - David Auty (Raytheon, EED) - Hailiang Zhang (NASA GES DISC) - Christine Smit (NASA GES DISC) - Dieu M. Nguyen (NASA GES DISC) - Amy Steiker (NSIDC) - Tom Augspurger (Microsoft) - Patrick Quinn (NASA ESDIS) - Matt Savoie (NSIDC / NASA ESDIS) - Jeremy Maitin-Shepard (Zarr Community) - Chris Durbin (EED) - Martin Durant (Anaconda) ## Agenda * Introductions + Meeting Context * Some background: https://github.com/zarr-developers/zarr-specs/issues/154 * NASA Use Case: Christine Smit * Existing Best Practicies/ Solutions * Jeremy Maitin-Shepard, Zarr community, OCDBT (Optionally-cooperative distributed B+tree) * Performance driven ^ petabyte with billions of chunks, GB speed etc. * You can find a high-level description of the format and its motivations here: https://docs.google.com/document/d/1PLfyjtCnfJRr-zcWSxKy-gxgHJHSZvJ2y4C3JEHRwkQ/edit?resourcekey=0-o0JdDnC44cJ0FfT8K6U2pw#heading=h.8g8ih69qb0v * Allows full versioning and snapshots, can be layered on s3 and layered on zarr * Christine: Users would have to have access to this db? Are there public libraries? How easy would it be for end user to work with this. * Jeremy: implemented only in tensorstore, same as opening a regular zarr, instead of saying it's on s3, it's a json object that describes a data source. Possible to also use zarr-python * And a description of the current on-disk format here: https://google.github.io/tensorstore/kvstore/ocdbt/index.html#storage-format * Jeremy: Simple solution if someone is only reading the metadata once. * Zarr uses fixed size chunks, zarr would not look for extra data in it * Martin: need to check on this * Jeremy: Fairly confident * Hailiang: Chunk will be overwritten, people are reading while we are writing. * Jeremy: Atomic file system, s3 * Joe: Opportunities for inconsistent state, not going to get atomic updates across store, just on a single key, not a way to lock the whole thing at once in s3. Lots of open questions about what consolidated metadata will be in next spec * Christine: even if it works now, will it work in the future? * Jeremy: will be padded with fill value, so data is still there, in future version might be support for non-resizeable dimensions, that behavoir will be supported * Hailiang: Based on metadata, nothing preventing us from getting the data because it's appended already. Not doing any version stuff of the object, the exact same object, loading into memory a mix of old and new stuff, some chunk have old and some have new. * Christine: the application thinks it stops earlier, this would be prevented. * Martin: on load even if final piece if a full chunk, extra things are dropped. * Joe: do you only care about atomic updates or time travel? * Christine: time travel is not useful (edit: Actually, I totally see the value of time travel, or al least versioning, but versioning has been treated a bit differently at EOSDIS (the NASA earth science archives). My imperfect understanding is that EOSDIS currently handles version at a high level. Generally "version" here refers to the high level algorithm that is being run rather than to the files themselves or even the software version used to create those files. A new version comes out maybe every few years when there is a major algorithm change. Generally speaking, we get new granules (files) for the most recent version of a dataset as new data is produced from new observations. Every so often, however, we do replace files that have mistakes in them. The fixed files have new filenames. So, for reproducibility, a user would have to record the filenames of the files they used. This is not necessarily ideal, especially since I don't believe EOSDIS keeps the old/incorrect files around. For the tool that Hailiang, Dieu My, and I are working on, we generate zarr stores in order to have efficient services. Our zarr stores are not the archived version of the product. So we rely on lineage/provenance to point back to the archived version.) * Joe Hammam, Earthmover, https://gist.github.com/rabernat/112edcee71005d69e03ca835db3694fc * I have a few slides and/or a demo to show if it turns out to be useful for the conversation * Static versus dynamic intermediary. static solutions only use s3, dynamic maintain state in seperate db * Static: pro, simple no additional cloud service. con, more complex on client side, possible conflicts between writiers. all solutions experimental. * Dyanmic: pro,purpose build solutions. fewer changes required for clients. cons, overhead of running seperate service * Demo * Christine: question on performance. If you're going in there very frequently, and start changing chunks, you can end up with some version that has performance issues. * Joe: Definitely, more work needed. * Hailiang: Before we finish processing the whole time slices, that was a mistake, don't want to expose them yet. If updating just 1 year of 20 years, might not have committed previous change, array lake handles this. * Joe: additional version control utilities, like full version tagging which is just like in git, commit, and special commit which is a tag. can use those constructs. * Joe: no difference between append or insert in middle of array. * Hailiang: Real concern, we don't want public to see appending. static intermediate, they will see it no matter what. Second solution is more valid for dynamic. What kind of database? What costs? * Joe: can me Mongo, postgres, all are options, all built for scale, confident you can build a database that could handle NASA size data. * Joe: you can use a version of this with file reference system... * Hailiang: Before you commit and make it public, how do you make sure public doesn't see it. * Joe: Part of array lake, is access control, that's a feature in the platform. More generally, the keys might be in s3, but they are hashed and not available, client has no way of finding those keys. Client talking to service does not know these keys exist. * Hailiang: we have access control on those prefix * Demo notebook: https://gist.github.com/rabernat/112edcee71005d69e03ca835db3694fc * Martin Durant, Anaconda, kerchunk + iceberg = icechunk * Kerchunk + fsspect, not about versions of datasets, aggregating files not in zarr format * It's true for a dataset that is represented by kerchunk, store references (i.e. jsons) tell you what chunk lives in what absolute URL, paths of other files. You can in theory update any of these references, which is very much like the manifest Joe was showing. Basically the same thing. * You can manually edit the json, it's not something reference file system does for you, you have to know the different versions. No service or API that does this, can plug into other efforts. NASA has the people power to build something out with kerchunk. * Tom Augspurger, Microsoft * Not alot with zarrs, more with tiffs, might be relevant, experience running a service of STAC items, or Earthmover, querying zarr metadata, very doable, public service, if you have a db with metadata and some web servers taking requests, it can be scaled and it can be done. * Theo? from UK Met office wrote a post about what they are doing with high momentum datasets * https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671 - Open Discussions/Questions **Scoping questions (from Joe/Earthmover)** - What are the requirements on the client side? Can we assume the client is using Python? Can we ask the client to install special packages? Or does the client expect spec-compliant http / s3 access to the data? - What are the security and auth requirements? Does the solution have to integrate with Earthdata login? AWS IAM roles? - What are the performance requirements? e.g. Is s3 performance the upper/lower bar? - How many reads / writes per second are expected? What are the throughput and latency requirements? How many s3 objects does NASA need to store? How many versions? How often will versions change? How many writers? **Open Questions from Alexey** - One thing I didn't see in the notes was a discussion of lower-level language APIs (e.g., C, Fortran). In the past, it seems like the go-to approach was to solve these kinds of problems once with a C/Fortran API and then adding higher-level language support was just a matter of adding lightweight wrappers around that API (while maintaining support for performance-critical applications like numerical modeling). A lot of the work for Zarr (and Zarr-adjacent things like Kerchunk) seems to be focused on native implementations in Python -- that has its advantages, but also means that support for any other language requires re-implementing the library from scratch every time. - I hear this concern most from numerical modeling / data assimilation people, who do everything (reads and writes) through the NetCDF C/Fortran APIs. To some extent, I also hear murmurs of discontent from R, Julia, and Matlab users where Zarr support is less mature (and fsspec support is absent altogether). - Chat Notes - Dieu M. Nguyen2:27 PM - Back a little to the previous point on updating the metadata after all updating is done. I understand if we’re appending to the end, it sounds simple enough. What about if we need to update data in the middle? The metadata would still include what we are in the middle of updating right? Just wondering if you think it’d still be as simple as appending. - Martin Durant2:28 PM - Zarr metadata tells you the shape and chunking, so there is no update to make when you write to the middle of a dataset -