# Data TWG Breakout Ideas ## Comments from TWG session (Day 1) - We need new leadership! - Existing Mission Statement: > Our mission is to define a set of best practices around "cloud native weather and climate data." These practices include: > > - the creation of cloud-optimized datasets, > - cataloging cloud-based datasets, and > - user-facing services for interacting with these datasets and catalogs. > > As much as possible, this work should leverage existing standards and technologies, focusing innovation wherever cloud technology requires a departure from legacy practices. - Lists all of the datasets managed by Pangeo - How to upload data to the cloud in Zarr? (https://github.com/pangeo-data/pangeo-datastore/issues/8) - Cataloging tools and standards (intake & STAC) - Pageo Datastore service - What else falls into the mission statement? What else _should_ fall into the mission statement? - I/O Benchmarking? - Zarr on HPC (https://github.com/pangeo-data/pangeo/issues/659) ## Comments from TWG breakout (Day 2) ### Discussion notes: - Discoverability of datasets that can be used on Pangeo is important - Can be solved with STAC and searchable catalogs (Element 84) - Zarr seems the de facto standard in Pangeo, but there are too little benchmarking efforts to support that - We welcome benchmarking other formats, etc.(https://github.com/pangeo-data/benchmarking) - There are also low-level I/O benchmarking options using IOR (https://github.com/NCAR/ior) - Data formats are not set in stone, yet; We are always looking for better formats and testing - Can we get better cloud interoperability? - BIG data is also being discussed (e.g., CMIP6 on the cloud) - Big pain points for upload: - No single script works to upload all datasets (every dataset has its own personality) - Do we faithfully move data to the cloud or do we clean it up in the process? - Clean it up? - Do we have sufficient standards? - CF convention has issues - No standards for file collections - Volunteers to Lead: - Matthew Hanson (Element84), Norman Barker (TileDB), Charles Blackmon-Luca (Columbia), Luke Madaus (Jupiter), Anderson Banihirwe (NCAR), Rich Signell (USGS) - Desired Outcomes (near-term): - STAC Sprint Proposal (extension for model data / CMIP) - Benchmarking use cases (formats vs formats) - Find use cases from people here - Best practices for converting to Zarr (https://github.com/pangeo-data/pangeo-datastore/issues/8); Chapter in Pangeo book; Should also include chunking best practices - Discuss how to organize --- ### Should the mission (statement) of the WG change? If so, how? **"Official" Mission Statement (change at will):** Our mission is to define a set of best practices around "cloud native weather and climate data." These practices include: - the creation of cloud-optimized datasets, - cataloging cloud-based datasets, and - user-facing services for interacting with these datasets and catalogs. As much as possible, this work should leverage existing standards and technologies, focusing innovation wherever cloud technology requires a departure from legacy practices. ### Pangeo Datastore includes datasets that Pangeo manages...but should we link to other datasets that we _do not_ manage? ### What datasets (if any) are missing from the Pangeo Datastore? ### Are we ready to collect our "best practices" on how to upload data to the cloud? **Sprint Idea:** Consolidate/summarize https://github.com/pangeo-data/pangeo-datastore/issues/8 and add it to the Pangeo website ("Pangeo & Data" and "Pangeo Data Catalog") _We would need a (some) volunteer(s)._ ### What data storage services are missing? What's missing from the Pangeo Datastore? ### What else falls into the mission statement? What else _should_ fall into the mission statement? Some examples might be "I/O Benchmarking" (https://github.com/pangeo-data/benchmarking) and "Zarr on HPC" (https://github.com/pangeo-data/pangeo/issues/659). ### Ryan's Talk https://speakerdeck.com/rabernat/pangeo-cloud-datastore-lightning-talk