Data cube guide brainstorming

# Goals - Create an open, community-maintained resource that clearly communicates what *not to do* when building multi-dimensional data products ## Location - Cloud Native Geospatial "Datacube" guide? - `datacube.cloudnativegeo.org`? - How does this relate to the new CNG education organization? - https://github.com/cng-education/curriculum ## Notes - The *best* practices for data cube production will depend on how that data will be used. As an analogy, consider *best practices* for boat building. If that boat will be used to carry people from one side of a river to another, you would chose very different design patterns than if it will be used to carry cargo across the ocean. Similarly, you will need to chose different design patterns when building a data cube to support sub-millisecond time-series visualization versus fully saturating a GPU when training a neural network. This guide will walk you through different design patterns and best practices for various use-cases. ## Commonalities - Follow metadata standards - Use a widely-supported format that meets the requirements in the CNG formats guide - Consider what languages/APIs/interfaces your users frequent - Use a widely supported compression algorithm - Consider how easy it'll be to harmonize your data with other sources - Be specific about no data values - Be specific about projection and datum - Be specific about cell definition - Plan for how you will catalog your datacubes - Consider precision and bitrounding (only store information you need) - Goldilocks range of chunk sizes for high-latency environments - Don't anger your HPC sys-admin with too many files - Consider the anticipated costs of get, head requests when deciding how to split your data over files - Consider your risk model for adopting new technologies with additional helpful features, at the potential expense of stability and/or stress testing ## Common gotchas with the Pangeo ecosystem - Using older version of libraries and missing out on optimizations or bug fixes - Using the default caching with fsspec (see https://tutorial.xarray.dev/intermediate/remote_data/remote-data.html) - Using the default parameters with xarray open_mfdataset, concat, and merge (see https://github.com/pydata/xarray/issues/8778 and https://github.com/pydata/xarray/pull/10062) - Concatenation leading to 1D coordinate variables with lots of tiny chunks ## Rules of thumb to test - A compressed chunk should be >1MB, <100MB for optimal reads. ## Recommendations for specific use-cases ### Time series - show an example of why churro-like chunking provides faster access - consider whether regional aggregation will be common - how to consider variable length months and years ### Machine learning on spatial chips - lessons learned from https://github.com/pangeo-data/ncar-hackathon-xarray-on-gpus ### Spatial mapping - lessons learned from https://github.com/developmentseed/warp-resample-profiling, https://github.com/developmentseed/tile-benchmarking - COG/GDAL configurations ## References - Earthmovers blog posts - https://earthmover.io/blog - https://ncar.github.io/dask-tutorial/notebooks/06-dask-chunking.html - https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters - https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes - [notes from Pangeo community discussion](https://docs.google.com/document/d/1BkL0arf1Lz6fHgVBEJNxKbFmVN8glNQXmDKdsuT0GcU/edit?tab=t.0#heading=h.mgt4rbxr710f) ## Parking lot of ideas - Question from the EOPF Zarr webinar - "Do you have recommendations for chunking when working with (1) time-series of S2 and (2) an area covered by multiple adjacent tiles?" -