Try   HackMD

Goals

  • Create an open, community-maintained resource that clearly communicates what not to do when building multi-dimensional data products

Location

Notes

  • The best practices for data cube production will depend on how that data will be used. As an analogy, consider best practices for boat building. If that boat will be used to carry people from one side of a river to another, you would chose very different design patterns than if it will be used to carry cargo across the ocean. Similarly, you will need to chose different design patterns when building a data cube to support sub-millisecond time-series visualization versus fully saturating a GPU when training a neural network. This guide will walk you through different design patterns and best practices for various use-cases.

Commonalities

  • Follow metadata standards
  • Use a widely-supported format that meets the requirements in the CNG formats guide
  • Consider what languages/APIs/interfaces your users frequent
  • Use a widely supported compression algorithm
  • Consider how easy it'll be to harmonize your data with other sources
  • Be specific about no data values
  • Be specific about projection and datum
  • Be specific about cell definition
  • Plan for how you will catalog your datacubes
  • Consider precision and bitrounding (only store information you need)
  • Goldilocks range of chunk sizes for high-latency environments
  • Don't anger your HPC sys-admin with too many files
  • Consider the anticipated costs of get, head requests when deciding how to split your data over files
  • Consider your risk model for adopting new technologies with additional helpful features, at the potential expense of stability and/or stress testing

Common gotchas with the Pangeo ecosystem

Rules of thumb to test

  • A compressed chunk should be >1MB, <100MB for optimal reads.

Recommendations for specific use-cases

Time series

  • show an example of why churro-like chunking provides faster access
  • consider whether regional aggregation will be common
  • how to consider variable length months and years

Machine learning on spatial chips

Spatial mapping

References

Parking lot of ideas

  • Question from the EOPF Zarr webinar - "Do you have recommendations for chunking when working with (1) time-series of S2 and (2) an area covered by multiple adjacent tiles?"