The best practices for data cube production will depend on how that data will be used. As an analogy, consider best practices for boat building. If that boat will be used to carry people from one side of a river to another, you would chose very different design patterns than if it will be used to carry cargo across the ocean. Similarly, you will need to chose different design patterns when building a data cube to support sub-millisecond time-series visualization versus fully saturating a GPU when training a neural network. This guide will walk you through different design patterns and best practices for various use-cases.
Commonalities
Follow metadata standards
Use a widely-supported format that meets the requirements in the CNG formats guide
Consider what languages/APIs/interfaces your users frequent
Use a widely supported compression algorithm
Consider how easy it'll be to harmonize your data with other sources
Be specific about no data values
Be specific about projection and datum
Be specific about cell definition
Plan for how you will catalog your datacubes
Consider precision and bitrounding (only store information you need)
Goldilocks range of chunk sizes for high-latency environments
Don't anger your HPC sys-admin with too many files
Consider the anticipated costs of get, head requests when deciding how to split your data over files
Consider your risk model for adopting new technologies with additional helpful features, at the potential expense of stability and/or stress testing
Common gotchas with the Pangeo ecosystem
Using older version of libraries and missing out on optimizations or bug fixes
Question from the EOPF Zarr webinar - "Do you have recommendations for chunking when working with (1) time-series of S2 and (2) an area covered by multiple adjacent tiles?"