Zarr data manipulation strategies thinking of quality and performance

# Zarr data manipulation strategies thinking of quality and performance ## Zarr's Architecture Review Zarr is designed as a cloud-native protocol for storing multi-dimensional **array** datasets. Its performance and usability heavily depend on how the data *before and during writing* is organized and manipulated. Manipulation strategies impact: - Query performance - Interoperability - API responsiveness - Developer & user experience Its architecture is built around three fundamental concepts: ## Zarr example so a Zarr array of shape (`T`, `Y`, `X`) or (`time`, `lat`, `lon`): - `T` = number of time steps (e.g. days, hours, years). - `Y` = number of latitude points (e.g. grid cells North-South). - `X` = number of longitude points (e.g. grid cells East-West). ### Grid Example (like ERA5 or CMIP6) Suppose we have: | Dimension | Meaning | Count | |:-----------:|:----------------------------------:|:-----:| | `time` | Daily data for 10 years | 3650 | | `lat` | Grid every 0.25° from -90 to +90 | 721 | | `lon` | Grid every 0.25° from -180 to +180 | 1440 | So full data shape = `(3650, 721, 1440)` That's **3.78 billion** values total. ### Pluvial flooding [WIP] example | Dimension | Meaning | Count | |:-------------:|:---------------------------------------:|:-----:| | `scenario` | Climate/emissions (e.g. SSPs) | 4 | | `return_period` | Statistical return period (e.g. 100-yr) | 1 | | `year` | Future years (e.g. 2030, 2040, ...) | 6 | | `percentile` | Uncertainty quantile (e.g. P10/P50/P90) | 3 | | `y` | 30m-resolution rows | 6000 | | `x` | 30m-resolution cols | 6000 | Shape: `(4, 1, 6, 3, 6000, 6000)` Chunks: `(4, 1, 6, 3, 100, 100)` Total values: **~2.6 billion** per array ### Hail [WIP] example | Dimension | Meaning | Count | |:----------:|:----------------------------------:|:-----:| | `percentile` | Uncertainty quantile (P10/P50/P90) | 3 | | `year` | Projection years (e.g. 2040/2080) | 2 | | `latitude` | Grid cells (20° to 60°) | 161 | | `longitude` | Grid cells (230° to 300°) | 281 | | `scenario` | CMIP6 emission pathways | 3 | Shape: `(3, 2, 161, 281, 3)` Chunks: `(2, 2, 81, 281, 1)` ### Chunked Storage Chunking is critical: affects lazy loading, parallel access, cloud costs. - Data is divided into small, manageable pieces called chunks - Each chunk can be read and written independently > [*Disclaimer*: Optimal chunk size is at least 1MB for larger uncompressed zarr.](https://zarr.readthedocs.io/en/latest/user-guide/performance.html) #### Chunking optimization strategy example Strategy depends on access pattern, so the question is **how users will access the data**. > In our case how metrics-api service will manage *users* requests and query the data. > Typically pin **one location** (lat/lon), and query over: > a range of years (e.g. 2020–2080), > a range of percentiles (e.g. P10, P50, P90), > or scenarios (e.g. SSP126 vs SSP585). Let’s ask ourselves the right questions to find the optimal trade-off between performance, cost, and flexibility. | Scenario | Example Question | Implication | |:-----------------:|:---------------------------------------------------------------:|:----------------------------------------------:| | Time series | “Will users query values over time at a **single lat/lon**?” | Prefer temporal chunking `[T, 1, 1]` | | **Regional maps** | “Will users retrieve full maps for a year or percentile?” | Prefer spatial chunking `[1, Y, X]` | | Ensemble analysis | “Do we need fast access to **different scenarios or percentiles?**” | Include scenario/percentile in fast dimensions | | Bulk reprocessing | “Will pipelines scan entire datasets (e.g. AI training)?” | Fewer, larger chunks to reduce open/read ops | ##### Temporal chunks: Efficient for time series. Example: - Shape: `(90, 180, 360)` - 90 days - 10° latitude blocks - 10° longitude blocks ``` Chunk shape: [time, lat, lon] = [90, 180, 360] Time axis ↓ ┌───────┐ → │ day 0 │ ← [180 x 360] map ├───────┤ │ day 1 │ ├───────┤ │ ... │ └───────┘ (90 slices stacked) ``` - Efficient for time series queries (e.g. temps at a location) - Balanced memory use (~50–200MB chunks) - Works well with compression (e.g., Blosc) ##### Spatial chunks: Efficient for regional analysis. Spatial chunking optimizes data access for regional analysis and mapping operations. This strategy is especially effective when working with large geospatial datasets where queries often target specific areas rather than full time series. Example: - Shape: `(90, 180, 360)` - 90 days - 10° latitude blocks - 10° longitude blocks ``` Chunk shape: [time, lat, lon] = [90, 180, 360] Latitude axis → ┌────────────────────────────────────────┐ │ 10° x 10° block │ │ [180 x 360] pixels per block │ ├────────────────────────────────────────┤ │ ... │ └────────────────────────────────────────┘ (180 blocks stacked) Latitude ↑ ┌────────────┬────────────┬────────────┐ │ chunk 1 │ chunk 2 │ chunk 3 │ → Longitude → ├────────────┼────────────┼────────────┤ │ chunk 4 │ chunk 5 │ chunk 6 │ ├────────────┼────────────┼────────────┤ │ chunk 7 │ chunk 8 │ chunk 9 │ └────────────┴────────────┴────────────┘ ``` - Time axis: 90 days per chunk - Enables efficient spatial subsetting for regional queries - Balances chunk size to optimize memory and I/O - Compatible with compression schemes to improve storage efficiency | Strategy | Use Case | Chunk Shape Example | Benefits | |:---------------:|:-------------------:|:-------------------:|:-------------------------------:| | Temporal | Time series queries | [90, 1, 1] | Fast time slicing, small memory | | Spatial | Regional queries | [1, 180, 360] | Fast spatial subset, good I/O | | Spatio-temporal | Mixed queries | [10, 90, 180] | Balanced performance | ### Our usecases (hail and flooding) #### Flooding (Pluvial) — Optimized for location + year/percentile | Dimension | Meaning | Count | Suggested Chunk | |:-------------:|:-----------------------------:|:-----:|:-------------------:| | `scenario` | Emission pathways (SSPs) | 4 | 1–2 | | `return_period` | Flood frequency | 1 | 1 | | `year` | Future projection years | 6 | 2 | | `percentile` | Uncertainty bounds (P10...) | 3 | 3 | | `y` | Grid rows (30m resolution) | 6000 | 1 ← location pinned | | `x` | Grid columns (30m resolution) | 6000 | 1 ← location pinned | Chunk shape: `[scenario, return_period, year, percentile, y, x]` = `[1, 1, 2, 3, 1, 1]` ``` (x, y) pinned at 1 pixel ┌─────────────────────────────┐ │ scenario = SSP245 │ │ return_period = 100-yr │ │ │ │ year = [2030, 2040, 2050] │ → chunked │ percentile = [P10, P50, P90]│ → chunked └─────────────────────────────┘ ``` Efficient for: - "Give me flood depth at (x, y) from 2030–2080" - "Return all percentiles for this pixel" - Small spatial chunks → reduced read I/O - Blosc compression handles small blocks well #### Hail | Dimension | Meaning | Count | Suggested Chunk | |:----------:|:-------------------------------:|:-----:|:-------------------:| | percentile | P10, P50, P90 | 3 | 3 | | year | Projection years (e.g. 2040...) | 2 | 2 | | latitude | 20°–60° (0.25° resolution) | 161 | 1 ← location pinned | | longitude | 230°–300° (0.25° resolution) | 281 | 1 ← location pinned | | scenario | Emission pathways | 3 | 1–3 | | x | Grid columns (30m resolution) | 6000 | 1 ← location pinned | ### Hierarchical Organization - Data is organized using stores and groups - Stores define where data physically lives - Groups provide logical organization of related data #### Hierarchy optimization strategy example Organize logically related variables (arrays) into structured Zarr groups Example: `zarr://hazards/hail/era5/temperature.zarr` ``` Group: hail/ ├── metadata.json ├── era5/ │ ├── temperature/ │ │ ├── .zarray │ │ ├── 0.0.0 │ │ ├── ... │ └── reflectivity/ │ ├── .zarray │ └── ... └── gfs/ └── ... ``` ``` /hail (Zarr group) │ ├── era5 (group) │ ├── temperature (array) │ └── reflectivity (array) │ └── gfs (group) ``` - Easy to navigate by model/source/variable - Works like a filesystem - Allows independent chunking, compression per array ### Metadata Management - Metadata is embedded at every level of the hierarchy - Each array and group has its own metadata file - This enables lazy loading and efficient navigation #### Metadata optimization strategy example Example: Store standardized metadata as `.zattrs + CF conventions`, `.zattrs` for a variable: ``` long_name: a descriptive name for the variable units: physical units (e.g. mm, K, 1) standard_name: controlled vocab to describe the physical meaning coordinates: the names of spatial/temporal axes (e.g. lat, lon, time) grid_mapping: the CRS or projection missing_value: fill value ``` ```json { "long_name": "2m air temperature", "units": "K", "standard_name": "air_temperature", "coordinates": "time lat lon", "grid_mapping": "crs", "missing_value": -9999 } ``` ``` Array: temperature/ ├── .zarray ├── .zattrs ← describes meaning, units, dimensions ├── 0.0.0 ← chunk files ├── ... ``` - Lets tools like xarray, cdo, netCDF libraries read metadata - Improves searchability and interoperability - Supports semantic data validation ## Key Strategies per Data flow stage Overview ### Data Generation Stage - **Key Technical Checks** - Coordinate Systems: ensures spatial consistency, prevents errors - Data Types: avoids type-related bugs, optimizes storage - Missing Values: early gap detection, supports quality control - **Why It Matters** - Prevents costly downstream fixes - Enables reliable processing - Supports data quality monitoring ### Data Ingestion Stage - **Chunk Optimization** - Enables efficient parallel processing - Optimizes storage & retrieval patterns - Supports scalable access - **Azure Blob Storage** - Cloud scalable & distributed - High-performance I/O - **Metadata Management** - Efficient data discovery & organization - Enables lineage tracking - **Why It Matters** - Directly impacts query speed & scalability - Enables efficient navigation & management ### Data Reading Stage - **Spatial** reading: by geo-coordinates, chunk spatially - **Time series** reading: by time intervals, chunk temporally - **Variable** reading: chunk by variable for selective access - Benefits - Reduces memory footprint - Enables handling huge datasets - Prevents crashes & improves resource use ### Data Processing Stage - **CRS Transformations** - Ensures spatial consistency - Supports integration of diverse data sources - Accurate spatial calculations - **Performance Optimization** - Improves computational efficiency - Reduces processing overhead - Enhances responsiveness ### Data Serving Stage - API Access - Standardized data retrieval - Supports distributed apps - Consistent query interface - Location-Based Access - Optimizes data retrieval regionally - Reduces latencySupports global distribution > well-organized `.zattrs` and chunk-level metadata enable APIs to quickly locate and read the right chunks without scanning entire datasets. ## Team trigger questions and analysis on-the-fly ### Primary Access Pattern - Do most queries focus on time, scenario, and percentile ranges at one fixed location (y, x pinned)? - Do most of them focus on a specific dimension/range? (e.g. all scenarios, a range of years) - How often do we expect requests for full spatial maps (all y & x) for specific years or scenarios? - Are there use cases where bulk scans over all locations and dimensions (e.g., for AI training or analytics) will be frequent soon? ### Dimensional Prioritization - Given that location is pinned, should chunking optimize for fast access across time, scenario, percentile dimensions? - Should spatial dimensions y and x always be chunked at size 1 (single pixel)? - When spatial chunks are bigger (e.g. for bulk), how do we balance performance? ### API Design Implications - Should the API prioritize rapid responses to "all variables at one location" queries? - How should caching and parallel reads be designed given these access patterns? - Async! (depends on zarr version) ### Chunking Strategy Rubric | Question | Yes | No | Unsure | Recommended Chunking Strategy | |--------------------------------------------------------------|:---:|:--:|:------:|------------------------------------------------------| | Queries mostly request data for a single lat/lon location | x | | | Spatial dims chunked at size 1 (pin location) | | Queries request ranges of scenarios, years, percentiles | x | | | Chunk scenario/year/percentile dims for fast slicing | | Bulk or full-region queries are rare now | x | | | Use small spatial chunks, moderate temporal chunks | | Future bulk or regional queries are anticipated | | x | x | Consider larger spatial chunks for easier regional access | | Spatial chunk size (y,x) = 1 is preferred | x | | | Optimized for single-pixel access | ### Hierarchy Strategy Rubric | Question | Yes | No | Unsure | Recommended Hierarchy Depth & Structure | |--------------------------------------------------------------|:---:|:--:|:------:|------------------------------------------------------| | Easy navigation by scenario, year, percentile per location | x | | | Medium depth: group by dataset → scenario → variable | | Many related variables within groups | | x | x | Avoid overly deep hierarchy to prevent navigation overhead | --- ### Metadata Strategy Rubric | Question | Yes | No | Unsure | Metadata Recommendations | |-------------------------------------------------------------|:---:|:--:|:------:|--------------------------------------------------------| | Store detailed metadata at array and group levels | x | | | Embed `.zattrs` with CF conventions for all arrays | | Metadata helps fast filtering by scenario/year/percentile | x | | | Use chunk-level metadata to speed API queries | | Metadata size should be balanced to avoid query slowdowns | | x | x | Optimize metadata detail to balance speed vs completeness |