owned this note
owned this note
Published
Linked with GitHub
# Data Compression Project
<!-- Add the tag for the current cycle number in the top bar -->
- Shaped by: Nikki, Christos, Praveen, Mauro
- Appetite (FTEs, weeks):
- Developers: Nikki, Christos
## Problem
<!-- The raw idea, a use case, or something we’ve seen that motivates us to work on this -->
Shape up for the Data Compression Project.
We want to put together many data sets from different source, e.g. ECMWF, MCH, EXCLAIM,...
such that scientists have them readily available. Ideally, they would be homogenized wrt data format, retrieval, parsing. Also we would like to have it on netcdf format with each individual variables (standard with ESGF data format convention) available at varius temporal resolutions.
1. Data layout for users
- We have two main user classes
- Scientists analyzing results
- Identifying and tracking phenomena in the data
- Standard statistical tests such as Bias, RMSE, correlation for Quick Model Verification. Ideally it would be nice to have EXCLAIM dataset and observational/reanalysis datasets on the same grid for quick model verification. Similarity measure, i.e. PAA, DWT, DFT,... for data sets on same stuff but from different sources (e.g. EXCLAIM and ECMWF) (https://www.ifi.uzh.ch/dam/jcr:bac88dda-56d4-4a86-b0b4-d351beba07f5/ReportMScProjectProkofjevsFarabullini.pdf)
- Extracting and analyzing geographical time series
- Machine Learning training
- Global or geographically localized
2. Data storage layout as output of the model
- Options available:
- Native model layout (icon grid, domain decomposed and serialized)
- Lat-long (this is the data at 5km we have for the aquaplanet)
- Lat-long (this is the pSST (prescribed SST) dataset at 2.5km )
- Native Grid (This is the pSST dataset, with 3D variables)
- Save icosahedral-preserving layout
- Enables blocking and more effective data compressions
- Reshape the data from native model layout back to icosahedral in "post-processing" after the model runs
3. Data compression scheme
- We aim at Langwen's algorithm
- JPEG2000 (blocks of horizontal planes) + compression of the residual
- This is needed to preserve properties like rotation to enable data analysis, and possibly ML
- This has implication on the choice of data storage layout and user experience
- Do we use same compression strategy for all the varioables? For example, for water quantities, we should be careful enough.
- Power spectrum analysis (power loss) is a good way to see if compression is good or not
## Appetite
<!-- Explain how much time we want to spend and how that constrains the solution -->
...
## Solution
<!-- The core elements we came up with, presented in a form that’s easy for people to immediately understand -->
Links for further investigation:
1. https://github.com/climet-eu/compression-lab-notebooks
2. https://docs.google.com/document/d/1Q30yGIQDySCtF2pKGAMLsympV36BW0TGZKUSgvX3jyU/edit?tab=t.0#heading=h.dygv08xj9ltk
The initial idea was to use Langwen's data compression approach, which use DWT and zarr. However, seeing that ECMWF already has implemented several others compression methods, it might be a good idea to run some test cases and see which fits us best.
Proposed structure:
```
data_compression_folder
L
command_line_tool.py
parameter_file.txt
L
src/ --compression filters--
```
The user would tweak compression algorithm parameters in `parameter_file.txt` and then run `command_line_tool.py` with specific compression algorithm and file to compress. This last script should output the compressed file.
## Rabbit holes
<!-- Details about the solution worth calling out to avoid problems -->
## No-gos
<!-- Anything specifically excluded from the concept: functionality or use cases we intentionally aren’t covering to fit the ## appetite or make the problem tractable -->
## Progress
<!-- Don't fill during shaping. This area is for collecting TODOs during building. As first task during building add a preliminary list of coarse-grained tasks for the project and refine them with finer-grained items when it makes sense as you work on them. -->
- [x] Task 1
- [x] Subtask A
- [x] Subtask X
- [ ] Task 2
- [x] Subtask H
- [ ] Subtask J
- [ ] Discovered Task 3
- [ ] Subtask L
- [ ] Subtask S
- [ ] Task 4