Data Compression Project

# Data Compression Project  - Shaped by: Nikki, Christos, Praveen, Mauro - Appetite (FTEs, weeks): - Developers: Nikki, Christos ## Problem  Shape up for the Data Compression Project. We want to put together many data sets from different source, e.g. ECMWF, MCH, EXCLAIM,... such that scientists have them readily available. Ideally, they would be homogenized wrt data format, retrieval, parsing. Also we would like to have it on netcdf format with each individual variables (standard with ESGF data format convention) available at varius temporal resolutions. 1. Data layout for users - We have two main user classes - Scientists analyzing results - Identifying and tracking phenomena in the data - Standard statistical tests such as Bias, RMSE, correlation for Quick Model Verification. Ideally it would be nice to have EXCLAIM dataset and observational/reanalysis datasets on the same grid for quick model verification. Similarity measure, i.e. PAA, DWT, DFT,... for data sets on same stuff but from different sources (e.g. EXCLAIM and ECMWF) (https://www.ifi.uzh.ch/dam/jcr:bac88dda-56d4-4a86-b0b4-d351beba07f5/ReportMScProjectProkofjevsFarabullini.pdf) - Extracting and analyzing geographical time series - Machine Learning training - Global or geographically localized 2. Data storage layout as output of the model - Options available: - Native model layout (icon grid, domain decomposed and serialized) - Lat-long (this is the data at 5km we have for the aquaplanet) - Lat-long (this is the pSST (prescribed SST) dataset at 2.5km ) - Native Grid (This is the pSST dataset, with 3D variables) - Save icosahedral-preserving layout - Enables blocking and more effective data compressions - Reshape the data from native model layout back to icosahedral in "post-processing" after the model runs 3. Data compression scheme - We aim at Langwen's algorithm - JPEG2000 (blocks of horizontal planes) + compression of the residual - This is needed to preserve properties like rotation to enable data analysis, and possibly ML - This has implication on the choice of data storage layout and user experience - Do we use same compression strategy for all the varioables? For example, for water quantities, we should be careful enough. - Power spectrum analysis (power loss) is a good way to see if compression is good or not ## Appetite  ... ## Solution  Links for further investigation: 1. https://github.com/climet-eu/compression-lab-notebooks 2. https://docs.google.com/document/d/1Q30yGIQDySCtF2pKGAMLsympV36BW0TGZKUSgvX3jyU/edit?tab=t.0#heading=h.dygv08xj9ltk The initial idea was to use Langwen's data compression approach, which use DWT and zarr. However, seeing that ECMWF already has implemented several others compression methods, it might be a good idea to run some test cases and see which fits us best. Proposed structure: ``` data_compression_folder L command_line_tool.py parameter_file.txt L src/ --compression filters-- ``` The user would tweak compression algorithm parameters in `parameter_file.txt` and then run `command_line_tool.py` with specific compression algorithm and file to compress. This last script should output the compressed file. ## Rabbit holes  ## No-gos  ## Progress  - [x] Task 1 - [x] Subtask A - [x] Subtask X - [ ] Task 2 - [x] Subtask H - [ ] Subtask J - [ ] Discovered Task 3 - [ ] Subtask L - [ ] Subtask S - [ ] Task 4

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.