At CarbonPlan we frequently use Terabyte scale public climate datasets in our projects. This can be difficult due to the huge data volumes. Instead of trying to download massive datasets, we use cloud resources that allow us to perform computations close to the data. However, the data format can also cause difficulty if it isn’t optimized for cloud-based computations. People often work around this difficulty by creating a separate, analysis-ready, cloud-optimized (ARCO) dataset, but that requires storing a second copy of the data which can be expensive. By creating a reusable Kerchunk reference of the NASA-NEX-CMIP6 dataset, we avoided duplicating 36 Terabytes of data, and our work in extreme heat was sped up by an order of magnitude. Creating and sharing Kerchunk references of public datasets in a data commons can help speed up everyone's work and enable open science.
### Two models for Scientific Computing
#### Download Model
Many remote-sensing, model and assimilation datasets exist in NASA DAAC’s or similar data centers. Gridded datasets are commonly in NetCDF or GeoTIFF formats. In the case of NetCDF, a dataset is usually composed of hundreds or thousands of individual files. Working with an entire dataset commonly involves downloading data to a local university cluster or a subset of it to a laptop. In some archival systems, you enter a global queue, where you wait for a robot arm to retrieve and read data from a tape archive. This **download -> subset -> analyze model** moves the data to your compute resources. Generally it takes a lot of transfer and processing time, storage space (as data is duplicated) and makes reproducibility difficult.
#### Data-proximate model
##### Cloud-optimized Datasets
<!-- -->
Another model is to have datasets in *Analysis-Ready Cloud-Optimized (ARCO)* formats. This model switches the paradigm, and moves the computation to the data which allows for quicker analysis due to the nature of ARCO data, with the ability for asynchronous reads on individual chunks, data proximate computing and easy scalability. Zarr, GeoTIFF and Parquet are examples of this file format.
##### Archival Datasets
One can also access archival data formats, such as NetCDF with data-proximate computing. This can provide some flexibility around your compute resources, but may have preformance drawbacks compared to an ARCO dataset.
https://figshare.com/articles/figure/Data_Access_Modes_in_Science/11987466?file=22017009
### Kerchunk
This is all great, if your data is already in an ARCO format, but there could be many reasons why this isn’t the case. Perhaps you don’t have the expertise or resources to process and store an ARCO copy of the archival dataset.
Here is where Kerchunk comes in. Kerchunk allows you to create a reusable reference file of an archival dataset so that it can be read as if it were an ARCO data format such as Zarr. Not only does this offer significant performance benefits, but it also allows you to create a ‘virtual reference dataset’ by merging across variables and concatenating along a time dimension.
By using Kerchunk, you can have the best of both worlds* – data providers doing the work of maintaining, updating, storing and publishing stable common formats of the dataset, plus the cloud-optimized read performance of an ARCO dataset. These Kerchunk references are usually tiny and can be easily shared, which could help enable open-science and reproducibility. In this model, data providers can still use their stable time-tested data formats and people can still have cloud-native access patterns.
\* *depends on your chunking-schema use case*
### Real-world analysis using Kerchunk
Our work at CarbonPlan in Climate Impacts relies on access to high-quality climate datasets. As part of our recent work on [Extreme-Heat](https://carbonplan.org/research/extreme-heat-explainer), we created a global dataset of Wet-Bulb-Globe-Temperature (WBGT) for multiple climate Shared Socioeconomic Pathways (SSPs). To calculate WBGT, we used the NASA-NEX-CMIP6 dataset, which is a spatially downscaled version of the CMIP6 archive. This dataset is composed of over 7000 NetCDF files and is nearly 36 Terabytes in size.
Instead of creating an ARCO copy of the NASA-NEX-CMIP6 dataset, we created a [repo](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6) to demonstrate how Kerchunk can be used to create a virtual reference dataset, which allowed us to speed-up our WBGT calculation.
As shown in this [notebook](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/blob/main/generation/parallel_reference_generation.ipynb), Kerchunk references can be generated in parallel if you have access to a distributed computing framework such as Dask. Here, we spun up 500 tiny cloud workers to create references for all the NetCDF files. With that many workers, it took about 30 minutes to create Kerchunk reference files for the entire 36 Terabyte NASA-Nex CMIP6 dataset. The resulting references only take up about 290 Megabytes. The beauty of this approach is that these reference files only have to be created once. Now, anyone can use these references to read the NASA-NEX-CMIP6 dataset as if it were an ARCO dataset.
## Performance
Once we had our references created, we wanted to see if it was all worth it. With this virtual ARCO dataset, can we speed up our WBGT calculation? In [this section of the repo](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/tree/main/comparison) we have two notebooks: [heat_datatree.ipynb](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/blob/main/comparison/heat_datatree.ipynb) and [heat_open_mfdataset.ipynb](https://github.com/carbonplan/kerchunk-NEX-GDDP-CMIP6/blob/main/comparison/heat_openmfdataset.ipynb). These detail two approaches to calculating WBGT – one loading all of the NetCDF files and one using the Kerchunk references to read the dataset as if it were an ARCO dataset.
| Method | # of Input Datasets | Temporal Extent | # of Workers | Worker Instance Type | Time |
| ------------------- | ------------------- | --------------- | ------------ | -------------------- | --------------------- |
| Archival dataset | 20 | 365 days | 10 | m7i.xlarge | 20 minutes 24 seconds |
| Cloud-optimized dataset | 20 | 365 days | 10 | m7i.xlarge | 2 min 49 seconds |
As you can see in the table above, the ARCO / Kerchunk method took only three minutes, compared to the 20 minutes of the download model. While these time differences might not seem wildly important, it’s good to keep in mind that this timing was applied on a small subset of the entire dataset. A nearly 10x speedup could take a computation from weeks to days.
### Try it yourself!
Feel free to use or redistribute the Kerchunk reference we made. It’s good to know that the Kerchunk project is under active development, so you might find some sharp edges and breaking changes. Additionally, if your use-case requires a different chunking schema then the underlying file chunking, you will want to look to other projects such as [pangeo-forge-recipes](https://pangeo-forge.readthedocs.io/en/latest/) and [xarray-beam](https://xarray-beam.readthedocs.io/en/latest/) which are ETL tools to create ARCO datasets where the chunking can be modified.
At the time of writing, Kerchunk supports NetCDF 3 and 4, GRIB2, TIFF/GeoTIFF and FITS. Examples of using Kerchunk can be found on the [official docs](https://fsspec.github.io/kerchunk/) as well as [Project Pythia cookbook](https://projectpythia.org/kerchunk-cookbook/README.html). We hope that this example shows how useful Kerchunk can be for large-scale analysis of earth-science data.
## Attribution
This type of work is possible because of the development work of:
[Martin Durant (Kerchunk)](https://github.com/martindurant)
[Tom Nichols (Xarray-Datatree)](https://github.com/TomNicholas)
[Ori Chegwidden (WBGT - Heat Risk Analysis)](https://github.com/orianac)
[Max Jones](https://github.com/maxrjones/maxrjones1)
[Andrew Huang](https://github.com/ahuang11)
Pangeo-ML kerchunk grant augmentation # <INSERT>