owned this note
owned this note
Published
Linked with GitHub
# Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data
Submission site: https://www.conftool.org/earthcube2020/
Pangeo Discourse: https://discourse.pangeo.io/t/earthcube-annual-meeting-call-for-abstracts-due-apr-15/556
All abstract submissions are due on April 15, 2020 and will be limited to 300 words in length.
-----
## Abstract
Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
In this notebook, we demonstrate the integration of data discovery tools such as [intake](https://intake.readthedocs.io/en/latest/) and [intake-esm](https://intake-esm.readthedocs.io/en/latest/) (an intake plugin) with data stored in cloud optimized formats (zarr). We highlight (1) how these tools provide transparent access to local and remote catalogs and data, (2) the API for exploring arbitrary metadata associated with data, loading data sets into data array containers.
We also showcase the [Pangeo catalog](https://catalog.pangeo.io/), an open source project to enumerate and organize cloud optimized climate data stored across a variety of providers, and a place where several intake-esm collections are now publicly available. We use one of these public collections as an example to show how an end user would explore and interact with the data, and conclude with a short overview of the catalog's online presence.
## Notes/Outline
- The Pangeo ecosystem allows us to leverage data stored on the cloud in a variety of different ways
- Low level interaction through zarr or netCDF has always been possible
- Catalogging tools such as Intake and its ESM extension allow us to open and consolidate several homogeneous datasets with minimal user effort
### Tools/Packages Used
- xarray - allows for simpler interfacing with datasets of a variety of formats (netCDF, zarr, csv, etc.)
- zarr - our **preferred** format for cloud-ready datasets - typically all metadata is consolidated in one `.zmetadata` file which allows users to get an overview of the dataset with minimal egress
- intake - primary package behind data catalogging (grouping of different datasets into categories); offers flexible plugin system to access new data paradigms
- intake-esm - intake plugin used to create and share ESM collections, which aggregate several datasets with homogeneous metadata; used heavily for the catalogging of CMIP and CESM large ensembles
- etc...
### Catalog Website
- To reflect the utility of these tools, the [online catalog](catalog.pangeo.io) makes liberal use of these tools to list out the data made available from the [Pangeo Datastore](https://github.com/pangeo-data/pangeo-datastore)
- By default the online catalog points to a ["master" catalog](https://github.com/pangeo-data/pangeo-datastore/blob/master/intake-catalogs/master.yaml), which in turn can be explored to access and open all the datasets within its child catalogs
- Because these datasets vary in the format of their contents, the online version of a dataset can vary in its appearance:
- A "standard" zarr-based dataset can be opened directly in xarray, with contents and metadata being displayed in a similar manner to netCDF
- For larger homogeneous datasets, Intake-esm is able to display an overview of all associated data in a Pandas-like dataframe, which can be searched and sorted
- Reach goals of the website
- To have functionality for generic intake catalogs via file upload or direct link
- Integration of Pangeo's JupyterHubs (i.e. "Open this in Jupyter" option)
- Overview of data status (size in GB, if it can be opened without issue, version of packages used to create it?)
- Directing of users to datasets in their region (currently all Pangeo buckets are located in the same location)
### Intake-esm & ESM collection specification
- Enable data providers
- To direct data users with help of catalogs
- To enable data users to easily onboard new data with structured workflows
- To provide complete information about the data assets at a user’s fingertips
- Enable data users
- To find, discover, investigate existing data assets
- To load data assets into compute-ready containers (xarray.Datasets)
### Use Cases of Pangeo Catalog
- [Older use cases](https://github.com/pangeo-data/pangeo-example-notebooks), don't necessarily reflect the current state of Pangeo Catalog
- [Using intake-esm to access TRACMIP data](https://github.com/pangeo-data/pangeo-tracmip-examples)