owned this note
owned this note
Published
Linked with GitHub
# Hackathon goals
## Goals for these events
1. Improve how we work together to build analysis tools.
1. Begin to imagine and build the next generation of "model diagnostics."
## Vision: An interactive numerical laboratory for Earth system science
In the context of (2) above, we might imagine some high-level requirements.
- Seamless integration of routine model evaluation and cutting-edge research
- Enable novel means of data interactivity and visualization
- Scalable
- Capable of handling Big Data performantly
- Enabling new applications, entraining communities, etc.
- Component models are not necessarily a natural organizing principle
- Be as model agnostic as much as possible
- Fluid integration of observations and models
- Enable reproducible science
- Cloud forward perspective
- Community-developed and open-source
## Diagnostic frameworks: the path forward
We might break the problem into two very big parts:
- `Analysis elements`
- `Workflow`
### Analysis elements
`Analysis elements` are scripts or [Jupyter Notebooks](https://jupyter.org/), they typically follow a sequence:
1. Read some data (preferably via an API)
2. Apply `operators` that transform the data, perform dimensions reductions, compute derived quantities, etc.
3. Produce visualizations: could be static plots or interactive visualizations
### Workflows
`Workflows` might be comprised of multiple `analysis elements`. This is the framework that automates, parameterizes, and executes these elements.
[Netflix has some interesting ideas and tools for Notebook based workflows](https://netflixtechblog.com/notebook-innovation-591ee3221233).
### Data access APIs
An API wraps messy details behind an standardized interface. Using APIs for data access has several advantages.
- Data access based on hard-coded paths is fragile
- Desired product sometimes involves rote computation
- e.g., [`pop_tools.get_grid(...)`](https://pop-tools.readthedocs.io/en/latest/examples/get-model-grid.html) reads `CESM INPUTDATA` files via web protocol
- Access details are messy
- Arbitrary number of files
- Standardizations steps to be applied en route
- An API can be parameterized (i.e., it can accept arguments, enabling control and automation)
#### Example data API: Intake-esm catolog
```python
import intake
col = intake.open_esm_datastore("glade-cmip6.json")
cat = col.search(activity_id="CMIP", source_id="CESM2",
experiment_id="1pctCO2", variable_id="co2",
table_id="Amon", grid_label="gn")
dsets = cat.to_dataset_dict(cdf_kwargs={"chunks": {"time":36}})
```
### Operators
`Operators` do the actual computational work of the analysis. We might think of these as functions that consume and produce Xarray objects:
```python
ds_out = operator_func(ds_in)
```
In reality, it's often more complicated, but some key points for `operators` are:
- Consume and produce [Xarray](http://xarray.pydata.org/en/stable/api.html) objects.
- Be [Dask](https://docs.dask.org/en/latest/)-friendly to enable scalability
- Be as model-agnostic as possible, leveraging layers of abstraction
- [eventually] Enable provenance tracking, caching state, etc. (see [xpersist](https://github.com/matt-long/xpersist), [xpublish](https://github.com/jhamman/xpublish))
- Collaborate with [GeoCAT](https://geocat.ucar.edu/) to build domain-specific operators
## Initial plan of attack
### Build `analysis element` prototypes
- Focus on problems of interest to you now, the functionality we might want to see in routine validation or particular research questions
- Aim for best coding practices (modularity!), but focus on scientific functionality first
- Get it working, then refactor
- Communicate!
- Share what you're up to, challenges and achievements
- Ask for help
### Identify and build key `operators`
A good operating principle might be "write new code as a last resort." However, sometimes the thing you want to do isn't well supported by existing packages. In these cases, we need to identify the right place to build and support functionality.
#### When to create a package?
[Pangeo has developed some guidelines for packages](https://pangeo.io/packages.html#guidelines-for-new-packages). Paraphrasing some of these, packages should:
1. Solve a general problem
1. Have a clearly defined, relatively narrow scope
1. Avoid duplication, leverage existing packages as much as possible
1. Consume and Produce Xarray Objects: Xarray data structures facilitate mutual interoperability between packages
1. Operate lazily (i.e., Dask friendly)
### Coordinate sharing and standardization
As we accumulate a body of work, we might conceive of synthesis and refactoring, ultimately aiming to build workflow supporting automation. We should think carefully about how to maximize the likihood that prototypes will be useful.
- We might consider building prototypes around some common high-priority datasets, standardizing data access with `intake`.
- We might want to use a project template and consolidate projects in a single GitHub organization.
## Relevant packages
We should leverage existing functionality!
- [**xarray**](xarray.pydata.org): netCDF-like data model in Python
- [**dask**](dask.org): Intrinsic parallelism for analytics
- [**xgcm**](xgcm.readthedocs.io): General Circulation Model Postprocessing with xarray]
- [**pop-tools**](https://pop-tools.readthedocs.io/): nascent support of POP-specific functionality
- [**intake-esm**](intake-esm.readthedocs.io): data catalog utility for loading ESM datasets as Xarray objects
- [**xhistogram**](https://xhistogram.readthedocs.io): Fast, flexible, label-aware histograms for numpy and Xarray
- [**regionmask**](https://regionmask.readthedocs.io): create masks of geographical regions
- [**climpred**](https://climpred.readthedocs.io/en/stable/): analysis of ensemble forecast models for climate prediction
- [**metpy**](https://unidata.github.io/MetPy/latest/index.html): tools for reading, visualizing, and performing calculations with weather data
[And many more....](http://xarray.pydata.org/en/stable/related-projects.html)