Xdev Project Specification: Funnel

# Xdev Project Specification: Funnel [<img style="float: right;" src="https://hackmd.io/F4m79FUGQiuP_8X1haXR0Q/badge" />](https://hackmd.io/F4m79FUGQiuP_8X1haXR0Q) **Author:** Kevin Paul ([@kmpaul](https://github.com/kmpaul)) ## Links [Project Trello Board](https://trello.com/b/FfQmlXoh/xdev-project-funnel) Repositories: - [xdev](https://github.com/NCAR/xdev) (for this document) - [intake-esm](https://github.com/intake/intake-esm) - [xcollection](https://github.com/NCAR/xcollection) - [funnel prototype](https://github.com/marbl-ecosys/cesm2-marbl/tree/main/notebooks/funnel) - [funnel prototype example](https://github.com/matt-long/ocean-metabolisms) - [provenance](https://provenance.readthedocs.io/en/latest/) - [prefect](https://www.prefect.io/core) ## Disclaimer **This *functional specification* is incomplete.** Until this specification has been agreed upon by the entire time, this should be considered a work in progress and should be visited frequently to obtain updates. Until this specification is considered complete, decisions about how to code the solutions to this specification should be assumed to be *unmade*. *This document should be complete before development progress actually begins.* Things to keep an eye out for while reading this document include: - ***Technical Notes or Issues*** should be made regarding suggestions and details that might pertain to the technical implementation of each individual feature. Put all of the jargon and technical mumbo-jumbo in *Technical Notes*! - ***Issues*** should be detailed in each feature for discussion and decision before moving forward with development. ## Background Funnel is motivated by the need for better, modern CESM diagnostics. Thus, to understand Funnel, you need to understand the problem that Funnel is trying to solve. ### Traditional CESM Diagnostics CESM diagnostic packages are stand-alone utilities that, when run, produce new *data assets* (i.e., NetCDF files) as well as a pre-defined collection of *visualizations* of the model data. The new data assets are permanently stored on disk "next to" the original model output, essentially extending the model output data. The visualizations are saved as image files in a hierarchy of folders also containing simple HTML files that allow easy "browsing" of the visualizations. These diagnostic visualizations can then be shared with other scientists by serving the visualization folder with a webserver (e.g., Apache). A basic diagram of this approach is shown below. The diagnostic processing steps are run in sequence immediately after the CESM model run is complete, and the two steps can be understand as first "compute the diagnostic data" and then "generate the visualizations". <img style="width: 350px; display: block; margin: 0 auto;" src="https://raw.github.com/NCAR/xdev/hackmd/projects/funnel/images/diagnostics-workflow.svg" alt="Diagnostics Workflow"> In essence, the traditional diagnostics approach acknowledges that some of the necessary data for the diagnostic visualizations is *intrinsic* to the data produced directly from the model, and some of the necessary data is *derived* from the intrinsic data. One could view this as *parent* data (i.e., intrinsic) and *child* data (i.e., derived). #### Weaknesses of Traditional Diagnostics - Traditional diagnostic packages are not very *modular*. That is, the "recipes" for how to compute *child* data cannot be easily "extracted" from the diagnostics package and used in separate analyses. - Traditional diagnostic packages are not very *extensible*. That is, if a scientist wants to visualize a different variable than one found in the *parent* or *child* data, that scientist needs to write their own analysis and visualization script. It would be better if the scientist could just "add their own recipe" to the diagnostics package, instead of having to write a stand-alone script. (Then their custom analyses could even be shared with others.) - Traditional diagnostic packages are not very *customizable*. That is, there is usually no mechanism to finely tune the diagnostics run to accommodate different kinds of workflows or different kinds of model runs. - Child data produced by the diagnostic pre-processing steps is saved *permanently* to disk, alongside the parent data. Since disk space is expensive these days, it would be better to save the "recipe for producing" that data, instead of the data itself. - Traditional diagnostics packages are expected to be run in *batch mode*. That is, traditional diagnostic packages are usually run as "post-processing" steps immediately after the model run is finished. This means that traditional diagnostic package analysis doesn't lend itself to interactive, curiosity-driven analysis. It would be better if the same code used for diagnostic analysis could be used by scientists in interactive analysis, too. - Traditional diagnostics packages typically do not work with Pangeo-friendly packages, such as Intake-ESM, Xarray and Dask. This is also partly the reason they are not geared toward interactive analysis (see previous bulletpoint). It would be better if the diagnostics packages were built off of the same technology as is used for interactive analysis. ## Overview Funnel is an approach to resolving all of the above weaknesses. In essence, Funnel is a modern reinterpretation of the traditional diagnostic package. Most significantly, Funnel takes a new view of the traditional diagnostic workflow. The traditional diagnostic workflow makes *data assets* (i.e., files) "first-class citizens", meaning that the data needs to exist *on disk* in order for any diagnostic operations (e.g., visualization) to be performed. The Funnel workflow makes *recipes* (i.e., functions) first-class citizens, meaning that the highest priority is saving the functions for computing the diagnostic data to disk instead of the data itself. The consequences of treating *recipes* (i.e., a Python functions) as first-class citizens are numerous. 1. The storage space needed to save a recipe is much smaller than the storage space needed to save the data produced by the recipe. 2. Recipes can be "chained" together making complex analyses modular and easier to manage. 3. Recipes can be easily developed by scientists and added to the diagnostics package, making the entire diagnostics package extensible. 4. Recipes make it possible for scientists to use the components of a "diagnostics suite" in their own interactive analyses. Below is a depiction of how a Funnel-based diagnostics workflow might operate. <img style="width: 350px; display: block; margin: 0 auto;" src="https://raw.github.com/NCAR/xdev/hackmd/projects/funnel/images/funnel-workflow.svg" alt="Funnel Workflow"> In the Funnel depiction, the "Diagnostic Data" is never written to disk. Instead, `ecgtools` is used to create an Intake-ESM catalog for the "Model Data." Then, instead of "Diagnostic Data" being generated and written to disk, a "Virtual" Catalog object, or an Intake-ESM catalog with "virtual" data assets. This "Virtual" Catalog object is a valid Intake-ESM catalog taht contains keys pointing to *recipes* which represent "virtual data assets," taking advantage of new features added to Intake-ESM through this project. Scientists can then use the "Virtual" Catalog knowing that some data assets (variables) might be generated "on the fly." ### Recipes, Preprocessing, Derived Variables and Operators Matt Long's [overview of the project](https://docs.google.com/presentation/d/19M_2el9y9P15YyBHOaXVbD_iosbhZoLsyKPq78znmJU/edit#slide=id.ge6486cb2d6_0_73) describes 3 main kinds of functions used to generate the "virtual data assets" (i.e., the *recipes*) on the fly: pre-processing functions, derived variable functions and operators. **Pre-processing functions** are functions that are passed to Xarray's `open_*` functions when constructing the Xarray `Dataset`. These functions are useful for modifying or amending metadata in the files before Xarray attempts to merge `Datasets` together. **Derived variable functions** are functions that create Xarray `DataArray` objects from other `DataArray` objects. The idea of these is that these functions require specificly-named variables as "input" to the function, and they create a new variable from the input variables. For example, a derived variable function `Z(X,Y)` might compute the new variable `Z` given the variables `X` and `Y`. There are many different considerations for these functions. > ***Technical Note:*** New features in Intake-ESM make it possible to define how to compute *derived variables* from variables contained in the existing Intake-ESM catalog. **Operators** are functions that create new Xarray `Datasets` from another Xarray `Dataset`. An example might be the computation of climatologies or anomalies of variables in a given `Dataset`. The resulting variables computed from the application of the operator do not necessarily share the same coordinates as the input variables, and they are many times data reductions (e.g., temporal and spatial means). As such, with operators, the assumption is that the returned `Dataset` cannot be merged or concatenated with the input `Dataset`. > ***Technical Notw:*** Operators are functions that become `tasks` in Prefect `Flow` objects. There is not fixed API for these functions; they can be anything. Thus, guidance is required to help the user know how to write new operators and use existing operators in an effective and efficient way. > ***Technical Note:*** Operators should be chainable, so that the output of one operator can be input of another operator. This is satisfied by the use of general Prefect `tasks`. > ***Technical Note:*** All recipes (functions) should be usable *outside* the Funnel framework. That is, these functions should be usable without Prefect! ## Nongoals This project will not address the "Diagnostics Output." That is, we will not concern ourselves with what the format and mechanisms for displaying the diagnostics results will be (e.g., web pages with static images, Jupyter Notebooks, Jupyter Book). > ***Note:*** The JupyterBook approach described in the ESDS blog post ["Reimagining Diagnostics Through the Use of the Jupyter Ecosystem"](https://ncar.github.io/esds/posts/2021/jupyter-based-diagnostics-overview/) could be the template for this. ## Requirements ### "Virtual" Intake-ESM Catalogs **The user should interact directly with a valid "Intake-ESM" catalog.** It should allow the user to "query" the catalog (subselecting the total data assets based on the query) and then convert the subselected catalog into a "dictionary of datasets." Both original and derived variables should be queriable, and the derived variables should be computed on the fly when the "dictionary of datasets" is produced. Operators should exist as functions that act on datasets returned by Intake-ESM. Hence, they are not explicitly contained in the catalog. ### Recipe Precedence By design, recipes can only be applied to Xarray data structures, and therefore **recipes can only be applied at the time of `to_dataset_dict` or soon thereafter.** Given the 3 *kinds* of "recipes" define above, **there is an order of operations that needs to take place for each of these kinds.** 1. **The *pre-processing functions* act first**, and they generally act on the "raw data" being read from the file. Hence, they act *before* the data has been converted into any Xarray data structure. > ***Technical Note:*** Intake-ESM can only accept 1 pre-processing function to be used across all Datasets. 2. Next, **the *derived variable functions* are constructed and applied to `Datasets` after their construction.** > ***Technical Note:*** This is now handled by Intake-ESM natively. Intake-ESM catalogs can define recipes for computing derived variables, which get computed "on the fly" and added to the datasets at this time. 3. Finally, ***operators* are applied to the `Datasets` *after* the derived variable functions are applied.** This defines the precedence of the *kinds* of recipes, but what determines the precedence of multiple recipes *of the same kind*? That is, if there are multiple derived variable functions to be applied to a `Dataset`, in what order should they be applied? To determine this, **we will need some sort of *dependency* information to indicate that one derived variable function needs to be applied before another.** This is also possibly true for *operators*. > ***Technical Note:*** We will not worry about derived variable precidence. Derived variables must be computed from the originating variables in the catalog. ### Query-Dependent Recipes As mentioned above, all of the *recipes* should, in principle, be *restrictable* to *only* `Datasets` that match a certain Intake-ESM query. That is, each recipe needs to be associated with a "query range" (i.e., a range of values for each key in the Intake-ESM catalog). ### Cacheable Recipes When a *recipe function* is executed with a specific input that can be uniquely distinguished from other inputs, **the data returned by the function should be cached so that subsequent applications of the function *with the same input data* can be simply read from disk.** > ***Technical Note:*** This implies that having a function that returns `None`, but does operations on the input "in-place might be hard to "automatically" cache because the cachable data is buried in the function somewhere and not returned from the function. > ***Technical Note:*** This also implies that functions should *probably* act on the minimal amount of data needed. For example, if a derived variable function "inserts" a new variable into the input `Dataset`, and there are 50 other data variables in the `Dataset` (that are read directly from disk with Intake-ESM), it would be inefficient to cache the entire 51-data-variable `Dataset`. It would be better to simply cache the newly created variable. ## Use Cases TBD ## Prototype Architecture 1. **Derived variables** are defined in Intake-ESM catalogs using the new capabilities implemented in [intake/intake-esm#379](https://github.com/intake/intake-esm/issues/379) and [intake/intake-esm#389](https://github.com/intake/intake-esm/issues/389). > ***Technical Issue:*** Where should the functions defining the derived variables be defined? In the same packages where the Operators are grouped? 3. **Operators** are generic functions that can be used as Prefect tasks in a Prefect Flow. These operator functions can be adorned with the Prefect `task` decorator to be hooked into a Prefect Flow. Operators should be grouped into packages that define specific diagnostic workflows. Hence, a different package can be defined for a different diagnostic workflow. The operators can be used in their `task`-undecorated form. 5. **Recipes** for diagnostic workflows are contained in Prefect Flow objects. Flow objects can be created from `task`-decorated **Operator** functions. > ***Technical Issue:*** Where should the Flow objects be defined? In the same packages where the Operators are grouped? 5. **Caching** is provided by a separate package that can be hooked into Prefect. > ***Techinical Issue:*** What is the name of the new **caching** package? What is its state and what needs to be done to get it published?