# Seasonal Monitor processing with Dask The "Seasonal Monitor in the cloud" is set to process and refine potentially terabytes of EO-data generating analysis-ready data that can be served to various applications and stakeholders through the WFP RAM ODC. To be able to handle and process such a signifcant amount of data in a timely fashion, we need to use software for scheduling and executing concurrent processing tasks in a distributed fashion. This software is [Dask](https://dask.org/). Dask is an obvious choice as it is a python-based library that is tighly integrated with many of the tools the Seasonal Monitor uses (Jupyter, xarray, ... etc) and is in-fact one of the core dependencies of the Open Data Cube library. ### What is Dask? According to the homepage, Dask is "is a flexible library for parallel computing in Python" that "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". Essentially Dask is a library that enables you to read data in a "lazy" fashion (*i.e.* create a python object without loading the data fully) and combine it with different processing steps into a "Task Graph" which can be submitted to a scheduler for computing. Depending on the input data, the computation can be performed on a vaerity of processing infrastructures, all the way from your personal laptop up to "clusters with 1000s of cores". Summing up, Dask has the following advantages: * integrated with core software of Seasonal Monitor + ships with ODC * easy to use API * can be used on personal laptop and distributed on the cloud in the same way * enables distributed processing of large multidimensional arrays For a more conceptual overview and the "virtues" Dask emphazises, please see the [relevant docs](https://docs.dask.org/en/latest/). ### How will it be used? Dask will be used for three different tasks: 1. Scaling the periodic processing of EO data within the Seasonal Monitor 2. Scaling the ad-hoc processing tasks that might come from applications such as a stats extraction API 3. Serving as a processing backend to the JupyterHub User platform ### What are the options? Dask is an open source software, so it can be used freely. Therefore, the options can be generally split up into two different categories: 1. **Owning the Dask processing cluster** The first and most obvious option is to maintain a Dask processing cluster ourselves. The cluster can be deployed on AWS using either a Kubernetetes cluster or a helper library such as [Dask cloudprovider](https://github.com/dask/dask-cloudprovider). Using a Kubernetes cluster might be the most "professional" and scalable options of the two, while Dask cloudprovider seems to be more intended for indivdual use. However, the first implementation for the Seasonal Monitor processing was using Dask cloudprovider with reasonable success. Upsides would be: * Full control over the infrastructure, over how it scales and when it scales, and if anything needs to be added / changed * The deployment could be replicated by someone else w/o having to subscribe to another service besides AWS. There are two obvious downsides for owning the infrastructure: * The infrastructure needs to be set up first, which can be increasingly complex and time intensive (managing all the permissions, logging, etc.) * All owned infrastructure needs to be maintained: if something doesn't work, there's no support hotline. Version upgrades need to be managed, and usage needs to be monitored carefully. 2. **Using Dask as a Service** At the end of 2020, [Pangeo released a blogpost about new funding and new directions](https://medium.com/pangeo/pangeo-2-0-2bedf099582d) where they write: > A Pangeo-style cloud environment is more than just a vanilla JupyterHub — it also means access to Dask clusters on demand, plus a specialized software environment. When we started operating JupyterHubs in the cloud three years ago, there were few commercial options available for purchasing these services, and we had to roll our own. Now the situation has changed. Some exciting new companies have recently launched to provide Jupyter together with scalable Dask clusters in the cloud. These include > * Coiled: Founded by the Dask creators, Coiled provides Dask as a service to both individuals and enterprise. > * Saturn Cloud: one of the first to offer Jupyter + Dask as a service. So instead of maintaing the infrastructure ourselves we could choose to use the services of one of the available providers. Of the two menioned above, **[Coiled](https://coiled.io/)** has already been tested for processing within the Seasonal Monitor. While the company was only founded recently and their service is still officially in a "beta" state, the people beind the company are some of the main Dask developers, including Matt Rocklin who is one of the "fathers" of Dask. **[Saturn Cloud](https://www.saturncloud.io/s/home/)** has not been tested yet, but looks to be an intriguing option due to the offer of Jupyter as a service and their integration into AWS: the service can be "purchased" directly from the AWS marketplace. Considering the upsides of Dask, it's fair to assume that having a Dask cluster will at some point be a managed service by the cloud provdider. There are some obvious upsides: * No need to maintain or worry about the processing infrastructure. Version upgrades and implementation of new features are done "automatically". If anything doesn't work, there's dedicated support. * Potentially smoother scaling of processing power * Good out-of-the-box monitoring and cost-tracking * Integrated management of users and dedicated resoruces Of course, there are downsides as well: * Additional costs compared to running it ourselves: most likely we have to pay for computation + the service fee (still TBD what the costs would be) * Using a service adds another dependency that can't be assumed everyone will subscribe to. However, Dask is universal so in theory most of the code can be used as is with a different Dask backend. * Might be an issue in light of being a digital-public-good * Procurement? ### Recommendation Need to gather more information and clarify details. But using Dask as a service has upsides which are hard to ignore, such as no burden of maintanance and (almost) no setup costs. ### Next steps In the coming weeks, we'll likely schedule a call with Coiled and maybe Saturn cloud to get an idea how the service would fit and what the costs involved would be. Additionally, we should get a picture of what the cost / work is required to set up a cluster ourselves. This could be potentially sub-contracted to a developer / company (_e.g._ cloudflight).