OGGM on the cloud - a preliminary roadmap

# OGGM on the cloud - a preliminary roadmap Some technical notes about the deployment of OGGM on the cloud. Context: - The GlacioCloud project: https://oggm.org/2019/04/17/glaciocloud/ - This github issue: https://github.com/pangeo-data/pangeo/issues/521 Here are some notes I gathered after playing around with Jupyterhub myself a little bit: they will be refined as I/we learn more about how to do it properly. Comments welcome! ## Goal From the proposal: > **Set-up and deploy the Open Global Glacier Model in a scalable cloud environment, and make this computing environment available for everyone.** ... We envision an online platform where people can log-in and get access to a fully functional computing environment where they can run the OGGM model. This environment will scale according to resources demand. It will be personalized and persistent, so that a user can prepare some computations, start them, log out, then log back in and still find the computing environment he or she left earlier. The advantages for the user will be important: scalable computing resources, no installation burden, no data download, no version issues, user-friendly development environment, all in a web browser. More concretely: we wish to have a JupyterHub installed on our own cloud resources with a working python environment where OGGM can run. User authentification will be handled the same way as Pangeo does, i.e. via a github organisation. According to Ryan, the sessions can be persistent, i.e. users will develop a script or a notebook, let it run, log-out, come back and see the results. ## Timeline I *need* to have a **working prototype by mid-July 2019** because I promised this for an invited talk I will give in Montreal (this is usually how I get things done: promising things before they actualy exist). Details (see below) can be delt with later, on the longer run. ## Tools Before we start, here are some definitions for the newcomers: - [JupyterHub](https://jupyter.org/hub) will provide the platform where users can log-in and work in their own [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) environment with OGGM installed. - [Binder](https://mybinder.readthedocs.io/en/latest/) is a layer on top of JupyterHub which automatises the process of providing *any* kind of environment in a Hub. They define a set of rules about how to define the environment, and use [repo2docker](https://github.com/jupyter/repo2docker) to create the environments. Binder is open-source, but they provide a free service on [mybinder](https://mybinder.org/). MyBinder is what we use currently for OGGM-Edu. It works *very* well, the only drawback is performance: users only get small machines of course. I don't think we want to use Binder here, at least not at first. We will use repo2docker though **NB @kaedonkers:** Have you considered using https://binder.pangeo.io/?? It is Binder with much more resources. It's even been used for hackathons with multiple users. - [kubernetes](https://kubernetes.io/) is what makes the hardcore job of scaling in the background. Don't ask me more, I don't know about it. [Helm](https://helm.sh/) is the thing which makes installing things easier. - [Pangeo](http://pangeo.io/) is not really a "tool": it is more a community of persons developing, promoting and documenting how to use the tools listed above in the geosciences context. We will be part of this community, use they expertise and extend their toolbox to the specifics of OGGM. ## Cloud resources I need to get access to free cloud resources. This should not be an issue, and I will write a proposal ASAP. I have no preference for any cloud provider, but my first experience with Azure was not very good. I've hear good things about Digital Ocean (see [this](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1192)) and Pangeo uses Google. **Which providers should we choose?** Here's a wishlist: - it should provide free resources for research and teaching cause we have no money to pay - it should support kubernetes for scaling - it should be easy to use Can someone help me to choose here? ## repo2docker and OGGM This is going to be quite easy I think. We need to provide JupyterHub with Docker containers where OGGM can run. We have good [installation instructions](https://docs.oggm.org/en/v1.1/installing-oggm.html), and I have been able to use repo2docker to build a working environment for OGGM-Edu: see [this](https://github.com/OGGM/oggm-edu/tree/master/binder). Here, we will have to see how we can use the pangeo way of doing things to be more efficient and follow their protocoll. These links provide more context about things I don't fully understand yet: - https://discourse.jupyter.org/t/repo2docker-make-it-easy-to-start-from-arbitrary-docker-image/502 - https://github.com/jupyter/repo2docker/issues/487 ## Set-up JupyterHub in the Pangeo style This is where I don't know yet how this is going to look like. We will have to follow the instructions on Pangeo (http://pangeo.io/setup_guides/index.html) and maybe start smaller with [Zero2JupyterHub](http://zero-to-jupyterhub.readthedocs.io/). This is where kubernetes and helm come into play. **This is where I'll need most help!** ## OGGM specifics Most of the things I described until now are similar to a "standard JupyterHub set-up with a Pangeo flavor" as far as I understand. In practice, there will be interesting questions related to OGGM itself: - we rely on a lot of input data, which will have to live on the cloud as well. How do we organize it, will it be a bottleneck? I have some ideas to start with but this will need some brainstorming - OGGM is embarrassingly parallel - but uses a lot of I/O for practical reasons and modularity? How will it perform on the cloud? - etc.