owned this note
owned this note
Published
Linked with GitHub
# repo2docker-service - A JupyterHub service to use repo2docker
BinderHub and tljh-repo2docker are two pieces of software that both build Docker images using repo2docker. Users who directly deploy a [JupyterHub Helm chart](https://github.com/jupyterhub/zero-to-jupyterhub-k8s) don't have a way to provide their users with a way to build images using repo2docker though.
## What this text includes and not
This text includes two main sections. First a technical background of tljh-repo2docker, and then a proposal to develop a new jupyterhub service to use repo2docker that should both function as a building block via its REST API's but also be useful on its own.
This text does not include a section about how the proposed jupyterhub repo2docker service could be used beyond its quite narrow scope, to for example provide an experience involving for example a "click link -> build & push image -> launch user server" workflow. A workflow like that is meant to be facilitated by a service like this though.
## Technical background of [tljh-repo2docker](https://github.com/plasmabio/tljh-repo2docker)
This text aims to summarize the code in tljh-repo2docker in order to evaluate what code could be adjusted and shared for a JupyterHub service (accessed under `/services/repo2docker`, and scoped to build images with repo2docker).
### [`__init__.py`](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/__init__.py)
`__init__.py` includes `tljh_custom_jupyterhub_config` and `tljh_extra_hub_pip_packages` that are detected as [TLJH plugin hooks](https://tljh.jupyter.org/en/latest/contributing/plugins.html), and this is how installing tljh-repo2docker can influence a [TLJH](https://github.com/jupyterhub/the-littlest-jupyterhub) distribution of JupyterHub.
The `tljh_custom_jupyterhub_config` registers additional _tornado web request handlers_ like below.
```python
c.JupyterHub.extra_handlers.extend(
[
(r"environments", ImagesHandler),
(r"api/environments", BuildHandler),
(r"api/environments/([^/]+)/logs", LogsHandler),
(r"environments-static/(.*)", CacheControlStaticFilesHandler, ...),
]
)
```
### Extra JupyterHub web request handlers
JupyterHub as a [tornado](https://www.tornadoweb.org/) web based application will do things and respond with HTML/JSON when HTTP web requests arrive. Web requests to different paths (`/hub/home` vs `/hub/admin` etc) are handled by different handlers that can also behave differently based on the web request's HTTP method (`GET`, `POST`, `DELETE`, ...).
A [tornado web request handler](https://www.tornadoweb.org/en/stable/web.html) is a class with functions (`get`, `post`, ...) that reacts to an incoming web request and provides a HTTP response, often containing HTML or JSON.
tljh-repo2docker registers a few additional tornado web request handlers with JupyterHub. These handlers also rely on Python decorators [`@web.authenticated`](https://www.tornadoweb.org/en/stable/web.html#tornado.web.authenticated) and [`@admin_only`](https://github.com/jupyterhub/jupyterhub/blob/75e03ef1d977dfee680708289c98432e2893ed5a/jupyterhub/utils.py#L342-L354) provided by tornado and jupyterhub for use by the tornado applicatin itself (that can't be used in a standalone web application).
- **[images.py](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/images.py)**
The `ImagesHandler` is registered to handle requests arriving to `/environments` and renders the [jinja2](https://jinja.palletsprojects.com) HTML template [images.html](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/templates/images.html).
This location is available only for authenticated JupyterHub admin users.
```python
class ImagesHandler(BaseHandler):
@web.authenticated
@admin_only
async def get(self):
images = await list_images()
containers = await list_containers()
result = self.render_template(
"images.html",
images=images + containers,
default_mem_limit=self.settings.get("default_mem_limit"),
default_cpu_limit=self.settings.get("default_cpu_limit"),
)
if isawaitable(result):
self.write(await result)
else:
self.write(result)
```
- **[builder.py](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/builder.py)**
The `BuildHandler` is registered to handle HTTP `POST` and HTTP `DELETE` requests arriving to `api/environments`. When handling a POST request, it will build an image, and when handling a DELETE request it will delete an already built image.
As an API, the responses from the handlers are basic JSON messages like `{"status": "ok"}` after finishing the task they were meant to accomplish.
- **[logs.py](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/logs.py)**
The `LogsHandler` is registered to handle HTTP `GET` requests arriving to `api/environments/.../logs` where `...` is a name associated with a build container to get repo2docker build logs from.
### Jinja HTML templates
When a JupyterHub responds with HTML to a user, it is HTML rendered from a jinja template given data such as the name of the user and more.
- **[page.html](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/templates/page.html) - A JupyterHub template**
By providing this template, it overrides the template provided by JupyterHub. By doing so, `tljh-repo2docker` is able to add a `Environments` like to arrive at `/environments`.
Adding that link is this templates sole purpose. This strategy is [in conflict with for example jupyterhub-nativeauthenticator](https://github.com/plasmabio/tljh-repo2docker/issues/58) that also overrides `page.html` in order to provide a link to its custom pages and handlers.
- **[admin.html](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/templates/admin.html) - A JupyterHub template**
The `/hub/admin` view relies on the `admin.html` template, which is expanded to also list an image used in a spawned server.
- **[images.html](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/templates/images.html) - A dedicated `tljh-repo2docker` template.**
This is the only template that is directly rendered by `tljh-repo2docker` itself. It provides a user interface to interact with `api/environments` paths via the bundled [images.js](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/static/js/images.js) javascript.
### Logic to execute repo2docker
tljh-repo2docker doesn't execute repo2docker directly, but instead starts a docker container using the image `quay.io/jupyterhub/repo2docker:main`, and builds a new image from within that container using the host machines docker runtime that is mounted to the container. `docker.py` is code that starts that build container.
- **[docker.py](https://github.com/plasmabio/tljh-repo2docker/blob/master/tljh_repo2docker/docker.py)**
This file has three functions.
- `list_images` lists built images.
- `list_containers` lists build containers running the repo2docker image to build images.
- `build_image` runs the repo2docker image to build a given repository.
The file is small with very little source code, centered around calling `docker` to in turn run a container with repo2docker installed, which in turn builds a new image.
The `build_image` function has a signature like below.
```python
async def build_image(
repo,
ref,
name="",
memory=None,
cpu=None,
username=None,
password=None,
extra_buildargs=None
):
```
Note how `memory` is passed. It is used to label the image being built (and the container building it) with `tljh_repo2docker.mem_limit`. The `list_...` functions are also coupled like this to `mem_limit` and `cpu_limit`.
### Summary
As we consider creating a general purpose JupyterHub service to build images using repo2docker, I'd like to highlight how tljh-repo2docker parts that we could hope to share in tljh-repo2docker is coupled in ways we can't.
1. Extra web request handlers, but we need a dedicated web application
As we plan to build a dedicated JupyterHub service exposed via `/services/repo2docker`, we must also start a dedicated web server separate from the JupyterHub tornado server. We can't easily take these web request handlers and re-use them due to that.
2. HTML templates are coupled to JupyterHub's web application
The HTML templates are tightly coupled with JupyterHub's other provided template. For example, images.html leads with the line `{% extends "page.html" %}`, and the provided page.html leads with the line `{% extends "templates/page.html" %}` that refers to a JupyterHub bundled template I think.
3. UI, API, and built images are coupled with spawning options cpu_limit and mem_limit
The HTML/JS based user interface presented under `/environments` the API it interacts with, and the logic to build an image, all couple to specifically `cpu_limit` and `mem_limit`.
## Proposal of a jupyterhub repo2docker service
This is a proposal to develop a [JupyterHub service](https://jupyterhub.readthedocs.io/en/stable/reference/services.html) tightly scoped to do few things to build images using repo2docker. The idea is that this can be a building block usable on its own, but also as a building block for more advanced composed functionality.
### What the service should and shouldn't do
#### Should
- Be possible to use with a z2jh based JupyterHub where it should also be able to facilitate pushing of images to a container registry.
- Be its own web server.
- Work specifically against JupyterHub as an identity provider, and if needed its RBAC system with custom scopes to determine what users are allowed to use what actions.
Users must be logged in to view anything, and that users with further permissions, for example to build images are identified as having a JupyterHub RBAC scope.
- Provide a REST API to:
- build (and optionally push) images
- provide relevant information
- list built images
- list images currently building
- logs of recently built or building image
- Support being run with or without a pre-configured container registry.
- Provide a HTML/JS based user interface to interact with the REST API.
#### Shouldn't
- Assume it is running on the same machine as JupyterHub.
- Persist state to the local file system or JupyterHub's database, but instead rely on its in-memory state, the local docker runtime, and the optional remove container registry.
- Couple directly with things related to how the built images are used. As an example, consider a JupyterHub Spawner that want's to use information from this service about built images as the Spawner thinks of them as images that users can start containers from. Then it is the Spawner that is responsible for asking for that information from this service.
- Build images in a distributed manner like BinderHub can do by creating dedicated build Pods. This service should allow itself to run on a single machine.
### Choice of web server (tornado, fastapi, flask, etc.)
We need a web server, but what software should we build from and why? Our needs are probably not very advanced, so we can focus on more basic aspects.
- Smooth handling of authentication and authorization with JupyterHub
- Something the JupyterHub team is overall is already used to
JupyterHub use Tornado, [jupyter_server](https://github.com/jupyter-server/jupyter_server) use Tornado, [jupyverse](https://github.com/jupyter-server/jupyverse) use FastAPI. For the time being, let's assume we use Tornado and compare against it if we consider something else.
### Managing authentication (who?) and authorization (allowed?)
This service should be developed to rely on JupyterHub to validate a visitor's identity, and if needed to decide if the user is allowed to do something or not should be done using the new JupyterHub RBAC system. Like this, we can define new kinds of RBAC scopes that we tie to various permissions in the service. These scopes could then be granted to various JupyterHub groups of users as well.
A great thing is that from the perspective of this repo2docker service, it can just ask JupyterHub what user is interacting with it and what permission that user has. I think that a user could be assigned the relevant permissions indirectly by belonging to a JupyterHub group as well, and then management of permissions could be done by managing the users in a group.
### Q/A: Could code be shared between a new repo2docker service and tljh-repo2docker?
Yes, hopefully. The vision is that this service is developed tightly scoped to be functioning as a general purpose building block for use by more feature rich tooling like tljh-repo2docker.
### Open questions
Note that there are more open questions beyond these I have not managed to formulate, such as details on the UI to be provided etc.
- **Avoiding image name conflicts**
With users building and pushing images via this service, it can be security critical that they can't replace images built by other users. How do we ensure that this doesn't happen? I think the resolution is to ensure that the image name has a section for the username. A configuration like `image_name_template` could be relevant.
- **Container registry API assumptions**
What assumptions about the container registry do we make? I think there are standards for container registries with what REST API they provide etc, and the answer to this question should be at least a clarification on the kind of standard. We need this to keep track of what is available in the container registry.
# Followup, 2i2c slack's #binderhub-jupyterhub channel on Friday September 23rd ([link](https://2i2c.slack.com/archives/C03RLNFM43F/p1663928687936299))
I just had a meeting with Min RK who kindly helped out. Here are some notes from that and some next steps planned just in my head.
## My on-the-spot planned next steps
- practically start building something functioning as a hello world of the relevant techniques
- detail a REST API to cover all relevant interactions we want to support with this service to function as a reliable building block
- detail a primitive UI for /services/repo2docker for direct use of the service
- validate that a jupyterhub admin can be made to manage if a user is part of a group or not, which by the initial setup can imply permissions to do certain things in the service.
## My refined notes from the meeting with Min
- **Ideas ideas about how the JupyterHub RBAC can be used by the service was verified.** I wanted to make sure that its planned so that jupyterlab extensions could be developed to interact against the service as well for example and wasn't sure this would be viable - but it sure is practically viable!
- **Understanding that a dedicated webserver is needed was verified.** This is no matter what when we work to provide this service as a jupyterhub service (either managed or external). What tljh-repo2docker does would not require this though, it just registers additional web request handlers for the main jupyterhub application to also handle.
- **Input about the use of HubOAuth class (not tornado specific) and the HubOAuthenticated class (tornado specific).** I understood what Python code was around that could be reused related to oauth, but got confident that it would be fine to implement this anew if needed as its mainly standard OAuth2 procedure involved. JupyterHub related documentation: https://jupyterhub.readthedocs.io/en/stable/api/services.auth.html.
- **Tornado and FastAPI was discussed**, and both are serious options as the webserver for the service with some known pro/con and some unknown pro/con making it not obvious on what to go for. There are various examples of services developed with various webservers found here that can be used as input: https://github.com/jupyterhub/jupyterhub/tree/main/examples.
On a very technical level, making a service that relies on JupyterHub RBAC to manage a custom set of permissions would involve something like...
1. Declare custom RBAC scopes (`JupyterHub.custom_scopes`). See
https://jupyterhub.readthedocs.io/en/stable/rbac/scopes.html#custom-scopes.
2. Declare that the service exists (`JupyterHub.services`). See
https://jupyterhub.readthedocs.io/en/stable/api/service.html.
3. Declare a role with scopes, and a group of users to have that role, and add
users to that group. See also
https://jupyterhub.readthedocs.io/en/stable/rbac/scopes.html#custom-scopes.
4. Declare that the service allowed scopes involve the custom scopes (?)
5. Optionally declare that the Spawned servers should request the custom scopes
as well to help a JupyterLab extension make requests to the service. This
would be `spawner.oauth_client_allowed_scopes`. If the user has such scopes,
a JupyterLab extension could access it via a "pageconfig token" or similar.