owned this note
owned this note
Published
Linked with GitHub
# Self-hosting JupyterHub on GPU workstations
Like many research groups in the domain sciences, my lab owns a few GPU workstations which we rely upon for much of our computational experimentation and heavy-lifting. If you're coming from an industry or IT sector you might wonder why we'd even buy hardware in an era when renting compute from one of three global multi-nationals is all the rage. If you come from an academic research group it will probably be no surprise that we buy hardware, but you may be wondering why the heck I would bother with something like Kubernetes. So by way of introduction, let me try and address my motivation in both of these contexts for those respective audiences. And if you're like me and already know why you'd want to manage local hardware with today's user-friendly and familiar cloud-native environments, you've probably already skipped past this text and are now copy-pasting from code chunks below anyway.
**Why buy hardware?** For research teams like mine, owning a few compute workstations just makes sense on a number of levels. (Though importantly, this does not mean that we do not _also_ use compute from cloud providers -- we do so all the time!) (1) Economically, we aren't start-ups with VC funds to burn that will must either grow and scale or die in a few years. An academic research lab is a long-lived organism which seeks to maintain a consistent level of research output despite fluctuating funding. Buying hardware with a lifetime that is typically 2-3x the duration of a grant helps ride out those fluctuations. (2) In an academic context, larger hardware purchases tend to be exempt from overhead rates (typically 60% or more), though options like [cloudbank](https://cloudbank.org) have finally started to address this. (Ironic, given that xpenses such as elecricity and networing are bundled into cloud prices, but are free to purchased hardware as they are covered by overhead). (3) The economics of GPUs are also particularly distorted. GPUs for the consumer market are substantially cheaper for comparable computational speeds than those that are licensed for the data center market. Perhaps relatedly, cloud provider costs for GPU-instances are steep -- around $3/hour, and free-tier GPU instances that might be viable for prototyping are virtually non-existent. (4) But the most important is the marginal cost of experimentation. Yes, there are plenty of horror stories of some student or intern accidentally wracking up huge charges on cloud platforms, and yes some platforms have additional services they can sell you to decrease that risk. But from the past decade of my own experience, it's not true accidents that get me, but the nature of research itself. When I'm experimenting, I don't want the voice in the back of my head says "well that was $200 for nothing." And we run the same things again and again to make sure results are robust. Really, how many open source projects would use CI/CD if they were charged by the minute? (In fact, while it is a topic for a different post, running self-hosted runners on GitHub Actions is another primary use of our owned hardware). Our group relies on the amazing and reliable [Thelio lineup](https://system76.com/desktops) of desktop workstations from System76.
**But why Kubernetes?** I think the case for this is more subtle. I suspect that most computational labs which buy GPU workstations expect their users to interact with it in the traditional 'bare-metal' experience, i.e. `ssh`-ing into the server. (Yes, VSCode is now a decent alternative for those that want a more visual interface than the classic terminal experience.) This assumes everyone has ssh keys, and someone has the unix-admin responsibility for handling user accounts, permissions, dependencies, etc. But classic Unix cluster administration is pretty different from the DevOps of cloud platforms. The bar to being a user in this environment is already pretty high -- managing ssh keys and working with ssh-compatible interfaces. It also imposes a massive step between users who often lack basic permissions like installing system libraries, and the system administrator, who is the all-powerful `root`. Cloud-native DevOps patterns provide a lot more nuance.
## K3s
First, we must up [k3s](https://k3s.io/) on one or more nodes. Importantly, we'll disable `traefik` on K3s, since Z2JH will be handling our HTTPS certificates using `letsencrypt`. The [K3S docs](https://docs.k3s.io/) are quite solid, but this comes down to:
```bash
curl -sfL https://get.k3s.io | sh -s - --disable=traefik
```
(also in `install-reset-K3s.sh` script in this repo). Useful things to know:
- Scripts `k3s-killall.sh` or `k3s-uninstall.sh` are installed and added to path when using the K3s installation method above, does what it says. (This is great -- nuking everything and getting a fresh start is not always easy on other K8s setups)
- Use `systemctrl restart k3.service` to restart the thing without re-installing. (Good when we update configurations later on.)
### Helm
[Helm](https://helm.sh/) is the defacto package manager for Kubernetes, and that is what we will use to install the software we want to use. There are [many ways](https://helm.sh/docs/intro/install/) to install it, but for convenience you can just copy paste the following into your terminal:
```bash
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
```
K3S sets up credentials for talking to the Kubernetes server in `/etc/rancher/k3s/k3s.yaml`, and you have to tell Helm to look for these credentials here. You can do that by setting the `KUBECONFIG` environment variable, e.g. in your `.bashrc`.
```bash
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
```
Note that this file is owned by the *root* account, so if you are running as a non-root user, you may need to grant yourself rights to read that file. Something like `sudo chown $(id -u) /etc/rancher/k3s/k3s.yaml` may do the trick.
> Yuvi: Should we make a note here about security implications?
Helm is already installed with K3S, just set the env var:
```
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
```
## Z2JH
The [Zero To JupyterHub for Kubernetes](https://z2jh.jupyter.org/en/stable/) docs are excellent. They cover some Kubernetes and Helm setup in various contexts, but we're already good to go there and can jump right in to [Setup JupyterHub](https://z2jh.jupyter.org/en/stable/jupyterhub/installation.html).
First, let's tell `helm` where to find the JupyterHub helm chart
```bash
helm repo add jupyterhub https://hub.jupyter.org/helm-chart/
helm repo update
```
Next, let's setup a simple JupyterHub!
```bash
helm upgrade --cleanup-on-fail \
--install testjupyter jupyterhub/jupyterhub \
--namespace testjupyter \
--create-namespace \
--version=3.2.1 \
--wait
```
TODO: We should tell people how to find the version number
TODO: We should tell people how to pick the namespace / name
This should set up a simple but *working* JupyterHub that you can access by going to your machine's public IP address! **Any** username and any password will let you in. Go, try it!
TODO: Insert screenshot here
Now let's secure this so only people we want to can access it.
### HTTPS
We first set up a domain name and HTTPS.
TODO: Add a note here about pointing DNS record?
Create a file called `config.yaml` - this will contain the complete configuration for our JupyterHub. The full [reference documentation](https://z2jh.jupyter.org/en/latest/resources/reference.html) for this has a lot of details you can look through! But first, let's set up automatic HTTPS.
```yaml
proxy:
https:
enabled: true
hosts:
- {your-hostname-here}
letsencrypt:
contactEmail: {your-email-here}
```
This config will use the wonderful, free and not-for-profit [Let's Encrypt](https://letsencrypt.org/) to automatically provision HTTPS certificates, and renew them when necessary!
Launch/re-launch jupyterhub with this configuration by upgrading the helm chart:
TODO: Describe this as command we can re-run each time our config changes
```bash
helm upgrade --cleanup-on-fail \
--install \
testjuypterhelm jupyterhub/jupyterhub \
--namespace testjupyter \
--create-namespace \
--version=3.2.1 \
--values config.yaml \
--wait
```
where `config.yaml` is your config.yaml file with the above block. You should now have https access at your domain name.
### GitHub-based Authentication
The default authentication is for testing purposes only! any user can log in with any name and password. Let's set up an authenticator to allow only users who are members of my GitHub Org. The [official Z2JH docs](https://z2jh.jupyter.org/en/stable/administrator/authentication.html) are once again a great guide, but here's the quick version. You'll need to create an OAuth application for your org on GitHub, and then add the block below to config.yaml and run the `helm upgrade` command from above.
```yaml
hub:
config:
GitHubOAuthenticator:
allowed_organizations:
- {github-org}
scope:
- read:org
client_id: {oath_id}
client_secret: {oath_secret}
oauth_callback_url: https://{your-hostname-here}/hub/oauth_callback
JupyterHub:
authenticator_class: github
```
Now, only users who are members of the given GitHub org can authenticate. You can also use the syntax of `github-org:github-team` so only members of a particular GitHub Team can authenticate. Again, see the [official Z2JH docs](https://z2jh.jupyter.org/en/stable/administrator/authentication.html) for details on the large array of configuration options available, including other identity provider services.
## Working with the GPU
Our first step is to enable GPU support with k3s before we worry about JupyterHub. Unfortunately, as anyone who does work with GPUs can tell you, this can get janky. The drivers and CUDA packages required to support NVIDIA GPUs aren't entirely open source, which often leads to a bunch of manual work trying to figure out what versions go where.
CB: I think we should link to the ks3 docs here. I think
1. Install the latest version of the nvidia drivers that will work for your graphics card. Currently, the version with broadest availability seems to be 535 - this will change with time.
```bash
sudo apt install nvidia-container-runtime cuda-drivers-fabricmanager-535 nvidia-headless-535-server nvidia-utils-535-server
```
2. Validate that the GPU is recognized by running `nvidia-smi`
```
exouser@test-k3s:~/z2jh$ nvidia-smi
Fri Jan 12 23:45:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 GRID A100X-8C Off | 00000000:04:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 1MiB / 8192MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
```
3. Next, we install the nvidia-device-plugin, to allow kubernetes to *selectively* expose the GPU. Create a file named `nvidia-device-plugin-config.yaml` to store our configuration for this, and set its contents right now to the following:
```yaml
runtimeClassName: nvidia
```
4. Now that we have a config file, install the device plugin using helm. We pass in the config file we created via the `--values` (or `-f`) parameter.
```bash
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.14.3 \
--wait \
--values nvidia-device-plugin-config.yaml
```
You can check if this succeeded with `kubectl -n nvidia-device-plugin get pod`.
```bash
$ kubectl -n nvidia-device-plugin get pod
NAME READY STATUS RESTARTS AGE
nvdp-nvidia-device-plugin-6mx5j 1/1 Running 0 15m
```
5. Now, we need to tell our JupyterHub to use GPU as well. In the `config.yaml` file you are using, add the following:
```yaml
hub:
config:
KubeSpawner:
environment:
NVIDIA_DRIVER_CAPABILITIES: compute,utility
extra_resource_limits:
nvidia.com/gpu: "1"
extra_pod_config:
runtimeClassName: "nvidia"
```
And run the `helm upgrade` command from earlier again (TODO: link to the command). This should give *all* users access to the GPU, and you can test that by running `nvidia-smi` in the terminal in JupyterLab!
But there's only one GPU on the machine, and this user is already using it! We want to allow users to *select* between GPU and non GPU machines, as well as allow many of them to share GPUs.
6. NVIDIA has a [timeslicing](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) feature that allows one GPU to be shared between multiple users. This is not as advanced as sharing 1 CPU, but is better than not being able to share GPU at all.
As we set it up, we will need to *predetermine* how many 'slices' to create, and then that many total users can use the GPU at the same time. The overall power of the GPU will be shared between them. See the [NVIDIA docs](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) for more details.
In this case, let's slice our GPU into 8 slices. Open the `nvidia-device-plugin-config.yaml` file you created in step 3, and add the following lines:
```yaml
gfd:
enabled: true
config:
map:
config: |
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: "envvar"
deviceIDStrategy: "uuid"
gfd:
oneshot: false
noTimestamp: false
outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
sleepInterval: 60s
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 8
```
The very last line determines how many slices of this GPU are made.
After saving this file, run the `helm upgrade` command from Step 4 to apply this configuration. You can see how many GPU slices are available by running the following command:
```bash
$ kubectl get nodes -o=custom-columns=GPUs:.status.capacity.'nvidia\.com/gpu'
GPUs
8
```
This means upto 8 users can use the GPU at the same time on your JupyterHub!
CB: Based on my testing, I'm pretty sure this is not necessary when using images derived from Nvidia base image (perhaps just requires some of the env var exports already found there. I seem to be able to launch GPU-enabled pods without explicit gpu resource allocation, and thus not triggering the timeSlicing, on these images (e.g. `rocker/ml`, but not on other base images that have some GPU libraries added (`pangeo/torch-notebook`)))
TODO: Introduce profile List, allow users to choose yes / no GPU, and also do rocker
With a our kubernetes environment configured for GPU use, we can bring
Not sure how much profile list config to show. My current public config is https://github.com/boettiger-lab/k8s/blob/main/jupyterhub/public-config.yaml
A minimum entry I think is merely:
```
extra_pod_config:
runtimeClassName: "nvidia"
```
With rocker-based ml images, this enough (and we don't need time-slicing). with other images, we also need:
```
extra_resource_limits:
nvidia.com/gpu: "1"
```
(showing snippets like this is concise but confusing about where they belong in the config.yaml)