or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Syncing
xxxxxxxxxx
Like many research groups in the domain sciences, my lab owns a few GPU workstations which we rely upon for much of our computational experimentation and heavy-lifting. If you're coming from an industry or IT sector you might wonder why we'd even buy hardware in an era when renting compute from one of three global multi-nationals is all the rage. If you come from an academic research group it will probably be no surprise that we buy hardware, but you may be wondering why the heck I would bother with something like Kubernetes. So by way of introduction, let me try and address my motivation in both of these contexts for those respective audiences. And if you're like me and already know why you'd want to manage local hardware with today's user-friendly and familiar cloud-native environments, you've probably already skipped past this text and are now copy-pasting from code chunks below anyway.
Why buy hardware? For research teams like mine, owning a few compute workstations just makes sense on a number of levels. (Though importantly, this does not mean that we do not also use compute from cloud providers – we do so all the time!) (1) Economically, we aren't start-ups with VC funds to burn that will must either grow and scale or die in a few years. An academic research lab is a long-lived organism which seeks to maintain a consistent level of research output despite fluctuating funding. Buying hardware with a lifetime that is typically 2-3x the duration of a grant helps ride out those fluctuations. (2) In an academic context, larger hardware purchases tend to be exempt from overhead rates (typically 60% or more), though options like cloudbank have finally started to address this. (Ironic, given that xpenses such as elecricity and networing are bundled into cloud prices, but are free to purchased hardware as they are covered by overhead). (3) The economics of GPUs are also particularly distorted. GPUs for the consumer market are substantially cheaper for comparable computational speeds than those that are licensed for the data center market. Perhaps relatedly, cloud provider costs for GPU-instances are steep – around $3/hour, and free-tier GPU instances that might be viable for prototyping are virtually non-existent. (4) But the most important is the marginal cost of experimentation. Yes, there are plenty of horror stories of some student or intern accidentally wracking up huge charges on cloud platforms, and yes some platforms have additional services they can sell you to decrease that risk. But from the past decade of my own experience, it's not true accidents that get me, but the nature of research itself. When I'm experimenting, I don't want the voice in the back of my head says "well that was $200 for nothing." And we run the same things again and again to make sure results are robust. Really, how many open source projects would use CI/CD if they were charged by the minute? (In fact, while it is a topic for a different post, running self-hosted runners on GitHub Actions is another primary use of our owned hardware). Our group relies on the amazing and reliable Thelio lineup of desktop workstations from System76.
But why Kubernetes? I think the case for this is more subtle. I suspect that most computational labs which buy GPU workstations expect their users to interact with it in the traditional 'bare-metal' experience, i.e.
ssh
-ing into the server. (Yes, VSCode is now a decent alternative for those that want a more visual interface than the classic terminal experience.) This assumes everyone has ssh keys, and someone has the unix-admin responsibility for handling user accounts, permissions, dependencies, etc. But classic Unix cluster administration is pretty different from the DevOps of cloud platforms. The bar to being a user in this environment is already pretty high – managing ssh keys and working with ssh-compatible interfaces. It also imposes a massive step between users who often lack basic permissions like installing system libraries, and the system administrator, who is the all-powerfulroot
. Cloud-native DevOps patterns provide a lot more nuance.K3s
First, we must up k3s on one or more nodes. Importantly, we'll disable
traefik
on K3s, since Z2JH will be handling our HTTPS certificates usingletsencrypt
. The K3S docs are quite solid, but this comes down to:(also in
install-reset-K3s.sh
script in this repo). Useful things to know:k3s-killall.sh
ork3s-uninstall.sh
are installed and added to path when using the K3s installation method above, does what it says. (This is great – nuking everything and getting a fresh start is not always easy on other K8s setups)systemctrl restart k3.service
to restart the thing without re-installing. (Good when we update configurations later on.)Helm
Helm is the defacto package manager for Kubernetes, and that is what we will use to install the software we want to use. There are many ways to install it, but for convenience you can just copy paste the following into your terminal:
K3S sets up credentials for talking to the Kubernetes server in
/etc/rancher/k3s/k3s.yaml
, and you have to tell Helm to look for these credentials here. You can do that by setting theKUBECONFIG
environment variable, e.g. in your.bashrc
.Note that this file is owned by the root account, so if you are running as a non-root user, you may need to grant yourself rights to read that file. Something like
sudo chown $(id -u) /etc/rancher/k3s/k3s.yaml
may do the trick.Helm is already installed with K3S, just set the env var:
Z2JH
The Zero To JupyterHub for Kubernetes docs are excellent. They cover some Kubernetes and Helm setup in various contexts, but we're already good to go there and can jump right in to Setup JupyterHub.
First, let's tell
helm
where to find the JupyterHub helm chartNext, let's setup a simple JupyterHub!
TODO: We should tell people how to find the version number
TODO: We should tell people how to pick the namespace / name
This should set up a simple but working JupyterHub that you can access by going to your machine's public IP address! Any username and any password will let you in. Go, try it!
TODO: Insert screenshot here
Now let's secure this so only people we want to can access it.
HTTPS
We first set up a domain name and HTTPS.
TODO: Add a note here about pointing DNS record?
Create a file called
config.yaml
- this will contain the complete configuration for our JupyterHub. The full reference documentation for this has a lot of details you can look through! But first, let's set up automatic HTTPS.This config will use the wonderful, free and not-for-profit Let's Encrypt to automatically provision HTTPS certificates, and renew them when necessary!
Launch/re-launch jupyterhub with this configuration by upgrading the helm chart:
TODO: Describe this as command we can re-run each time our config changes
where
config.yaml
is your config.yaml file with the above block. You should now have https access at your domain name.GitHub-based Authentication
The default authentication is for testing purposes only! any user can log in with any name and password. Let's set up an authenticator to allow only users who are members of my GitHub Org. The official Z2JH docs are once again a great guide, but here's the quick version. You'll need to create an OAuth application for your org on GitHub, and then add the block below to config.yaml and run the
helm upgrade
command from above.Now, only users who are members of the given GitHub org can authenticate. You can also use the syntax of
github-org:github-team
so only members of a particular GitHub Team can authenticate. Again, see the official Z2JH docs for details on the large array of configuration options available, including other identity provider services.Working with the GPU
Our first step is to enable GPU support with k3s before we worry about JupyterHub. Unfortunately, as anyone who does work with GPUs can tell you, this can get janky. The drivers and CUDA packages required to support NVIDIA GPUs aren't entirely open source, which often leads to a bunch of manual work trying to figure out what versions go where.
CB: I think we should link to the ks3 docs here. I think
Install the latest version of the nvidia drivers that will work for your graphics card. Currently, the version with broadest availability seems to be 535 - this will change with time.
Validate that the GPU is recognized by running
nvidia-smi
Next, we install the nvidia-device-plugin, to allow kubernetes to selectively expose the GPU. Create a file named
nvidia-device-plugin-config.yaml
to store our configuration for this, and set its contents right now to the following:Now that we have a config file, install the device plugin using helm. We pass in the config file we created via the
--values
(or-f
) parameter.You can check if this succeeded with
kubectl -n nvidia-device-plugin get pod
.config.yaml
file you are using, add the following:And run the
helm upgrade
command from earlier again (TODO: link to the command). This should give all users access to the GPU, and you can test that by runningnvidia-smi
in the terminal in JupyterLab!But there's only one GPU on the machine, and this user is already using it! We want to allow users to select between GPU and non GPU machines, as well as allow many of them to share GPUs.
NVIDIA has a timeslicing feature that allows one GPU to be shared between multiple users. This is not as advanced as sharing 1 CPU, but is better than not being able to share GPU at all.
As we set it up, we will need to predetermine how many 'slices' to create, and then that many total users can use the GPU at the same time. The overall power of the GPU will be shared between them. See the NVIDIA docs for more details.
In this case, let's slice our GPU into 8 slices. Open the
nvidia-device-plugin-config.yaml
file you created in step 3, and add the following lines:The very last line determines how many slices of this GPU are made.
After saving this file, run the
helm upgrade
command from Step 4 to apply this configuration. You can see how many GPU slices are available by running the following command:This means upto 8 users can use the GPU at the same time on your JupyterHub!
CB: Based on my testing, I'm pretty sure this is not necessary when using images derived from Nvidia base image (perhaps just requires some of the env var exports already found there. I seem to be able to launch GPU-enabled pods without explicit gpu resource allocation, and thus not triggering the timeSlicing, on these images (e.g.
rocker/ml
, but not on other base images that have some GPU libraries added (pangeo/torch-notebook
)))TODO: Introduce profile List, allow users to choose yes / no GPU, and also do rocker
With a our kubernetes environment configured for GPU use, we can bring
Not sure how much profile list config to show. My current public config is https://github.com/boettiger-lab/k8s/blob/main/jupyterhub/public-config.yaml
A minimum entry I think is merely:
With rocker-based ml images, this enough (and we don't need time-slicing). with other images, we also need:
(showing snippets like this is concise but confusing about where they belong in the config.yaml)