Tutorials - HackMD

# Tutorials ## Set up your machine with the right environment and access - Full local setup to use everything - Install tools - sops - kubectl - gcloud - aws - az - terraform - python3 - docker - repo2docker - Local python env setup - How do you set up to run deploy.py - Authenticate - sops auth # Topic guides ## Logging What are logs Where do they come from - JupyterHub - Notebook server - Proxy - nginx-ingress - Kubernetes control plane - GitHub action logs (image builds and deploys) - Autoscaler - Console of the cloud provider - dask-gateway controller, api server, schedulers, workers ### Links to how-tos about accessing /using logs *(these might be how-tos)* Logs of currently running stuff - `kubectl` Logs of stuff that isn't currently running - cloud specific stuff Looking at *events* (`kubectl describe`) ## Configuration ### What kinds of configuration do we use and where is it located? Config cascade and hierarchy - upstream software (jupyterhub, authenticator, kubespawner) - z2jh - our helm-templates (basehub, daskhub) - deployer customizations (which we are reducing) - per-hub overrides (viewable in cluster.yaml) - configurator - admin panel - notebook-specific user config on user home directories Starting at most specific, and looking upwards (not sure howto) Common locations where config is? - This is where admin users are configured - resource usage is configured - and other vague common things and where they are ## Support charts What gets deployed on each cluster? - prometheus (prometheus-server, kube-state-metrics, node-exporter), grafana, nginx-ingress - (autoscaler on AWS) - List their functions, and upstream info on what they are ## Dask and `dask-gateway` - what is dask? (link outs) - parts of dask (distributed, dask-gateway, (alt: dask-kubernetes), dask-labextension) - What our deployment of dask is like (dask-gateway) - Our node labeling conventions for dask - Different parts of dask-gateway and - Note that you don't have to be a dask user to help debug dask, same way you don't have to be a numpy user to help debug jupyter! ## Cluster design Node pools we have and why we have them - core node - user nodes (notebook pod nodes + dask-schedulers also land here) - dask worker nodes (ephemerel) Considerations for *sizing* node pools on CPU & RAM Considerations for type of disk on node pools - cost vs speed of image pulls Differences in Cloud Providers Autoscaling - GKE - EKS + autoscaler - Azure Network policy and why we have it *Where* to put the cluster (location) Cluster Master highly available (when, why, cost) ## Image building and management process - What's required inside the image? - repo2docker docs - ??? ## Home directory storage - Cloud provider specific - EFS on AWS <3 - Filestore on GCP - Azure file on Azure - custom NFS VM in some places, why? - How Unix permissions work - They don't haha! - uid 1000 and gid 1000 on everything - how is access restricted to just the user's own home directory? - What is the shared/ and shared-readwrite/ directory? - What is nfs-share-creator and why do we have it? # How-tos ## Shared home directory - Enable shared-readwrite & shared/ directories ## Image building - Set up an image to be built on quay.io from a new github repo ## Autoscaling - Manually scale up (and down) a nodepool on each cloud provider ## Support charts - How to deploy - How to decommission - How to look at logs of (link to our howto logs section) ## Debugging ### How to access logs - Things that are currently running - `kubectl logs` - Table of options to try w/ quick pointers to different things - Things that may not be running anymore - Cloud provider UIs.