TLJH on DataProc

--- tags: blog, jupyterhub --- # TLJH on DataProc https://the-littlest-jupyterhub.readthedocs.io/en/latest/install/google.html notes Step 10 - would appreciate some kind of recommendation for the, say, 1-10 user case and explaining how / why resources use would increase - ..... - never mind it's at the end of the section :D - could be maybe up a little further Step 15 - we need to give rights to the additional resources we want the VM to have access to. - GCS - for mounting storage - kubernetes for dask? - dataproc for dask and spark? Notes - jlab - get to hub via left menu - gcsfuse: - install as per instructions https://cloud.google.com/storage/docs/gcs-fuse - make a directory that user has access to - gcsfuse bucket directory - do that via jupyterlab config: - sudo su - cd /opt/tljh/config/jupyterhub_config.d - make mount.py ``` c.SystemdSpawner.unit_extra_properties = { 'ExecStartPre': '/bin/bash -c "mkdir -p $HOME/analysis && gcsfuse birdsarah-cluster-storage analysis"' } ``` - make a README.md and put it in /etc/skel if you want users to see it RecommendedCPU = (Maxconcurrentusers × MaxCPUusageperuser) + 20% MaxConcurrentUsers = 3 MaxCPUUsagePerUser = 1 * 0.2 = 3 ---> end up with n2-standard-2 2vCPU, 8GB https://the-littlest-jupyterhub.readthedocs.io/en/latest/howto/admin/resource-estimation.html#howto-admin-resource-estimation - lets make a little js calculator - add minimum number to Disk Sizing ## We weren't going crazy jlab menu is not there ## On DataProc - make version of script that's shell script and put in a google bucket, add this as initialization script - update firewall settings to allow http / https traffic (not sure if you can do on create) - install gcsfuse - manually mount bucket for everyone gcsfuse -o allow_other -file-mode=777 -dir-mode=777 birdsarah-cluster-storage /mnt/bucket-birdsarah-cluster-storage/ - permissions errors to do with .config ( sudo chown bird:bird .config ) - maybe sudo jupyter lab stuff - as server user (not jhub user) - conda env list (make sure you're working with /opt/conda/miniconda3) - sudo /opt/conda/miniconda3/bin/conda update -n base conda - sudo /opt/conda/miniconda3/bin/conda install jupyterhub jupyterlab -c conda-forge - sudo /opt/conda/miniconda3/bin/conda install <whatever other packages you want> - /opt/conda/conda-meta/pinned has pinned conda! - sudo make an environment and then add the kernel spec - sudo /opt/conda/miniconda3/envs/anaconda/bin/python -m ipykernel install --prefix=/opt/conda/default --name 'anaconda' - there may be something wierd with fuse and old tarballs from conda pack being used instead of the one we just made with the same name - mount.py - mount.py ```python c.SystemdSpawner.unit_extra_properties = { 'ExecStartPre': [ '/bin/bash -c "cd $HOME && mkdir -p analysis"', '/bin/bash -c "cd $HOME && ln -sf /mnt/bucket-birdsarah-cluster-storage/* analysis/"' ] } ``` - dataproc_conda_path.py ```python c.SystemdSpawner.extra_paths = ["/opt/conda/default/bin", ] ``` - reload hub - sudo conda create - trouble with conda environments - there's two because dataproc set one up had a hard time with fuse per user remounting and detecting, lots of "Transport endpoint is not connected" when trying to access with python allow_other mount and then symlink ### problems - my master node keeps dyeing as i'm trying to do dask work, i'm not sure why. - add a script that does the gcsfuse mount to a startup script so restarting the vm recovers without intervention - managing conda environments is annoying with sudo stuff, think about this more - scaling up feels very slow - look at possible settings - https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling - with so many environments i forgot where i need to install jupyterlab plugins i'm interested in ### to do - run a spark job (may need auto-scaling of non-pre-emptible workers) - jupyter-server-proxy so i can see dask status old shit ``` script = """ import os import subprocess import getpass HOME = os.environ['HOME'] print('My home is', HOME, flush=True) print(getpass.getuser(), flush=True) DIR = os.path.join(HOME, 'analysis') CHECK_FILE = os.path.join(DIR, '.mounted') # Note that os.path.ismount does not seem to work for gcsfuse if os.path.exists(CHECK_FILE): print('All mounted and good to go.', flush=True) else: os.makedirs(DIR, exist_ok=True) subprocess.check_call([ 'gcsfuse', 'birdsarah-cluster-storage', DIR ]) """ c.SystemdSpawner.unit_extra_properties = { 'ExecStartPre': f'/usr/bin/python3 -c "{script}"' } ```