---
tags: blog, jupyterhub
---
# TLJH on DataProc
https://the-littlest-jupyterhub.readthedocs.io/en/latest/install/google.html notes
Step 10
- would appreciate some kind of recommendation for the, say, 1-10 user case and explaining how / why resources use would increase - ..... - never mind it's at the end of the section :D - could be maybe up a little further
Step 15
- we need to give rights to the additional resources we want the VM to have access to.
- GCS - for mounting storage
- kubernetes for dask?
- dataproc for dask and spark?
Notes
- jlab - get to hub via left menu
- gcsfuse:
- install as per instructions https://cloud.google.com/storage/docs/gcs-fuse
- make a directory that user has access to
- gcsfuse bucket directory
- do that via jupyterlab config:
- sudo su
- cd /opt/tljh/config/jupyterhub_config.d
- make mount.py
```
c.SystemdSpawner.unit_extra_properties = {
'ExecStartPre': '/bin/bash -c "mkdir -p $HOME/analysis && gcsfuse birdsarah-cluster-storage analysis"'
}
```
- make a README.md and put it in /etc/skel if you want users to see it
RecommendedCPU = (Maxconcurrentusers × MaxCPUusageperuser) + 20%
MaxConcurrentUsers = 3
MaxCPUUsagePerUser = 1
* 0.2 = 3
---> end up with n2-standard-2 2vCPU, 8GB
https://the-littlest-jupyterhub.readthedocs.io/en/latest/howto/admin/resource-estimation.html#howto-admin-resource-estimation
- lets make a little js calculator
- add minimum number to Disk Sizing
## We weren't going crazy jlab menu is not there
## On DataProc
- make version of script that's shell script and put in a google bucket, add this as initialization script
- update firewall settings to allow http / https traffic (not sure if you can do on create)
- install gcsfuse
- manually mount bucket for everyone
gcsfuse -o allow_other -file-mode=777 -dir-mode=777 birdsarah-cluster-storage /mnt/bucket-birdsarah-cluster-storage/
- permissions errors to do with .config ( sudo chown bird:bird .config ) - maybe sudo jupyter lab stuff
- as server user (not jhub user)
- conda env list (make sure you're working with /opt/conda/miniconda3)
- sudo /opt/conda/miniconda3/bin/conda update -n base conda
- sudo /opt/conda/miniconda3/bin/conda install jupyterhub jupyterlab -c conda-forge
- sudo /opt/conda/miniconda3/bin/conda install <whatever other packages you want>
- /opt/conda/conda-meta/pinned has pinned conda!
- sudo make an environment and then add the kernel spec
- sudo /opt/conda/miniconda3/envs/anaconda/bin/python -m ipykernel install --prefix=/opt/conda/default --name 'anaconda'
- there may be something wierd with fuse and old tarballs from conda pack being used instead of the one we just made with the same name
- mount.py
- mount.py
```python
c.SystemdSpawner.unit_extra_properties = {
'ExecStartPre': [
'/bin/bash -c "cd $HOME && mkdir -p analysis"',
'/bin/bash -c "cd $HOME && ln -sf /mnt/bucket-birdsarah-cluster-storage/* analysis/"'
]
}
```
- dataproc_conda_path.py
```python
c.SystemdSpawner.extra_paths = ["/opt/conda/default/bin", ]
```
- reload hub
- sudo conda create
- trouble with conda environments
- there's two because dataproc set one up
had a hard time with fuse per user remounting and detecting, lots of "Transport endpoint is not connected" when trying to access with python
allow_other mount and then symlink
### problems
- my master node keeps dyeing as i'm trying to do dask work, i'm not sure why.
- add a script that does the gcsfuse mount to a startup script so restarting the vm recovers without intervention
- managing conda environments is annoying with sudo stuff, think about this more
- scaling up feels very slow - look at possible settings - https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
- with so many environments i forgot where i need to install jupyterlab plugins i'm interested in
### to do
- run a spark job (may need auto-scaling of non-pre-emptible workers)
- jupyter-server-proxy so i can see dask status
old shit
```
script = """
import os
import subprocess
import getpass
HOME = os.environ['HOME']
print('My home is', HOME, flush=True)
print(getpass.getuser(), flush=True)
DIR = os.path.join(HOME, 'analysis')
CHECK_FILE = os.path.join(DIR, '.mounted')
# Note that os.path.ismount does not seem to work for gcsfuse
if os.path.exists(CHECK_FILE):
print('All mounted and good to go.', flush=True)
else:
os.makedirs(DIR, exist_ok=True)
subprocess.check_call([
'gcsfuse', 'birdsarah-cluster-storage', DIR
])
"""
c.SystemdSpawner.unit_extra_properties = {
'ExecStartPre': f'/usr/bin/python3 -c "{script}"'
}
```