owned this note
owned this note
Published
Linked with GitHub
---
tags: binder, hub23
---
# BinderHub Access to Private GitHub Repos
[Configuration and Source Code Reference for BinderHub](https://binderhub.readthedocs.io/en/latest/reference/ref-index.html)
## Goals
Enable access to private repos for users who have authenticated into Hub23 via GitHub without providing the BinderHub config with a read/write Personal Access Token to all private repos.
## The Problem
[As the documentation currently stands](https://binderhub.readthedocs.io/en/latest/setup-binderhub.html#accessing-private-repositories), access to private GitHub repositories is granted to BinderHub through a Personal Access Token (PAT) and the git cloning is performed "as BinderHub". This is not desirable as the BinderHub will have access to any/all private repos that the person who creates the PAT has access to and the BinderHub will not discriminate which user can launch a Binder instance of a private repo. This will result in users being able to access private repos through BinderHub that they would not have access to through the GitHub API. Also, [GitHub's OAuth scopes for private repos](https://developer.github.com/apps/building-oauth-apps/understanding-scopes-for-oauth-apps/#available-scopes) are not very granular - it's full access or nothing!
## A Potential Solution
The GitHub API facilitates [checking if a user is collaborator on a repo](https://developer.github.com/v3/repos/collaborators/#check-if-a-user-is-a-collaborator) which, in the case for private repos, would mean they have permission to access it. This method is desirable as it does not require full write access to check.
We may then be able to propagate the token we receive when the user logs in to the BinderHub via GitHub so that the git clone can be performed as the user, rather than "as BinderHub". (This would also be a step towards providing git push access back to the repo!)
Here is the source code for the [JupyterHub OAuthenticator](https://github.com/jupyterhub/oauthenticator) and the GitHub specific code lives in [oauthenticator/github.py](https://github.com/jupyterhub/oauthenticator/blob/master/oauthenticator/github.py).
## Other Conversations
Please see the [discussion in this issue](https://github.com/alan-turing-institute/the-turing-way/issues/291) for links to some work done in this area by GESIS. I believe their BinderHub instance is private, but I don't know if their code is public and how much they contributed back - but they seem to have a good relationship with the Binder team so we may be able to unearth some things.
## JupyterHub OAuth flow
[JupyterHub Authentication Reference](https://jupyterhub.readthedocs.io/en/latest/api/auth.html)
JupyterHub has two levels of authentication: 1. internally, services and single-user servers use oauth with the Hub itself as a provider to authenticate requests. This is what's in jupyterhub/jupyterhub. This is always on and not optional. 2. the Hub itself has a pluggable authentication mechanism for authenticating users with an external identity provider (e.g. PAM, GitHub, Google, Active Directory, etc.). This is the Authenticator API, and could be username+password form or OAuth in the case of `oauthenticator`.
So when a JupyterHub is running with GitHub OAuth and a totally fresh browser (no jupyterhub cookies yet) tries to visit a user server, it goes:
1. notebook server: I don't know who this browser is, begin the OAuth process with the Hub to identify which JupyterHub user this is. <-- this oauth is in jupyterhub
2. during OAuth, the Hub: I don't know who this browser is, begin OAuth with GitHub to identify which GitHub user this is. <-- this oauth is in oauthenticator
3. GitHub: this is @sgibson91
4. JupyterHub, completing OAuth with GitHub: Thanks, GitHub! Welcome, @sgibson91 (or potentially a different username for the Hub)! <-- oauthenticator again
5. notebook server, completing OAuth with the Hub: Thanks, Hub! Welcome, @sgibson91! <-- jupyterhub again
Form-based login replaces steps 2-4 with presenting and submitting a login form instead of oauth, but steps 1 and 5 are always oauth, no matter what external Authenticator system is used.
If you have [auth state enabled](https://zero-to-jupyterhub.readthedocs.io/en/latest/reference.html#auth-state), the GitHub token is stored in the jupyterhub database (as `user.auth_state['access_token']`)
What we'll then need is to pass this to user.environment in a pre-spawn hook, e.g.
```python
async def add_github_token(spawner):
# retrieve auth state
auth_state = await spawner.user.get_auth_state()
# pass token to notebook env:
spawner.environment['GITHUB_TOKEN'] = auth_state.get('access_token', '')
c.Spawner.pre_spawn_hook = add_github_token
```
We'll probably want to also specify some git environment variables with e.g. username, email, etc.
These should be in `auth_state['github_user']` which is the [github user json](https://developer.github.com/v3/users/#response).
By default BinderHub uses credentials setup at "configuration time" when it clones a repository. Need to organise for the "per user" credentials to be passed in [here](https://github.com/jupyterhub/binderhub/blob/01b1c59b9e7dc81250c1ed579c492ec2fd6baaf6/binderhub/repoproviders.py#L85-L90) or somehow override that bit.
* Learn where above class is implemented, just once for configuration, or multiple times per user?
## Tasks
### "Baby Steps" Tasks
* Write a "Hello World" GitHub app that asks for basic permissions. This would allow us to interrogate what information/token is returned.
* https://developer.github.com/apps/building-oauth-apps/
* https://requests-oauthlib.readthedocs.io/en/latest/examples/real_world_example.html
* The idea of using a GitHub App instead of an OAuth app is that the GitHub App provides more fine-grained permissions when accessing private info. Things to investigate are:
* Can we achieve read-only access to private repos?
* Can we read a person's organisation membership when it is not publicly visible (without reading the whole organisation membership list)?
### Ultimate Goals
I have cloned the BinderHub source code into the Turing's GitHub: [alan-turing-institute/binderhub](https://github.com/alan-turing-institute/binderhub). We should work on branches of that repo.
~~Implement a call to the GitHub API to verify that a Binder user has collaborator status on the repo they are trying to build. This call should be implemented somewhere around the [GitHub API request](https://github.com/alan-turing-institute/binderhub/blob/01b1c59b9e7dc81250c1ed579c492ec2fd6baaf6/binderhub/repoproviders.py#L403) in the BinderHub source code. It should be nested under a `self.auth` conditional statement and should raise an error if the user is **not** a collaborator on the repo.~~ If we use a GitHub user token to clone then it will have correct permissions by default, check not necessary.
1) Propagate the token we receive at login so that BinderHub clones as the user, instead of itself.
## Testing the `alan-turing-institute` fork of BinderHub
I have a test BinderHub with GitHub OAuth running at https://testbinder.hub23.turing.ac.uk
## Permissions testing
I don't think this will be necessary if we have the User's credentials.
> There are different levels of permissions we may want to test in this scenario:
>
> * organisation members that automatically have read/write access to _all_ repos in the org
> * team members that have read/write access to some but not others
> * external collaborators added to a specific repo
>
> Repositories to test permissions with (loaded with the requirements.txt example):
>
> * [alan-turing-institute/test-repo-private](https://github.com/alan-turing-institute/test-repo-private)
> * [alan-turing-institute/test-repo-public](https://github.com/alan-turing-institute/test-repo-public)
## Things We've Learned
* The PAT mentioned in [The Problem](https://hackmd.io/X_Hkb4YkRmiLp74IpZKkMA?view#the-problem), would be one that Sarah (as Hub23 admin) generate with full repo scope access and pass into the k8s config that deploys the BinderHub
* [name=Sarah] - I think I have solved the dummy authentication for testing issue. Need to be really strict about properly tearing down and redeploying hubs - and clearing your browsing data/cookies helps too.
* What's a JupyterHub?:https://jupyterhub.readthedocs.io/en/stable/
* Spawners are what provide the instance the user wants to access. Different spawners for different scenarios: e.g. DockerSpawner, KubeSpawner... Code for Spawners is still within JupyterHub.
*
*
## Unanswered Questions
* [name=Sarah] - Where does the code [name=minrk] suggested in [JupyterHub OAuth flow](#JupyterHub-OAuth-flow) go? Tried:
```yaml
hub:
extraConfig:
extraAuth: |
async def add_github_token(spawner):
# retrieve auth state
auth_state = await spawner.user.get_auth_state()
# pass token to notebook env:
spawner.environment['GITHUB_TOKEN'] = auth_state.get('access_token', '')
c.Spawner.pre_spawn_hook = add_github_token
```
But this caused Internal Server Errors for the Hub.
* Think the `extraAuth` line needs to be removed - needs testing
* Link to Auth with Microsoft AD?
* Once we've built a private repo, will **anyone** be able to pull it from the repository?
* What order does the OAuth happen in when checking the commit hash to locate the image?
* Related issue: https://github.com/alan-turing-institute/hub23-deploy/issues/19
*
*
*
## Probot
* Migrating OAuth Apps to GitHub Apps: https://developer.github.com/apps/migrating-oauth-apps-to-github-apps/
* A scaffold/framework for deploying GitHub Apps: https://probot.github.io
* GitHub-Flask app: https://github-flask.readthedocs.io/en/latest/
Louise and I used the "Hello World" example from Probot in the glitch environment to try and see what it could do.
It seems that GitHub Apps listen for events on the repository, which isn't necessarily what we want. We just want an authentication endpoint.
I may try writing a Python script that uses the requests lib to send these requests to the API independently and see what I can achieve. Then the GitHub Apps step becomes a separate issue. [But this will still require an OAuth app! :scream_cat:]