# wheelhouse-concept-for-colab
Great—here’s a clear, **step-by-step** guide to implement the “Repo on Drive + pip cache on Drive + shared wheelhouse (GCS) + two-pass install” pattern, with the **why** after each step.
---
# Goal (1 sentence)
Make Colab sessions **predictably fast** and **reproducible** by caching installs and serving prebuilt wheels from a fast CDN, while keeping the actual Python environment ephemeral (safer across Colab updates).
---
# Part A — Instructor one-time setup
## 1) Create a small “env” repo
**Do**
* Make a GitHub repo (public or private) with:
```
requirements.txt # top-level, pinned (==)
constraints.txt # sub-dep pins; keeps transitive deps stable
notebooks/ # (optional) your course notebooks
```
* Keep pins tight. Example:
```
# requirements.txt
openai==1.51.0
pandas==2.2.2
numpy==1.26.4
scikit-learn==1.5.1
matplotlib==3.9.0
tiktoken==0.7.0
transformers==4.44.2
```
**Why**
* Colab changes under you. Pinned tops + constraints prevent surprise upgrades mid-semester.
---
## 2) Build a **wheelhouse** once (on Linux, matching Colab’s Python)
**Do**
```bash
python -m pip install --upgrade pip wheel setuptools
mkdir -p wheels
pip wheel -r requirements.txt -c constraints.txt -w wheels
```
**Special notes**
* Don’t compile big frameworks (Torch). Download the **official wheels** that match Colab’s CUDA/Python and drop them in `wheels/`.
**Why**
* Installing from local wheels avoids slow builds and PyPI latency. It’s the main speed win.
---
## 3) Host the wheelhouse on **Google Cloud Storage (GCS)**
**Do (one time)**
```bash
# Create a bucket (pick a unique name and region)
gsutil mb -l US-CENTRAL1 gs://your-course-bucket
# Upload the wheels
gsutil -m rsync -r wheels gs://your-course-bucket/wheels
# Make them public-read (objects only)
gsutil iam ch allUsers:objectViewer gs://your-course-bucket
```
Now your wheelhouse lives at:
```
https://storage.googleapis.com/your-course-bucket/wheels/
```
**Why**
* GCS is fast, CDN-backed, and handles many simultaneous downloads. Drive often throttles.
---
## 4) Put the **bootstrap code** in your notebooks (Cell 1)
Use the snippet below (same one you liked). It:
* sets a **persistent pip cache** on Drive,
* fetches your latest `requirements.txt` / `constraints.txt` from GitHub,
* does a **two-pass install** (wheelhouse first, then PyPI fallback, both respecting constraints).
```python
import os, subprocess
os.environ["PIP_CACHE_DIR"]="/content/drive/MyDrive/colab_cache/pip-cache"
WHEELHOUSE_URL = "https://storage.googleapis.com/your-course-bucket/wheels/"
REQ_URL = "https://raw.githubusercontent.com/your-org/your-course-repo/main/requirements.txt"
CON_URL = "https://raw.githubusercontent.com/your-org/your-course-repo/main/constraints.txt"
!curl -sSL {REQ_URL} -o /tmp/req.txt
!curl -sSL {CON_URL} -o /tmp/con.txt
base = ["python","-m","pip","install","-c","/tmp/con.txt","--prefer-binary","--no-build-isolation"]
subprocess.check_call(["python","-m","pip","install","--upgrade","pip","wheel","setuptools"])
subprocess.check_call(base + ["--find-links", WHEELHOUSE_URL, "-r", "/tmp/req.txt"])
subprocess.check_call(base + ["-r", "/tmp/req.txt"])
print("✅ Ready.")
```
**Why**
* **Cache on Drive**: cuts repeated downloads.
* **Wheelhouse first**: fetches at CDN speed, avoids builds.
* **Fallback pass**: fills any gaps from PyPI, still locked by `constraints.txt`.
---
## 5) Give students a tiny **“mount + clone (to Drive)”** pre-cell
```python
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive
!git clone https://github.com/your-org/your-course-repo.git || true
%cd your-course-repo
```
**Why**
* Cloning into **Drive** makes your code persist across sessions. (Speed comes from wheelhouse+cache, not the clone.)
---
## 6) Version and freeze
* Tag releases: `f25-v1`, `f25-v2` when you update pins.
* Keep the GCS wheelhouse in versioned folders if you need parallel cohorts (e.g., `wheels/f25-v1/`).
**Why**
* Reproducibility for grading and late submissions; clean rollback if something breaks.
---
# Part B — Student flow (every session)
1. Open the course notebook in Colab.
2. Run the **mount + clone** cell once per machine (persists in Drive).
3. Run **Cell 1 (bootstrap)**.
* First pass: installs from GCS wheelhouse (fast).
* Second pass: fills gaps from PyPI, using the persistent pip cache.
4. Start the lab.
**Reset if broken**: `Runtime → Factory reset runtime` then re-run Cell 1. (Stateless by design.)
---
# Why these pieces matter (short rationale)
* **Repo on Drive** → persisting your code & notebooks (not for speed, for convenience).
* **PIP\_CACHE\_DIR on Drive** → big speedup on repeated sessions; avoids re-downloading.
* **GCS wheelhouse** → the biggest speed win; parallel, CDN-backed downloads; no builds.
* **Two-pass install** → deterministic (constraints) and resilient (PyPI fallback).
* **No shared site-packages** → avoids ABI/CUDA breakage when Colab updates Python/GPU.
---
# Optional: persist a few **pure-Python** libs
If you truly want some tiny pure-Python packages to persist across sessions:
```python
TARGET = "/content/drive/MyDrive/colab_cache/site-packages"
%pip install some_pure_python_lib==1.2.3 -t $TARGET
import sys; sys.path.insert(0, TARGET)
```
**Use sparingly.** Don’t do this for NumPy/Torch/etc.—binary wheels will break on Colab updates.
---
# Ongoing maintenance (simple)
* If something breaks upstream, **only edit `constraints.txt`** to pin a safe sub-dep. Students automatically pick it up next session.
* If Colab bumps Python/CUDA, **rebuild the wheelhouse** once and re-sync to GCS.
* Keep a quick **smoke test** cell (imports + `__version__`) to fail fast.
---
# Quick troubleshooting
* **“Failed building wheel …”** → ensure a binary wheel exists; add `--prefer-binary`; use the version that has wheels.
* **Torch/CUDA mismatch** → use the exact torch/cu\* wheels that match Colab that week.
* **Drive feels slow** → you’re already on GCS for wheels; that’s why we chose it. If your `PIP_CACHE_DIR` is huge, occasionally clear old cache entries.
---
# What students should remember (TL;DR)
* Always run **Cell 1** (the bootstrap) first.
* If errors: **Factory reset runtime** and re-run bootstrap.
* Installs should be **\~30–90s** for light stacks, **2–4 min** for DS stacks, **4–8 min** for Torch/CUDA (from wheelhouse).
---
# Why this approach (philosophy)
* Keep the **environment ephemeral** (safer vs. Colab changes), but move **heavy I/O** (wheels + pip cache) to **persistent, fast storage**.
* Centralize control in your repo + constraints; students simply re-run one cell and always land on the blessed environment.
If you want, I can generate a tiny **starter repo** (reqs/constraints + a demo notebook with these cells) and a one-liner **gsutil** script to publish your wheelhouse.