wheelhouse-concept-for-colab

# wheelhouse-concept-for-colab Great—here’s a clear, **step-by-step** guide to implement the “Repo on Drive + pip cache on Drive + shared wheelhouse (GCS) + two-pass install” pattern, with the **why** after each step. --- # Goal (1 sentence) Make Colab sessions **predictably fast** and **reproducible** by caching installs and serving prebuilt wheels from a fast CDN, while keeping the actual Python environment ephemeral (safer across Colab updates). --- # Part A — Instructor one-time setup ## 1) Create a small “env” repo **Do** * Make a GitHub repo (public or private) with: ``` requirements.txt # top-level, pinned (==) constraints.txt # sub-dep pins; keeps transitive deps stable notebooks/ # (optional) your course notebooks ``` * Keep pins tight. Example: ``` # requirements.txt openai==1.51.0 pandas==2.2.2 numpy==1.26.4 scikit-learn==1.5.1 matplotlib==3.9.0 tiktoken==0.7.0 transformers==4.44.2 ``` **Why** * Colab changes under you. Pinned tops + constraints prevent surprise upgrades mid-semester. --- ## 2) Build a **wheelhouse** once (on Linux, matching Colab’s Python) **Do** ```bash python -m pip install --upgrade pip wheel setuptools mkdir -p wheels pip wheel -r requirements.txt -c constraints.txt -w wheels ``` **Special notes** * Don’t compile big frameworks (Torch). Download the **official wheels** that match Colab’s CUDA/Python and drop them in `wheels/`. **Why** * Installing from local wheels avoids slow builds and PyPI latency. It’s the main speed win. --- ## 3) Host the wheelhouse on **Google Cloud Storage (GCS)** **Do (one time)** ```bash # Create a bucket (pick a unique name and region) gsutil mb -l US-CENTRAL1 gs://your-course-bucket # Upload the wheels gsutil -m rsync -r wheels gs://your-course-bucket/wheels # Make them public-read (objects only) gsutil iam ch allUsers:objectViewer gs://your-course-bucket ``` Now your wheelhouse lives at: ``` https://storage.googleapis.com/your-course-bucket/wheels/ ``` **Why** * GCS is fast, CDN-backed, and handles many simultaneous downloads. Drive often throttles. --- ## 4) Put the **bootstrap code** in your notebooks (Cell 1) Use the snippet below (same one you liked). It: * sets a **persistent pip cache** on Drive, * fetches your latest `requirements.txt` / `constraints.txt` from GitHub, * does a **two-pass install** (wheelhouse first, then PyPI fallback, both respecting constraints). ```python import os, subprocess os.environ["PIP_CACHE_DIR"]="/content/drive/MyDrive/colab_cache/pip-cache" WHEELHOUSE_URL = "https://storage.googleapis.com/your-course-bucket/wheels/" REQ_URL = "https://raw.githubusercontent.com/your-org/your-course-repo/main/requirements.txt" CON_URL = "https://raw.githubusercontent.com/your-org/your-course-repo/main/constraints.txt" !curl -sSL {REQ_URL} -o /tmp/req.txt !curl -sSL {CON_URL} -o /tmp/con.txt base = ["python","-m","pip","install","-c","/tmp/con.txt","--prefer-binary","--no-build-isolation"] subprocess.check_call(["python","-m","pip","install","--upgrade","pip","wheel","setuptools"]) subprocess.check_call(base + ["--find-links", WHEELHOUSE_URL, "-r", "/tmp/req.txt"]) subprocess.check_call(base + ["-r", "/tmp/req.txt"]) print("✅ Ready.") ``` **Why** * **Cache on Drive**: cuts repeated downloads. * **Wheelhouse first**: fetches at CDN speed, avoids builds. * **Fallback pass**: fills any gaps from PyPI, still locked by `constraints.txt`. --- ## 5) Give students a tiny **“mount + clone (to Drive)”** pre-cell ```python from google.colab import drive drive.mount('/content/drive') %cd /content/drive/MyDrive !git clone https://github.com/your-org/your-course-repo.git || true %cd your-course-repo ``` **Why** * Cloning into **Drive** makes your code persist across sessions. (Speed comes from wheelhouse+cache, not the clone.) --- ## 6) Version and freeze * Tag releases: `f25-v1`, `f25-v2` when you update pins. * Keep the GCS wheelhouse in versioned folders if you need parallel cohorts (e.g., `wheels/f25-v1/`). **Why** * Reproducibility for grading and late submissions; clean rollback if something breaks. --- # Part B — Student flow (every session) 1. Open the course notebook in Colab. 2. Run the **mount + clone** cell once per machine (persists in Drive). 3. Run **Cell 1 (bootstrap)**. * First pass: installs from GCS wheelhouse (fast). * Second pass: fills gaps from PyPI, using the persistent pip cache. 4. Start the lab. **Reset if broken**: `Runtime → Factory reset runtime` then re-run Cell 1. (Stateless by design.) --- # Why these pieces matter (short rationale) * **Repo on Drive** → persisting your code & notebooks (not for speed, for convenience). * **PIP\_CACHE\_DIR on Drive** → big speedup on repeated sessions; avoids re-downloading. * **GCS wheelhouse** → the biggest speed win; parallel, CDN-backed downloads; no builds. * **Two-pass install** → deterministic (constraints) and resilient (PyPI fallback). * **No shared site-packages** → avoids ABI/CUDA breakage when Colab updates Python/GPU. --- # Optional: persist a few **pure-Python** libs If you truly want some tiny pure-Python packages to persist across sessions: ```python TARGET = "/content/drive/MyDrive/colab_cache/site-packages" %pip install some_pure_python_lib==1.2.3 -t $TARGET import sys; sys.path.insert(0, TARGET) ``` **Use sparingly.** Don’t do this for NumPy/Torch/etc.—binary wheels will break on Colab updates. --- # Ongoing maintenance (simple) * If something breaks upstream, **only edit `constraints.txt`** to pin a safe sub-dep. Students automatically pick it up next session. * If Colab bumps Python/CUDA, **rebuild the wheelhouse** once and re-sync to GCS. * Keep a quick **smoke test** cell (imports + `__version__`) to fail fast. --- # Quick troubleshooting * **“Failed building wheel …”** → ensure a binary wheel exists; add `--prefer-binary`; use the version that has wheels. * **Torch/CUDA mismatch** → use the exact torch/cu\* wheels that match Colab that week. * **Drive feels slow** → you’re already on GCS for wheels; that’s why we chose it. If your `PIP_CACHE_DIR` is huge, occasionally clear old cache entries. --- # What students should remember (TL;DR) * Always run **Cell 1** (the bootstrap) first. * If errors: **Factory reset runtime** and re-run bootstrap. * Installs should be **\~30–90s** for light stacks, **2–4 min** for DS stacks, **4–8 min** for Torch/CUDA (from wheelhouse). --- # Why this approach (philosophy) * Keep the **environment ephemeral** (safer vs. Colab changes), but move **heavy I/O** (wheels + pip cache) to **persistent, fast storage**. * Centralize control in your repo + constraints; students simply re-run one cell and always land on the blessed environment. If you want, I can generate a tiny **starter repo** (reqs/constraints + a demo notebook with these cells) and a one-liner **gsutil** script to publish your wheelhouse.