Two Paradigms of Science: From Physics to Machine Learning

# Two Paradigms of Science: From Physics to Machine Learning ### -- Jeff M. Phillips --- ## Type A: Mechanistic Science - **Goal:** Describe simple physical processes or mechanisms - **Example:** $F = MA$ - **Characteristics:** - Simple, elegant hypotheses can be correct (theoretical physics) - All components can be accurately measured (experimental physics) **Classical Workflow:** 1. Make a (simple) hypothesis 2. Collect relevant data 3. Use statistics to evaluate hypothesis --- ## Type B: Predictive Science - **Goal:** Learn to predict the future (mechanism comes later) - **Key assumption:** The future will look like the past - **Characteristics:** - No mechanism required initially - Components may only allow noisy, incomplete measurements - Some mechanistic aspects may be missing from observations - Can collect large amounts of data (the "big data" era) --- ## Data Science Paradigm ### The Type B Workflow 1. **Collect** lots of data 2. **Define** what a good prediction looks like 3. **Search** for any function that performs well on this prediction rubric ### The Shift **Machine Learning, and most of modern science, has moved to Type B over the last 20 years** --- ## The ML Game: Rule #1 ### Data Assumptions **$X \sim \text{iid from } \mu$** - **iid:** independently and identically distributed - **$\mu$:** represents "the world" — you make observations under a fixed mechanism - $x_i \sim \mu$ - Can be repeated many times ($n$ observations) --- ## The ML Game: Rule #2 ### Paired Data **Dataset structure:** $\{(x_i, y_i)\}_i = X$ Each data point $x_i$ is paired with an outcome $y_i$ **Example:** - $x_i$ = [stock price on days: now-$j$, now-$(j-1)$, ..., now] - $y_i$ = stock price tomorrow --- ## The ML Game: Rule #3 ### The Learning Goal **Learn a function $f: \mathcal{X} \to \mathcal{Y}$ such that $f(x_i) \approx y_i$** Where: - $\mathcal{X}$ = space of input data $x_i$ - $\mathcal{Y}$ = space of outcomes $y_i$ **Note:** Five years ago, we'd spend time on function spaces $\mathcal{F}$. Now? We care much less about formal constraints. --- ## The ML Game: Rule #4 ### Cross-Validation: The Great Equalizer **No need for statistics!** (sort of) **Process:** 1. **Randomly split** data into: - Training set: $X_T$ - Evaluation set: $X_E$ - Where $X_T \cap X_E = \emptyset$ 2. **Train:** Learn $f$ using $X_T$ (do whatever you like!) 3. **Evaluate:** Measure how $f$ "generalizes" by testing $f(X_E) \approx y_E$ --- ## Why Cross-Validation Works ### The Benefits - **Natural** with "big data" - **The great equalizer:** Allows apples-to-apples comparison for ANY function $f$ - **Makes sense** when data is truly iid from $\mu$ No complex statistics needed — just split, train, test! --- ## The ML Game: Rule #5 ### Computability Matters **Key requirement:** Invoking $f(x_i)$ must be *computable*, not just existential - Efficient computation is a plus, but not strictly required - **Bonus:** If you can parameterize $f_a$ with parameter vector $a$ - And compute gradient: $\nabla_a (f_a(x_i) - y_i)$ - But auto-diff works, and non-gradient methods are possible too --- ## Example: Linear Regression ### The Classic ML Example **Setup:** - $x_i \in \mathbb{R}^d$ - $y_i \in \mathbb{R}$ - $f_a(x_i) = \langle x_i, a \rangle$ with parameters $a \in \mathbb{R}^d$ - (Can force first coordinate $x_i = 1$ to encode offset) **Closed form solution** when "close" means $(f(x_i) - y_i)^2$: $a = (X^T X)^{-1} X^T Y$ (or use gradient descent) --- ## The Interpretability Problem ### Linear Regression is NOT Interpretable! **This is Type B science, not Type A** (without many more assumptions) **Especially problematic when $d$ is large:** - Collinearity issues - Confounders - Complex parameter interactions We can predict, but understanding *why* is much harder. --- ## Frictionless Reproducibility ### The Secret to Type B Science Success *"Data Science at the Singularity"* — David Donoho **Question:** How has Type B Science become so effective and influential? **Answer:** Frictionless Reproducibility (FR) --- ## The Challenge of Optimization ### Science is Hard! **Problems:** - Optimizing for the "best" $f$ is difficult - "Graduate student descent" - 1000 AI-monkeys at keyboards **Solution:** Engage the community in your quest! Think: bumper bowling guardrails for science --- ## FR-1: Data Accessibility ### Share Your Data **Why share?** - Others can confirm your results - Others can try their own approaches **Best practices:** - The easier to access & use, the better - Examples: HuggingFace - ML conferences have "dataset" tracks dedicated to new datasets --- ## FR-2: Re-Execution ### Share Your Function $f$ **Don't just share $X$, also share $f$** **What to provide:** - Executable code (not just descriptions in papers) - Weights/parameters for neural networks - Full source code (e.g., on GitHub) **Why it matters:** Easier confirmation that $f(X)$ predictions match your reported results --- ## The Common Workflow ### The Modern Research Cycle **Step A:** Download recent paper and code - Confirm $f(X)$ matches $y$ about as well as the paper claims **Step B:** Make (often small) modifications to code - Improve so $f(X)$ is closer to $y$ **Step C:** Write new paper! **Repeat every 4 months** --- ## FR-3: Challenges ### Define What "Close" Means **Provide a scoring function:** - Example: $S(f) = \sum_i (f(x_i) - y_i)^2$ **Hold out evaluation data:** - Hide $X_E$ from the world (or release with honor code) - $S(f; X_E)$ is the arbiter of which $f$ is better **Provide code:** - To compute $S$ and $S(\cdot, X_E)$ --- ## Kaggle: Crowdsourced Optimization ### The Competition Model **Setup:** 1. Release $X$ and scoring function $S$ (but not $X_E$) 2. Offer to compute $S(\cdot, X_E)$ - Sometimes limited to once per day - Announce deadline **The twist:** - Have secret $X_E'$ (different from $X_E$, but also iid from original $X$) - At deadline: compute $S(\cdot, X_E')$ for each team once - Announce winner based on $X_E'$, not $X_E$ --- ## ImageNet: A Game Changer ### The Dataset That Launched Deep Learning **What they did:** - Collected huge dataset $X, y$ where: - $x_i$ = images - $y_i$ = text labels - Released publicly **Result:** Deep learning (complex $f$) shined! This challenge fundamentally transformed computer vision and ML --- ## LLMs and Self-Supervision ### A Breakthrough Realization **Key insight:** Any text is an example of valid language (a label!) **Example:** - $x_i$: "...it was a dark and [MASK] night, and the wind howled..." - $y_i$: MASK = "stormy" **Impact:** Massive amounts of unlabeled text became training data --- ## What's Next? ### Three Frontiers --- ## Frontier #1: Merging Type A and Type B ### Scientific Machine Learning **Goal:** Intersection of mechanistic and predictive science **Approach:** Constrain the function space $\mathcal{F}$ so that $f \in \mathcal{F}$ conforms to physics **Impact:** Functions that both predict well AND respect physical laws --- ## Frontier #2: Scientific Conclusions from f ### From Prediction to Understanding **Approaches:** - Use $f$ to simulate new $(x_i, y_i)$ pairs - Fit mechanisms to $f$, or approximate $f$ with simple functions - Interpret parameters $a$ **(Good luck! — I am trying)** --- ## Frontier #3: Distribution Shift ### When iid Assumption Fails ### Predictions Affect World **Approaches:** - time series modeling, predict affect of change - reinforcement learning (AI differs from ML/DataSci) - distribution shift algorithm (old data as weaker prior, fine tune) --- ## Frontier #4: AI's Role ### Can AI Do Science? **Current capabilities:** - **Finding $f$:** If we set up FR-1, FR-2, FR-3, AI can participate fairly well **Setting up the infrastructure:** - **FR-1 (Data):** AI can help (with some human guidance) - **FR-2 (Code):** Yes! AI is good at this - **FR-3 (Challenges):** This is harder --- ## AI-Generated Science ### The Current State **Milestone:** First paper accepted to a real AI conference written almost entirely by AI **Industry trend:** Several companies have this as their explicit goal **Question:** What does this mean for the future of scientific research? --- ## Key Takeaways ### The Big Picture 1. **Science has evolved** from Type A (mechanistic) to Type B (predictive) 2. **Frictionless Reproducibility** enables community-driven optimization - Share data (FR-1) - Share code (FR-2) - Share challenges (FR-3) 3. **The future** lies in: - Merging Type A and Type B - Extracting scientific understanding from predictions - AI as a scientific collaborator --- ## Discussion ### Questions? **Thank you!**