# Two Paradigms of Science: From Physics to Machine Learning
### -- Jeff M. Phillips
---
## Type A: Mechanistic Science
- **Goal:** Describe simple physical processes or mechanisms
- **Example:** $F = MA$
- **Characteristics:**
- Simple, elegant hypotheses can be correct (theoretical physics)
- All components can be accurately measured (experimental physics)
**Classical Workflow:**
1. Make a (simple) hypothesis
2. Collect relevant data
3. Use statistics to evaluate hypothesis
---
## Type B: Predictive Science
- **Goal:** Learn to predict the future (mechanism comes later)
- **Key assumption:** The future will look like the past
- **Characteristics:**
- No mechanism required initially
- Components may only allow noisy, incomplete measurements
- Some mechanistic aspects may be missing from observations
- Can collect large amounts of data (the "big data" era)
---
## Data Science Paradigm
### The Type B Workflow
1. **Collect** lots of data
2. **Define** what a good prediction looks like
3. **Search** for any function that performs well on this prediction rubric
### The Shift
**Machine Learning, and most of modern science, has moved to Type B over the last 20 years**
---
## The ML Game: Rule #1
### Data Assumptions
**$X \sim \text{iid from } \mu$**
- **iid:** independently and identically distributed
- **$\mu$:** represents "the world" — you make observations under a fixed mechanism
- $x_i \sim \mu$
- Can be repeated many times ($n$ observations)
---
## The ML Game: Rule #2
### Paired Data
**Dataset structure:** $\{(x_i, y_i)\}_i = X$
Each data point $x_i$ is paired with an outcome $y_i$
**Example:**
- $x_i$ = [stock price on days: now-$j$, now-$(j-1)$, ..., now]
- $y_i$ = stock price tomorrow
---
## The ML Game: Rule #3
### The Learning Goal
**Learn a function $f: \mathcal{X} \to \mathcal{Y}$ such that $f(x_i) \approx y_i$**
Where:
- $\mathcal{X}$ = space of input data $x_i$
- $\mathcal{Y}$ = space of outcomes $y_i$
**Note:** Five years ago, we'd spend time on function spaces $\mathcal{F}$. Now? We care much less about formal constraints.
---
## The ML Game: Rule #4
### Cross-Validation: The Great Equalizer
**No need for statistics!** (sort of)
**Process:**
1. **Randomly split** data into:
- Training set: $X_T$
- Evaluation set: $X_E$
- Where $X_T \cap X_E = \emptyset$
2. **Train:** Learn $f$ using $X_T$ (do whatever you like!)
3. **Evaluate:** Measure how $f$ "generalizes" by testing $f(X_E) \approx y_E$
---
## Why Cross-Validation Works
### The Benefits
- **Natural** with "big data"
- **The great equalizer:** Allows apples-to-apples comparison for ANY function $f$
- **Makes sense** when data is truly iid from $\mu$
No complex statistics needed — just split, train, test!
---
## The ML Game: Rule #5
### Computability Matters
**Key requirement:** Invoking $f(x_i)$ must be *computable*, not just existential
- Efficient computation is a plus, but not strictly required
- **Bonus:** If you can parameterize $f_a$ with parameter vector $a$
- And compute gradient: $\nabla_a (f_a(x_i) - y_i)$
- But auto-diff works, and non-gradient methods are possible too
---
## Example: Linear Regression
### The Classic ML Example
**Setup:**
- $x_i \in \mathbb{R}^d$
- $y_i \in \mathbb{R}$
- $f_a(x_i) = \langle x_i, a \rangle$ with parameters $a \in \mathbb{R}^d$
- (Can force first coordinate $x_i = 1$ to encode offset)
**Closed form solution** when "close" means $(f(x_i) - y_i)^2$:
$a = (X^T X)^{-1} X^T Y$
(or use gradient descent)
---
## The Interpretability Problem
### Linear Regression is NOT Interpretable!
**This is Type B science, not Type A**
(without many more assumptions)
**Especially problematic when $d$ is large:**
- Collinearity issues
- Confounders
- Complex parameter interactions
We can predict, but understanding *why* is much harder.
---
## Frictionless Reproducibility
### The Secret to Type B Science Success
*"Data Science at the Singularity"* — David Donoho
**Question:** How has Type B Science become so effective and influential?
**Answer:** Frictionless Reproducibility (FR)
---
## The Challenge of Optimization
### Science is Hard!
**Problems:**
- Optimizing for the "best" $f$ is difficult
- "Graduate student descent"
- 1000 AI-monkeys at keyboards
**Solution:** Engage the community in your quest!
Think: bumper bowling guardrails for science
---
## FR-1: Data Accessibility
### Share Your Data
**Why share?**
- Others can confirm your results
- Others can try their own approaches
**Best practices:**
- The easier to access & use, the better
- Examples: HuggingFace
- ML conferences have "dataset" tracks dedicated to new datasets
---
## FR-2: Re-Execution
### Share Your Function $f$
**Don't just share $X$, also share $f$**
**What to provide:**
- Executable code (not just descriptions in papers)
- Weights/parameters for neural networks
- Full source code (e.g., on GitHub)
**Why it matters:** Easier confirmation that $f(X)$ predictions match your reported results
---
## The Common Workflow
### The Modern Research Cycle
**Step A:** Download recent paper and code
- Confirm $f(X)$ matches $y$ about as well as the paper claims
**Step B:** Make (often small) modifications to code
- Improve so $f(X)$ is closer to $y$
**Step C:** Write new paper!
**Repeat every 4 months**
---
## FR-3: Challenges
### Define What "Close" Means
**Provide a scoring function:**
- Example: $S(f) = \sum_i (f(x_i) - y_i)^2$
**Hold out evaluation data:**
- Hide $X_E$ from the world (or release with honor code)
- $S(f; X_E)$ is the arbiter of which $f$ is better
**Provide code:**
- To compute $S$ and $S(\cdot, X_E)$
---
## Kaggle: Crowdsourced Optimization
### The Competition Model
**Setup:**
1. Release $X$ and scoring function $S$ (but not $X_E$)
2. Offer to compute $S(\cdot, X_E)$
- Sometimes limited to once per day
- Announce deadline
**The twist:**
- Have secret $X_E'$ (different from $X_E$, but also iid from original $X$)
- At deadline: compute $S(\cdot, X_E')$ for each team once
- Announce winner based on $X_E'$, not $X_E$
---
## ImageNet: A Game Changer
### The Dataset That Launched Deep Learning
**What they did:**
- Collected huge dataset $X, y$ where:
- $x_i$ = images
- $y_i$ = text labels
- Released publicly
**Result:** Deep learning (complex $f$) shined!
This challenge fundamentally transformed computer vision and ML
---
## LLMs and Self-Supervision
### A Breakthrough Realization
**Key insight:** Any text is an example of valid language (a label!)
**Example:**
- $x_i$: "...it was a dark and [MASK] night, and the wind howled..."
- $y_i$: MASK = "stormy"
**Impact:** Massive amounts of unlabeled text became training data
---
## What's Next?
### Three Frontiers
---
## Frontier #1: Merging Type A and Type B
### Scientific Machine Learning
**Goal:** Intersection of mechanistic and predictive science
**Approach:** Constrain the function space $\mathcal{F}$ so that $f \in \mathcal{F}$ conforms to physics
**Impact:** Functions that both predict well AND respect physical laws
---
## Frontier #2: Scientific Conclusions from f
### From Prediction to Understanding
**Approaches:**
- Use $f$ to simulate new $(x_i, y_i)$ pairs
- Fit mechanisms to $f$, or approximate $f$ with simple functions
- Interpret parameters $a$ **(Good luck! — I am trying)**
---
## Frontier #3: Distribution Shift
### When iid Assumption Fails
### Predictions Affect World
**Approaches:**
- time series modeling, predict affect of change
- reinforcement learning (AI differs from ML/DataSci)
- distribution shift algorithm (old data as weaker prior, fine tune)
---
## Frontier #4: AI's Role
### Can AI Do Science?
**Current capabilities:**
- **Finding $f$:** If we set up FR-1, FR-2, FR-3, AI can participate fairly well
**Setting up the infrastructure:**
- **FR-1 (Data):** AI can help (with some human guidance)
- **FR-2 (Code):** Yes! AI is good at this
- **FR-3 (Challenges):** This is harder
---
## AI-Generated Science
### The Current State
**Milestone:** First paper accepted to a real AI conference written almost entirely by AI
**Industry trend:** Several companies have this as their explicit goal
**Question:** What does this mean for the future of scientific research?
---
## Key Takeaways
### The Big Picture
1. **Science has evolved** from Type A (mechanistic) to Type B (predictive)
2. **Frictionless Reproducibility** enables community-driven optimization
- Share data (FR-1)
- Share code (FR-2)
- Share challenges (FR-3)
3. **The future** lies in:
- Merging Type A and Type B
- Extracting scientific understanding from predictions
- AI as a scientific collaborator
---
## Discussion
### Questions?
**Thank you!**
{"title":"Two Paradigms of Science: From Physics to Machine Learning","description":"Goal: Describe simple physical processes or mechanisms","contributors":"[{\"id\":\"677425a9-5b30-410a-a1f6-95db22cd3f6e\",\"add\":8787,\"del\":1,\"latestUpdatedAt\":1762195207369}]"}