# resource-on-sutton
# RL vs. LLMs in Sutton’s view — what he means, why he says it, and where it leaves us
## TL;DR
Richard Sutton argues that **reinforcement learning (RL)** is “basic AI” because it learns from **experience**: an agent takes actions, observes **what actually happens**, and improves toward **goals** defined by **reward**. By contrast, he sees today’s **large language models (LLMs)** as primarily trained to **mimic what people write** (next-token prediction), which lacks grounded goals and **environmental feedback** about consequences. Hence, LLMs are excellent at *what a person would say*, not *what the world will do*. ([Dwarkesh][1])
---
## What Sutton is saying (in his own recent words)
* “**I consider reinforcement learning to be basic AI.** … RL is about understanding your world, whereas large language models are about **mimicking people**… They’re not about figuring out what to do.” ([Dwarkesh][1])
* LLMs “**have the ability to predict what a person would say. They don’t have the ability to predict what will happen.**” ([Dwarkesh][1])
* Without **goals and ground truth**, there’s “no right thing to say” in the LLM setup; in RL, there *is* a right thing to do—**the thing that gets reward**. ([Dwarkesh][1])
These remarks come from Sutton’s September 26, 2025 interview with Dwarkesh Patel, where he repeatedly contrasts **mimicking text** with **learning from real-world experience**. ([Dwarkesh][1])
---
## How this fits his lifetime of work
Sutton’s long-standing program centers on **agents that learn from interaction**:
* **Temporal-Difference (TD) learning** formalized how to learn predictions by bootstrapping from successive predictions—one of the field’s fundamental ideas. ([Incomplete Ideas][2])
* **Dyna** (1991) unified **learning, planning, and acting**: learn a model from experience, use it to plan (simulate), and keep improving via trial-and-error. ([ACM Digital Library][3])
* The widely used textbook ***Reinforcement Learning: An Introduction*** (Sutton & Barto) codifies RL’s agent–environment formulation, rewards, value/policy learning, and model-based vs. model-free methods. (2nd ed., 2018). ([Incomplete Ideas][4])
* **“The Bitter Lesson”** (2019) argues that **general methods that scale with computation** (search & learning) beat approaches that build in human knowledge; eventually, **learning from data/experience** wins. ([Incomplete Ideas][5])
* **“Reward is Enough”** (2021, with Silver, Singh, Precup) advances the hypothesis that maximizing reward can, in principle, produce most of intelligence’s facets—perception, memory, planning—when grounded in interaction. (This view has published counter-arguments, too.) ([ScienceDirect][6])
He also leans on **John McCarthy’s** definition of intelligence as “**the computational part of the ability to achieve goals in the world**,” which places **goal-achievement** at the center of intelligence—and therefore at the center of RL. ([Formal Reasoning Group][7])
---
## The core technical distinction he’s drawing
| Aspect | RL (Sutton’s “real AI”) | LLM pretraining (next-token prediction) |
| ----------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| **Learning signal** | **Reward** from environment (success/failure toward a **goal**). | **Cross-entropy** loss on text: predict the next token correctly. |
| **Interaction** | **Active**: agent acts, observes **what happens**, updates. | **Passive**: learn from **static corpora** (what people wrote in the past). |
| **Ground truth** | **Outcomes in the world** (or environment/simulator) define truth. | “Truth” is **how humans wrote**; feedback is linguistic correctness/plausibility. |
| **World model** | Can be **explicit** (model-based RL) or implicit via values/policies tied to **consequences**. | A statistical model of **language**; any “world knowledge” is **mediated by text**, not by interventions. |
| **Continual/online learning** | Central; **learn during life**. | Typically **frozen** after training; alignment (RLHF) is post-hoc and still text-based. |
Sutton’s claim is **not** that LLMs are useless; it’s that **their training objective is misaligned with learning about the external world’s causal dynamics and with achieving goals through action**. ([Dwarkesh][1])
---
## “But LLMs use RL too, right?” The nuance
* **RLHF / RLAIF** fine-tunes LLMs with preference-based reward models (human or AI feedback). This does add an **RL loop**, but the reward is still **about text quality/behavior**, not about **state changes in the world**. It aligns style and helpfulness; it doesn’t supply the **physics or consequences** Sutton cares about. ([arXiv][8])
* **Agentic LLMs**: when you couple LLMs to tools, browsers, or robots, you *approximate* ground truth by tying outputs to **observable outcomes** (task success, robot success rates). This **moves toward** Sutton’s desiderata but is **not** standard LLM pretraining. Examples include Google/DeepMind’s **SayCan** and **RT-2**, which use language models as high-level planners/controllers **grounded in robotic affordances** and sensorimotor feedback. ([SayCan][9])
**Bottom line:** Sutton’s critique targets the **core training regime** (next-token prediction on text), not the *possibility* of building **RL-grounded agents** that *use* language models. He expects **systems that learn from experience** to **supersede** large pretrained imitators. ([Dwarkesh][1])
---
## Where his view meets counter-evidence or counter-arguments
1. **LLMs do encode a lot of world regularities.** Language is a lossy but vast record of the world; next-token models often **generalize** to causal/physical reasoning tests and can **simulate futures** in narrative form. Sutton would reply: without **acting** and **being corrected by outcomes**, this knowledge **isn’t anchored** to consequences. ([Dwarkesh][1])
2. **Hybrid systems increasingly succeed.** Vision-Language-Action models (e.g., **RT-2**) and frameworks like **SayCan** demonstrate that **text-trained models + embodied feedback** can do purposeful things in the world—precisely the direction Sutton champions (though he’d say: make **RL/experience** the foundation, not the add-on). ([Google DeepMind][10])
3. **“Reward is Enough” is debated.** Some argue that **scalar reward alone** is insufficient (multi-objective values, safety, ethics). So even on Sutton’s home turf, there’s an ongoing research conversation. ([arXiv][11])
---
## Why he’s doubling down now (context)
* Sutton and Andrew Barto won the **A.M. Turing Award** for their foundational work on RL (news this year), which puts a spotlight on **agents that learn from experience** versus **generative imitation**. In coverage, Sutton distinguishes **learning from people’s data** vs. **learning from one’s own life/experience**. ([AP News][12])
* His 2019 **Bitter Lesson** presaged today’s scaling-heavy paradigms, but in the recent interview he questions whether **LLMs are really “bitter-lesson-pilled,”** since they **inject human knowledge** and may hit **data limits** relative to open-ended **experiential learning**. ([Dwarkesh][1])
---
## Practical takeaways (for designing work, courses, or research)
1. **If you want “real-world” learning, give models consequences.**
Build assignments and lab projects where agents **act** (in simulation or the physical studio) and receive **reward signals**—even simple ones. (Classic **Dyna**/TD setups scale from toy tasks to robots/simulated environments.) ([ACM Digital Library][3])
2. **Treat LLMs as powerful priors, not end states.**
Use LLMs for **planning, language, and perception glue**, but **close the loop** with **environmental feedback** (robotics, web agents with success metrics, evaluators tied to measurable outcomes). ([SayCan][9])
3. **Teach the difference between imitation and experience.**
Side-by-side demos: (a) LLM answers “what to do” purely from text vs. (b) an RL agent that must learn by **trying, failing, and improving**. Students see why **ground truth via outcomes** matters. ([Incomplete Ideas][4])
4. **Make “goals” explicit.**
Even in classroom LLM projects, define **task-level rewards** (e.g., rubric-derived scoring proxies, automated checks, simulation returns) so systems are **optimizing** toward something beyond stylistic plausibility. (Note: RLHF optimizes **text preferences**, not **state changes**.) ([arXiv][8])
---
## A short reading list (Sutton-centric + key debates)
* **Sutton & Barto, *Reinforcement Learning: An Introduction* (2nd ed., 2018)** – canonical RL text. ([Incomplete Ideas][4])
* **Sutton (1988), “Learning to Predict by the Methods of Temporal Differences.”** – TD learning foundations. ([Incomplete Ideas][2])
* **Sutton (1991), “Dyna: an integrated architecture for learning, planning, and reacting.”** – unifies model learning with planning. ([ACM Digital Library][3])
* **Sutton (2019), “The Bitter Lesson.”** – why general, compute-scaling methods win. ([Incomplete Ideas][5])
* **Silver, Singh, Precup, Sutton (2021), “Reward is Enough.”** – argues reward suffices to drive intelligence (plus **responses** critiquing scalar reward). ([ScienceDirect][6])
* **Dwarkesh Patel interview (2025)** – Sutton’s latest, explicit statements on **LLMs vs. RL**. ([Dwarkesh][1])
* **SayCan (2022), RT-2 (2023)** – examples of grounding language models in **robotic action**. ([arXiv][13])
---
## My synthesis
Sutton’s position is philosophically clean and technically coherent: **intelligence = goal-directed improvement from interaction.** LLMs, as trained today, **lack the action-consequence loop** that defines this kind of learning. The most compelling path forward is **hybrid**: keep the strengths of language models (prior knowledge, reasoning, interfaces) **but embed them inside agents that learn from experience**. That’s where current research in **robotics** and **tool-using web agents** seems to be heading—and it’s exactly the terrain Sutton has argued for since the beginning.
---
### Recent coverage (context)
* [AP News](https://apnews.com/article/83db773712dd3abccd21e3782d9059ec?utm_source=chatgpt.com)
* [Financial Times](https://www.ft.com/content/d8f85d40-2c5b-4a2b-b113-87fa8e30f61b?utm_source=chatgpt.com)
**Sources cited throughout:** Sutton interview (Sep 26, 2025), “The Bitter Lesson” (2019), RL textbook (2018), TD/Dyna papers, “Reward is Enough” (2021) and responses, RLHF/RLAIF papers, SayCan/RT-2 robotics results.
[1]: https://www.dwarkesh.com/p/richard-sutton "Richard Sutton – Father of RL thinks LLMs are a dead end"
[2]: https://incompleteideas.net/papers/sutton-88-with-erratum.pdf?utm_source=chatgpt.com "Learning to predict by the methods of temporal differences"
[3]: https://dl.acm.org/doi/pdf/10.1145/122344.122377?utm_source=chatgpt.com "Dyna, an integrated architecture for learning, planning, and ..."
[4]: https://incompleteideas.net/sutton/book/the-book-2nd.html?utm_source=chatgpt.com "Reinforcement Learning: An Introduction"
[5]: https://www.incompleteideas.net/IncIdeas/BitterLesson.html?utm_source=chatgpt.com "The Bitter Lesson"
[6]: https://www.sciencedirect.com/science/article/pii/S0004370221000862?utm_source=chatgpt.com "Reward is enough"
[7]: https://www-formal.stanford.edu/jmc/whatisai.pdf?utm_source=chatgpt.com "What is Artificial Intelligence - Formal Reasoning Group"
[8]: https://arxiv.org/abs/2203.02155?utm_source=chatgpt.com "Training language models to follow instructions with human feedback"
[9]: https://say-can.github.io/?utm_source=chatgpt.com "SayCan: Grounding Language in Robotic Affordances"
[10]: https://deepmind.google/discover/blog/rt-2-new-model-translates-vision-and-language-into-action/?utm_source=chatgpt.com "RT-2: New model translates vision and language into action"
[11]: https://arxiv.org/abs/2112.15422?utm_source=chatgpt.com "Scalar reward is not enough: A response to Silver, Singh, Precup and Sutton (2021)"
[12]: https://apnews.com/article/83db773712dd3abccd21e3782d9059ec "AI pioneers who channeled 'hedonistic' machines win computer science's top prize | AP News"
[13]: https://arxiv.org/abs/2204.01691?utm_source=chatgpt.com "Do As I Can, Not As I Say: Grounding Language in Robotic ..."