resource-on-sutton

# resource-on-sutton # RL vs. LLMs in Sutton’s view — what he means, why he says it, and where it leaves us ## TL;DR Richard Sutton argues that **reinforcement learning (RL)** is “basic AI” because it learns from **experience**: an agent takes actions, observes **what actually happens**, and improves toward **goals** defined by **reward**. By contrast, he sees today’s **large language models (LLMs)** as primarily trained to **mimic what people write** (next-token prediction), which lacks grounded goals and **environmental feedback** about consequences. Hence, LLMs are excellent at *what a person would say*, not *what the world will do*. ([Dwarkesh][1]) --- ## What Sutton is saying (in his own recent words) * “**I consider reinforcement learning to be basic AI.** … RL is about understanding your world, whereas large language models are about **mimicking people**… They’re not about figuring out what to do.” ([Dwarkesh][1]) * LLMs “**have the ability to predict what a person would say. They don’t have the ability to predict what will happen.**” ([Dwarkesh][1]) * Without **goals and ground truth**, there’s “no right thing to say” in the LLM setup; in RL, there *is* a right thing to do—**the thing that gets reward**. ([Dwarkesh][1]) These remarks come from Sutton’s September 26, 2025 interview with Dwarkesh Patel, where he repeatedly contrasts **mimicking text** with **learning from real-world experience**. ([Dwarkesh][1]) --- ## How this fits his lifetime of work Sutton’s long-standing program centers on **agents that learn from interaction**: * **Temporal-Difference (TD) learning** formalized how to learn predictions by bootstrapping from successive predictions—one of the field’s fundamental ideas. ([Incomplete Ideas][2]) * **Dyna** (1991) unified **learning, planning, and acting**: learn a model from experience, use it to plan (simulate), and keep improving via trial-and-error. ([ACM Digital Library][3]) * The widely used textbook ***Reinforcement Learning: An Introduction*** (Sutton & Barto) codifies RL’s agent–environment formulation, rewards, value/policy learning, and model-based vs. model-free methods. (2nd ed., 2018). ([Incomplete Ideas][4]) * **“The Bitter Lesson”** (2019) argues that **general methods that scale with computation** (search & learning) beat approaches that build in human knowledge; eventually, **learning from data/experience** wins. ([Incomplete Ideas][5]) * **“Reward is Enough”** (2021, with Silver, Singh, Precup) advances the hypothesis that maximizing reward can, in principle, produce most of intelligence’s facets—perception, memory, planning—when grounded in interaction. (This view has published counter-arguments, too.) ([ScienceDirect][6]) He also leans on **John McCarthy’s** definition of intelligence as “**the computational part of the ability to achieve goals in the world**,” which places **goal-achievement** at the center of intelligence—and therefore at the center of RL. ([Formal Reasoning Group][7]) --- ## The core technical distinction he’s drawing | Aspect | RL (Sutton’s “real AI”) | LLM pretraining (next-token prediction) | | ----------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | | **Learning signal** | **Reward** from environment (success/failure toward a **goal**). | **Cross-entropy** loss on text: predict the next token correctly. | | **Interaction** | **Active**: agent acts, observes **what happens**, updates. | **Passive**: learn from **static corpora** (what people wrote in the past). | | **Ground truth** | **Outcomes in the world** (or environment/simulator) define truth. | “Truth” is **how humans wrote**; feedback is linguistic correctness/plausibility. | | **World model** | Can be **explicit** (model-based RL) or implicit via values/policies tied to **consequences**. | A statistical model of **language**; any “world knowledge” is **mediated by text**, not by interventions. | | **Continual/online learning** | Central; **learn during life**. | Typically **frozen** after training; alignment (RLHF) is post-hoc and still text-based. | Sutton’s claim is **not** that LLMs are useless; it’s that **their training objective is misaligned with learning about the external world’s causal dynamics and with achieving goals through action**. ([Dwarkesh][1]) --- ## “But LLMs use RL too, right?” The nuance * **RLHF / RLAIF** fine-tunes LLMs with preference-based reward models (human or AI feedback). This does add an **RL loop**, but the reward is still **about text quality/behavior**, not about **state changes in the world**. It aligns style and helpfulness; it doesn’t supply the **physics or consequences** Sutton cares about. ([arXiv][8]) * **Agentic LLMs**: when you couple LLMs to tools, browsers, or robots, you *approximate* ground truth by tying outputs to **observable outcomes** (task success, robot success rates). This **moves toward** Sutton’s desiderata but is **not** standard LLM pretraining. Examples include Google/DeepMind’s **SayCan** and **RT-2**, which use language models as high-level planners/controllers **grounded in robotic affordances** and sensorimotor feedback. ([SayCan][9]) **Bottom line:** Sutton’s critique targets the **core training regime** (next-token prediction on text), not the *possibility* of building **RL-grounded agents** that *use* language models. He expects **systems that learn from experience** to **supersede** large pretrained imitators. ([Dwarkesh][1]) --- ## Where his view meets counter-evidence or counter-arguments 1. **LLMs do encode a lot of world regularities.** Language is a lossy but vast record of the world; next-token models often **generalize** to causal/physical reasoning tests and can **simulate futures** in narrative form. Sutton would reply: without **acting** and **being corrected by outcomes**, this knowledge **isn’t anchored** to consequences. ([Dwarkesh][1]) 2. **Hybrid systems increasingly succeed.** Vision-Language-Action models (e.g., **RT-2**) and frameworks like **SayCan** demonstrate that **text-trained models + embodied feedback** can do purposeful things in the world—precisely the direction Sutton champions (though he’d say: make **RL/experience** the foundation, not the add-on). ([Google DeepMind][10]) 3. **“Reward is Enough” is debated.** Some argue that **scalar reward alone** is insufficient (multi-objective values, safety, ethics). So even on Sutton’s home turf, there’s an ongoing research conversation. ([arXiv][11]) --- ## Why he’s doubling down now (context) * Sutton and Andrew Barto won the **A.M. Turing Award** for their foundational work on RL (news this year), which puts a spotlight on **agents that learn from experience** versus **generative imitation**. In coverage, Sutton distinguishes **learning from people’s data** vs. **learning from one’s own life/experience**. ([AP News][12]) * His 2019 **Bitter Lesson** presaged today’s scaling-heavy paradigms, but in the recent interview he questions whether **LLMs are really “bitter-lesson-pilled,”** since they **inject human knowledge** and may hit **data limits** relative to open-ended **experiential learning**. ([Dwarkesh][1]) --- ## Practical takeaways (for designing work, courses, or research) 1. **If you want “real-world” learning, give models consequences.** Build assignments and lab projects where agents **act** (in simulation or the physical studio) and receive **reward signals**—even simple ones. (Classic **Dyna**/TD setups scale from toy tasks to robots/simulated environments.) ([ACM Digital Library][3]) 2. **Treat LLMs as powerful priors, not end states.** Use LLMs for **planning, language, and perception glue**, but **close the loop** with **environmental feedback** (robotics, web agents with success metrics, evaluators tied to measurable outcomes). ([SayCan][9]) 3. **Teach the difference between imitation and experience.** Side-by-side demos: (a) LLM answers “what to do” purely from text vs. (b) an RL agent that must learn by **trying, failing, and improving**. Students see why **ground truth via outcomes** matters. ([Incomplete Ideas][4]) 4. **Make “goals” explicit.** Even in classroom LLM projects, define **task-level rewards** (e.g., rubric-derived scoring proxies, automated checks, simulation returns) so systems are **optimizing** toward something beyond stylistic plausibility. (Note: RLHF optimizes **text preferences**, not **state changes**.) ([arXiv][8]) --- ## A short reading list (Sutton-centric + key debates) * **Sutton & Barto, *Reinforcement Learning: An Introduction* (2nd ed., 2018)** – canonical RL text. ([Incomplete Ideas][4]) * **Sutton (1988), “Learning to Predict by the Methods of Temporal Differences.”** – TD learning foundations. ([Incomplete Ideas][2]) * **Sutton (1991), “Dyna: an integrated architecture for learning, planning, and reacting.”** – unifies model learning with planning. ([ACM Digital Library][3]) * **Sutton (2019), “The Bitter Lesson.”** – why general, compute-scaling methods win. ([Incomplete Ideas][5]) * **Silver, Singh, Precup, Sutton (2021), “Reward is Enough.”** – argues reward suffices to drive intelligence (plus **responses** critiquing scalar reward). ([ScienceDirect][6]) * **Dwarkesh Patel interview (2025)** – Sutton’s latest, explicit statements on **LLMs vs. RL**. ([Dwarkesh][1]) * **SayCan (2022), RT-2 (2023)** – examples of grounding language models in **robotic action**. ([arXiv][13]) --- ## My synthesis Sutton’s position is philosophically clean and technically coherent: **intelligence = goal-directed improvement from interaction.** LLMs, as trained today, **lack the action-consequence loop** that defines this kind of learning. The most compelling path forward is **hybrid**: keep the strengths of language models (prior knowledge, reasoning, interfaces) **but embed them inside agents that learn from experience**. That’s where current research in **robotics** and **tool-using web agents** seems to be heading—and it’s exactly the terrain Sutton has argued for since the beginning. --- ### Recent coverage (context) * [AP News](https://apnews.com/article/83db773712dd3abccd21e3782d9059ec?utm_source=chatgpt.com) * [Financial Times](https://www.ft.com/content/d8f85d40-2c5b-4a2b-b113-87fa8e30f61b?utm_source=chatgpt.com) **Sources cited throughout:** Sutton interview (Sep 26, 2025), “The Bitter Lesson” (2019), RL textbook (2018), TD/Dyna papers, “Reward is Enough” (2021) and responses, RLHF/RLAIF papers, SayCan/RT-2 robotics results. [1]: https://www.dwarkesh.com/p/richard-sutton "Richard Sutton – Father of RL thinks LLMs are a dead end" [2]: https://incompleteideas.net/papers/sutton-88-with-erratum.pdf?utm_source=chatgpt.com "Learning to predict by the methods of temporal differences" [3]: https://dl.acm.org/doi/pdf/10.1145/122344.122377?utm_source=chatgpt.com "Dyna, an integrated architecture for learning, planning, and ..." [4]: https://incompleteideas.net/sutton/book/the-book-2nd.html?utm_source=chatgpt.com "Reinforcement Learning: An Introduction" [5]: https://www.incompleteideas.net/IncIdeas/BitterLesson.html?utm_source=chatgpt.com "The Bitter Lesson" [6]: https://www.sciencedirect.com/science/article/pii/S0004370221000862?utm_source=chatgpt.com "Reward is enough" [7]: https://www-formal.stanford.edu/jmc/whatisai.pdf?utm_source=chatgpt.com "What is Artificial Intelligence - Formal Reasoning Group" [8]: https://arxiv.org/abs/2203.02155?utm_source=chatgpt.com "Training language models to follow instructions with human feedback" [9]: https://say-can.github.io/?utm_source=chatgpt.com "SayCan: Grounding Language in Robotic Affordances" [10]: https://deepmind.google/discover/blog/rt-2-new-model-translates-vision-and-language-into-action/?utm_source=chatgpt.com "RT-2: New model translates vision and language into action" [11]: https://arxiv.org/abs/2112.15422?utm_source=chatgpt.com "Scalar reward is not enough: A response to Silver, Singh, Precup and Sutton (2021)" [12]: https://apnews.com/article/83db773712dd3abccd21e3782d9059ec "AI pioneers who channeled 'hedonistic' machines win computer science's top prize | AP News" [13]: https://arxiv.org/abs/2204.01691?utm_source=chatgpt.com "Do As I Can, Not As I Say: Grounding Language in Robotic ..."