---
# System prepended metadata

title: 'World Models, From Zero to Hero'

---

# World Models, From Zero to Hero

*A field guide to one of the most overloaded terms in AI.*

---

## Preface

The phrase "world model" now refers to so many different things that the only safe assumption, when you encounter it, is that the speaker and the listener mean different things by it.

I started writing this because the noise had become hard to parse. In March 2026, Yann LeCun's AMI Labs closed a $1.03B seed to build world models. A few weeks earlier, Fei-Fei Li's World Labs raised another billion for world models. Google DeepMind's Genie 3 is a world model. NVIDIA Cosmos is a world model platform. Wayve's GAIA-2 is a world model for driving. Meta's V-JEPA 2 is a world model. OpenAI framed Sora as a world simulator. General Intuition builds world models from gaming data. Pete Florence at Generalist wrote a long post arguing that GEN-1 is *not* a world model, precisely because everyone is now calling theirs one.

Alexandre LeBrun, the CEO of AMI Labs, said it out loud to TechCrunch the week of the funding: *"In six months, every company will call itself a world model to raise funding."* It was a strange line from inside one of the companies doing exactly that, and the strangeness is the point. The label has become elastic enough that even the labs using it in good faith now have to explain, first, what they do *not* mean by it.

The problem is deeper than marketing. Strip away the companies that picked up the term loosely and the ones left are still building fundamentally different things. They differ on what the model predicts, in what space, conditioned on what, evaluated against what, and plugged into what downstream system. You can put them in the same category. The category has almost nothing mechanical in common.

This essay is an attempt to unbundle the term. It is written for people who read ML papers but do not live inside the subfield: engineers, researchers from adjacent areas, and anyone who wants to stop nodding along when "world model" appears in a sentence.

I will not pretend to be neutral. The representation-space camp, the lineage from LeCun's 2022 JEPA position paper through V-JEPA 2, strikes me as the most intellectually serious of the current bets, and I will say why. But I will steelman the other camps rather than dismiss them, because the field is genuinely uncertain and some of the strongest empirical results are coming from approaches I am skeptical of. If you read this and decide I am wrong, you will at least have a cleaner map to disagree with.

One note on sources. Most of the labs I cover have released papers, blog posts, or technical reports since 2024, and the specific claims here are traced back to those primary sources. Where I mention numbers, benchmark results, parameter counts, dataset sizes, I have tried to verify them against the original releases. Where I could not, I left them out.

---

## Part I: The Term Is Broken

In 1990, a graduate student at TU Munich named Jürgen Schmidhuber published a technical report with the title *Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments*. The title is a mouthful. The idea is simple. Train one neural network to predict the environment. Train a second to act inside it. Let the first network serve as a differentiable simulator the second can learn from. Related ideas appeared in Schmidhuber's work that decade, alongside planning inside learned representations and curiosity as an intrinsic training signal. A parallel argument came from the RL tradition a year later: Sutton's 1991 *Dyna*, which advocated for unifying learned environment models with action and planning. Much of the modern vocabulary was in place by the mid-1990s. The compute was not, and these directions sat for most of two decades.

It came back, with a name and a clean demonstration, in 2018. David Ha and Schmidhuber published *World Models* at NeurIPS. A variational autoencoder compressed raw pixels into a latent vector. A recurrent network predicted the next latent. A small controller chose actions. They trained the controller entirely inside the world model's *dream*, never letting it touch the real environment, and the policy still transferred when they plugged it back in. It worked on car racing. It worked on Doom. The title of the paper became the common name for the thing.

The image at the heart of that paper is the one the generative world model lineage has been chasing ever since: an agent learning to act by living inside a model of the world, rehearsing in imagination, then walking out the door competent. *Agents inside dreams.* Hold onto that image. Not every camp in this essay is downstream of it. The latent-space camp in particular (Part III, Camp 4) grew out of LeCun's self-supervised learning program of the 2010s, a different line of work, with its own predictive-representations thesis. The five camps below disagree about what should happen inside the agent's head. They do not all share the same intellectual ancestors.

If the term had kept the meaning Ha and Schmidhuber gave it in 2018, we would have a clean technical definition today: *a world model is a learned predictive model of environment dynamics, usable as a differentiable simulator for planning and policy learning.*

That definition still holds, more or less, in the academic literature. But in the eight years since, the term has been grabbed by at least five different research agendas, each solving a different problem, each with its own evaluation protocol, each with its own bet attached. Consider what gets called a world model today.

A video generation model like Sora produces a minute of high-fidelity video from a text prompt. You cannot interact with it. You cannot ask what happens if you push the cup that is sitting on the table. It generates *a* plausible future, not *the* future conditioned on your intervention. OpenAI's technical report titles itself *Video Generation Models as World Simulators* and cites Ha and Schmidhuber 2018 in its references. But the object is not the same object.

A spatial generation model like World Labs' Marble produces a persistent, navigable 3D scene from a text or image prompt. You can walk through it. You can export meshes. What you cannot do is act on it and watch dynamics unfold. Fei-Fei Li calls these "large world models," and she is making a coherent and interesting bet, but the bet is about spatial structure rather than temporal prediction, and it is not action-conditioned.

A latent-space predictive architecture like V-JEPA 2 is trained to predict future representations of a video, not pixels, in an abstract embedding space. You cannot look at its output, because there is no pixel output. You evaluate it by how well it transfers to downstream tasks like action anticipation, video question-answering, or robot planning. The whole point of the design is to *not* generate the kind of output the other camps are selling as their product.

A physics simulation platform like NVIDIA Cosmos is a collection of models, data pipelines, and tokenizers for generating synthetic video and training physical AI systems. It competes with Isaac Sim and with custom simulators, not with Sora or JEPA. NVIDIA calls its models "world foundation models." The term is honestly used; the category is orthogonal.

A generative simulator like DeepMind's Genie 3 generates navigable, action-conditioned environments frame by frame, in real time. You can walk through it, and the world responds to your inputs. This is closer to the classical definition than any of the others. But it still generates in pixel space, and whatever physics it has learned are emergent rather than principled.

These are not five angles on the same object. They are five different objects that share a name. They have different training data, different architectures, different evaluation metrics, different deployment paths, and different theories of what intelligence is. Treating them as a single category is not just sloppy. It hides where the interesting technical disagreements are.

The right question is not *which team has the best world model?* It is *what is a world model for?* A simulator you can train agents in? A representation learner whose outputs feed downstream systems? A data generator for physical AI? A creative tool? A scientific theory about intelligence itself? Each of those has a different right answer, and the answers do not compose.

Before walking the camps, the technical definition has to carry some weight.

---

## Part II: What a World Model Actually Is

Start with the formal object. Two formulations dominate the literature.

The first is the classical reinforcement learning decomposition, which goes back to the POMDP literature of the 1990s and gets written down most cleanly in the Dreamer family of papers (Hafner et al., 2019 through 2025). A world model in this tradition decomposes into three learned components:

- a **transition model** $P(s_{t+1} \mid s_t, a_t)$: how the state evolves when you act,
- an **observation model** $P(o_t \mid s_t)$: how sensor readings are generated from state,
- a **reward model** $P(r_t \mid s_t)$: what outcomes are worth.

This is the workhorse formulation across most of the generative world model literature. It is probabilistic, it treats state and observation as distinct, and it couples prediction to reward.

The second formulation is Yann LeCun's, from his 2022 position paper *A Path Towards Autonomous Machine Intelligence* and clarified in a February 2024 LinkedIn post titled *"Lots of confusion about what a world model is. Here is my definition."* LeCun writes:

> **Given**: an observation $x(t)$, a previous estimate of the state of the world $s(t)$, an action proposal $a(t)$, and a latent variable proposal $z(t)$.
>
> **A world model computes**:
> - representation: $h(t) = \mathrm{Enc}(x(t))$
> - prediction: $s(t+1) = \mathrm{Pred}(h(t), s(t), z(t), a(t))$
>
> where $\mathrm{Enc}$ is an encoder and $\mathrm{Pred}$ is a state predictor, both trainable deterministic functions. The latent variable $z(t)$ represents "the unknown information that would allow us to predict exactly what happens."

Three things distinguish LeCun's formulation. First, there is no reward model inside the world model: reward, cost, and intrinsic motivation live in separate modules of the broader architecture. Second, the predictor is deterministic, with stochasticity handled through the latent variable $z(t)$, which parameterizes the set of plausible futures. Third, and most importantly for what follows in this essay, the prediction happens over $s(t+1)$, which is a *representation*, not necessarily an observation. Whether to decode $s(t+1)$ back into pixels is a separate question. In the JEPA family it is explicitly not decoded.

I'll use LeCun's formulation as the reference point in the rest of this essay. The Dreamer-style probabilistic decomposition is equivalent for most practical purposes, and you will see it in most papers. Where the two formalisms differ in ways that matter for the argument, I will flag it.

Three properties distinguish a world model, under either formulation, from the adjacent objects that get called world models.

**First, it is action-conditioned.** The prediction is explicitly conditioned on $a_t$. A world model answers the question "what happens if I do X?" This is the difference between a video model and a simulator. A video model predicts the next frame given the past; a world model predicts the next state given the past *and the intervention*. Sora can generate a jeep driving through dust. It does not accept a per-timestep control signal that would let you steer it once the generation has started. Genie 3 does, because it takes the user's input (WASD, mouse) as part of its conditioning at every frame. That distinction is not a detail. It is the definition.

**Second, it is causal rather than correlational.** A video model trained to maximize the likelihood of real-world video will learn that when a glass falls, it often shatters. It will not necessarily learn that the shattering was *caused* by the falling. The test for this is intervention under distribution shift: put the glass in a situation you never saw in training (say, falling onto a trampoline) and see whether the model predicts what physics says should happen (bouncing) or what its training data told it happens (shattering).

A 2024 paper from ByteDance Seed, *How Far Is Video Generation from World Model: A Physical Law Perspective*, ran exactly this kind of controlled study. The authors are themselves at a major video generation lab, which makes the finding harder to wave away. They showed that video models, including the state-of-the-art diffusion architectures, do not pass the test. Given unseen combinations of color, size, velocity, and shape, the models case-match the closest training example rather than applying physical laws. They generalize in-distribution, they degrade combinatorially, and they fail out-of-distribution. Scaling did not fix it in the range they tested.

**Third, it is consistent over multiple steps.** A world model has to close the loop on itself. Its prediction at time $t+1$ becomes the input to its prediction at time $t+2$, and errors compound. This is where most world models die. Genie 3 maintains consistency for "a few minutes" according to DeepMind. GAIA-2 does it for longer, in the constrained domain of driving. Dreamer 4 does it over sequences of twenty thousand or more actions in Minecraft, which is the current state of the art for long-horizon imagination. Most video models degrade within seconds of open-loop rollout, and none of them claim otherwise.

Put these three properties together and you have a clean membership test. A model is a world model if it (a) takes actions as input, (b) produces future states or observations as output, and (c) remains coherent over rollouts long enough to be useful for planning or agent training. Everything that fails one of those tests is something else: a video generator, a 3D scene generator, a representation learner, an infrastructure layer. Those things can be useful. They can even be necessary ingredients. But they are not world models, and calling them world models erases the distinction that makes the concept work in the first place.

### The fork in the road: state, not pixels

There is one more piece of the formal definition that turns out to do most of the work in the rest of this essay. It is the distinction between state $s(t)$ and observation $x(t)$.

In LeCun's formulation this distinction is built in: the encoder $\mathrm{Enc}$ maps observations into a latent state, and the predictor operates on that state. In the classical formulation the same structure appears as the gap between transition model $P(s_{t+1} \mid s_t, a_t)$ and observation model $P(o_t \mid s_t)$. Either way, a world model operates in state space. The state can be an abstract embedding the model has learned. The decision to *render observations at all*, to produce pixels at the end of the pipeline, is a separate engineering decision from the decision to do world modeling.

This is the fork in the road. The whole field splits along it.

On one side: pixels are the point. You should render observations because you need to see what the model thinks will happen, both for human evaluation and as a training signal that forces the model to predict the world rather than abstract around it. Without pixels, you have no idea whether the model has learned anything.

On the other side: pixels are a distraction. They force the model to spend capacity on high-entropy details (carpet textures, ripples on water, the exact pattern of leaves) that have nothing to do with intelligence. The right thing to predict is the *representation* of the next observation, in some abstract embedding space, and never bother decoding back to image space.

Both sides think the other side is confused. Neither side is wrong about what they are building. They are building different things, and the line between them is the cleanest signal of which camp a given lab belongs to. Hold this distinction in mind. The five camps in Part III sort along it.

With that in place, we can walk the taxonomy.

---

## Part III: The Five Camps

### Camp 1: Video Generation as "World Simulation"

The canonical statement is OpenAI's February 2024 technical report, *Video Generation Models as World Simulators*. The argument runs roughly like this. Sora is trained at scale on internet video. It generates coherent minute-long clips with emergent 3D consistency and object permanence. Therefore it has implicitly learned to simulate the physical world, and scale will close whatever gaps remain. The report's authors are careful; they acknowledge Sora does not correctly model glass shattering or bitten food. The thesis is that these are training-data problems rather than architectural ones.

Roughly the same position is held by Runway (GWM-1, Gen-4.5), by Kuaishou (Kling), and by the ByteDance and Chinese lab ecosystem (Seedance, Hailuo). Google's Veo 3 sits adjacent; Veo has a real claim to emergent physics understanding, and Genie 3 is explicitly built on top of Veo. The labs differ in emphasis but share the core commitment: train a large enough generative video model on a broad enough dataset, and you get a system that models the world well enough to serve as the foundation for downstream physical AI.

The argument is worth steelmanning before disagreeing with, because its strongest form is not "pixels are the goal." It is that *prediction is the universal objective*, and video is the most data-dense signal we have about the physical world. If you can accurately predict the next frame of video, you must have learned something about physics, object permanence, lighting, material properties, and agent behavior, because all of those things constrain what the next frame can be. Video generation is not the product on this view. It is a training objective that happens to produce a viewable artifact. The video is a byproduct of learning.

There is something to this. Sora did learn *something* about 3D structure and motion. Its outputs exhibit regularities that require the model to have internalized consistency across viewpoints and over time. The interpretability literature on smaller diffusion models suggests that linearly decodable representations of geometry, depth, and motion emerge in the model's internal activations, and it would be surprising if Sora did not have some version of these.

The problem is the gap between "has some representations of physics" and "can be used as a world simulator." That gap shows up in three places.

First, interactivity. Sora does not take actions as input. You can prompt it with text, but a text prompt is not an intervention; it is a description of a scenario. To convert Sora into something action-conditioned, you would have to train the model to accept a control signal at each timestep and generate the resulting next frame. This is what Genie 3 does, and it is the reason Genie 3 is a world model while Sora is not, despite Sora being much larger and generating higher-fidelity video. Interactivity is not a feature you bolt on. It is a different training objective that produces a different model.

Second, evaluation. Video models are scored on FVD, FID, aesthetic quality, user studies. Those metrics tell you how good the videos look. They tell you very little about whether the model correctly predicts physical consequences. The 2024 paper I mentioned above built a controlled test specifically to measure physical prediction: a 2D simulator governed by classical mechanics, with unlimited training data available. Diffusion video models trained on this data showed perfect in-distribution generalization, measurable but imperfect combinatorial generalization, and complete failure out-of-distribution. When asked to predict a new combination of features, the models prioritized color over shape over size over velocity. That prioritization has nothing to do with physics and everything to do with pixel statistics. This is not a damning result for video generation as a creative tool. It is a damning result for video generation as world simulation.

Third, and most importantly for the argument about scale: the paper showed that more data did not close the gap. Scaling produced tighter in-distribution fit, not better out-of-distribution extrapolation. The models case-matched their training set more precisely rather than inferring underlying laws. If the "scale closes the gap" argument were correct, we would expect OOD performance to improve with scale. It did not, at least in the regime tested.

LeCun's reply to the Sora announcement captured the asymmetry as sharply as anyone has: *"The generation of mostly realistic-looking videos from prompts does not indicate that a system understands the physical world. Generation is very different from causal prediction from a world model. The space of plausible videos is very large, and a video generation system merely needs to produce one sample to succeed."*

That is the asymmetry. A world model has to produce predictions that match what actually happens. A video generation model only has to produce predictions that are *plausible*. The hypothesis space for the latter is astronomically larger, and optimizing for plausibility does not force the model to be right. It only forces the model to be believable.

Video generation is a real technology with real uses. Training video models at scale may well produce representations that serve as useful initializations for actual world models. But the video models themselves, as of this writing, are not world models in the technical sense, and treating them as such has cost the field something in clarity.

### Camp 2: Spatial Intelligence and 3D Scene Generation

World Labs is the flagship here. Fei-Fei Li's thesis, stated cleanly in her 2025 manifesto on spatial intelligence, is that real intelligence is grounded in an understanding of 3D space (where things are, how they move, what they afford) and that the current generation of language and image models is fundamentally limited because they operate on 2D projections of a 3D world. If language models teach machines to read and write, spatial intelligence models should teach them to see and build. Their first product, Marble, generates and edits persistent 3D environments from text, images, video, or 3D layouts. It exports to Gaussian splats and meshes. It is a real product with real users.

This is a serious bet, and it is worth being precise about why I keep it separate from world models proper.

The core object Marble produces is a 3D scene, not a dynamical system. You can walk through the scene, rotate the camera, export it as a mesh. What you cannot do is act on it and watch it evolve. If you throw a ball in a Marble environment, nothing happens, because Marble does not model what would happen. The persistence Marble gives you is *spatial* (the wall is still there when you turn around) not *temporal* (the coffee is cooling). Those are different axes.

That is not a gap World Labs is racing to close. It is a design choice. Their own writing frames interactive dynamics as a future direction rather than the current product. They are attacking the spatial problem first, on the reasonable hypothesis that you cannot build a world model without solid 3D grounding, and that 3D is undersolved and worth solving in its own right.

Whether the path from spatial generation to action-conditioned world modeling is a straight line is an open question. It might be. Generating a 3D scene that is physically plausible, one where objects rest on surfaces, lighting makes geometric sense, occlusions resolve correctly, requires the model to have internalized a great deal about how the physical world is structured. If you then add a physics engine on top, or train a second model to predict how the scene evolves given actions, you might get a world model out the other end. World Labs appears to be moving in this direction.

But spatial intelligence and world modeling are not the same problem, and it is worth keeping the distinction clean even when a single company works on both. You can have excellent 3D generation without a world model (Marble today). You can have a world model without explicit 3D scene reconstruction (most of the rest of this essay). The two might converge. They have not yet.

### Camp 3: Generative World Models

This is where the interesting action has been since 2023. Generative world models commit to the full classical picture: they take actions as input, they predict future states, they render those states back into pixels or tokens that a human or agent can observe, and they close the loop to support rollouts over time. They *look like* what you picture when you hear "world model."

The modern lineage runs Ha and Schmidhuber 2018, DreamerV2 (2020), IRIS (2022), GAIA-1 (2023), DIAMOND (2024), Genie 2 and 3 (2024 to 2025), GAIA-2 (2025), Dreamer 4 (2025). Specifics differ, the recipe is consistent. Compress raw observations into a latent space with an encoder. Predict the next latent conditioned on action. Decode back to pixels when you need to show a human, or feed a vision-based agent. Handle stochasticity by producing distributions over futures rather than point estimates.

Architectural choices matter, and they have been churning. Early work used discrete tokens and autoregressive transformers (IRIS, Genie 1, GAIA-1), which inherit scaling laws from language modeling and handle multimodal futures naturally through token-level sampling. Later work moved toward diffusion (DIAMOND, GAIA-2), which produces higher-fidelity outputs and handles multimodal futures by denoising from noise rather than sampling from a categorical distribution. Dreamer 4 introduced a training objective called "shortcut forcing," a flow-matching extension that lets the model produce clean outputs in 4 steps instead of the 64 typical of diffusion, making real-time inference on a single GPU possible. Genie 3 stays closer to the original Genie recipe, a spatiotemporal VQ-VAE tokenizer with an autoregressive transformer on top, pushed to real-time interactivity and multi-minute coherence. There is no consensus architecture yet.

The best work in this camp is genuinely impressive. Dreamer 4 is the first agent to collect diamonds in Minecraft purely from offline data: no environment interaction during training, more than twenty thousand actions per episode, outperforming OpenAI's VPT agent with a hundred times less data. The world model was accurate enough that reinforcement learning inside it produced policies that transferred to the real game. This is the Ha and Schmidhuber thesis made to work at scale, agents learning inside their own dreams, on a task hard enough that a lot of people had written off the approach.

Genie 3 demonstrates real-time interactive generation of novel environments at 720p and 24 frames per second, with consistency maintained over minutes. Walk toward a wall and it stays where it should. Turn away from a tree and it is still there when you turn back. That kind of stability across viewpoints had been "nearly working" in world model research for years.

Wayve's GAIA-2 generates multi-camera driving scenes with fine-grained control over ego-vehicle dynamics, agent behavior, and environmental factors. This is the kind of targeted scenario generation that serves closed-loop AV validation, the long tail of situations you cannot collect enough of from real roads, like sudden cut-ins, emergency braking, pedestrians stepping off curbs. GAIA-2 generates them on demand.

All three are working systems, used in production or near it. When I talk about world models I take seriously, this is the category I mean.

The question I have about generative world models is the same question I have about video models, posed in a different register. When Dreamer 4 gets the diamond in Minecraft, what has it learned? Has it learned Minecraft's game mechanics, the block lattice, the crafting rules, the enemy behavior, in a form that would generalize to a modified Minecraft it had never seen? Or has it learned a good interpolator over the 2,500 hours of human gameplay it was trained on? The paper makes claims about generalization, and the offline-only result is striking, but the underlying question is open.

The same question applies to Genie 3. It generates environments that look physical. It maintains object permanence across occlusions. Does it know that glass shatters, or does it know that glass-shattering-videos tend to follow glass-falling-videos in its training set? Nobody knows the answer. The interpretability work needed to answer it has not been done at the scale of these models, and until it has, judgments about how much physics these systems have learned are partly aesthetic.

My read of this camp: it is producing the most spectacular demos in the field, the demos are not fraudulent (the models really do what they claim), and the underlying scientific question of whether they have learned physics or sophisticated pattern matching is still open. If they have learned physics, this is the winning approach. If they have learned pattern matching, they will hit a ceiling that scale will not crack. Both outcomes are consistent with the current evidence.

What tilts me toward skepticism, and toward the next camp, is an information-theoretic objection LeCun has been making for years. I think it is underappreciated.

### Camp 4: Latent-Space World Models and JEPA

The argument against pixel prediction is not that it doesn't work. It is that it wastes capacity on the wrong thing.

Consider what a diffusion video model is trained to minimize: some distance (L2, perceptual, whatever) between the predicted pixel distribution and the true next frame's pixel distribution. The model is penalized for getting any pixel wrong. But most pixels in a video are either unchanging across frames, or unpredictable in their fine detail (the exact pattern of leaves in wind, the precise texture of a carpet, the stochastic ripples on a pond), or only trivially related to the semantic content of the scene. The model spends enormous capacity learning to render these unpredictable details correctly, because that is what the loss is asking it to do.

Here is LeCun's framing: *"The world is unpredictable. If you try to build a generative model that predicts every detail of the future, it will fail."* More precisely, it will not fail on training data. It will get the unpredictable details statistically right, producing plausible-looking carpet texture. It just will not have learned anything useful about physics, causality, or planning from that effort. The carpet texture is noise. The model is spending capacity fitting noise.

The alternative is to predict in representation space. Don't ask the model to predict pixels. Ask it to predict the *representation* of the next observation. Train a separate encoder that takes in future frames and produces an embedding. Train a predictor that, given the past and an action, predicts that embedding. The loss is distance in embedding space, not pixel space. The carpet texture, which is high-entropy in pixel space, collapses to a low-entropy vector in a well-designed embedding. The model can ignore it and focus on what changes: the motion, the interactions, the causal structure.

This is the Joint Embedding Predictive Architecture, or JEPA. LeCun introduced it in his 2022 position paper *A Path Towards Autonomous Machine Intelligence*, which is worth reading in full. The paper is a sixty-page manifesto covering his entire vision for autonomous systems: a configurator that sets goals, a world model that predicts consequences, an actor that chooses actions, a critic that evaluates outcomes, and hierarchies of abstraction at multiple timescales. JEPA is the centerpiece because it is the component that had no existing implementation.

The first working instantiations were I-JEPA (images, 2023) and V-JEPA (video, 2024), both from Meta's FAIR lab. V-JEPA 2, released in June 2025, is the one I consider the strongest evidence to date that the approach is not just theoretically elegant but practically competitive.

V-JEPA 2 is pre-trained on over a million hours of internet video with a self-supervised masked prediction objective. No labels. No text. The model learns to predict the representations of masked spatiotemporal patches, given context patches, in an abstract embedding space produced by a second encoder that is updated via exponential moving average. This is the basic JEPA setup: context encoder, target encoder, predictor, distance loss in representation space, collapse prevented by the teacher-student asymmetry.

After pretraining, the model is competitive with or state-of-the-art on a range of benchmarks. Motion understanding on Something-Something v2. Human action anticipation on Epic-Kitchens-100. Video question answering when aligned with an LLM. None of those required the model to generate pixels. The representations themselves, consumed by downstream heads, carried enough information about motion, objects, and physics to support strong task performance.

The part that made people pay attention is the robotics result. Meta fine-tuned V-JEPA 2 on just 62 hours of unlabeled robot videos from the DROID dataset, adding an action-conditioned predictor on top of the frozen encoder. This produced V-JEPA 2-AC, which was deployed zero-shot on Franka arms in two labs that had never contributed data to DROID. The robot was given a goal image and asked to plan. It used the model to predict the representations of imagined future states, the cross-entropy method to search for actions minimizing the distance between predicted and goal representations, executed the first action, replanned from the new observation, and iterated.

It worked. Pick-and-place on novel objects in unseen environments, 65 to 80 percent success rate, no task-specific training. The planning loop ran at about 16 seconds per step; the Cosmos-based pixel-space baseline took roughly 4 minutes per step for the same task. Roughly sixteen times faster.

The data efficiency is the headline number. 62 hours is not nothing, but it is nothing compared to the hundreds of thousands of hours of teleoperation data that behavior-cloning approaches have been consuming, and it is nothing compared to the internet-scale pixel-space world models. The argument for JEPA is not that it produces prettier outputs. It is that if your representations are doing the right work, you don't need as much downstream data to get competent behavior.

This is the bet AMI Labs is making, and increasingly the bet a number of academic labs are making, helped by recent theoretical work. LeJEPA, published in November 2025, replaces the EMA-teacher heuristic with explicit distribution-matching regularization and claims a mathematically principled way to train these architectures without the usual bag of tricks.

This is the camp I find most convincing. My reasons:

**The objective is epistemically correct.** Intelligence is about predicting what matters. Pixel prediction conflates what matters with what is visually present. Representation-space prediction lets the system decide, through the learning dynamics of the encoder, what is worth keeping and what is noise. This aligns with what we know about biological perception: early visual cortex does something that looks much more like predictive encoding of meaningful features than pixel-level reconstruction.

**The data efficiency argument is empirical, not just theoretical.** V-JEPA 2's 62-hour fine-tune to zero-shot robotics is the kind of result that would have seemed implausible two years ago. One data point, yes, and one should not over-extrapolate from a single lab's result. But it is the right *kind* of data point, the behavior one would predict if the theory were correct.

**The approach is compatible with the broader architecture of autonomous systems.** LeCun's 2022 vision (configurator, perceptor, world model, actor, critic, hierarchies of abstraction) requires a world model that operates in latent space, because planning over pixel-space rollouts at multiple levels of temporal abstraction is computationally intractable. You cannot do hierarchical planning in pixel space. If the long-run goal is autonomous machine intelligence, JEPA is the component you need.

The counterarguments I take seriously:

**JEPA is hard to evaluate and hard to iterate on.** When you generate pixels, you can look at the output and see if it makes sense. When you predict embeddings, you cannot. You are debugging a scalar loss and some downstream task metrics, without the human-interpretable intermediate artifact. That matters for iteration speed, which matters enormously in modern ML. The generative camp's iteration advantage is not theoretical; it is a real reason they are shipping faster.

**Collapse is a real problem.** The naive JEPA objective has a trivial solution: output the same embedding for every input, loss goes to zero, representations are useless. Every JEPA variant has an anti-collapse mechanism (EMA teachers, VICReg, stop-gradient, LeJEPA's distribution matching) and those mechanisms are finicky. The theoretical understanding of *why* they work is still developing. This is a real drag on the approach.

**Existence proofs for JEPA at scale are thinner than for generative world models.** V-JEPA 2 is impressive but it is one model from one lab. Dreamer 4, Genie 3, GAIA-2 are all shipping at scale right now, producing results other labs are reproducing. The generative camp has more miles on it. That is not proof they will win, but it is data.

**The "pixels are wasteful" argument might be less strong than it looks.** Modern generative models do not fit every pixel equally. Diffusion models learn multiscale representations; autoregressive transformers compress visual tokens. The effective objective after enough training may be closer to predicting the predictable parts of the image than the naive loss suggests. If that is true, the information-theoretic argument against pixel prediction is weaker than LeCun presents it.

Weighing all of this: JEPA is the most intellectually correct bet, and also one of the hardest to cash. It is where I expect the most important long-run progress to come from, and I would not be surprised if the next two years of shipped products come disproportionately from the generative camp while JEPA continues to mature. Research timelines do not always align with deployment timelines.

One thing the JEPA camp has not yet done publicly, and will need to do, is close the agent loop. V-JEPA 2 shows planning for short-horizon manipulation. It does not yet show extended behavior, online learning, or the sustained task performance Dreamer 4 demonstrated in Minecraft. AMI Labs is presumably working on this. Until it lands, the generative camp has the stronger end-to-end demonstrations, even if the latent camp has the stronger theoretical story.

### Camp 5: Infrastructure and Orthogonal Paradigms

Two things belong in this camp for different reasons.

**NVIDIA Cosmos** is an infrastructure play. Launched at CES 2025, it is a platform rather than a single model: a video data curation pipeline (claimed to process 20 million hours of video in 14 days on Blackwell), a family of open-weight world foundation models (Cosmos Predict, Cosmos Transfer, Cosmos Reason), a tokenizer, and fine-tuning tools. The WFMs themselves come in two architectural families, diffusion-based and autoregressive, trained on 9,000 trillion tokens from 20 million hours of real-world video spanning driving, industrial, and robotics data.

Cosmos is not trying to be the best world model. It is trying to be the platform on which other people build their world models. Jensen Huang's framing at CES was explicit: the ChatGPT moment for robotics is coming, and Cosmos is the infrastructure layer meant to democratize physical AI development. Early adopters include 1X, Agility, Figure, Skild, Waabi, and Uber. The bet is that world models, whoever builds them and whatever architecture they use, will need massive amounts of curated synthetic data, and NVIDIA should be the vendor for that data and the GPUs that generate it.

Cosmos is not a world model in the technical sense, and NVIDIA does not really claim it is; they call it a *platform* for world foundation models. But it is an important piece of the landscape because it is where other labs actually get things done. V-JEPA 2 used Cosmos as a baseline in its robotics evaluations. Many of the generative world model labs use Cosmos components somewhere in their stack. If you are trying to build in this space and you are not thinking about what NVIDIA is providing, you are missing half the picture.

The second thing in this camp is **active inference**, associated with Karl Friston's free-energy principle and commercialized at VERSES. I am flagging it not because I think it is going to win, but because it is the most coherent non-deep-learning alternative in the space, and ignoring it would leave a gap in the taxonomy.

Active inference is a theory from computational neuroscience. The claim is that biological systems act to minimize *variational free energy*, a quantity that upper-bounds the surprise they experience. Instead of maximizing reward, an agent tries to minimize the gap between its predictions and its observations, either by updating its model (perception) or by acting to make observations match predictions (action). Goals are represented as prior preferences; exploration emerges naturally because resolving uncertainty reduces expected free energy. VERSES built AXIOM on this foundation: a structured generative model where each entity is a discrete object with typed attributes and relations, inference via Bayesian message passing rather than gradient descent. It is interpretable and data-efficient. Whether it scales the way transformers have scaled is an open question, and the active inference community has been working for over a decade without producing anything close to the empirical impact of V-JEPA 2 or Dreamer 4 or Genie 3. But the conceptual clarity is real, and if deep learning hits a wall, this is one of the places the field might look.

I include it as a completeness check on the taxonomy, not as a bet.

### The five camps, side by side

The taxonomy in one table, before moving on:

| Camp | Predicts in | Action-conditioned? | Headline example | What it's actually for |
|---|---|---|---|---|
| Video generation | Pixel space | No | Sora, Veo 3, Kling | Creative tool; possibly a representation prior |
| Spatial / 3D | 3D scene space | No | World Labs Marble | Persistent navigable scenes; design and review |
| Generative WM | Pixel or token space | Yes | Genie 3, Dreamer 4, GAIA-2 | Agent training inside imagination, scenario generation |
| Latent (JEPA) | Embedding space | Yes (V-JEPA 2-AC) | V-JEPA 2 | Representation learning, sample-efficient planning |
| Infrastructure | (Platform / non-DL) | (N/A) | NVIDIA Cosmos, AXIOM | Picks-and-shovels; alternative computational substrate |

Two columns matter most. *Action-conditioned?* separates what is a world model from what is something else with an aspirational name. *Predicts in* separates the field's two real philosophical tribes: the people who believe rendering pixels is necessary, and the people who believe rendering pixels is harmful.

---

## Part IV: Agents Inside the Dream

World models are the substrate. They are not the product.

The product is an agent that can do something useful: drive a car, fold laundry, navigate a warehouse, play a game, operate a drone. Almost every lab in the world-model space is ultimately trying to build a training environment or planning substrate for agents rather than a standalone world-generation product. This is not obvious from the press coverage, because the videos are prettier than the robots. But read the papers and the agent loop is always the destination.

There are three main ways to connect a world model to an agent, and which one a lab chooses tells you a lot about what they are actually betting on.

**Train agents inside the world model via reinforcement learning in imagination.** This is the Dreamer lineage, Hafner's series from 2019 to Dreamer 4 in 2025. The agent learns a policy entirely from rollouts inside the learned world model, never touching the real environment during training. This is what Ha and Schmidhuber originally proposed in 2018. It is the cleanest instantiation of the "agents dreaming" thesis. It works when the world model is accurate enough that a policy optimized against it transfers to the real world. Dreamer 4's Minecraft diamond result is the current high-water mark. Hafner and Wilson Yan are reportedly raising for a new company, Embo, to commercialize the approach, presumably for robotics.

**Use the world model as a planning oracle at inference time.** This is V-JEPA 2's approach. The world model is not used to train a policy offline; it is used online, during action selection, to predict the consequences of candidate action sequences and pick the best one. Model-predictive control with a learned dynamics model. This is simpler in some ways, because you do not need to worry about the policy transfer problem, but it puts more load on the world model's accuracy and on the search algorithm. V-JEPA 2-AC uses the cross-entropy method over short action sequences. Earlier work used shooting methods or trajectory optimization.

**Skip the world model entirely and train the agent directly on experience.** This is the VLA (Vision-Language-Action) camp. Physical Intelligence's π series, Google DeepMind's RT-2, Figure's Helix, most of the current robotics foundation models. The argument is that we do not have world models good enough to rely on for sim-to-real transfer in open-ended manipulation, and we do have enough real-world teleoperation and first-person video data to train policies directly. VLAs are pragmatic. They inherit scaling laws and infrastructure from the LLM world. They work. Physical Intelligence's release cadence over the past year, from π0 through π0.5, π0.6, π*0.6, and their memory-augmented follow-up, has been a steady march toward commercial viability on real robots, and they have gotten further than the world-model camps on any narrow metric of "what can a robot actually do right now."

The tension between these approaches is real and productive. The Dreamer and JEPA camps believe that learning dynamics is worth the investment because it unlocks things VLAs cannot do: counterfactual reasoning, long-horizon planning, data efficiency on out-of-distribution tasks. The VLA camp believes that training end-to-end on enough real data will always beat training through a learned model, because the model introduces compounding approximation error and real data does not lie. The disagreement is empirical, and the question is which of their intuitions generalizes further.

This is where Pete Florence's recent post from Generalist deserves close attention. Generalist is a robotics foundation model company whose founding team includes engineers from OpenAI, Google DeepMind, and Boston Dynamics. Their April 2026 model GEN-1 is state-of-the-art on several dexterous manipulation benchmarks. What makes Florence's post interesting is not the model. It is that he explicitly rejects both of the labels in this essay.

*"GEN-1 is not a fine-tuned vision-language model with robot actions bolted on, nor is it just a world model. It is a first-class-citizen, native foundation model for physical interaction... World models are having their moment in early 2026. VLAs had theirs from 2023 to 2025. Bandwagons are part of the nature of academic research. At Generalist, we've never referred to our models as either VLAs or world models. This is not an accident."*

His argument is worth engaging on its own terms. Florence distinguishes *goal-driven* from *idea-driven* research, picking a concrete outcome and solving whatever stands in the way versus picking a method and following its implications. He places current world-model discourse firmly in the idea-driven category, which he considers a distraction. The goal, for Generalist, is fully zero-shot robotics at high success rates. Whatever architectural choices get them there, VLA-style or world-model-style or something new, are tools, not tribes.

Florence is right about the sociology. Most of the "world model versus VLA" discussion is idea-driven. Labs pick a camp based partly on aesthetic preference, then optimize within it, and the camps are more dug in than the underlying technical questions justify. The next year will probably see a wave of hybrid architectures: world models with VLA-style action heads, VLAs with latent prediction objectives, LLM-backboned systems with generative world-model rollout components. Generalist's approach of ignoring the labels and training whatever works is probably closer to what the winning systems will look like in 2027 and 2028.

But Florence is being a little too clean. The categories are imperfect; they are not empty. Whether to learn dynamics is a real question. Whether to predict in pixel space or representation space is a real question. Whether to treat language as first-class or video as first-class is a real question. You can refuse the labels while still having answers to the questions, and Generalist clearly does: they train from scratch on physical interaction data, they do not use pixel-space video generation as a core objective, they use action conditioning. The labels are shorthand for technical commitments. Refusing the shorthand does not exempt you from the commitments underneath.

---

## Part V: Where I Land

I'll say where I come out, and what I'll be watching to find out if I'm wrong.

The camp I bet on is the representation-space camp. JEPA and its descendants feel right to me for reasons I can articulate and reasons I cannot fully articulate.

The articulable reasons are the ones in Part III: the objective is epistemically correct, the data efficiency results are the right shape, and the approach is compatible with the architecture of autonomous intelligence I think we eventually need. The harder-to-articulate reason is that when I read generative world model papers, I keep feeling that something is being glossed over. The models produce beautiful outputs, but the gap between "produces beautiful outputs" and "has understood the physical world" is not being closed, and the papers do not always seem to notice the gap. When I read JEPA papers, the authors read as aware of that gap and designing around it. That might be a bias of mine. I am flagging it as a bias.

I do not think this means the generative camp is wrong, or that their results do not matter. Dreamer 4 is the strongest agent-training result of the past year. Genie 3 is the strongest interactive world-generation result. GAIA-2 is the strongest domain-specific world model. These are real systems that do real things. If they continue to scale, if scaling fixes the OOD generalization problem, if longer rollouts remain coherent, if the pattern-matching-versus-physics question resolves in favor of physics, then the generative approach will have been right and I will have been wrong about the fundamental bottleneck.

What I am watching over the next eighteen months:

**Whether JEPA closes the agent loop convincingly.** V-JEPA 2 showed planning. It has not yet shown extended behavior, online learning, or open-ended task performance. If the JEPA camp can demonstrate something on the scale of Dreamer 4's Minecraft result, an agent that solves a hard long-horizon task entirely from representation-space rollouts, the balance of evidence tilts sharply. AMI Labs has the team and the runway to do this. I expect to see their first major results in 2026 or early 2027.

**Whether video models demonstrate out-of-distribution physics.** The 2024 paper I cited earlier showed scaling did not produce OOD generalization in the regime tested. Either someone scales past that regime and breaks the pattern, or they don't. If they do, the generative camp's argument gets much stronger. If they don't, it gets weaker. We will know soon enough. Both OpenAI and Google have strong reasons to produce and publish physics-focused evaluations of their video models, and the field is watching.

**Whether hybrid approaches dominate.** My prior is that in five years, the most important world models will not cleanly fit either camp. They will be something like JEPA-style representations with generative rollout heads for evaluation, or generative world models with JEPA-style auxiliary objectives, or something else I cannot anticipate. The pure camps today are useful for thinking about the design space; they are probably not the shape of the winners.

**Whether the robotics companies bypass world models entirely.** Generalist, Physical Intelligence, Skild, and the rest of the VLA camp are betting you can skip world modeling by training on enough real-world interaction data. If they are right, world models will be remembered as a detour rather than a path. I think this is possible. I also think that "enough" real-world interaction data for the long tail of tasks humans care about is larger than any of these companies currently has, and that the scaling wall they hit will look more like "we need to model dynamics explicitly" than like "more data fixes this." But I hold that view weakly.

**Whether someone from outside the current taxonomy surprises everyone.** Active inference is the obvious candidate. Friston's community has been working on this for longer than deep learning has existed, and if the deep learning approaches hit a wall, the conceptual inventory to replace them already exists. I would not bet on active inference. I would also not bet against "something non-obvious" in a five-year window. The field is young enough that the winning approach might not yet have a name.

If you force me to commit: I think the representation-space camp is right. Not as a matter of taste, but as a matter of what the learning problem is. Intelligence is about predicting what matters, and pixels are not the thing that matters. The generative labs are producing the flashier demos today, and they may continue to for a while, because pixel-space systems are easier to show off and easier to iterate on. But the question of whether a system has understood the world is a question about its representations, not its outputs, and the representation-space camp is the one taking that question seriously.

V-JEPA 2 is the cleanest existence proof we have so far. 62 hours of robot data, zero-shot transfer to new labs, planning in latent space, sixteen-times-faster inference than the pixel-space baseline. That is what the theory predicts should happen if the theory is right. One data point. The right kind of data point.

A second data point worth flagging, because it is a production system on a safety-critical use case rather than a lab benchmark: Nexar's BADAS 2.0, released April 2026, is a collision prediction model built on a fine-tuned V-JEPA 2 backbone. It is deployed across Nexar's fleet of 350,000 dashcams and trained on roughly two million real-world collision-risk events drawn from 200 million miles of driving. The headline numbers are 99.4% average precision with 91% early warning recall, and the comparison that matters for this essay is the one against NVIDIA Cosmos on the same task: BADAS 2.0 outperforms a 2-billion-parameter pixel-space foundation model at roughly 91× fewer parameters. The Nexar team's own framing is precisely the one Camp 4 makes, that predicting in latent space optimizes for physical causality rather than visual fidelity, and that this is what you want when the downstream task is "will this collide." It is a narrow use case, and a single deployment does not settle the argument. But it is the kind of result that shows the JEPA approach works outside academic benchmarks, on a task where false positives and missed detections have real consequences.

I am watching AMI Labs with real anticipation. If they can close the agent loop convincingly, if the next V-JEPA scales the way the first two did, if the data-efficiency story holds at the next order of magnitude, the case goes from suggestive to decisive inside two years. If it does not, I want to know that too; I have tried to make my position falsifiable in the list above. Either way, the next stretch of this field is going to be the most interesting part.

A final word on the term itself. "World model" is going to outlast this particular moment of overuse, and it should. The underlying idea, that intelligence requires an internal model of how the world changes when you act on it, is one of the deep ideas in AI, and it connects to traditions older than deep learning: cybernetics, control theory, predictive coding in neuroscience. The labs covered in this essay disagree about what the model should represent and how it should be trained. They agree on the shape of the question. Getting the question right is already most of the work. Watching how it gets answered, over the next few years, is the part I am most looking forward to.

---

## Further Reading

Papers I would pick if I had to pack a reading list. Rough chronological order within each group.

**Origins**
- Schmidhuber, *Making the World Differentiable* (1990). An early technical report on using recurrent networks as differentiable environment models for planning. [Original PDF](https://people.idsia.ch/~juergen/FKI-126-90_(revised)bw_ocr.pdf)
- Sutton, *Dyna, an Integrated Architecture for Learning, Planning, and Reacting* (1991). Not a world-model paper per se, but the same decade's argument for unifying learned models with action and planning.

**Modern revival**
- Ha and Schmidhuber, *World Models* (2018). The NeurIPS paper that popularized the term, with an iconic [interactive web version](https://worldmodels.github.io/).
- Hafner et al., DreamerV2 (2020) and DreamerV3 (2023). The RL-in-imagination lineage before it became a large-scale story.

**Generative world models**
- Micheli, Alonso, Fleuret, *Transformers are Sample-Efficient World Models* (IRIS, 2022). Autoregressive token-based world modeling for Atari.
- Hu et al., *GAIA-1* (2023) and *GAIA-2* (2025). Wayve's generative world models for autonomous driving. [GAIA-2 technical report](https://arxiv.org/abs/2503.20523).
- Alonso et al., *DIAMOND* (2024). Diffusion-based world modeling. Much subsequent open-source work builds on this architecture.
- DeepMind, *Genie 2* (2024) and *Genie 3* (2025). Interactive generative world models. [Genie 3 announcement](https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/).
- Hafner, Yan, Lillicrap, *Training Agents Inside of Scalable World Models* (Dreamer 4, 2025). The first offline Minecraft diamond result. [arXiv](https://arxiv.org/abs/2509.24527).

**Latent and representation-space world models**
- LeCun, *A Path Towards Autonomous Machine Intelligence* (2022). The position paper. Long, opinionated, worth reading in full. [OpenReview](https://openreview.net/pdf?id=BZ5a1r-kVsf).
- Assran et al., *I-JEPA* (2023). The first working JEPA instantiation.
- Bardes et al., *V-JEPA* (2024) and *V-JEPA 2* (2025). Video JEPA and zero-shot robot planning. [V-JEPA 2 paper](https://arxiv.org/abs/2506.09985).
- Balestriero et al., *LeJEPA* (2025). Distribution-matching approach to JEPA training without EMA teachers.
- Nexar, *BADAS: Context Aware Collision Prediction Using Real-World Dashcam Data* (2025). Production V-JEPA 2 deployment for automotive safety. [Paper](https://arxiv.org/abs/2510.14876).

**Video generation as "world simulation"**
- OpenAI, *Video Generation Models as World Simulators* (Sora technical report, 2024). The canonical statement of the scale-will-solve-it position.
- Kang et al., *How Far Is Video Generation from World Model: A Physical Law Perspective* (2024). The controlled study showing scaling did not produce OOD physics generalization in the regime tested.

**Spatial intelligence and 3D**
- World Labs, *Marble* (2025). Persistent 3D scene generation.
- Mildenhall et al., *NeRF* (2020) and Kerbl et al., *3D Gaussian Splatting* (2023). The neural 3D representation lineage Marble builds on.

**Physical AI infrastructure**
- NVIDIA, *Cosmos World Foundation Model Platform for Physical AI* (2025). [Technical report](https://arxiv.org/abs/2501.03575).

**Agent architectures and the "refuse the label" case**
- Brohan et al., *RT-2: Vision-Language-Action Models* (2023). The VLA lineage.
- Black et al., *π0* and related papers (2024-2025). Physical Intelligence's VLA series.
- Florence, *Going Beyond World Models & VLAs* (Generalist blog, April 2026). The goal-driven critique of both camps.

---

*This is my understanding of the field as of early 2026. It will be wrong in places, and dated in more. If you think I have gotten something importantly wrong, I want to know.*