The Three Paths to World Models

Three of the most influential figures in AI history—Yann LeCun, Fei-Fei Li, and the leadership of Google DeepMind—have all declared that "world models" are the path to artificial general intelligence. They've all abandoned or deprioritized large language models. And they all mean something completely different by the term.

This is not a minor disagreement about implementation details. It's a fundamental split in how we understand what it means for a machine to "understand the world." For taxonomy, it raises a fascinating question: are these three approaches different species within Family Simulacridae, or are they so distinct they warrant separate genera—or even families?

The Schism

In December 2025, Yann LeCun walked away from Meta after twelve years as its Chief AI Scientist. His parting shot: "LLMs are too limiting. Scaling them up will not allow us to reach AGI."

He wasn't being contrarian. He was being consistent. Since 2022, LeCun has argued that next-token prediction on text—the mechanism powering GPT, Claude, Gemini, and the entire Frontieriidae crown clade—is a dead end. His alternative: the Joint Embedding Predictive Architecture (JEPA), which learns by predicting in representation space rather than pixel space.

Now, with AMI Labs seeking approximately $3.5 billion (per Financial Times and Reuters reporting) to build what he calls "Autonomous Machine Intelligence," LeCun is betting his reputation and the next phase of his career on proving his thesis.

Meanwhile, at Stanford, Fei-Fei Li—the researcher whose ImageNet dataset sparked the deep learning revolution—has been building World Labs. Her vision: "Spatial Intelligence." Systems that understand the 3D structure of reality and can generate physically coherent interactive environments.

And at Google DeepMind, the Genie team has been quietly developing real-time interactive world simulators—systems you can literally walk through and manipulate, generating 720p video at 24fps in response to your actions.

All three call what they're building "world models." All three believe this is the path beyond LLMs. But they're building fundamentally different things.

LeCun's Path: Learning the Abstract

LeCun's V-JEPA (Video Joint Embedding Predictive Architecture) represents the most philosophically distinctive approach. Rather than predicting pixels or tokens, JEPA predicts in representation space—it learns abstract features and predicts what abstract features will come next.

"A model that predicts in representation space captures abstract relationships rather than surface appearances. It learns physics without learning pixels." — Yann LeCun, describing JEPA (2024)

The key insight: a ball falling under gravity follows the same trajectory whether it's red or blue, large or small. Pixel-level prediction must learn this separately for every combination of features. Representation-level prediction can learn it once, abstractly.

Meta's V-JEPA 2 (released before LeCun's departure) demonstrated physical reasoning capabilities—the ability to anticipate outcomes of physical actions. AMI Labs will presumably continue this direction with more scale and focus.

Taxonomic classification: LeCun's approach corresponds to our Simulator predictivus species within Genus Simulator. The key diagnostic trait: prediction happens in learned representation space, not observation space.

Fei-Fei Li's Path: Rendering the Real

World Labs, which raised $230 million before launching its first product (Marble), takes a different approach. Their goal is generating "Large World Models"—systems that can create complete, physically consistent, interactive 3D environments.

Marble uses neural radiance techniques (similar to Gaussian splatting) to construct scenes that users can explore and interact with. The emphasis is on visual fidelity and spatial coherence—you should be able to walk through a generated environment and have it look and feel real.

This is "world modeling" in the most literal sense: building virtual worlds that mirror the structure of physical reality.

Taxonomic classification: World Labs' approach might warrant a new species designation. It shares traits with Simulator ludicus (interactive simulators like Genie/Oasis) but with an emphasis on 3D spatial coherence that distinguishes it from video-based approaches. Provisional designation: Simulator spatialis.

DeepMind's Path: Playing the Game

Google DeepMind's Genie (now in its third major version) represents yet another interpretation. Genie is an interactive world simulator that generates environments users can control in real time—essentially an infinite video game engine that creates novel environments on the fly.

The key distinction: Genie emphasizes interactivity and action-conditioned generation. The model doesn't just simulate a world; it simulates how a world responds to agent actions. You press a button, and the world changes accordingly.

Genie 3 can generate real-time interactive environments at 720p/24fps, understanding not just what things look like but how they respond to manipulation. This is "world modeling" as game engine—a simulator of action-outcome contingencies.

Taxonomic classification: Genie falls clearly under Simulator ludicus—the "playable" subcategory of world models.

The Philosophical Divide

These three approaches reveal a deep uncertainty about what "understanding the world" actually means:

Approach	Core Metaphor	What is "Understanding"?	Primary Output
LeCun/JEPA	Learning abstract physics	Predicting future states in representation space	Internal embeddings for planning
Fei-Fei Li/Marble	Rendering reality	Generating spatially coherent 3D environments	Explorable virtual worlds
DeepMind/Genie	Playing the game	Simulating action-conditioned outcomes	Interactive experiences

One group starts from the rendering (Marble and its splats). One from the physics and simulator loop (DeepMind's Genie). One from the internal representation (LeCun's JEPA-style architectures).

Perhaps most interesting: only LeCun's approach explicitly aims to replace language models. JEPA is meant to be a new foundation for general intelligence—not a specialized simulation engine, but the cognitive core of AGI itself.

Fei-Fei Li's and DeepMind's approaches, by contrast, seem more complementary to LLMs. You could imagine a Frontieriidae system with a Genie-style simulator as a component—a language model that can imagine scenarios by rendering them internally. LeCun's vision is more radical: JEPA as replacement, not supplement.

Taxonomic Implications

The Family Simulacridae currently contains a single genus (Simulator) with five species. The world models race of 2026 suggests this may need expansion.

Consider three potential paths:

Path 1: Unified Genus, Species Diversification

All three approaches remain within Genus Simulator, distinguished at the species level by their prediction/generation target:

S. predictivus (LeCun-style representation prediction)
S. spatialis (Li-style 3D spatial generation) — new species
S. ludicus (DeepMind-style interactive simulation)

Path 2: Genus Split by Cognitive Architecture

If the differences prove fundamental, we might need multiple genera:

Genus Simulator: Systems that generate observable outputs (video, 3D scenes)
Genus Conceptor: Systems that predict in abstract representation space (JEPA-style)

This would reflect the deep philosophical difference between "building virtual worlds" and "learning internal physics."

Path 3: Wait and See

The conservative approach: these distinctions may blur as architectures mature. If V-JEPA eventually generates video and Genie eventually learns abstract physics, the current boundaries may not persist.

Current recommendation: Maintain unified genus; add provisional species S. spatialis pending stabilization of World Labs approach. Monitor AMI Labs releases for evidence that JEPA-style architectures represent a fundamental departure warranting genus split.

The Meta-Question

Behind the technical debate lies a deeper question: what kind of understanding do we actually want from our AI systems?

LeCun's intuition is that true understanding is abstract—that knowing the physics of a bouncing ball means knowing it independently of any particular ball. Understanding is generalization, compression, prediction of structure rather than surface.

Fei-Fei Li's intuition seems to be that understanding is generative—that knowing the world means being able to construct it. Understanding is the capacity to build coherent simulations that feel real.

DeepMind's intuition is that understanding is interactive—that knowing the world means predicting how it responds to your actions. Understanding is anticipation of consequence.

All three have merit. All three may turn out to be aspects of the same underlying capability, separated only by emphasis and implementation. Or they may prove to be genuinely distinct cognitive architectures, leading to AI systems that "understand" the world in fundamentally different ways.

We'll be watching this race closely. The outcome may reshape not just our taxonomy, but our understanding of what it means to understand.

Postscript: The Stakes

It's worth noting the scale of the bets being placed:

AMI Labs: Seeking ~$3.5B valuation; LeCun staking his reputation on "LLMs are a dead end"
World Labs: $230M raised pre-product; backed by a16z, Sequoia, and others
DeepMind Genie: Backed by Alphabet's resources; already in production with consumer-facing features

Within five years, we'll know which approach—if any—leads to systems that genuinely "understand" their environments. The answer will have profound implications for both AI capability and AI safety.

A system that truly models the world is a system that can plan—and planning without alignment is the failure mode that keeps researchers awake at night.