What Is This?
Fei-Fei Li and the World Labs team published "A Functional Taxonomy of World Models" to clean up a term that is becoming overloaded in AI.
Their core point: a world model is not one product category. It is a family of systems that model some part of the loop between an agent and its environment. Some world models produce what an agent sees. Some produce the hidden structure of the world. Some produce the next action an agent should take.
The useful taxonomy is:
- Renderers output observations: pixels, frames, scenes, views.
- Simulators output state: geometry, physics, dynamics, structure.
- Planners output actions: what an agent should do next.
The article matters because it gives a sharper vocabulary for the next phase of spatial AI. A beautiful video generator, a robotics policy, a physics engine, an explorable 3D world, and an action-conditioned simulator may all be called "world models". They are not doing the same job.
The One-Sentence Summary
World models should be understood by function: renderers show the world, simulators model the world's structure, and planners choose actions inside it.
Why It Matters
Language models learned the statistical structure of text. World models try to learn the statistical structure of space and time: how scenes persist, how objects occupy space, how light changes, how materials behave, and how actions change the future.
That shift matters because the next hard AI problems are not only verbal. Robotics, simulation, architecture, games, autonomous vehicles, digital twins, embodied agents, and creative 3D tools all need systems that understand space rather than merely describe it.
The problem is that "world model" is now used too loosely. A model that produces a gorgeous but physically impossible fire simulation is not equivalent to a simulator that respects combustion physics. A language model improvising a playable game is not equivalent to a robot planner that can recover from physical failure. A renderer can look convincing while still being structurally wrong.
The taxonomy gives a way to ask the right question: what does this model output, and who can rely on that output?
The Loop Beneath The Taxonomy
The article grounds the taxonomy in the classic agent-environment loop from reinforcement learning and POMDPs.
An agent takes an action. The action changes the hidden state of the world. The agent does not directly see the full state. It receives an observation: pixels, sensor readings, sounds, proprioception, or some other partial view. The next action depends on the observation, and the loop continues.
This distinction is the spine of the taxonomy:
- State is the underlying reality: objects, positions, velocities, geometry, materials, forces, and relevant properties.
- Observations are what an agent can perceive from its partial viewpoint.
- Actions are the interventions the agent chooses.
Most confusion around world models comes from collapsing these into one thing. A renderer models observations. A simulator models state. A planner chooses actions.
Renderer: A Model That Outputs Observations
A renderer creates what the world looks like from a particular viewpoint. Text-to-video models, image generators, interactive frame models, and real-time visual world generators sit here.
The contract is visual fidelity. The model is judged by whether the output looks plausible to a human observer.
Examples from the World Labs framing include cinematic video generation, Google's Genie-style interactive worlds, and World Labs' RTFM, which generates frames in real time as a user interacts with an environment.
Renderers are commercially mature because the output is legible and useful. People can use them for concept art, video, game prototyping, spatial ideation, and visual storytelling.
But the weakness is clear: visual plausibility is not physical truth. A generated building can look right from a drone view and fail when you try to enter it. A generated cup can look solid and have no reliable geometry, mass, friction, or collision behaviour underneath.
A renderer answers: what would this look like?
Simulator: A Model That Outputs State
A simulator produces structure that can be inspected, computed on, and interacted with. It cares about geometry, physics, dynamics, material properties, scale, collisions, and constraints.
The contract is not just beauty. The contract is structural reliability.
That makes simulation the linchpin of the taxonomy. If a model can simulate the world, it can often derive pixels for humans and action consequences for agents. But a model that only renders pixels cannot automatically be trusted for robotics, engineering, architecture, or high-fidelity training.
This is why the World Labs article gives simulation special weight. Simulation is the bridge between renderers and planners. It is where the visual surface becomes an operational world.
Examples include physics engines, digital twins, robot training environments, autonomous vehicle scenario simulation, and generative 3D systems like Marble that output explorable spatial environments plus computable structures such as Gaussian splats and collision meshes.
The hard part is data and accuracy. Internet video is abundant; explicit 3D data with geometry, material properties, physical annotations, and robot interactions is scarce. Generative simulators also create a new failure mode: an object can look correct while having broken scale, self-intersections, impossible geometry, or physics that will mislead downstream systems.
A simulator answers: what is the structure of this world, and how will it evolve?
Planner: A Model That Outputs Actions
A planner chooses what to do next.
Given an observation and a goal, it outputs an action or action sequence. In robotics, this means deciding how a robot should move, grasp, push, navigate, recover, or manipulate. In virtual agents, it means deciding how to interact with a simulated or generated environment.
The World Labs article frames planners as the inverse of renderers. A renderer takes an action or prompt and produces observations. A planner takes observations and goals and produces actions.
Vision-language-action models, model-based control systems, robot foundation models, and World Action Models belong in this territory.
This is the most exciting and least proven category. Robotics demos have improved quickly, but many remain constrained to lab setups, short horizons, narrow objects, and favourable conditions. A system that works in a video demo is not the same as a system that survives open-ended kitchens, warehouses, hospitals, construction sites, or homes.
A planner answers: what should the agent do now?
Why Simulation Is The Bottleneck
The article's strongest claim is that simulation is under-discussed relative to its importance.
Renderers get attention because they are visible. Planners get attention because they imply agency. Simulators are less glamorous, but they may determine whether world models become useful infrastructure rather than impressive media systems.
A useful simulator has to support both humans and machines:
- Humans need worlds that can be edited, navigated, inspected, designed, and trusted.
- Agents need worlds where they can train, test, fail, and learn without destroying real things.
This is why simulation-shaped systems matter for factories, warehouses, autonomous vehicles, game worlds, architecture, robotics, scientific discovery, and digital twins.
The open problems are still hard:
- 3D and robot interaction data are scarce compared with video and text.
- Sim-to-real gaps persist when simulated behaviour differs from reality.
- Multi-physics simulation across rigid bodies, deformables, fluids, cloth, and contact remains expensive.
- Generated geometry can be visually plausible but physically invalid.
- Optimization for beauty can conflict with optimization for accurate structure.
A renderer can be wrong in ways that merely look strange. A simulator can be wrong in ways that train a robot badly.
The Boundaries Are Starting To Collapse
The taxonomy is useful, but the endgame is not three separate model families forever.
The same underlying knowledge supports all three functions. A model that truly understands a cup on a table should be able to:
- render the cup from a new angle
- simulate what happens if it is pushed
- plan how a hand should pick it up
This is why recent research keeps blurring the borders. Interactive renderers become action-conditioned. Simulators become generative and editable. Planners use imagined futures to choose better actions. Systems like Marble combine visual world generation with structures that physics engines can operate on. DeepMind's Genie line points toward interactive generated environments rather than passive clips.
The likely endpoint is a unified world model: one foundation model that can switch between observations, state, and actions depending on the downstream use case.
That is not solved. The data, architecture, evaluation, and deployment problems are different across all three outputs. But the direction is clear: spatial AI wants models that can see, build, simulate, and act in the same representational space.
Why Smart People Get This Wrong
The common mistake is judging world models by the easiest output to inspect: pixels.
Pixels are persuasive. A video can look like evidence of understanding. But a generated scene can be visually coherent and physically useless. For robotics, engineering, autonomous driving, and digital twins, the hidden state matters more than the visible surface.
The second mistake is treating planning as if it can skip simulation. A robot that selects actions without a reliable model of consequences is brittle. It may react well in narrow demos but fail under distribution shift, contact uncertainty, occlusion, or long-horizon tasks.
The third mistake is assuming the terms map neatly to companies. They do not. A single company can build renderer, simulator, and planner components. A single product may expose only one surface while depending on another underneath.
The right question is not "is this a world model?" The right question is: which part of the agent-world loop does this model learn, and how reliable is that part?
How To Use This
Use the taxonomy as a filter whenever you read a world-model claim.
Ask:
- What does the model output? Pixels, state, actions, or some mixture?
- What is the reliability contract? Visual plausibility, physical accuracy, or task success?
- Who consumes the output? Humans, software, robots, or other models?
- What failure matters? Weird visuals, broken geometry, bad physics, unsafe actions, poor transfer?
- What data trained it? Internet video, 3D worlds, robot demonstrations, physics simulation, real deployment feedback?
- Can it interact? Passive generation is different from closed-loop response.
This keeps the discussion grounded. A renderer can be valuable without being a robot simulator. A planner can be impressive without being general. A simulator can be strategically central even if it is less flashy than video.
Key Terms
- World model: A model that captures some useful structure of an environment, often to predict observations, states, or action consequences.
- Observation: What an agent can perceive: pixels, sensor data, audio, proprioception, or any partial view of the world.
- State: The underlying configuration of the world: objects, positions, velocities, geometry, materials, forces, and dynamics.
- Action: An intervention chosen by an agent that changes, or attempts to change, the world.
- Renderer: A model that outputs observations, usually visual frames or scenes.
- Simulator: A model or system that outputs computable state and dynamics.
- Planner: A model or system that outputs actions given observations and goals.
- POMDP: A formal model for decision-making where the agent cannot directly observe the full state of the environment.
- Sim-to-real gap: The mismatch between behaviour in simulation and behaviour in the real world.
- Spatial intelligence: The ability to perceive, reason about, generate, and act within real or virtual space.
Recall Questions
- What is the difference between state and observation?
- Why can a visually impressive renderer still be useless for robot training?
- What does a simulator output that a renderer usually does not?
- Why does the World Labs taxonomy treat planners as the inverse of renderers?
- What makes simulation the bridge between rendering and planning?
- Why is 3D and robot interaction data harder to get than internet video?
- What does the sim-to-real gap mean?
- Why is "is this a world model?" a weaker question than "what function does this world model perform?"
Best Resources To Learn More
- Fei-Fei Li and World Labs' taxonomy essay is the core source. Read it for the renderer/simulator/planner frame.
- Sutton and Barto's Reinforcement Learning: An Introduction gives the agent-environment loop that underlies the taxonomy.
- Ha and Schmidhuber's World Models is the canonical modern reinforcement-learning version of agents learning inside imagined environments.
- World Labs' Marble and RTFM posts show the company's product and research direction: persistent 3D worlds and real-time frame generation.
- Google DeepMind's Genie work is useful for understanding interactive generated environments rather than passive video generation.
Sources
- Fei-Fei Li / World Labs team, "A Functional Taxonomy of World Models," World Labs, June 3, 2026. https://www.worldlabs.ai/blog/taxonomy-of-world-models
- Fei-Fei Li, X post linking to the article, June 3, 2026. https://x.com/drfeifei/status/2062247238143996275
- Fei-Fei Li, "From Words to Worlds: Spatial Intelligence is AI's Next Frontier," November 10, 2025. https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence
- Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, 2nd edition, MIT Press, 2018. http://www.incompleteideas.net/book/the-book-2nd.html
- David Ha and Jürgen Schmidhuber, "World Models," 2018. https://worldmodels.github.io/
- World Labs team, "Marble: A Multimodal World Model," November 12, 2025. https://www.worldlabs.ai/blog/marble-world-model
- World Labs team, "RTFM: A Real-Time Frame Model," October 16, 2025. https://www.worldlabs.ai/blog/rtfm
- Google DeepMind, "Genie 3: A new frontier for world models," August 5, 2025. https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/