World Models for Robotics

A world model in robotics is a learned internal representation of the physical environment that allows a robot to predict the consequences of its actions before executing them. Where VLA models are reactive — they map current observations to immediate actions — world models are predictive: they simulate forward in time, letting the robot "imagine" what will happen if it pushes an object, takes a step, or reaches for a tool. This enables planning, risk assessment, and multi-step reasoning about physical interactions.

NVIDIA Cosmos and Kosmos

NVIDIA's Cosmos world foundation models are the most commercially significant world models for robotics in 2026. Cosmos models generate photorealistic predictions of future visual states given current observations and proposed actions. They serve a dual purpose: as data generators (creating vast quantities of synthetic training trajectories for sim-to-real pipelines — 780,000 trajectories in 11 hours) and as planning modules embedded in robot control stacks.

At GTC 2026, NVIDIA announced Kosmos as an open-source vision world model alongside Nemotron (language), ALPAMIO, and GROOT (physical AI/autonomous driving). The Kosmos model powers a new generation of robotics collaborations, including a partnership with Disney and DeepMind to create character robots using the Newton physics solver combined with Kosmos world predictions. These entertainment robots represent one facet of what Huang called "the world's first large-scale deployment of physical AI" — spanning autonomous vehicles, industrial robots, operating room assistance, and character animation.

Cosmos Transfer takes a small number of real demonstrations and generates exponentially more synthetic variations — different objects, lighting conditions, viewpoints, and physical parameters — that maintain physical plausibility. This addresses the data bottleneck that limits robot learning.

DreamZero and GR00T N2

NVIDIA's DreamZero research, previewed at GTC 2026 as the foundation for GR00T N2, represents the next generation: a world model that is not just predictive but generative of action plans. Rather than requiring a separate VLA to select actions and a world model to evaluate them, DreamZero integrates action generation and world prediction into a single model. The robot simultaneously imagines what to do and what will happen if it does it — a closed loop of imagination and planning.

Large-Scale Physical AI Deployment

The robotaxi sector illustrates the scale of physical AI deployment enabled by world models. At GTC 2026, NVIDIA announced autonomous vehicle partnerships with BYD, Hyundai, Nissan, and Geely (joining existing partners Mercedes-Benz, Toyota, and GM), covering 18 million vehicles annually. These systems rely on world models to predict traffic behavior, pedestrian movement, and environmental changes — the same predict-then-act loop that powers robotic manipulation, just applied at vehicle speed on public roads.

Why World Models Matter

Multi-step planning: A VLA can pick up a cup. A world model can plan: pick up the cup, walk to the counter, set it down, open the cabinet, put it on the shelf. Each step depends on the result of the previous step.

Safety: Before executing an action, the robot can predict whether it will cause damage, collision, or failure. A cobot that can predict when its arm trajectory will enter a human's workspace can preemptively adjust.

Generalization: World models capture physics, not specific trajectories. A robot that understands how gravity, friction, and rigid-body dynamics work can generalize to novel objects and situations.

Relationship to Physics Simulation

World models and traditional physics simulation (like NVIDIA Isaac Sim or MuJoCo) serve overlapping but distinct roles. Physics simulators are hand-engineered and accurate but slow; world models are learned from data and potentially faster. The emerging approach combines both: use physics simulation where it's reliable, use learned world models where simulation falls short, and use the discrepancy as a training signal.