Vision-Language-Action Models

Vision-Language-Action models (VLAs) are multimodal foundation models that take visual input (camera images) and natural language instructions, and output motor actions — joint torques, gripper positions, and body movements — that a robot can execute in the physical world. They are the robotics equivalent of large language models: general-purpose learned systems that replace hand-coded control logic with data-driven behavior. VLAs are the core intelligence layer of the 2026 humanoid robot generation.

Architecture: Two Systems

The leading VLA architectures follow a dual-system design inspired by Kahneman's "fast and slow thinking." NVIDIA's GR00T N1 makes this explicit: System 2 is a vision-language model that reasons about the environment and instructions, planning what to do ("pick up the red cup and place it on the shelf"). System 1 is a fast action model that translates those plans into continuous motor commands — the reflexive, sub-second control loop that keeps the robot balanced and moving smoothly. System 2 thinks; System 1 acts. The two operate at different timescales: System 2 updates plans every few seconds; System 1 runs at hundreds of hertz.

This mirrors how a human operates: you decide to pick up a cup (slow, deliberate), but the actual hand movement — adjusting grip force, correcting for the cup's weight, compensating for your arm's momentum — happens automatically (fast, reflexive). VLAs embed both capabilities in a single model architecture.

Key Models

Figure AI Helix: A proprietary VLA trained on 500+ hours of teleoperation data plus simulation. Powers the Figure 02 humanoid's ability to follow natural language instructions in warehouse and manufacturing settings. Helix represents the "vertical" approach: one company building both the model and the robot.

Physical Intelligence pi0: A cross-embodiment VLA designed to control any robot, not just one specific platform. Physical Intelligence's thesis is that robotic control is a foundation model problem: train on diverse manipulation data across many robot types, and the resulting model generalizes to new embodiments. Demonstrated on laundry folding, assembly, and tool use across different robot arms and hands.

NVIDIA GR00T N1/N2: The first open, commercially licensable humanoid VLA. GR00T N1.7 (March 2026) ships with generalized skills including advanced dexterous control. GR00T N2, previewed at GTC 2026, is based on DreamZero research and incorporates world models for action planning. NVIDIA's strategy: be the "Android of robotics" by providing the foundation model that every humanoid company builds on.

Google RT-2 / RT-X: Google DeepMind's Robotics Transformer 2, trained on both internet-scale vision-language data and robotic manipulation data. RT-2 demonstrated that web-scale pretraining transfers meaningfully to robot control — a robot trained partly on internet images understands concepts like "move the banana to the right" without needing robotic training data for every concept. The Open X-Embodiment dataset extends this across 22 robot platforms.

The Data Question

VLAs face a fundamental data scarcity problem that LLMs don't. The internet contains trillions of tokens of text and billions of images, but almost no robot action data. Robotic manipulation datasets are orders of magnitude smaller than language or vision datasets. The solutions being explored: imitation learning from human demonstrations, sim-to-real transfer from synthetic data, learning from internet videos of humans performing tasks, and cross-embodiment transfer where data from one robot type helps train policies for another. The scaling laws for robotic data suggest that environment and object diversity matters more than raw volume — a finding that makes data collection more tractable than it initially appears.

What VLAs Change

Before VLAs, programming a robot to perform a new task meant writing new code — specifying the exact sequence of motions, the sensor thresholds, and the error recovery procedures. Each new task was a new engineering project. VLAs invert this: you show the robot what to do (via demonstration or instruction), and it generalizes. This shifts robotics from a programming discipline to a training discipline — the same paradigm shift that LLMs brought to software. The implication for the agentic economy is that physical agents can now be deployed and retrained at software speed, not hardware engineering speed.

Vision-Language-Action Models

Architecture: Two Systems

Key Models

The Data Question

What VLAs Change

Related Topics

Further Reading