Robotic Manipulation Datasets

Robotic manipulation datasets are curated collections of robot demonstration data — synchronized camera images, joint positions, gripper states, and force measurements recorded during manipulation tasks — that serve as training data for VLA models and other robot learning systems. They are the robotics equivalent of ImageNet (which catalyzed the deep learning revolution in computer vision) or Common Crawl (which enabled large language models). The scarcity and fragmentation of robotic data is the single biggest bottleneck limiting robot foundation model progress.

The Data Problem

Language models train on trillions of tokens scraped from the internet. Vision models train on billions of images. Robotic manipulation datasets contain, at best, millions of trajectories — orders of magnitude less. The reason is physical: every data point requires a real robot (or a high-fidelity simulation) to actually perform an action in a real (or simulated) environment. You can't scrape robot data from the web because the web contains text and images, not motor commands. This creates a fundamental asymmetry: the AI techniques driving robot progress (transformers, foundation models, scaling) come from domains with abundant data, but robotics has scarce data.

Key Datasets

Open X-Embodiment: A Google DeepMind-led collaboration across 21 research institutions, combining demonstration data from 22 different robot embodiments (arms, hands, mobile manipulators) into a single dataset. The key insight: training a model on data from many different robot types produces a more generalizable policy than training on data from a single robot. This cross-embodiment transfer is the foundation of Physical Intelligence's pi0 model and Google's RT-X.

DROID: A large-scale dataset of diverse robot manipulation demonstrations collected across multiple institutions, designed specifically for training generalizable manipulation policies. DROID standardizes the data format and collection protocol, making it easier to combine data from different labs.

RH20T: A dataset focusing on human hand demonstrations for robotic manipulation, leveraging the insight that the internet contains billions of hours of humans performing tasks with their hands. Mapping human hand data to robot actions is nontrivial but potentially unlocks a much larger data source than robot-specific demonstrations.

Scaling Laws and Diversity

Research presented at ICLR 2026 demonstrated that robotic imitation learning follows power-law data scaling laws, but with a crucial nuance: environment and object diversity matters far more than raw volume. A policy trained on 50 demonstrations in each of 32 different environments generalizes better than one trained on 1,600 demonstrations in a single environment. This finding reshapes the data collection strategy: rather than maximizing quantity in controlled settings, the priority is maximizing diversity across settings, objects, and conditions. Four data collectors working one afternoon across diverse environments can produce policies achieving 90% success in novel settings.

Synthetic Data and the NVIDIA Pipeline

The most scalable approach to the data bottleneck is synthetic generation via simulation. NVIDIA's Cosmos world foundation models generated 780,000 synthetic manipulation trajectories in 11 hours — equivalent to 9 months of human demonstrations. This transforms the economics: instead of hiring teleoperators, a simulation pipeline produces diverse training data at compute cost. The Sim2Real-VLA model demonstrated that policies trained primarily on synthetic data can achieve robust real-world performance with minimal sim-to-real gap. The endgame is likely a mixture: human demonstrations for seed data and quality anchoring, simulation for massive scale and diversity augmentation.