NVIDIA Unveils Open Physical AI Dataset to Advance Robotics and Autonomous Vehicle Development

Teaching autonomous robots and vehicles how to interact with the physical world requires vast amounts of high-quality data. To give researchers and developers a head start, NVIDIA is releasing a massive, open-source dataset for building the next generation of physical AI.

Announced at NVIDIA GTC, this commercial-grade, pre-validated dataset can help researchers and developers kickstart physical AI projects that can be prohibitively difficult to start from scratch. Developers can either directly use the dataset for model pretraining, testing and validation — or use it during post-training to fine-tune world foundation models, accelerating the path to deployment.

The initial dataset is now available on Hugging Face, offering developers 15 terabytes of data representing more than 320,000 trajectories for robotics training, plus up to 1,000 Universal Scene Description (OpenUSD) assets, including a SimReady collection. Dedicated data to support end-to-end autonomous vehicle (AV) development — which will include 20-second clips of diverse traffic scenarios spanning over 1,000 cities across the U.S. and two dozen European countries — is coming soon.

SimReady assets in the NVIDIA Physical AI Dataset

This dataset will grow over time to become the world’s largest unified and open dataset for physical AI development. It could be applied to develop AI models to power robots that safely maneuver warehouse environments, humanoid robots that support surgeons during procedures and AVs that can navigate complex traffic scenarios like construction zones.

The NVIDIA Physical AI Dataset is slated to contain a subset of the real-world and synthetic data NVIDIA uses to train, test and validate physical AI for the NVIDIA Cosmos world model development platform, the NVIDIA DRIVE AV software stack, the NVIDIA Isaac AI robot development platform and the NVIDIA Metropolis application framework for smart cities.

Early adopters include the Berkeley DeepDrive Center at the University of California, Berkeley, the Carnegie Mellon Safe AI Lab and the Contextual Robotics Institute at University of California, San Diego.

As stated by Henrik Christensen, director of multiple robotics and autonomous vehicle labs at UCSD, “We can do a lot of things with this dataset, such as training predictive AI models that help autonomous vehicles better track the movements of vulnerable road users like pedestrians to improve safety. A dataset that provides a diverse set of environments and longer clips than existing open-source resources will be tremendously helpful to advance robotics and AV research.”

This open dataset, comprising thousands of hours of multicamera video at unprecedented diversity, scale and geography — will particularly benefit the field of safety research by enabling new work on identifying outliers and assessing model generalization performance. The effort contributes to NVIDIA Halos’ full-stack AV safety system.

In addition to harnessing the NVIDIA Physical AI Dataset to help meet their data needs, developers can further boost AI development with tools like NVIDIA NeMo Curator, which process vast datasets efficiently for model training and customization. Robotics developers can also tap the new NVIDIA Isaac GR00T blueprint for synthetic manipulation motion generation, a reference workflow built on NVIDIA Omniverse and NVIDIA Cosmos.

NVIDIA Unveils Open Physical AI Dataset to Advance Robotics and Autonomous Vehicle Development

NVIDIA Unveils Open Physical AI Dataset to Advance Robotics and Autonomous Vehicle Development

Leave a Reply Cancel reply