Manipulate-Anything:
Automating Real-World Robots using Vision-Language Models

1University of Washington 2NVIDIA
3Allen Institute for Artifical Intelligence 4Universidad Católica San Pablo

* Equal contribution

Abstract

Large-scale endeavors like RT-1[1] and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose MANIPULATE-ANYTHING, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, MANIPULATE-ANYTHING successfully generates trajectories for all 7 real-world and 14 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, MANIPULATE-ANYTHING’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser, Scaling-up-Distilling-Down and Code-As-Policies [5]. We believe MANIPULATE-ANYTHING can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting

Real World Experiments

Manipulate-Anything



Simulation

Zero-shot Generated data

Method
applied to task


Policy distilled from different data generation

Method
generated data for task


Manipulate-Anything Framework

MANIPULATE-ANYTHING is an automated method for robot manipulation in real world environments. Unlike prior methods, it does not require privileged state information, hand-designed skills, or limited to manipulating a fixed number of object instances. It can guide a robot to accomplish a diverse set of unseen tasks, manipulating diverse objects. Furthermore, the generated data enables training behavior cloning policies that outperform training with human demonstrations.

Manipulate-Anything System Overview

Begins by inputting a scene representation and a natural language task instruction into a VLM, which identifies objects and determines sub-tasks. For each sub-task, we provide multi-view images, verification conditions, and task goals to the action generation module, producing a task-specific grasp pose or action code. This leads to a temporary goal state, assessed by the sub-task verification module for error recovery. Once all sub-tasks are achieved, we filter the trajectories to obtain successful demonstrations for downstream policy training.

Zero-shot empirical results

MANIPULATE-ANYTHING outperformed other baselines in 10 out of 14 simulation tasks from RLBench. Each task was evaluated over 3 seeds to obtain the task-averaged success rate and standard deviations.

BC with zero-shot data generation methods

The behavior cloning policy trained on the data generated by MANIPULATE-ANYTHING provides the best performance on 10 out of 12 tasks compared to the other autonomous data generation baselines. We report the Success Rate % for behaviour cloning policies trained with data generated from VoxPoser and Code as Policies in comparison. Note that the RLBench baseline uses human expert demonstrations and is considered an upper bound for behavior cloning.

Action Distribution

We compare the action distribution of data generated by various methods against human-generated demonstrations via RLBench on the same set of tasks. We observed a high similarity between the distribution of our generated data and the human-generated data. This is further supported by the computed CD between our methods and the RLBench data, which yields the lowest (CD=0.056).

BibTeX

@article{duan2024manipulate,
        title={Manipulate-Anything: Automating Real-World Robots using Vision-Language Models}, 
        author={Duan, Jiafei and Yuan, Wentao and Pumacay, Wilbert and Wang, Yi Ru and Ehsani, Kiana and Fox, Dieter and Krishna, Ranjay},
        journal={arXiv preprint arXiv:2406.18915},
        year={2024}
        }