Overview of GRAPE. Given a complex manipulation task (top), GRAPE first adopts a vision-language model to decompose the task into several temporal stages, then identifies spatial keypoints essential for each stage’s subtask completion. Then given user-specified alignment goals, GRAPE prompts a powerful vision-language model to obtain a series of cost functions for each stage, where lower cost implies higher alignment compliance. During iterative preference optimization (bottom), we sample multiple offline trajectories from the base VLA model and obtain trajectories with associated multi-stage costs. This score further incorporates the model’s self-evaluation of each trajectory and a binary task success indicator. Then we rank the sampled trajectories with their corresponding scores to obtain a list of preferences. Then we perform a trajectory-wise preference optimization to obtain a improved model, from which we further sample online trajectories and iterate until convergence.
Comparison of GRAPE with OpenVLA and Octo finetuned on the same data on the Simpler-Env environment. We report the in-domain performance, which includes four tasks and three generalization evaluations (subject, physical, and semantic), where each incorporates multiple tasks.
Comparison of GRAPE with OpenVLA and Octo finetuned on the same data on the LIBERO environment. We report the performance on four LIBERO tasks, including LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long.
Comparison of GRAPE with OpenVLA and Octo fine-tuned on the same data on the real-world environment. We report the in-domain performance, which includes four tasks and five generalization evaluations (visual, subject, action, semantic, and language grounding), where each incorporates multiple tasks. We also report the average performance across all tasks.
We selected safe trajectories based on GCPG reward, then we TPO-trained the OpenVLA-SFT to get GRAPE-Safety. The robot learns to complete the task safely.
@misc{zhang2024grape,
title={GRAPE: Generalizing Robot Policy via Preference Alignment},
author={Zijian Zhang and Kaiyuan Zheng and Zhaorun Chen and Joel Jang and Yi Li and Chaoqi Wang and Mingyu Ding and Dieter Fox and Huaxiu Yao},
year={2024},
eprint={2411.19309},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2411.19309},
}
|