GRAPE: Generalizing Robot Policy via Preference Alignment

GRAPE Method

Overview of GRAPE. Given a complex manipulation task (top), GRAPE first adopts a vision-language model to decompose the task into several temporal stages, then identifies spatial keypoints essential for each stage’s subtask completion. Then given user-specified alignment goals, GRAPE prompts a powerful vision-language model to obtain a series of cost functions for each stage, where lower cost implies higher alignment compliance. During iterative preference optimization (bottom), we sample multiple offline trajectories from the base VLA model and obtain trajectories with associated multi-stage costs. This score further incorporates the model’s self-evaluation of each trajectory and a binary task success indicator. Then we rank the sampled trajectories with their corresponding scores to obtain a list of preferences. Then we perform a trajectory-wise preference optimization to obtain a improved model, from which we further sample online trajectories and iterate until convergence.

Experiments

Simulation Experiments

Simpler Env Experiments

Comparison of GRAPE with OpenVLA and Octo finetuned on the same data on the Simpler-Env environment. We report the in-domain performance, which includes four tasks and three generalization evaluations (subject, physical, and semantic), where each incorporates multiple tasks.


LIBERO Experiments

Comparison of GRAPE with OpenVLA and Octo finetuned on the same data on the LIBERO environment. We report the performance on four LIBERO tasks, including LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long.


Real-world Experiments

Comparison of GRAPE with OpenVLA and Octo fine-tuned on the same data on the real-world environment. We report the in-domain performance, which includes four tasks and five generalization evaluations (visual, subject, action, semantic, and language grounding), where each incorporates multiple tasks. We also report the average performance across all tasks.


Rollouts of GRAPE


Real-World Tasks

In-domain tasks
Language Grounding
Semantic Gen: New instructions
Action Gen: New actions (knock down)
Visual Gen: Different background
Subject Gen: New objects

Simpler-Env

In-domain tasks
Subject Gen: New objects
Semantic Gen: New instructions
Physical Gen: Different sizes/shapes

LIBERO

LIBERO-10
LIBERO-GOAL
LIBERO-Object
LIBERO-Spatial

Learned Safety Behaviors

We selected safe trajectories based on GCPG reward, then we TPO-trained the OpenVLA-SFT to get GRAPE-Safety. The robot learns to complete the task safely.

Unsafe behavior demonstrated by the GRAPE
Safe behavior demonstrated by the GRAPE-Safety

BibTeX

@misc{zhang2024grape,
                title={GRAPE: Generalizing Robot Policy via Preference Alignment}, 
                author={Zijian Zhang and Kaiyuan Zheng and Zhaorun Chen and Joel Jang and Yi Li and Chaoqi Wang and Mingyu Ding and Dieter Fox and Huaxiu Yao},
                year={2024},
                eprint={2411.19309},
                archivePrefix={arXiv},
                primaryClass={cs.RO},
                url={https://arxiv.org/abs/2411.19309}, 
      }