GRAPE: Generalizing Robot Policy via Preference Alignment

Zijian Zhang* Kaiyuan Zheng* Zhaorun Chen* Joel Jang Yi Li Chaoqi Wang Mingyu Ding Dieter Fox Huaxiu Yao

Overview of GRAPE. Given a complex manipulation task (top), GRAPE first adopts a vision-language model to decompose the task into several temporal stages, then identifies spatial keypoints essential for each stage’s subtask completion. Then given user-specified alignment goals, GRAPE prompts a powerful vision-language model to obtain a series of cost functions for each stage, where lower cost implies higher alignment compliance. During iterative preference optimization (bottom), we sample multiple offline trajectories from the base VLA model and obtain trajectories with associated multi-stage costs. This score further incorporates the model’s self-evaluation of each trajectory and a binary task success indicator. Then we rank the sampled trajectories with their corresponding scores to obtain a list of preferences. Then we perform a trajectory-wise preference optimization to obtain a improved model, from which we further sample online trajectories and iterate until convergence.

Comparison of GRAPE with OpenVLA and Octo finetuned on the same data on the Simpler-Env environment. We report the in-domain performance, which includes four tasks and three generalization evaluations (subject, physical, and semantic), where each incorporates multiple tasks.

Comparison of GRAPE with OpenVLA and Octo finetuned on the same data on the LIBERO environment. We report the performance on four LIBERO tasks, including LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long.

Comparison of GRAPE with OpenVLA and Octo fine-tuned on the same data on the real-world environment. We report the in-domain performance, which includes four tasks and five generalization evaluations (visual, subject, action, semantic, and language grounding), where each incorporates multiple tasks. We also report the average performance across all tasks.

Semantic Gen: New instructions

Action Gen: New actions (knock down)

Visual Gen: Different background

Subject Gen: New objects

Semantic Gen: New instructions

Physical Gen: Different sizes/shapes

We selected safe trajectories based on GCPG reward, then we TPO-trained the OpenVLA-SFT to get GRAPE-Safety. The robot learns to complete the task safely.

Unsafe behavior demonstrated by the GRAPE

Safe behavior demonstrated by the GRAPE-Safety

BibTeX

@misc{zhang2024grape,
                title={GRAPE: Generalizing Robot Policy via Preference Alignment}, 
                author={Zijian Zhang and Kaiyuan Zheng and Zhaorun Chen and Joel Jang and Yi Li and Chaoqi Wang and Mingyu Ding and Dieter Fox and Huaxiu Yao},
                year={2024},
                eprint={2411.19309},
                archivePrefix={arXiv},
                primaryClass={cs.RO},
                url={https://arxiv.org/abs/2411.19309}, 
      }

GRAPE: Generalizing Robot Policy via Preference Alignment

GRAPE Method

Experiments

Simulation Experiments

Real-world Experiments

Rollouts of GRAPE

Real-World Tasks

In-domain tasks

Language Grounding

Semantic Gen: New instructions

Action Gen: New actions (knock down)

Visual Gen: Different background

Subject Gen: New objects

Simpler-Env

In-domain tasks

Subject Gen: New objects

Semantic Gen: New instructions

Physical Gen: Different sizes/shapes

LIBERO

LIBERO-10

LIBERO-GOAL

LIBERO-Object

LIBERO-Spatial

Learned Safety Behaviors

Unsafe behavior demonstrated by the GRAPE

Safe behavior demonstrated by the GRAPE-Safety

BibTeX