FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks

Overview of FLIP
Left: FLIP is trained on video datasets across different tasks, objects, and robots, with only one language description for each video as the goal.
Right: we train an interactive world model consisting of an action module for flow generation, a dynamics module for video generation, and a value module for assigning value at each step. These modules can perform flow-centric model-based planning for manipulation tasks on the flow and video space

Abstract

We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-CentrIc generative Planning (FLIP), a model-based planning algorithm on visual space that features three key modules: 1. a multi-modal flow generation model as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics module; and 3. a vision-language representation learning model as the value module. Given an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon plans across objects, robots, and tasks with image flows as the general action representation, and the dense flow information also provides rich guidance for long-horizon video generation. In addition, the synthesized flow and video plans can guide the training of low-level control policies for robot execution. Experiments on diverse benchmarks demonstrate that FLIP can improve both the success rates and quality of long-horizon video plan synthesis and has the interactive world model property, opening up wider applications for future works.


Three Modules of FLIP

Action, Dynamics, and Value modules of FLIP.
Left: the tokenizing process of different modalities in training data.
Middle Left: we use a Conditional VAE to generate flows as actions. It sep- arately generates the delta scale and directions on each query point for flow reconstruction.
Middle Right: we use a DiT model with the spatial-temporal attention mechanism for flow-conditioned video gen- eration. Flows (and observation history) are conditioned with cross attention, while languages and timestep are conditioned with AdaLN-zero.
Right: The value module of FLIP. We fol- low the idea of LIV and use time- contrastive learning for the visual-language rep- resentation, but we treat each video clip (rather than each frame) as a state. The fine-tuned value curves of LIV and ours are shown at bottom.

LIBERO-LONG Planing Results

We train FLIP on LIBERO-LONG, a suite of long-horizon tabletop manipulation tasks with 50 demonstrations for each task.
The flows, videos, and value curves are all generated by FLIP.

Real World Policy Results: Tea Scooping

Select Video ID: 0

Real World Policy Results: Cloth Unfolding

Select Video ID: 0

Interactive World Model Results

We manually spcify image flows for several LIBERO-LONG tasks to demonstrate the interactive property of FLIP.
Note the flows are different with the flows in the training dataset.

Zero-Shot Results

We show the zero-shot results of FLIP that is trained on LIBERO-90 and tested on LIBERO-LONG.

Resist Visual Distractions

We show how will FLIP perfrom if we add visual distractions on the initial image.

Failure Cases

We show the failure cases of FLIP. These are caused by the accumulated error during the planning.

Aloha Real Results

Select Video ID: 0

Aloha Sim Results

Select Video ID: 0

FMB Benchmark Results

Select Video ID: 0

Cube Results

Select Video ID: 0

Egg Peeling Results

Select Video ID: 0

Folding Results

Select Video ID: 0

Unfolding Results

Select Video ID: 0

Pen Spin Results

Select Video ID: 0

Tying Plastic Bag Results

Select Video ID: 0

Fruit Peel Results

Select Video ID: 0

BibTeX

BibTeX content TODO