We aim to develop a model-based planning framework for world models that can
be scaled with increasing model and data budgets for general-purpose manipulation
tasks with only language and vision inputs.
To this end, we present FLow centrIc generative Planning (FLIP), a model-based planning algorithm on visual
space that features three key modules: 1. a multi-modal flow generation model
as the general-purpose action proposal module; 2. a flow-conditioned video generation model as the dynamics
module; and 3. a vision-language representation learning model as the value module. Given
an initial image and language instruction as the goal, FLIP can progressively search for long-horizon flow and video
plans that maximize the discounted return to accomplish the task. FLIP is able to synthesize long-horizon
plans across objects, robots, and tasks with image flows as the general action representation, and
the dense flow information also provides rich guidance for long-horizon video generation.
In addition, the synthesized flow and video plans can guide the training of low-level control
policies for robot execution.
Experiments on diverse benchmarks demonstrate that FLIP can improve
both the success rates and quality of long-horizon video plan synthesis and has the interactive
world model property, opening up wider applications for future works.
We train FLIP on LIBERO-LONG, a suite of long-horizon tabletop manipulation tasks with 50 demonstrations for each task.
The flows, videos, and value curves are all generated by FLIP.
We use the flow and video plans from FLIP to train a low-level conditional diffusion policy on the LIBERO-LONG tasks, and achieve better results than previous methods. Among them, Ours-FV achieves the best results, showing the advantages of using both flow and video conditions.
We manually spcify image flows for several LIBERO-LONG tasks to demonstrate the interactive property of FLIP.
Note the flows are different with the flows in the training dataset.
We show the zero-shot results of FLIP that is trained on LIBERO-90 and tested on LIBERO-LONG.
We show how will FLIP perfrom if we add visual distractions on the initial image.
We show the failure cases of FLIP. These are caused by the accumulated error during the planning.
BibTeX content TODO