VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning

IROS 2026
1IIIS, Tsinghua University, 2Galaxea AI

*Equal Contribution

Abstract

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D–3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds.

Method Overview

VolumeDP first lifts image features into a 3D Volumetric Representation through Cross-Attention. A Spatial Token Generation module then learns to select task-relevant voxels, such as regions around the end-effector and manipulated objects, and compresses them into compact spatial tokens. Finally, a Multi-Token Decoder conditions on the full set of spatial, language, and proprioceptive tokens to generate coherent action sequences.


Real-World Experiment

The real-world experiments are conducted on Galaxea R1 Lite. The robot contains two 6-DoF arms and 1-DoF grippers. Perception is provided by an RGB head camera and an RGB wrist camera. Our benchmarks contain the following four tasks: placing bowl, microwave operation, door opening, and nut onto screw. Compared with Diffusion Policy, VolumeDP improves the average success rate from 57.5% to 76.3%.

Placing bowl

Microwave operation

Door opening

Nut onto screw


Performance on Simulation Environments

LIBERO

LIBERO Results

ManiSkill

ManiSkill Results

LIBERO-Plus

LIBERO-Plus Results

The Spatial Token Generation module is designed to suppress task-irrelevant information and retain action-critical cues in volumetric representation. Qualitatively, the video below shows that the learned weights concentrate on the end effector and the manipulated object, and even localize the intended grasp contact region, indicating task-relevant spatial focus.

BibTeX

@article{zhou2026volumedp,
  title={VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning},
  author={Zhou, Tianxing and Xue, Feiyang and Ye, Zhangchen and Yuan, Tianyuan and Zhao, Hang and Jiang, Tao},
  journal={arXiv preprint arXiv:2603.17720},
  year={2026}
}