Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)
RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective
Chenxi Wang · Hongjie Fang · Hao-Shu Fang · Cewu Lu
Precise robot manipulations require rich spatial information in imitation learning, which remains a challenge in both 2D and 3D based policies. To tackle this problem, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Project website: rise-policy.github.io.