Timezone: »

Learning Vision-Guided Quadrupedal Locomotionwith Cross-Modal Transformers
Ruihan Yang · Minghao Zhang · Nicklas Hansen · Harry (Huazhe) Xu · Xiaolong Wang

We propose to solve quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based loco-motion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that lever-ages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs and that Transformer-based models further improve generalization across environments. Our project page with videos is athttps://LocoTransformer.github.io/.

Author Information

Ruihan Yang (UC San Diego)
Minghao Zhang (Tsinghua University)
Nicklas Hansen (University of California, San Diego)
Harry (Huazhe) Xu (UC Berkeley)

I am a 2nd year phd at UC Berkeley doing RL and vision under Prof. Trevor Darrell. I also actively collaborate with Prof. Sergey Levine and Prof. Tengyu Ma

Xiaolong Wang (UCSD)
Xiaolong Wang

Our group has a broad interest around the directions of Computer Vision, Machine Learning and Robotics. Our focus is on learning 3D and dynamics representations through videos and physical robotic interaction data. We explore various means of supervision signals from the data itself, language, and common sense knowledge. We leverage these comprehensive representations to facilitate the learning of robot skills, with the goal of generalizing the robot to interact effectively with a wide range of objects and environments in the real physical world. Please check out our individual research topic of Self-Supervised Learning, Video Understanding, Common Sense Reasoning, RL and Robotics, 3D Interaction, Dexterous Hand.

More from the Same Authors