Combines vision, tactile, language, and other multimodal inputs to predict robotic actions

Closes the loop from multimodal perception to robot control

Enhances the robot reasoning and generalization abilities in complex scenarios

Tri-modality
Understanding the Rule of Physics
Proprioception
Proprioception
Enables precise sensation of the robot bodies’ position and orientation
Tactile
Tactile
Captures interactive force and contact events with objects
Vision
Vision
Provides spatial information about the target objects and surrounding environment

Multi modal end-to-end model empowers robots with multi-dimensional perception and precise control, realize autonomous execution of complex physical interaction tasks