Integrating visual, tactile, linguistic, and other multimodal information inputs to predict action output capabilities

Realize the complete closed-loop capability of directly mapping multimodal perception input to robot control actions

Improve the reasoning and task generalization ability of robots in complex scenes

Tri-modality
Understanding the Rule of Physics
Proprioception
Proprioception
Enables precise sensation of the robot bodies’ position and orientation
Tactile
Tactile
Captures interactive force and contact events with objects
Vision
Vision
Provides spatial information about the target objects and surrounding environment

Multi modal end-to-end model empowers robots with multi-dimensional perception and precise control, realize autonomous execution of complex physical interaction tasks