Integrating multimodal inputs such as vision, touch, and language to predict action outputs

Enabling a complete closed-loop capability that directly maps multimodal perceptual inputs to robotic control actions

Enhancing the reasoning and task generalization abilities of robots in complex scenarios

Tri-modality
Understanding the Rule of Physics
Proprioception
Proprioception
Enables precise sensation of the robot's position and orientation
Tactile
Tactile
Captures interactive force and contact events with objects
Vision
Vision
Provides spatial information about the target objects and surrounding environment

Multimodal end-to-end model empowers robots with multi-dimensional perception and precise control, realize autonomous execution of complex physical interaction tasks