Television

https://robot-tv.github.io/

real-time stereoscopic video streaming from the robot's perspective to a VR headset

Open-TeleVision can control multi-fingered robot hands, enabling more complex and precise manipulations.

Process

A stereo camera on the robot's head streams 3D video to the VR headset.

VR headset user makes an action, changes the pose. A central server handles communication between the VR device and the robot, processing poses and video streams at 60Hz.

retargets these human poses to the robot's joint positions.

The retargeted joint positions are sent directly to the robot for real-time control.

During demonstrations, the system records: a) The stereo video from the robot's perspective b) The joint positions of the robot (which correspond to the retargeted human poses)
This recorded data becomes the "Demonstration Dataset" shown in the right part of the image.
Training the Imitation Policy:
- The collected demonstration dataset is used to train the imitation policy.
- The policy uses a transformer architecture with an encoder and decoder.
- The encoder processes the stereo images and proprioception data (joint positions).
- The decoder outputs action sequences, which are chunked for efficiency.
Once trained, the policy can be deployed on the robot.
During deployment, the robot uses its own stereo camera and joint position data as input to the trained model, which then outputs the actions for the robot to perform.

One question is how can teleoperator feel the objects they are manipulating? to get more intutively feel