can you share details of: text encoder, Image encoder, robot arm, camera, num of tasks, num of trajectories, number of video clips, what is pretraining policy, what is finetuning method, model sizes
and other details of a vision language action model?