Mobile Aloha is a whole-body remote operation data collection system developed by Zipeng Fu, Tony Z. Zhao, and Chelsea Finn from Stanford University. link.
Based on Mobile Aloha, AgileX developed Cobot Magic, which can achieve the complete code of Mobile Aloha, with higher configurations and lower costs, and is equipped with larger-load robotic arms and high-computing power industrial computers. For more details about Cobot Magic please check the AgileX website.
Currently, AgileX has successfully completed the integration of Cobot Magic based on the Mobile Aloha source code project.
After setting up the Mobile Aloha software environment(metioned in last section), model training in the simulation environment and real environment can be achieved. The following is the data collection part of the simulation environment. The data is provided by the team of Zipeng Fu, Tony Z. Zhao, and Chelsea Finn team.You can find all scripted/human demo for simulated environments here. here
After downloading, copy it to the act-plus-plus/data directory. The directory structure is as follows:
act-plus-plus/data
├── sim_insertion_human
│ ├── sim_insertion_human-20240110T054847Z-001.zip
├── ...
├── sim_insertion_scripted
│ ├── sim_insertion_scripted-20240110T054854Z-001.zip
├── ...
├── sim_transfer_cube_human
│ ├── sim_transfer_cube_human-20240110T054900Z-001.zip
│ ├── ...
└── sim_transfer_cube_scripted
├── sim_transfer_cube_scripted-20240110T054901Z-001.zip
├── ...
Generate episodes and render the result graph. The terminal displays 10 episodes and 2 successful ones.
# 1 Run
python3 record_sim_episodes.py --task_name sim_transfer_cube_scripted --dataset_dir <data save dir> --num_episodes 50
# 2 Take sim_transfer_cube_scripted as an example
python3 record_sim_episodes.py --task_name sim_transfer_cube_scripted --dataset_dir data/sim_transfer_cube_scripted --num_episodes 10
# 2.1 Real-time rendering
python3 record_sim_episodes.py --task_name sim_transfer_cube_scripted --dataset_dir data/sim_transfer_cube_scripted --num_episodes 10 --onscreen_render
# 2.2 The output in the terminal shows
ube_scripted --num_episodes 10
episode_idx=0
Rollout out EE space scripted policy
episode_idx=0 Failed
Replaying joint commands
episode_idx=0 Failed
Saving: 0.9 secs
episode_idx=1
Rollout out EE space scripted policy
episode_idx=1 Successful, episode_return=57
Replaying joint commands
episode_idx=1 Successful, episode_return=59
Saving: 0.6 secs
...
Saved to data/sim_transfer_cube_scripted
Success: 2 / 10
The loaded image renders as follows:
Visualize simulation data. The following figures show the images of episode0 and episode9 respectively.
The episode 0 screen in the data set is as follows, showing a case where the gripper fails to pick up.
The visualization of the data of episode 9 shows the successful case of grippering.
Print the data of each joint of the robotic arm in the simulation environment. Joint 0-13 is the data of 14 degrees of freedom of the robot arm and the gripper.
Simulated environments datasets must be downloaded (see Data Collection)
python3 imitate_episodes.py --task_name sim_transfer_cube_scripted --ckpt_dir <ckpt dir> --policy_class ACT --kl_weight 10 --chunk_size 100 --hidden_dim 512 --batch_size 8 --dim_feedforward 3200 --num_epochs 2000 --lr 1e-5 --seed 0
# run
python3 imitate_episodes.py --task_name sim_transfer_cube_scripted --ckpt_dir trainings --policy_class ACT --kl_weight 1 --chunk_size 10 --hidden_dim 512 --batch_size 1 --dim_feedforward 3200 --lr 1e-5 --seed 0 --num_steps 2000
# During training, you will be prompted with the following content. Since you do not have a W&B account, choose 3 directly.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:
After training is completed, the weights will be saved to the trainings directory. The results are as follows:
trainings
├── config.pkl
├── dataset_stats.pkl
├── policy_best.ckpt
├── policy_last.ckpt
└── policy_step_0_seed_0.ckpt
Evaluate the model trained above:
# 1 evaluate the policy add --onscreen_render real-time render parameter
python3 imitate_episodes.py --eval --task_name sim_transfer_cube_scripted --ckpt_dir trainings --policy_class ACT --kl_weight 1 --chunk_size 10 --hidden_dim 512 --batch_size 1 --dim_feedforward 3200 --lr 1e-5 --seed 0 --num_steps 20 --onscreen_render
And print the rendering picture.
1.Environment dependency
1.1 ROS dependency
● Default: ubuntu20.04-noetic environment has been configured
sudo apt install ros-$ROS_DISTRO-sensor-msgs ros-$ROS_DISTRO-nav-msgs ros-$ROS_DISTRO-cv-bridge
1.2 Python dependency
# Enter the current working space directory and install the dependencies in the requirements.txt file.
pip install -r requiredments.txt
2.Data collection
2.1 Run ‘collect_data’
python collect_data.py -h # see parameters
python collect_data.py --max_timesteps 500 --episode_idx 0
python collect_data.py --max_timesteps 500 --is_compress --episode_idx 0
python collect_data.py --max_timesteps 500 --use_depth_image --episode_idx 1
python collect_data.py --max_timesteps 500 --is_compress --use_depth_image --episode_idx 1
After the data collection is completed, it will be saved in the ${dataset_dir}/{task_name} directory.
python collect_data.py --max_timesteps 500 --is_compress --episode_idx 0
# Generate dataset episode_0.hdf5 . The structure is :
collect_data
├── collect_data.py
├── data # --dataset_dir
│ └── cobot_magic_agilex # --task_name
│ ├── episode_0.hdf5 # The location of the generated data set file
├── episode_idx.hdf5 # idx is depended on --episode_idx
└── ...
├── readme.md
├── replay_data.py
├── requiredments.txt
└── visualize_episodes.py
The specific parameters are shown:
The picture of data collection from the camera perspective is as follows:
Run the following code:
python visualize_episodes.py --dataset_dir ./data --task_name cobot_magic_agilex --episode_idx 0
Visualize the collected data. --dataset_dir
, --task_name
and --episode_idx
need to be the same as when ‘collecting data’. When you run the above code, the terminal will print the action and display a color image window. The visualization results are as follows:
After the operation is completed, episode${idx}qpos.png, episode${idx}base_action.png and episode${idx}video.mp4 files will be generated under ${dataset_dir}/{task_name}. The directory structure is as follows:
collect_data
├── data
│ ├── cobot_magic_agilex
│ │ └── episode_0.hdf5
│ ├── episode_0_base_action.png # base_action
│ ├── episode_0_qpos.png # qpos
│ └── episode_0_video.mp4 # Color video
Taking episode30 as an example, replay the collected episode30 data. The camera perspective is as follows:
The Mobile Aloha project has studied different strategies for imitation learning, and proposed a Transformer-based action chunking algorithm ACT (Action Chunking with Transformers). It is essentially an end-to-end strategy: directly mapping real-world RGB images to actions, allowing the robot to learn and imitate from the visual input without the need for additional artificially encoded intermediate representations, and using action chunking (Chunking) as the unit to predict and integrates accurate and smooth motion trajectories.
The model is as follows:
Disassemble and interpret the model.
1. Sample data
Input: includes 4 RGB images, each image has a resolution of 480 × 640, and the joint positions of the two robot arms (7+7=14 DoF in total)
Output: The action space is the absolute joint positions of the two robots, a 14-dimensional vector. Therefore, with action chunking, the policy outputs a k × 14 tensor given the current observation (each action is defined as a 14-dimensional vector, so k actions are a k × 14 tensor)
2. Infer Z
The input to the encoder is a [CLS] token, which consists of randomly initialized learning weights. Through a linear layer2, the joints are projected to the joint positions of the embedded dimensions (14 dimensions to 512 dimensions) to obtain the embedded joint positions embedded joints. Through another linear layer linear layer1, the k × 14 action sequence is projected to the embedded action sequence of the embedded dimension (k × 14 dimension to k × 512 dimension).
The above three inputs finally form a sequence of (k + 2) × embedding_dimension, that is, (k + 2) × 512, and are processed with the transformer encoder. Finally, just take the first output, which corresponds to the [CLS] tag, and use another linear network to predict the mean and variance of the Z distribution, parameterizing it as a diagonal Gaussian distribution. Use reparameterization to obtain samples of Z.
3. Predict a action sequence
① First, for each image observation, it is processed by ResNet18 to obtain a feature map (15 × 20 × 728 feature maps), and then flattened to obtain a feature sequence (300 × 728). These features are processed using a linear layer Layer5 is projected to the embedding dimension (300×512), and in order to preserve spatial information, a 2D sinusoidal position embedding is added.
② Secondly, repeat this operation for all 4 images, and the resulting feature sequence dimension is 1200 × 512.
③ Next, the feature sequences from each camera are concatenated and used as one of the inputs of the transformer encoder. For the other two inputs: the current joint positions joints and the "style variable" z, they are passed through the linear layer linear layer6, linear layer respectively Layer7 is uniformly projected to 512 from their respective original dimensions (14, 15).
④ Finally, the encoder input of the transformer is 1202×512 (the feature dimension of the 4 images is 1200×512, the feature dimension of the joint position joins is 1×512, and the feature dimension of the style variable z is 1×512).
The input to the transformer decoder has two aspects:
On the one hand, the "query" of the transformer decoder is the first layer of fixed sinusoidal position embeddings, that is, the position embeddings (fixed) shown in the lower right corner of the above figure, whose dimension is k × 512
On the other hand, the "keys" and "values" in the cross-attention layer of the transformer decoder come from the output of the above-mentioned transformer encoder.
Thereby, the transformer decoder predicts the action sequence given the encoder output.
By collecting data and training the above model, you can observe that the results converge.
A third view of the model inference results is as follows. The robotic arm can infer the movement of placing colored blocks from point A to point B.
Cobot Magic is a remote whole-body data collection device, developed by AgileX Robotics based on the Mobile Aloha project from Stanford University. With Cobot Magic, AgileX Robotics has successfully achieved the open-source code from the Stanford laboratory used on the Mobile Aloha platform, including in simulation and real environment. AgileX will continue to collect data from various motion tasks based on Cobot Magic for model training and inference. Please stay tuned for updates on Github. And if you are interested in this Mobile Aloha project, join us with this slack link: Slack. Let’s talk about our ideas.
About AgileXEstablished in 2016, AgileX Robotics is a leading manufacturer of mobile robot platforms and a provider of unmanned system solutions. The company specializes in independently developed multi-mode wheeled and tracked wire-controlled chassis technology and has obtained multiple international certifications. AgileX Robotics offers users self-developed innovative application solutions such as autonomous driving, mobile grasping, and navigation positioning, helping users in various industries achieve automation. Additionally, AgileX Robotics has introduced research and education software and hardware products related to machine learning, embodied intelligence, and visual algorithms. The company works closely with research and educational institutions to promote robotics technology teaching and innovation.
Comments