Mobile robots such as vacuum cleaner robots use lidar for advanced spatial planning, path planning, and obstacle detection. lidar has problems with solar radiation near windows, especially in this price segment. Radar sensors have no problem with solar radiation and are now available cheaper than lidar sensors even for the hobby.Basically the current research for Autonomous Vehicle mostly using a combination of 2D cameras and radar sensors. My idea is to have a 2D camera and a short range radar on a mobile robot and utilize one of the state of the art of the mentioned research.
Research, Try-Outs and DecisionsModel pt_centerpoint_astyxThe Model pt_centerpoint_astyx from the Vitis-AI model zoo was my first choice. I spend a lot of time to find out how it works regarding the exact input, outputs and it's meaning. The link to data set doesn't work anymore, but finally I found it on GitHub. I tried to do a retrain with the given source from Xilinx. In the end I couldn't manage to do the retrain, because of several different technical issues. Also the referenced paper to the used neuronal network in the original astyx data set is apparently not public. I had to requested it from the author and got it after some time. Finally I found out that only cars are annotated in the astyx data set with a few samples. So record and annotate new data would be required, which is a huge effort to take. So I decided to reject it.
Model CenterFusion (arXiv:2011.04841v1) CenterFusion is based on CenterTrack (arXiv:2004.01177v2), based on "Objects as Points" (arXiv:1904.07850v2) which claims to be SOTA. Basically CenterFusion is using camera data and radar point cloud trained with nuScenes dataset for 3D object detection. The nuScenes dataset contains at least 10 object classes, so I decided to go for it.
Selecting the radar sensorSparkFun Pulsed Radar Breakout - A111:I had already experimented with this radar sensor using a RPI3. He has an SPI interface which could also be connected to the PMOD of the KV260. It has only one TX and RX antenna. This means to get a point cloud from it the sensor needs be rotated (like LIDAR) or using a multiply sensor array. But in meantime this sensor is no longer in production and available. The successor would be the XC112 Connector Board + XR112 Radar Sensor Board. It can connect up to four XR112. But this is far to expensive and would blow my budget.
TI IWR6843AOP/IWR6843AOPEVM:This sensor has 3 TX and 4RX and is capable to deliver a radar point cloud with the out-of-the box demo firmware on a the evaluation board (IWR6843AOPEVM). The evaluation board is compact enough to mount it on a mobile robot. The data can be retrieved via serial interface. So it can plug in to USB interface of the KV260 and readout by the CPU. Because of this I decided to use this sensor.
Selecting the cameraI got the Kria KV260 Basic Accessory Pack from the contest and had planed to use the AR1335 for this task. But unfortunately the connector cable is to short to mount it in a propitiate way. I went forward with the Digilent Pcam which is similar to the RPI camera. But there was no hardware design available at this time to control and capture frames from the Pcam in Petalinux. So I decided to use a cheap 1080p Webcam usb camera. Maybe some remark: I choose the camera with a FoV of 120 degree which is close to the max FoV of the radar sensor, because I hope it is maybe easier to calibrate later. Mobile robot hardwareI will use the base plates (3D printout) and some of the hardware from Zybot project. To connect the Pmod HB5 and the DC Motors I will use the Pmod IOXP.At the end there will be sand-witches of base plates at the experimental platform, which needs probably extended size because of the KV260 (bigger than Zybo).I faced a big issue here, because I wanted to utilize the i2c0. I tried mapping it to the EMIO, mark as external port, assigned it to the PMDO IO etc. The kernel (Petalinux) recognized it, but couldn't power it up because of some "Power Domain failure". I couldn't trace it's cause and decided to the Xilinix I2C IP. It worked. I could control the Pmod IOXP checked by Logic analyzer.
Operating SystemI decided to use PetaLinux (xilinx-k26-starterkit-2021_1). But in order to run successful my compiled model I needed to update the Vitis-AI libraries and tools to version 2.0. For this I created a new PetaLinux project using the provides BSP from the KV260 Wiki and added the Vitis-AI-Recipes. I also replaced the dropbear ssh with openSSH daemon so that remote debugging via Eclipse IDE and PyCharm IDE is working properly.
Capturing radar point cloud data
I flashed the pre-compiled "out of the box" demo (xwr6843AOP_mmw_demo.bin) from the TI Industrial mmWave toolbox and adapted the ROS1 driver from the Ti mmwave_industrial_toolbox to grab the data from UART. To find the right sensor configuration the mmWave Demo Visualizer are used to test directly on PC. The saved configuration parameter can then apply to the sensor via UART on startup.As I already mentioned the IWR6843AOPEVM is plug in one of the KV260 USB ports and powered by them. After the configuration is applied and the sensor is started the radar data can be obtained via UART in the specified format from Ti. Depending on the configuration (without heat map) I could manage to get a frame rate of about 30fps.
Quantization of the CenterFusion ModelThe convolutional backbone of the network reference by the paper is using DLA (Deep Layer Aggregation). So it could be quite a challenge in terms of run time performance for the KV260. I was happy as I saw that are already pre-trained models exist. Unfortunately the pre-trained models referenced by the paper are using mostly "Deformable Convolution" operations which are not supported by the Xilinx DPU-IP. So I had to retrain the model. To avoid stepping in all issues which might be coming up I decided to follow the paper and train and quantize the intermediary models too.
CenterTrackTraining
- follow the instructions on the CenterTrack GitHub repository to setup the required environment on top of the Vitis-AI docker container
- download, extract and convert the nuScenes dataset as it is described.
- change to the source folder and run the object detection training (batch_size, gpus depend on your environment):
python main.py ddd --exp_id nuScenes_3Ddetection_e140 --dataset nuscenes --batch_size 12 --gpus 0 --lr 5e-4 --num_epochs 140 --lr_step 90,120 --save_point 90,120 --dla_node conv
- after start the training with tracking:
python main.py tracking,ddd --exp_id nuScenes_3Dtracking --dataset nuscenes --pre_hm --load_model ../models/nuScenes_3Ddetection_e140.pth --shift 0.01 --scale 0.05 --lost_disturb 0.4 --fp_disturb 0.1 --hm_disturb 0.05 --batch_size 12 --gpus 0 --lr 2.5e-4 --save_point 10,20,30,60 --dla_node conv
Quantization
For this step the CenterTrackCustom repository is used.I created a python script "centernet_quant.py" which handles the quantization process like it is described in ug1414 (Xilinx documentation) with nearly the same command line options.
For the calibration run:
python centernet_quant.py ddd --exp_id nuScenes_3Ddetection_e140 --gpus 0 --dataset nuscenes --dla_node conv --data_dir /mnt/dataset --load_model ../models/nuScenes_3Ddetection_e170.pth --batch_size 1 --quant_mode calib --num_iters 500
For test and deploy run:
python centernet_quant.py ddd --exp_id nuScenes_3Ddetection_e140 --gpus 0 --dataset nuscenes --dla_node conv --data_dir /mnt/dataset --load_model ../models/nuScenes_3Ddetection_e170.pth --batch_size 1 --quant_mode test --num_iters 1 --deploy
The quantize and compiled models can be found in the models folder. The DPU architecture is "DPUCZDX8G_ISA0_B4096_MAX_BG2" and available if "kv260-dpu-benchmark" application is loaded. Because of the pytorch operation "aten::clone" the inference needs to be scheduled in two DPU graphs.
Benchmark test using PetaLinux (xilinx-k26-starterkit-2021_1, Vitis-AI 2.0):
sudo xmutil unloadapp
sudo xmutil loadapp kv260-dpu-benchmark
Accelerator loaded to slot 0
#Identify the DPU graphs
xdputil xmodel DDLASegDDDTrack.xmodel -l
#Run benchmark for graph 1 with one thread
xdputil benchmark DLASegDDDTrack.xmodel -i 1 1
FPS= 16.8081 number_of_frames= 1009 time= 60.0305 seconds.
#Run benchmark for graph 7 with one thread
xdputil benchmark DLASegDDDTrack.xmodel -i 7 1
FPS= 17.683 number_of_frames= 1062 time= 60.0577 seconds.
CenterFusionTraining
- follow the instructions on the CenterFusion GitHub repository to setup the required environment on top of the Vitis-AI docker container
- reuse the nuScenes dataset convert it again as it is described (radar point cloud data is needed this time).
- change to the source folder and run the object detection training (batch_size, gpus depend on your environment):
python main.py ddd --exp_id centerfusion --shuffle_train --train_split train --val_split mini_val --val_intervals 1 --nuscenes_att --velocity --batch_size 8 --lr 2.5e-4 --num_epochs 60 --lr_step 50 --save_point 20,40,50 --gpus 0 --not_rand_crop --flip 0.5 --shift 0.1 --pointcloud --radar_sweeps 6 --pc_z_offset 0.0 --pillar_dims 1.0,0.2,0.2 --max_pc_dist 60.0 --load_model ../models/centernet_baseline_e170.pth --dla_node conv
Quantization
For this step the CenterFusionCustom repository is used.I created a python script "centerfusion_quant.py" which handles the quantization process like it is described in ug1414 (Xilinx documentation) with nearly the same command line options.
For the calibration run:
python centerfusion_quant.py ddd --exp_id centerfusion --dataset nuscenes --val_split mini_val --run_dataset_eval --nuscenes_att --velocity --pointcloud --radar_sweeps 6 --max_pc_dist 60.0 --pc_z_offset -0.0 --load_model ../models/centerfusion_e60.pth --flip_test --dla_node conv --data_dir /mnt/dataset --batch_size 1 --quant_mode calib --num_workers 0 --gpus 0
For test and deploy run:
python centerfusion_quant.py ddd --exp_id centerfusion --dataset nuscenes --val_split mini_val --run_dataset_eval --nuscenes_att --velocity --pointcloud --radar_sweeps 6 --max_pc_dist 60.0 --pc_z_offset -0.0 --load_model ../models/centerfusion_e60.pth --flip_test --dla_node conv --data_dir /mnt/dataset --batch_size 1 --quant_mode test --num_workers 0 --gpus 0 --deploy
The quantize and compiled models can be found in the models folder. The DPU architecture is "DPUCZDX8G_ISA0_B4096_MAX_BG2" and available if "kv260-dpu-benchmark" application is loaded. Because of the pytorch operations "aten::clone", "select", "unsqueeze" and "generate_pc_hm_custom" the inference needs to be scheduled in two DPU graphs. Compared to the plain CenterTrack the CPU load will higher because of the custom operaration.
Benchmark test using PetaLinux (xilinx-k26-starterkit-2021_1, Vitis-AI 2.0):
sudo xmutil unloadapp
sudo xmutil loadapp kv260-dpu-benchmark
Accelerator loaded to slot 0
#Identify the DPU graphs
xdputil xmodel DLASecWrapper.xmodel -l
#Run benchmark for graph 1 with one thread
xdputil benchmark DLASecWrapper.xmodel -i 3 1
FPS= 16.8488 number_of_frames= 1011 time= 60.0041 seconds.
#Run benchmark for graph 7 with one thread
xdputil benchmark DLASecWrapper.xmodel -i 10 1
FPS= 9.77932 number_of_frames= 587 time= 60.0246 seconds.
ResultThe benchmark of the quantized model reveals that even with a high use of FPGA resources it can't reach real time (at least 30fps). Assuming there is only one DPU core it will have a latency of probably 3-4 frames after camera/radar frame to have objects detected.
Follow Ups- Clarify the calibration of the model regarding different camera and radar sensors compared to the data set
- evaluate the model performances compared to the quantized model
- retrain the model with less complex convolutional backbone to increase the inference run time to 30fps - reduction of FPGA resources would be also good to have room for CPU intensive non-ML calculations
- when satisfied with the model implement DPU scheduling first in python, c++ later
- Idea is to have the radar sensor and "Model-component" ROS2 integrated
- Deal with power supply
(in context of the Adaptive Computing Challenge 2021)I have underestimated the effort to have a full functional mobile robot which uses combination of 2D cameras and radar sensors for his tasks within the contest. Especially the Machine Learning part consumes the most of my available time. But still I'll continue to work on this topic because I had the idea long before the contest and the contest pushed me to finally start. The time for the challenge is over and I couldn't finish the project in this time. But I described the work I've done so far, the decisions I made, the issues I faced (also reflect by the source code changes) and will face. I wish I can inspire other people and collect some feedback.
Comments