Environmental perception plays an integral part while building autonomous vehicles, self-navigation robots, and other real-world applications.
Why 3D Object Detection on Point Clouds?While deep learning based 2D object detection on camera data demonstrates high accuracy, it may not be effective activities like localization, measuring the distance between the objects and calculating depth information.
Point clouds generated by the LiDAR sensor provide the 3D information of the object to localize the objects and characterize the shapes more efficiently. Therefore, 3D object detection on point clouds is emerging in various applications, especially for autonomous driving.
Still, designing a LiDAR-based 3D object detection system is challenging. Firstly, such systems require intensive computation in model inference. Secondly, as the point cloud data is irregular, the processing pipeline requires both preprocessing and post-processing to deliver end-to-end perception results.
KV260 made a perfect match with the 3D object detection system. The expensive computation of model inference can be offloaded to and accelerated by the programmable logic part of KV260 while the powerful ARM cores of KV260 are capable of handling preprocessing and postprocessing tasks.
Design OverviewWe now discuss the selected deep learning model for 3D object detection on point clouds and the system overview including software and hardware.
Network Architecture
As a sanity check of existing works, we select the ResNet-based Keypoint Feature Pyramid Network (KFPN), which is the first real-time system for monocular 3D detection with state-of-the-art performance on the KITTI benchmark. In particular, we adopt its open-sourced PyTorch implementation on point clouds, called SFA3D.
PYNQ-DPU on KV260
We use the Ubuntu Desktop 20.04.3 LTS for Xilinx development boards instead of Petalinux as the operating system on KV260 as Ubuntu is a great development environment for installing packages required in preprocessing point clouds and post-processing results. On the other hand, the support of Pynq and DPU overlay on KV260 avoids designing efficient DPUs from scratch and enables us to work under a python environment. This largely eases the migration of CPU/GPU based deep learning implementation to KV260.
Get Started!We now show you detailed instructions to implement the efficient 3D object detection system on the KV260 board, which contains model training, model quantization and compilation, and deployment on KV260.
Let's get started!
Setup Environment
Install the Ubuntu image to KV260 according to the official guide, and then install Pynq in the Ubuntu operating system with reference to Github. Git clone all the required files and install required packages to the board by executing the following commands.
git clone https://github.com/SoldierChen/DPU-Accelerated-3D-Object-Detection-on-Point-Clouds.git
cd DPU-Accelerated-3D-Object-Detection-on-Point-Clouds
pip install -r requirements.txt
Here, we need Pytorch 1.4 as the VART of Pynq DPU is v1.4.
Data Preparation
Download the 3D KITTI detection dataset from here.
The downloaded data includes:
- Velodyne point clouds (29 GB)
- Training labels of the object data set (5 MB)
- Camera calibration matrices of the object data set (16 MB)
- Left color images of the object data set (12 GB) (For visualization purpose only)
To visualize 3D point clouds with 3D boxes, let's execute:
cd model_quant_compile/data_process/
python kitti_dataset.py
python train.py --gpu_idx 0
The command uses one GPU for training, but it supports distributed training. In addition, you can select either the fpn_resnet or the resnet as the target model. The trained model will be stored in the checkpoint folder with the name "Model_restnet/fpn_resnet_epoch_#". Depending on your hardware, the epoch can be from 10 to 300, the higher the better accuracy.
Model Quantization and CompilationAgain, as the VART of Pynq is V1.4, we need VITIS AI v1.4 instead of the latest version (V2.0) for model quantization.
# install the docker at first (if not stalled)
docker pull xilinx/vitis-ai-cpu:1.4.1.978
# run the docker
./docker_run.sh xilinx/vitis-ai-cpu:1.4.1.978
We then quantize the model with the following commands. You can read the implementation details in quantize.py
.
# activate the pytorch environment
conda activate vitis-ai-pytorch
# install required packages
pip install -r requirements.txt
# configure the quant_mode to calib
ap.add_argument('-q', '--quant_mode', type=str, default='calib', choices=['calib','test'], help='Quantization mode (calib or test). Default is calib')
# here, it quantize the example model: Model_resnet_18_epoch_10.pth
python quantize.py
# configure the quant_mode to test
ap.add_argument('-q', '--quant_mode', type=str, default='test', choices=['calib','test'], help='Quantization mode (calib or test). Default is calib')
# here, it outputs the quantized model.
python quantize.py
Next, we will compile the model.
./compile.sh zcu102 build/
Never mind zcu102 shares the same DPUarchitecture with KV260. You will see the following message for a successful compilation.
So far, we get the compiled xmodel that can be executed on DPU overly on KV260. Next, we deploy it on the board and develop the application code.
Deployment on KV260Prerequisites
Following the official guide, we first set up the Ubuntu operation system on KV260. Then, we install python on the board by following the PYNQ-DPU GitHub.
After setting up the board, we need to install the git, clone the code to the board and copy the compiled xmodel to the folder.
Application Code Design
The full version of the code is shown in the attachment. Here we shall introduce how to invoke and interface with the DPU for inference.
We first load the DPU overlay and the customized xmodel. Then, importantly, the input and output tensor information must be known in order to coordinate with datasets. Here, we have only one tensor as input and five tensors as outputs. Input and output buffers are correspondingly allocated.
# load model and overly
overlay = DpuOverlay("dpu.bit")
overlay.load_model("./CNN_zcu102.xmodel")
dpu = overlay.runner
# get tensor information
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()
shapeIn = tuple(inputTensors[0].dims)
outputSize = int(outputTensors[0].get_data_size() / shapeIn[0])
shapeOut = tuple(outputTensors[0].dims)
shapeOut1 = tuple(outputTensors[1].dims)
shapeOut2 = tuple(outputTensors[2].dims)
shapeOut3 = tuple(outputTensors[3].dims)
shapeOut4 = tuple(outputTensors[4].dims)
# allocate input and output buffers.
# Note the output is a list of five tensors.
output_data = [np.empty(shapeOut, dtype=np.float32, order="C"),
np.empty(shapeOut1, dtype=np.float32, order="C"),
np.empty(shapeOut2, dtype=np.float32, order="C"),
np.empty(shapeOut3, dtype=np.float32, order="C"),
np.empty(shapeOut4, dtype=np.float32, order="C")]
# the input is only one tensor.
input_data = [np.empty(shapeIn, dtype=np.float32, order="C")]
image = input_data[0]
The process of one-time inference is encapsulated in the following function. Here, we permute the input tensor to the shape of DPU input tensor and permute the tensor to be the shape required for post-processing. This is critical for the correct results.
def do_detect(dpu, shapeIn, image, input_data, output_data, configs, bevmap, is_front):
if not is_front:
bevmap = torch.flip(bevmap, [1, 2])
input_bev_maps = bevmap.unsqueeze(0).to("cpu", non_blocking=True).float()
# do permutation
input_bev_maps = input_bev_maps.permute(0, 2, 3, 1)
image[0,...] = input_bev_maps[0,...] #.reshape(shapeIn[1:])
job_id = dpu.execute_async(input_data, output_data)
dpu.wait(job_id)
# convert the output arrays to tensors for the following post-processing.
outputs0 = torch.tensor(output_data[0])
outputs1 = torch.tensor(output_data[1])
outputs2 = torch.tensor(output_data[2])
outputs3 = torch.tensor(output_data[3])
outputs4 = torch.tensor(output_data[4])
# do permutation
outputs0 = outputs0.permute(0, 3, 1, 2)
outputs1 = outputs1.permute(0, 3, 1, 2)
outputs2 = outputs2.permute(0, 3, 1, 2)
outputs3 = outputs3.permute(0, 3, 1, 2)
outputs4 = outputs4.permute(0, 3, 1, 2)
outputs0 = _sigmoid(outputs0)
outputs1 = _sigmoid(outputs1)
# post-processing
detections = decode(
outputs0,
outputs1,
outputs2,
outputs3,
outputs4, K=configs.K)
detections = detections.cpu().numpy().astype(np.float32)
detections = post_processing(detections, configs.num_classes, configs.down_ratio, configs.peak_thresh)
return detections[0], bevmap
Execution on KV260
By running the following command, the inference of demo data will be executed on DPU and the results of two side views are stored in a video.
python demo_2_sides-dpu.py
The results of the front view will be stored in a video by running the following command.
python demo_front-dpu.py
The performance ranges from 10 to 20 FPS, which is 100 to 200 times faster than the execution on a server-level CPU (Intel Xeon Gold 6226R). In addition, the model delivers state-of-the-art accuracy, check out the detailed evaluation here.
Conclusion and Future WorkIn summary, we demonstrated how easy to use AMD-Xilinx DPU on KV260 to accelerate point clouds based 3D object detection. To further boost the performance, we plan to optimize the model inference phase by using multiple DPU instances, and the preprocessing and post-processing phases by using multiple threading and batch processing.
Credits:
LiDAR point-cloud based 3D object detection implementation with colab
Super Fast and Accurate 3D Object Detection based on 3D LiDAR Point Clouds
Comments