Depth estimation is a key technology in autonomous driving. In order to obtain the distance information of obstacles, it can be realized by various sensors such as lidar, monocular camera, stereo camera, and infrared. As the most commonly used sensor in autonomous driving, cameras can obtain more comprehensive, rich and dense information. Stereo vision-based depth estimation can accurately identifies and locates both moving and static targets and road elements. It also obtains the critical point cloud depth information of random obstacles, effectively reducing the rate of miss-detection.
With the rapid development of convolutional neural network technology and the emergence of large-scale standard stereo datasets, neural network-based stereo matching algorithms have developed rapidly, achieving higher computational accuracy and efficiency than traditional methods. In order to apply the algorithm to autonomous driving, it is significant to deploy the stereo matching algorithm in edge devices.
2、WhyAccording to statistics, there are about 1.25 million deaths due to road traffic accidents in the world every year, which is equivalent to 3, 500 deaths every day. If autonomous driving can be achieved, not only can these casualties be avoided to a great extent, but also problems such as traffic congestion and carbon emissions can be effectively alleviated.
In the field of autonomous driving, the stereo vision solution, as the most suitable solution for human cognition, has always attracted much attention. It can obtain very reliable high-precision dense depth information. The end-to-end binocular stereo matching technology based on deep learning, with the help of powerful feature extraction capabilities, ensures that objects with weak textures and repeated textures can still output effective and high-precision depth information, supporting the sensing requirements of intelligent driving systems in all scenarios. Furthermore, the neural network can obtain higher accuracy through continuous data training. Those makes the CNN-based stereo application become a trend in ADAS/AD.
Compared with the GPU platform, FPGA has the advantage of low power, higher efficiency and more flexible. The abundant interfaces in FPGA makes it flexible for various automotive driving scenarios. In this work, we choose the KV260 develop board for the based data injection demo to certify the possibility of deploying CNN stereo application in ZU5. In the final version, we made a expansion ZU5 board named heimdallr-DEB to connect the FPGA with stereo cameras for real-time demo.
3. Algorithm introduction3.1 Introduction to FADNetThere are several CNNs for stereo matching such as Raft-stereo, FADNet etc. Compared with others, the FADNet can achieve competitive depth precision with smaller calculation operations, this make it suitable for edge device deployment.
FADNet (A Fast and Accurate Network for Disparity Estimation) is a disparity estimation model proposed by researchers from Hong Kong Baptist University in March 2020 that takes into account both efficiency and accuracy. FADNet maintains a high computational speed by combining 2D convolution and correlation layer operations, and also uses residual structure and multi-scale feature fusion to reduce the difficulty of training and improve the accuracy of the model. The structure of FADNet is shown in the following figure[2] :
FADNet is mainly improved based on the two-dimensional convolution of the encoder-decoder architecture. The structure of DispNetC, which was also proposed before for disparity estimation, is used as the backbone. The residual structure and point-by-point correlation operations are introduced to improve the model for feature extraction and expression capabilities. The network is then fine-trained with a multi-scale residual learning strategy, and finally the model is coarse-trained with a weighted loss method.
The following figure shows the calculation effect of FADNet on the KITTI dataset [2] :
In general, the structure of FADNet is very deep and complex. To implement the algorithm in edge computing devices such as the KV260 board, you need to simplify and adjust the network to a certain extent.
In this project, we made some modification to the FADNet model, retained the upper part of the original model and made some structural optimizations. Afterwards, we use the existing large-scale scene-flow driving stereo dataset and some reality scenario data we manually collected and annotated to retrain the model parameters. Finally, we obtained a simplified FADNet named PhiFADNet with both computational accuracy and real-time performance.
We can divide the PhiFADNet model into three parts: front-end network, matching network and back-end network. Among them, the front-end network is two networks in which the left and right views share parameters, which are used for feature extraction. The model inputs the left-view and right-view images, extracts image features through the front-end network, and then calculates the matching cost of the image feature matrix through the matching network, and finally obtains the depth map through the back-end network. The abstract structure of the model is shown in the following figure:
The main computing operations of the front-end network are convolution and downsampling. Through multiple convolution and downsampling operations, the input feature map is continuously reduced to extract image features; the matching network performs matching cost calculation on the feature maps extracted from the left and right through correlation operations; The back-end network uses the matching results obtained by the matching network and some intermediate features of the front-end network to achieve scale recovery and disparity optimization through operations such as convolution and deconvolution.
3.3 Matching NetworkThe matching network, also known as the "correlation layer" in FADNet, is used to introduce prior geometric knowledge without adding additional training parameters to improve network accuracy and efficiency.
The role of the correlation layer is to find the correspondence between the left and right images. Given two multi-channel feature maps f1 and f2 with their width, height and number of channels are w, h and c, respectively, the calculation formula of the correlation operation is:
Among them, <. >is the vector inner product operation, k is the size of the kernel of the matching cost, x1 and x2 is the slice center of the two feature maps.
When the disparity search range is the entire feature map, the computational complexity of the correlation operation is very high. Therefore, the maximum disparity D is generally used to reduce the maximum search space to an area of 2D+1 pixels up and down, left and right. Also, since the input image for stereo matching is usually a rectified line-aligned image, the disparity search range can be constrained in the horizontal direction, and a one-dimensional correlation layer based on vector inner product is used to calculate the similarity between the left and right feature vectors., Calculated as follows:
Among them, f1 and f2 are the left and right multi-channel feature maps extracted by the convolutional neural network, (x, y) are the pixel coordinates, Nc are the number of feature channels, d in range of [0, D-1] represent the disparity value.
3.4 Algorithmic AdvantageTraditional stereo matching algorithms can be divided into two categories, local-based stereo matching algorithms and global-based stereo matching algorithms. Generally they consist of four parts: matching cost calculation, matching cost aggregation, parallax regression, and parallax refinement. However, traditional algorithms are either difficult to guarantee accuracy and robustness, or the computational complexity is too high to achieve real-time performance.
The story on deep learning, especially the end-to-end stereo matching algorithm represented by FADNet, can not only be trained by existing large-scale data, but also be updated with real-world data, those made the CNN-based stereo matching can achieve a excellent performance during iterated training. Therefore, compared with traditional algorithms, deep learning-based algorithms not only have high accuracy, but also have greater room for improvement, which can further improve the accuracy and effect of matching through iteration.
3.5 Model Training and QATFirst of all, we deployed the standard FADNet on KV260-vitis platform, running the fronted network and backend network with DPU. Through the profiling of DPU, we found that the concat layers in the network had become the efficiency bottleneck for its rearrangement with memory. So we replaced all the concat layers with an add layer proceeded by a 1x1 convolution layer. Furthermore, we cut off all the multi-scale prediction except the last one in our final model.
Because a single convolution layer could not utilize DPU speed-up to full extent. We still used multi-scale in our float32 model training process. First, we trained a multi-scale loss simplified float32 FADNet using both synthetic and real-world data. Then the float model trained will be transferred to a QAT( Quantization Aware Training) trainer provided by the Xilinx Vitis-AI. During QAT, only real-word data was used for better results on driving scenarios. QAT is the key step for better inference performance since PTQ (Post Training Quantization) increased both EPE and D1All a lot.
3.6 Optimized model for driving scenarioThe following is the actual performance effect of our model under the condition of highway. The left side is the right view of camera, and the right side is the depth map output by the model in real time.
This module is implemented in the PL side on the KV260 board. Two identical image pre-processing modules are instantiated in the FPGA to down-sample the original input images. The image data is transferred from CMOS to FPGA through the MIPI interface. The down-sample functions includes resize and corp, each of this can be used individually or in combination.
This module can resize and crop images of any size through parameters that can be flexibly configured in real time, so that the system can process any position in the scene with any precision resolution, significantly improving the flexibility and compatibility of the system.
The resize algorithm of the picture is based on the bilinear interpolation and is suitable for FPGA implementation. Combined with the data bandwidth of Axi bus and the characteristics of high parallelism of FPGA platform, this IP realizes the image resize algorithm that can make any target image less than or equal to the original image. It has the advantages of low power consumption and easy integration of all algorithms and modules. In the algorithm, the module adopts the linear splitting method to calculate, which reduces the number of multiplication operations in the calculation, reduces the implementation area and power consumption, and the calculation clock frequency of the system increases due to the reduction of operation.
We have tried to use the image pre-processing module implemented by HLS in kv260-vitis provided by Xilinx( https://github.com/Xilinx/kv260-vitis ), but found that in order to realize the corresponding functions, this module needs to consume a lot of logic resources and storage resources in FPGA, resulting in insufficient resources of our whole system. Finally, by using our customized image preprocessing module, the resource consumption of the system is significantly reduced when the function is realized.
4.2 Image RectificationThe parameters of the image correction module are determined by Zhang's calibration method. The PS reads the stereo image data from the DDR, and performs distortion correction and epipolar correction on the image according to the correction parameter matrix.
The image rectification is to make the images obtained by the left and right cameras achieve the following effects: the same object in the left and right images has the same size and is on the same horizontal line. This makes the matching cost of the image only need to be calculated in the horizontal direction and improve the final depth precision.
4.3 Xilinx DPU realizes front-end network and back-end networkAccording to the introduction of the algorithm, the main computing operations of the front-end network are convolution and downsampling, which are used to extract image features; the main operations of the back-end network are convolution and deconvolution, which are used for parallax optimization and scale recovery. Both parts of the network operations can be accelerated by the Xilinx DPU IP. To achieve the best performance of the stereo application, we reduced the resources of other modules in order to instantiate a DPU-B4096 (the maximum configuration of DPU) on the board.
4.4 Correlation acceleration modulesThe basic principle of the correlation operation (i.e. matching network) has been introduced in ch. 3.3 above. Since the correlation operation is a relatively special calculation operation in the stereo matching neural network, the DPU cannot accelerate the calculation process. So in the beginning, the correlation operation in the neural network is realized by the PS side in the KV260. However, after practical tests, when the input image size is 1 M, it takes 80 ms to implement a correlation operation by Quad Cortex-A53 ; when the input image size is 2 M, it takes 160 ms to implement a correlation operation by ARM.
Taking 1 M pixels as an example, the single processing speed of the stereo neural network is shown in the following figure:
It can be seen that both the front-end network and the back-end network implemented by DPU have faster processing speed, while the matching network implemented by ARM needs to consume a lot of time to process, which will greatly increase the latency of the system. The large latency is a very serious problem for autonomous driving applications!
Analyze the operation of the matching network or the correlation layer, it consist of basic mathematics such as multiplication and addition. So we decided to develop a dedicated RTL kernel to accelerate the correlation operations.
The input and output of correlation accelerator are both from the DPU. As we known, the data wise in DPU is 8-bit, so the data precision in the correlation accelerator should be 8-bit also. To implement this, we added a scaler in the kernel to guarantee the precision of the final results. In our experiments, the scaler mechanism can made the precision loss less than 0.1%.
To reduce the resources, we adopt the method of running DSP in double data rate in combination with the method of single DSP double multiplier. This method reduces the operation time of the system, and improves the real-time performance of the system. At the same time, we optimize the calculation order of related layer operations, avoid repeated reading of feature maps, and save the data bandwidth occupied by operations.
After optimization, when the input image size is 1 M, it only takes less than 5 ms for the acceleration module to implement one correlation operation ; when the input image size is 2 M, it only takes less than 10 ms.
Also at 1 M pixels, after introducing the acceleration module, the single processing speed of the binocular neural network is:
The block diagram of the system structure realized on the KV26-SOM is shown in the following figure.
The original stereo image data is input to the image processor module through the MIPI interface. This module is implemented by FPGA and can resize and crop the stereo image according to the parameters configured in real time. Then the data is transmitted to the DDR of the PS(processing system) side through the AXI4 bus, and the ARM performs rectification and transformation on the image according to the calibration parameter matrix of the stereo camera. Finally, DPU and Correlation Accelerator complete the matching cost calculation and the depth calculation of corrected stereo image.
6. Expansion board:Heimdallr-DEBIn the Kria KV260 develop boards, we cannot connect our stereo cameras with the KV260 board because of the hardware interfaces. To deploy our stereo application in reality scenario, we developed an expansion board according to the KV26 SOM schematic. The expansion board consist of three GMSL interface for MIPI controller, two GEO GW5300 for image processing. The overall appearance of the expansion board is as follows:
The Phigent-Heimdallr stereo application can be deployed on the vehicle, running in real-time with the Heimdallr-DEB platform. The stereo camera is mounted behind the vehicle windshield. The Heimdallr-DEB is placed in the trunk of the car. The stereo camera transfers the image data through the coaxial cable to the board. Then the depth map will be calculated in the FPGA and transfer the results to the host computer. Finally, the host computer will process the depth map and display the depth point clouds. Follows are some pictures about the Phigent-Heimdallr deployment on our test car.
This project has been open sourced on github:
To implement the functions we mentioned above, first of all, the hardware we need to prepare consists of:
- 1. a development board (KV260 or Heimdallr-DEB)
- 2. a TF SD card
- 3. a TF SD card reader
- 4, a micro usb cable, an ethernet cable
- 5. a PC, ubuntu 20.04
- 6. stereo camera (for Heimdallr-DEB)
We also need to prepare some software:
- 1. a SD card flash tool, download address: https://www.balena.io/etcher/
- 2. a serial port management tool, download address: https://mobaxterm.mobatek.net/
- 3. Board image file, download address:
After preparing the above hardware and software, perform the following steps:
- 1. Install the SD card flash tool on the PC
- 2. Insert the SD card into the PC through the card reader
- 3. Select the downloaded image
- 4. Select SD card
- 5. Start flash
After flashing SD card, remove the SD card and perform the following steps:
- 1. Install the SD card on the development board, and connect the development board and PC with the mico USB cable.
- 2. Open the serial port management tool on the PC side, find the corresponding serial port, such as COM4, and create a serial port system with a baud rate of 115200
- 3. Turn on the power supply of the development board, and the system will automatically log in as root
- 4. Configure the IP address of the board. Execute the command vi /etc/network/interfaces and replace the contents of the opened interfaces file with the following (just for example):
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 10.31.1.178
netmask 255.255.255.0
network 10.31.1.0
gateway 10.31.1.254
- 5. Restart the development board and connect the development board and router with an Ethernet cable.
The above steps have completed the modification of the development board IP, and then you can use the PC as the host computer to log in to the development board through ssh to operate. Note that the PC and the board should be on the same LAN.
The PC side and the development board are connected to the same router, and the following code is executed on terminal:
# download files from github
git clone --single-branch https://github.com/PhigentRobotics/Heimdallr_software.git -b dev
cd Heimdallr_software
tar -caf board_files.tgz bin boot_fs config k26heimdallr lib models
# transfer files onto board
scp -r board_files.tar petalinux@10.31.1.178:~/
# then log on the board
ssh petalinux@10.31.1.178
for kv260
#!!!!!!! following codes should be run on board
sudo dnf install -y xir xir-dev vart vart-dev packagegroup-petalinux-vitisai-dev
cd /home/petalinux
tar -xf board_files.tgz
cp -r k26heimdallr /lib/firmware/xilinx/
reboot
#after reboot
xmutil unloadapp
xmutil loadapp k26heimdallr
cd /home/petalinux
sh bin/run_heimdallr-app_playback.sh
for Heimdallr-DEB
first to replace the boot partition, then copy application on to board:
#!!!!!!! following codes should be run on board
echo "backup boot partition"
cp /media/sd-mmcblk1p1 ~/boot_bk -r
rm -rf /media/sd-mmcblk1p1/*
mv ~/boot_bk /media/sd-mmcblk1p1/
ls /media/sd-mmcblk1p1/boot_bk
echo "done backup boot partition"
cd /home/petalinux
tar -xf board_files.tgz
cp -r boot_fs/* /media/sd-mmcblk1p1/
sync
md5sum ~/boot_fs/* && md5sum /media/sd-mmcblk1p1/*
cp -r k26heimdallr /lib/firmware/xilinx/
md5sum ~/k26heimdallr/* /lib/firmware/xilinx/k26heimdallr/*
reboot
#after reboot
xmutil unloadapp
xmutil loadapp k26heimdallr
cd /home/petalinux
sh bin/run_heimdallr-app.sh
To show the result, there is a GUI client on PC (ubuntu 20.04).
# download files from github
git clone --single-branch https://github.com/PhigentRobotics/Heimdallr_software.git -b dev
cd Heimdallr_software
tar -xf heimdallr-hmi-bin.tgz
cd heimdallr-hmi-bin
# before run it, please edit the board ip settings on data/vradar_test.json
./AppRun.sh
We recorded a short video about how to run the Phigent Heimdallr on KV260 board.
8. Demo VideoThe follows are some demos running on the Xilinx KV260 board and our Heimdallr-DEB.
Note: Because the KV260 board cannot be directly connected to the stereo camera module, the stereo application deployed on the KV260 is implemented by data injection form the SD card.
Simulation on Scene Flow dataset on KV260
Reality scenario input on KV260
Real-time Phigent Heimdallr application running on road on Heimdallr-DEB
Comments