Perception is a key task in autonomous driving and probably among the most discussed topics in today's technology circles.
In very simple terms, perception is the task of making sense of data. This data is typically generated by a wide range of sensors such as cameras, RADARs, LiDARs etc.
Among the most commonly used perception tasks are object detection and semantic segmentation.
Object detection is the task of locating and classifying objects of interest in sensor data.
Semantic segmentation is the task of labeling every pixel/ element of the sensor data as belonging to a class from among a set of classes of interest.
Deep Neural Networks (DNN) have evolved to be the mainstay of complex perception algorithm, that would otherwise be nearly impossible to design using conventional computer vision algorithms.
The basic mathematical operation that goes into any type of DNN is matrix multiplication.
As it turns out, matrix multiplication is an operation that can be heavily parallelized. This is fundamentally why GPUs are useful for deep learning.
This is where, hardware acceleration can serve as a means to speed up deep learning algorithms.
This project aims to demonstrate just that, using the Ultra96v2 development board as the hardware acceleration platform.
The Ultra96v2 is based on the Xilinx Zynq MPSoC platform and the development board features several other peripherals apart from the FPGA.
The project demonstrates how object detection can be performed in LiDAR point clouds using the Ultra96v2 as the hardware acceleration platform and Vitis-AI as the software platform. It also demonstrates in an end-to-end fashion how the Ultra96v2 with the object detection model running on it can work as an edge-AI application.
The sensor dataThe project uses LiDAR sensor data as input. Now, since the state of the sensors are super pricey, the project uses pre-recorded data from a state of the art LiDAR sensor. This is the KITTI dataset.
A typical scan of the sensor, known as a point clouds is shown below.
Every point in the point cloud is represented by at least 3 numbers that correspond to the x, y and z coordinates of the point in 3D space with respect to the senor.
We are trying to figure where the cars and other objects are in the above point cloud and put a bounding box around them.
This is shown below.
The neural network used is a semantic segmentation network know as U-Net.
It is modified to predict key points, where key points are the object centers in the range image.
Once we have key points, a simple post-processing step converts them back to 3D coordinates - x, y, z.
A model based box fitting is done depending on the class of each keypoint.
This is found by taking the average dimension of boxes for different object categories in the KITTI training dataset.
For example, cars typically have box dimensions of 3.6, 1.5, 1.8 for length, width and height respectively.
The following figure shows the architecture of the network.
3D point cloud is first converted into a 2D representation known as range image, which then becomes compatible for processing using the U-Net architecture shown above.
Details of the conversion process can be found in the SqueezeSeg paper.
The authors also provide a pre-converted dataset.
For more details, please refer to their github page.
PredictionsKey points are predicted for each object, where a key point is the center of the object in it's segmentation mask. The following figure shows some predicted key points.
Once key points are detected and filtered to 1 key point for each object, then the spherical coordinates are converted back to 3D cartesian to get the position of the key point in 3D space (in the 3D PCL).
The math can be found here.
Also attached is the code to convert the 2D key points to 3D.
The above figure shows how the predicted 2D key points map to 3D locations in the point cloud.
Once the object locations are known, then a model based box fitting is done, i.e, for each object class, the average box dimensions found from the training set are used to place 3D bounding boxes as shown below.
The original model was trained using keras.
This gives a.h5 file.
The trained model can be found attached.
Step 1: Getting the board image for Ultra96V2
The board image from Mario's tutorial. Otherwise, we can build our own board image for Ultra96v2.
The board image contains all necessary components to instantiate the DPU IP.
Flash it to the SD-Card using Etcher.
Then, copy the.dcf file from the SD Card to your local file system.
A folder - Ultra96v2 was created, inside the github code base in the next step and copy the dcf file into it.
Then create an arch.json file in the same folder.
Both can be found attached.
Power up the board, connect to the web server and configure the WiFi.
ssh into the board and from the root folder verify that the DPU is indeed present by
dexporer --whoami
You should see an output similar to:
[DPU Core Configuration List]
DPU Core : #0
DPU Enabled : Yes
DPU Arch : B4096
DPU Target Version : v1.4.1
DPU Freqency : 300 MHz
Ram Usage : High
DepthwiseConv : Enabled
DepthwiseConv+Relu6 : Enabled
Conv+Leakyrelu : Enabled
Conv+Relu6 : Enabled
Channel Augmentation : Enabled
Average Pool : Enabled
DPU Core : #1
DPU Enabled : Yes
DPU Arch : B4096
DPU Target Version : v1.4.1
DPU Freqency : 300 MHz
Ram Usage : High
DepthwiseConv : Enabled
DepthwiseConv+Relu6 : Enabled
Conv+Leakyrelu : Enabled
Conv+Relu6 : Enabled
Channel Augmentation : Enabled
Average Pool : Enabled
[DPU Extension List]
Extension Softmax
Enabled : Yes
Step 2: Quantizing and compiling the.h5 model to.elf
Set up the local system with Vitis-AI
Clone the DenseNet example from Vitis-AI tutorial github
https://github.com/Xilinx/Vitis-AI-Tutorials.git
This tutorial has all the file necessary to convert, quantize and compile the.h5 trained keras model to.elf to be deployed on the Ultra96V2.
After cloning, replace the k_model.h5 file inside the files/build/keras_model folder by the trained model attached. Or, replace by your own custom model. The name must be again k_model.h5. This ensures that fewer changes are necessary in following steps.
Now, change the following parameters in the 0_setenv.sh script based on your local system.
# network parameters
export INPUT_HEIGHT=64
export INPUT_WIDTH=256
export INPUT_CHAN=1
export INPUT_SHAPE=?,${INPUT_HEIGHT},${INPUT_WIDTH},${INPUT_CHAN}
export INPUT_NODE=input_1
export OUTPUT_NODE=conv2d_23/BiasAdd
export NET_NAME=u3d_kp
# training parameters
export EPOCHS=200
export BATCHSIZE=150
export LEARNRATE=0.001
# target board
sudo mkdir /opt/vitis_ai/compiler/arch/DPUCZDX8G/ULTRA96V2
sudo cp ./ULTRA96V2/arch.json /opt/vitis_ai/compiler/arch/DPUCZDX8G/ULTRA96V2
sudo cp ./ULTRA96V2/ULTRA96V2.dcf /opt/vitis_ai/compiler/arch/DPUCZDX8G/ULTRA96V2
export ARCH=/opt/vitis_ai/compiler/arch/DPUCZDX8G/ULTRA96V2/arch.json
#export ARCH=/opt/vitis_ai/compiler/arch/DPUCZDX8G/ZCU102/arch.json
# DPU mode - best performance with DPU_MODE = normal
export DPU_MODE=normal
#export DPU_MODE=debug
The above changes will automatically copy the Ultra96 board files into the workspace when we launch the tensorflow workspace. The Ultra96 board files should be in the same folder as this script.
For the output node, in case of custom models, the best way is to just go ahead with quantization and then look at the error message and modify the parameter accordingly.
Follow on for an example of how to get your output node names.
Navigate into the files folder of the cloned repo and type.
sh -x docker_run.sh xilinx/vitis-ai:latest
You should see:
Now, locally change the quantize file as show below, replacing by your own calibration data generation function. This function is basically a simple python script that returns batched of training data, pre-processed the exact same way as you would for inference.
The function used for this project is attached.
#!/bin/bash
# Copyright 2020 Xilinx Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# quantize
quantize() {
echo "Making calibration images.."
# python tf_gen_images.py \
# --dataset=train \
# --image_dir=${QUANT}/images \
# --calib_list=calib_list.txt \
# --max_images=${CALIB_IMAGES}
# log the quantizer version being used
vai_q_tensorflow --version
# quantize
# vai_q_tensorflow quantize \
# --input_frozen_graph ${FREEZE}/${FROZEN_GRAPH} \
# --input_fn image_input_fn.calib_input \
# --output_dir ${QUANT} \
# --input_nodes ${INPUT_NODE} \
# --output_nodes ${OUTPUT_NODE} \
# --input_shapes ${INPUT_SHAPE} \
# --calib_iter 10
vai_q_tensorflow quantize \
--input_frozen_graph ${FREEZE}/${FROZEN_GRAPH} \
--input_fn datagen.get_calib_data \
--output_dir ${QUANT} \
--input_nodes ${INPUT_NODE} \
--output_nodes ${OUTPUT_NODE} \
--input_shapes ${INPUT_SHAPE} \
--calib_iter 10
}
echo "-----------------------------------------"
echo "QUANTIZE STARTED.."
echo "-----------------------------------------"
rm -rf ${QUANT}
mkdir -p ${QUANT}/images
quantize 2>&1 | tee ${LOG}/${QUANT_LOG}
rm -rf ${QUANT}/images
echo "-----------------------------------------"
echo "QUANTIZE COMPLETED"
echo "-----------------------------------------"
In the above, datagen.py is the python file and get_calib_data is the function.
Type in the following commands and verify outputs:
source ./0_setenv.sh
source ./2_keras2tf.sh
A part of the output should show:
-------------------------------------
keras_2_tf command line arguments:
--keras_json:
--keras_hdf5: ./build/keras_model/k_model.h5
--tf_ckpt : ./build/tf_chkpt/tf_float.ckpt
-------------------------------------
Keras model information:
Input names : [<tf.Tensor 'input_1:0' shape=(?, 64, 256, 1) dtype=float32>]
Output names: [<tf.Tensor 'conv2d_23/BiasAdd:0' shape=(?, 64, 256, 2) dtype=float32>]
-------------------------------------
Checkpoint created : ./build/tf_chkpt/tf_float.ckpt
In the above, we can see that the output node name is : conv2d_23/BiasAdd.
NOTE: DO NOT add the ':0'
Go back to the 0_setenv.sh script and fill in the output node name.
At this stage, a frozen graph named - frozen_graph.pb is available in build/freeze
Now execute the quantization script from within the workspace using:
source ./4_quant.sh
Successful quantization should produce:
We can see part of the calibration images used.
Finally, compile
source ./6_compile.sh
Successful compilation should produce:
Now you should see the.elf file inside /build/compile
Deployment and application execution on Ultra96v2Now that we have the model, next step is to copy the model over to the Ultra96 and then write and application that uses this model.
Copy the.elf to any folder of your choice inside the /root folder.
Following figure shows how I place it:
In the above, I have used VSCode to ssh into the board. It makes application development greatly simple.
While there are several ways to design an IoT application for the inference, for this project, a simulated sensor is used, since it is not practical to get a Velodyne senor that produces the data.
Therefore, again the KITTI test/ validation data set is used and the host computer serves as the simulated sensor, sending data to the Ultra96 over FTP for inference.
The overall application architecture is shown below.
A simple TCP client/ server socket application is used for synchronization and a shared folder on the Ultra96 is used for data exchange.
Concretely, the host opens a TCP socket as server and waits for the Ultra96 client to connect.
The client is part of the overall application for inference.
The client then sends a start message over the socket and the host PC copies the first batch of test data to the shared folder on the Ultra96 over FTP.
The following snippet shows the data transfer.
from ftpretty import ftpretty
import numpy as np
import os
import socket
def place_test_file_in_fpga(local_file_path, host_addr, remote_path):
"""
File copies the file at local_file_path to the remote_file_path on U96
"""
# Supply the credentisals
f = ftpretty(host_addr, "root", "root")
file = np.load(file_path) # read in the numpy file
# Get a file, save it locally
# f.get('someremote/file/on/server.txt', '/tmp/localcopy/server.txt')
# Put a local file to a remote location
# non-existent subdirectories will be created automatically
f.put(file_path, remote_file_path+'/')
The main application is attached as a jupyter notebook.
Some prediction results for detected key points are shown below.
The frame rates are shown below for inference on 50 frames with a batch size of 1.
As we can see, it achieves an impressive frame rate of 109 frames/ second.
However, part of high frame rate is also attributed to the fact that the network used is really small and accuracy is not the key concern here. Hence, accuracy was not bench marked in this project.
The final post processing as explained in the first section is done on the host computer as this is not part of the inference.
Some test files are attached.
ConclusionIt was a very interesting opportunity for me to experiment with acceleration of deep learning algorithms on hardware. Some of the pre-processing and post-processing could also be implemented as hardware IP blocks. This would make the inference even faster.
I plan to cover this later in another project.
I am sincerely grateful to Hackster, Avnet and Xilinx for giving me the opportunity and support for this project.
Comments