This project was made as a bachelor thesis at the Department of Automatic Control and Robotics, Faculty of Electrical Engineering, Automatics, Computer Science and Biomedical Engineering, AGH University of Science and Technology in Krakow, Poland under the watchful eye of the Embedded Vision Systems Group. The main objective of this project was to integrate two usually separate systems of autonomous vehicles perception - object detection and driveable area segmentation into a single system using one deep convolutional neural network (DCNN). This system will be later used as part of a mobile robot designed for the FPT FPGA Design Competition, whose objective is to create an autonomous vehicle, with an FPGA or FPGA System-on-Chip as a computing platform and visual data from a single CMOS/CCD camera mounted on the chassis. Two questions arise:
- How do you make a single convolutional network perform two different tasks?
- How do you make a neural network work on an FPGA?
These questions will be answered in the following tutorial.
Multi-task Neural NetworksSo, why use a single neural network to solve multiple tasks when you could use multiple nets in parallel? Well, on bigger devices, like PCs equipped with powerful graphics processing units (GPU), the advantage may not be as great, but embedded devices are a totally different story. Usually battery powered and with limited access to memory, they require a different approach to neural network modeling. Smaller, less computationally intensive models are preferred, even for the cost of sacrificing accuracy. Most systems, like in the case of autonomous vehicles, have to use models small enough to run in real-time, due to safety concerns. Some of the more important tasks in vehicle perception systems are object detection and driveable area segmentation. These complex tasks were usually approached separately, but recent years of research have shown promising results in solving them with a single convolutional neural network. Some examples include BlitzNet, DSPNet and most importantly YOLOP, which achieved almost 22fps on an embedded GPU - NVidia Jetson TX2.
How do they do it? They simply follow the time honored proverb "free two birds with one key". Most semantic segmentation or detection models use well known feature extractor architectures like ResNets, VGG or MobileNet as their backbones. So why not combine them into a single backbone? This would drastically lower the amount of trainable model parameters and computations needed. The extracted feature maps are then propagated through a single joined or two separate segmentation and detection heads. This simple yet effective strategy allows these models to achieve inference times comparable with state of the art models like YOLO or SSD.
Neural Networks on FPGAsIt can be said that we are living in the golden age of deep learning. Starting with AlexNet in 2012, deep learning models were used in a vast range of tasks, from simple image classification, to playing complex games like Starcraft. This however comes at a cost. Complicated tasks usually require larger models, which require more memory and computational power to train and run. This in turn, increases electrical power consumption, harming our already fragile environment. This issue was summed up perfectly in the article "Deep Learning's Diminishing Returns" published in IEEE Spectrum's special report "The Great AI Reckoning", where the authors state:
We must either adapt how we do deep learning or face a future of much slower progress.
One of the ways in which we could fight this growing problem, could be using more power efficient hardware. A great candidate for this position could be Field Programmable Gate Arrays, in short FPGAs. They are reconfigurable, which means you can program any number of digital circuits into the FPGAs logic fabric. This allows the programmer to reduce the number of components used for their chosen computational task, simultaneously maximizing computations and minimizing power consumption by only using components you really need. FPGAs only use a few watts of power for tasks that would take traditional CPUs and GPUs dozens of them. However, using them required the knowledge of Hardware Description Languages (HDL), which was a big hurdle for most "traditional" programmers. Yet, thanks to the growing interest in FPGA hardware acceleration, we are fortunate enough to have High Level Synthesis (HLS) tools, like Vitis, at our disposal. They allow the user to synthesize architectures for FPGAs using some well known higher level programming languages like C, C++ or even Python. This allows us to almost entirely "skip" the HDL part of FPGA development.
Step 1: Set Up the Kria KV260 Vision AI Starter KitStart with setting up the Kria KV260 Vision AI Starter Kit according to the official Getting Started Guide. It is important not to use the PetaLinux image available through the link in the guide, but the one in the official Vitis-AI git repository. This is due to a version conflict (2021.1 to 2020.2) where at the time, models compiled on the Vitis AI docker would only run on the older version of the system. After booting up, follow the Target Setup Guide in the Vitis AI repository. You should also do the steps marked as "Optional", as to make sure the Vitis AI library is installed properly.
You may also try running a few examples from the Model ZOO also linked in the previous guide. They show a range of possible device use cases. The simple GUI built into PetaLinux is simple yet effective, so feel free to explore and get accommodated.
If you're using Windows. I recommend using a virtual Linux machine (for example with VMWare) with Ubuntu for a smoother first experience. If using Linux follow the instructions in the Getting Started Guide available on the Vitis AI repository. I used the CPU container, but using the GPU container is equivalent. Before running the docker, download the needed repos and setup the cross compiler for PetaLinux 2020.2
Pull Vitis AI repo:
git clone --recurse-submodules https://github.com/Xilinx/Vitis-AI
Install cross compiler for PetaLinux 2020.2:
cd Vitis-AI/setup/mpsoc/VART
./host_cross_compiler_setup_2020.2.sh
cd ~/
Pull Vitis AI Tutorials repo:
git clone https://github.com/Xilinx/Vitis-AI-Tutorials
Pull docker Vitis AI container
docker pull xilinx/vitis-ai-cpu:latest
There might be an error later, during compilation of the model, so run this to prevent it:
cd Vitis-AI-Tutorials/Design_Tutorials/09-mnist_pyt/files/build/logs
touch compile_kv260.log
cd ~/
Step 3: Compile an exampleTo set up a suitable pipeline for model quantization and compilation we should start with a simple MNIST classifier example, that can be quickly trained even on a CPU. I decided to use PyTorch as my chosen Python neural net library. I used the tutorial "Pytorch Flow for Vitis AI" available in the Vitis AI Tutorials git repo. First, we need to add a few lines of code in some of the files:
- compile.sh (Add compiler option)
elif [ $1 = kv260 ]; then
ARCH=/opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json
TARGET=kv260
echo "-----------------------------------------"
echo "COMPILING MODEL FOR KV260.."
echo "-----------------------------------------"
- target.py (Change default arguments in main)
ap.add_argument('-t', '--target', type=str, default='kv260', choices=['zcu102','zcu104','u50','vck190','kv260'], help='Target board type (zcu102,zcu104,u50,vck190). Default is zcu102')
- run_all.sh (Add the compile and target for KV260)
source compile.sh kv260 ${BUILD} ${LOG}
python -u target.py --target kv260 -d ${BUILD} 2>&1 | tee ${LOG}/target_kv260.log
Now we compile the model:
- Start Vitis AI docker container
cd Vitis-AI-Tutorials/Design_Tutorials/09-mnist_pyt/files
sudo ./docker_run.sh xilinx/vitis-ai-cpu:latest
- Start virtual environment
conda activate vitis-ai-pytorch
- Run all scripts
source run_all.sh
Step 4: Send the app to the boardTo run the app, you first need to send it to the board. Run the ifconfig command in the PetaLinux terminal and check the board's IP address. Now connect your computer to the same network and use scp to send the generated app to the device.
cd Vitis-AI-Tutorials/Design_Tutorials/09-mnist_pyt/files/build
scp -r ./target_kv260 root@{TARGET_IP}
Step 5: Run the app!After sending the files to the Kria KV260 Vision AI Starter Kit, you can execute the app by simply running:
cd target_kv260
python3 app_mt.py -m CNN_kv260.xmodel
on the board. If everything went right, you should now see the app being executed. It should look something like this:
Now, that the pipeline works, you can concentrate on creating your own model. This step differs from person to person, but some tips are universal:
- the files you need to quantize and compile are: a .py script containing the model class and a .pth file with the models weights.
- keep in mind that the Vitis AI docker uses PyTorch 1.4. Not a big deal, but some functions are not available compared to the newer versions. Most importantly, .pth files are not compressed before PyTorch 1.6, so you need to remember to turn the _use_new_zipfile_serialization flag to False if saving in newer versions.
- the model should output a tuple of tensors. They will be converted to a tuple of numpy arrays after compilation.
- if training with GPU and compiling on CPU, remember to move the model to the CPU before saving the weights.
- to be sure that your model passes quantization and compilation, pass the model through the torch.jit.trace() function whilst using the Vitis AI docker. If it doesn't finish with an error, then the model should pass compilation.
- remember to keep a keen eye on the Vitis AI documentation. The supplied table of supported layers will help you decide which ones to use in your network.
In my case, I created a simultaneous segmentation and detection network, based on the U-Net model with ResNet18 as the backbone. The detection branch was based on YOLOv3. It was trained with the help of Cuda AMP (Automatic Mixed Precision) on a subset of BDD100K. It had 5 output tensors, three for the detection branch, and two for the driveable area segmentation branch (road and lane line).
Now, with help of our PyTorch quantization and compilation pipeline, we will be able to easily compile the model. You just need to change three key files
- common.py - here you enter your model class and any helper functions for evaluation or training.
- run_all.py - we comment out the lines that start model training
# run training
# python -u train.py -d ${BUILD} 2>&1 | tee ${LOG}/train.log
- quantize.py - we need to slightly change the quantize function.
def quantize(build_dir,quant_mode,batchsize):
dset_dir = build_dir + '/dataset'
float_model = build_dir + '/float_model'
quant_model = build_dir + '/quant_model'
# use GPU if available
if (torch.cuda.device_count() > 0):
print('You have',torch.cuda.device_count(),'CUDA devices available')
for i in range(torch.cuda.device_count()):
print(' Device',str(i),': ',torch.cuda.get_device_name(i))
print('Selecting device 0..')
device = torch.device('cuda:0')
else:
print('No CUDA devices available..selecting CPU')
device = torch.device('cpu')
rand_in = torch.randn([batchsize, 3, 416, 416]).to("cpu")
rand_out = rand_in
# load trained model
model = CNN(torchvision.models.resnet18,pretrained=True)
model = model.to(device)
model.load_state_dict(torch.load(os.path.join(float_model,'my_checkpoint_old.pth.tar')))
# FIX
rand_out = model(rand_in)
# force to merge BN with CONV for better quantization accuracy
optimize = 1
# override batchsize if in test mode
if (quant_mode=='test'):
batchsize = 1
# trace test
trance_mod = torch.jit.trace(model,rand_in)
#quantize
quantizer = torch_quantizer(quant_mode, model, rand_in, output_dir=quant_model,device=torch.device("cpu"))
quantized_model = quantizer.quant_model
rand_out = quantized_model(rand_in)
# HERE YOU ADD ANY EVALUATION FUNCTIONS
# export config
if quant_mode == 'calib':
quantizer.export_quant_config()
if quant_mode == 'test':
quantizer.export_xmodel(deploy_check=False, output_dir=quant_model)
return
An important note. There is a known issue with quantization of the model after loading weights from a file. The fixed point representation sizes won't be properly adjusted, giving a lot of [8,None] values in the quant_info.json file. The way to avoid these errors is to pass a random input tensor through the network before quantization (denoted under "FIX" comment in the code above).
Now, just one more step before running the whole pipeline.
Step 8: Change the target appNow that the pipeline is ready, the next step would be to change the app that will be run on the device (found under name ~/my_app.py). This app preprocesses input images, creates input and output buffers and runs the DPU (Deep Learning Processor Unit) on multiple threads. In my case, I changed the following functions:
- preprocess_fn - change the way input images are processed before saving them into the input buffer
- run_DPU - here you will make the most changes regarding output buffer sizes
First, I added output scaling for all of the outputs:
'''get tensor'''
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()
input_ndim = tuple(inputTensors[0].dims)
output_ndim = tuple(outputTensors[0].dims)
# we can avoid output scaling if use argmax instead of softmax
#output_fixpos = outputTensors[0].get_attr("fix_point")
#output_scale = 1 / (2**output_fixpos)
to
'''get tensor'''
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()
input_ndim = tuple(inputTensors[0].dims)
output_ndim1 = tuple(outputTensors[0].dims)
output_ndim2 = tuple(outputTensors[1].dims)
output_ndim3 = tuple(outputTensors[2].dims)
output_ndim4 = tuple(outputTensors[3].dims)
output_ndim5 = tuple(outputTensors[4].dims)
# we can avoid output scaling if use argmax instead of softmax
output_fixpos1 = outputTensors[0].get_attr("fix_point")
output_scale1 = 1 / (2**output_fixpos1)
output_fixpos2 = outputTensors[1].get_attr("fix_point")
output_scale2 = 1 / (2**output_fixpos2)
output_fixpos3 = outputTensors[2].get_attr("fix_point")
output_scale3 = 1 / (2**output_fixpos3)
output_fixpos4 = outputTensors[3].get_attr("fix_point")
output_scale4 = 1 / (2**output_fixpos4)
output_fixpos5 = outputTensors[4].get_attr("fix_point")
output_scale5 = 1 / (2**output_fixpos5)
Then I changed the outputData buffer size:
ids_max = 10
outputData = []
for i in range(ids_max):
outputData.append([np.empty(output_ndim, dtype=np.int8, order="C")])
to
ids_max = 1
outputData = []
for i in range(ids_max):
outputData.append([np.empty(output_ndim1, dtype=np.int8),np.empty(output_ndim2, dtype=np.int8),np.empty(output_ndim3, dtype=np.int8),np.empty(output_ndim4,dtype=np.int8),np.empty(output_ndim5, dtype=np.int8)])
And lastly, changed the way the outputs are retrieved from the DPU:
out_q[write_index] = np.argmax(outputData[index][0][j])
to
out_q1[write_index] = outputData[0][0] * output_scale1
out_q2[write_index] = outputData[0][1] * output_scale2
out_q3[write_index] = outputData[0][2] * output_scale3
out_q4[write_index] = outputData[0][3] * output_scale4
out_q5[write_index] = outputData[0][4] * output_scale5
- app - here the whole app is run, you add any post processing here
I added new global output buffers, one for every neural network output, some post processing and saving.
listimage=os.listdir(image_dir)
runTotal = len(listimage)
global out_q1
global out_q2
global out_q3
global out_q4
global out_q5
out_q1 = [None] * runTotal
out_q2 = [None] * runTotal
out_q3 = [None] * runTotal
out_q4 = [None] * runTotal
out_q5 = [None] * runTotal
Step 10: Run your own neural network!Now, you send your app to the device like earlier and run it with the same command. Congratulations! The output should look something like this:
A comparison of processing speed of the model was made. Three different devices were used.
As you may notice, there is a threefold improvement of computation speed between the NVidia GeForce GTX 1050 and the Kria KV260 Vision AI Starter Kit! This is quite interesting, considering that the Kria is a simple and power efficient FPGA Embedded device and the GPU is a standard PC graphics card. This came at a cost however. Lets see some output examples:
A comparison of detection and segmentation accuracy was also made:
The speedup was achieved at the cost of accuracy. The quantization process affected the detection branch more, probably because the output values were much closer to zero than for the segmentation branch. This decrease in accuracy can be easily battled with the help of Quantization Aware Training, which allows for model parameter adjustment during the quantization process.
What comes next?How is this project developed further? Well:
- More models are being explored. The experience gained during this project is being used and shared with multiple other projects at my university, mainly in autonomous vehicles, autonomous drones and dynamic vision sensors research.
- The simultaneous segmentation and detection neural network model is still being developed. The main goal is to further lower its size (currently 30 million parameters) and increase detection accuracy. The option to add a third depth estimation branch is also being considered.
- An autonomous vehicle for the FPT FPGA Design Competition is currently being developed. It will use the discussed perception system for navigation. The vehicle will use mecanum wheels for improved mobility. This project has been awarded with a research grant for Student Science Clubs from the AGH University of Science and Technology.
Comments