At the Competence Center of Intelligent Sensors and Networks (CC ISN) at Lucerne University of Applied Sciences and Arts (HSLU), parts of a design framework for the inference of neural networks on FPGAs have been developed in various research projects. The proprietary framework currently consist of a hardware accelerator called BinArray, which processes binary approximated CNNs, and an evolutionary search algorithm to reduce the complexity of a given CNN architecture.
In the meantime, various FPGA manufacturers like Xilinx or Microchip have also launched their own design frameworks for neural network inference development. During this work, the design framework Vitis AI (Xilinx), VectorBlox (Microchip) and the proprietary tools are compared.
General Working PrincipleThe workflow for developing an FPGA-based deep-learning inference is similar for all design frameworks considered:
- An initial CNN is trained in a deep-learning environment (e.g. TensorFlow)
- In an optional optimization step, the model complexity is reduced (e.g. Xilinx VAI Optimizer)
- During quantization, the real-valued model parameters (usually floating-point 32 representation) are transformed into a low precision representation such as fixed-point integer, logarithmic or binary representation (e.g. Xilinx VAI Quantizer)
- The optional step of re-training attempts to make up for the accuracy lost due to quantization.
- Translation of the high-level model description into the deployable format (e.g. Xilinx VAI Compiler)
- Run the inference on the target hardware (e.g. using Xilinx VART on Kria KV260)
VitisAI
Vitis AI is a docker based development environment that enables AI inference on Xilinx Alveo accelerator cards and Xilinx edge devices. This work focuses on the use of Vitis AI for edge devices (like Kria KV 260).
Currently supported deep learning frameworks are TensorFlow, Caffe and PyTorch. The Deep-Leraning Processing Unit (DPU) is a parametrizable IP core for accelerating neural networks on Xilinx devices. DPUs for different hardware platforms and different performance goals (throughput, latency, scalability, power) are available. This work concentrates on DPUCZDX8G for Zynq UltraScale+ MPSoCs. DPUCZDX8G is based on 8-bit fixed-point integer representation. Various configuration options are available to optimize resource usage and performance suitable for a specific application.
VectorBlox
The Software Development Kit (SDK) VectorBlox and the neural network Soft-IP core CoreVectorBlox from Microchip enable AI inference acceleration on PolarFire FPGAs. Part of the tool is based on OpenVINO toolkit, a free software for optimization of deep learning models developed by Intel.
The equivalent to the Xilinx DPU is the CoreVectorBlox accelerator. The user can choose from three different settings V250, V500 and V1000.
ProprietaryTools
The idea of the evolutionary algorithm (EA) is to find „better“ networks starting with an initial, already „quite good“ network. The term „better“ must be defined suitable for the application. It could be less MAC operations or less clock cycles, or even higher accuracy. The EA currently works with high-level model representation in TensorFlow. Therefore, the EA can also be used to optimize a model for inference on Xilinx or Microchip devices.
In contrast to the commercial accelerators, the proprietary BinArray hardware accelerator relies on the multi-level binary approximation method. Because the binary tensor consists solely of the values -1 and +1, the number of multiplication used per inference can be drastically reduced. The BinArray architecture is fully configurable by three design parameters.
The criteria for comparison can be divided into quantitative and qualitative criteria. Quantitative criteria are criteria which can be somehow measured. Qualitative criteria are criteria which can't be measured, but which are also very important for the success of a project.
QuantitativeCriteria
- Quantized model accuracy
- Estimated throughput
- Resource usage
- Hardware efficiency
QualitativeCriteria
- Installation and setup
- Deep-learning framework interface
- Layer type support
- Quantization options
- Documentation, tutorials and examples
- Target hardware and evaluation kits
- Simulation options
Quantitative evaluation results have been obtained by running the same neural network model (called reference CNN) through all considered frameworks. Then, the quantitative criteria have been evaluated and compared. The well-known MNIST database of handwritten digits was used to train a CNN with four convolutional and two dense layers according a post on Kaggle: Introduction to CNN Keras - 0.997 (top 6%). The model with 900’000 parameters and 32 Mio MAC operations achieved 99.52% accuracy on the test set. Training was done with Tensorflow 1.15 using data augmentation.
Accuracy
The trained floating-point 32 reference CNN was quantized using Vitis AI, VectorBlox and the proprietary binary approximation algorithm. The accuracy was determined by predicting 10'000 test images of the MNIST database.
Throughput
Maximum throughput was estimated with:
- Custom throughput estimation tool for the proprietary hardware accelerator
- VBX inference simulator for CoreVectorBlox
- Calculation depending on model operations for DPUCZDX8G with conservative efficiency estimation of 30%: estimate throughput of a model
The estimation is based on the assumption that each accelerator runs with a frequency of 100 MHz.
Hardware Utilization
Hardware utilization was estimated with:
- Custom hardware utilization tool for the proprietary hardware accelerator
- Numbers given in the CoreVectorBlox datasheet
- Numbers given in the DPUCZDX8G datasheet based on ZCU102 platform (low RAM and DSP usage, depthwise convolution, average pooling, channel augmentation, average pooling and leaky ReLU+ReLU6 features enabled)
Because the number of inputs per LUT differ between Xilinx and Microchip devices, a conversion factor of 2/3 is introduced for CoreVectorBlox LUTs. RAM was calculated in kB. For the custom hardware accelerator, RAM is tendentially underestimated, because system buffers are not considered for BinArray.
Evaluation ResultsAccuracy
Regarding the accuracy, all evaluated design frameworks achieved very good results. In all cases, quantization had only a minimal effect on accuracy.
Throughput
The highest throughput with one single core is achieved with Vitis AI on the DPUCZDX8G hardware accelerator. Using the largest possible DPU size B4096, theoretically 1900 inferences of the reference CNN can be processed per second. The maximum throughput with a single CoreVectorBlox instance is about half with 996 inferences per second. The lowest throughput was observed with BinArray. With the largest configuration evaluated, the maximum throughput is 120 inferences per second.
Hardware Utilization
The proprietary BinArray hardware accelerator uses significantly less hardware resources than CoreVectorBlox or Xilinx DPU, especially when considering the MAC units (DSP slices).
Hardware Efficiency
To allow comparability of results, the hardware utilization was divided with the theoretical maximal throughput. With putting throughput and resource usage in relation to each other, the results can be interpreted as a measure of hardware efficiency. Due to its binary approximation approach, the number of required MAC units per throughput is lower by a factor of 10 or more for the proprietary accelerator.
Qualitative Criteria
Xilinx offers the most advanced development environment of the three design frameworks considered. The range of supported target hardware, examples, tutorials and evaluation kits is outstanding. The documentation is consistent and comprehensive. The user also benefits from a large online community. The only disadvantages in terms of qualitative criteria are the lack of a simulator to estimate the performance of the neural network inference prior to hardware deployment and the complexity of the environment due to its numerous options. The latter is also reflected in the size of the installation which requires up to 180GB free disk space.
OutlookAs exposed in the quantitative evaluation, the proprietary approach has important advantages, especially in terms of hardware efficiency. In a next step, the hardware accelerator BinArray will be integrated as a Xilinx XRT compatible kernel into a platform. The quantitative results are then to be verified in hardware.
Appendix - Run custom MNIST model on Kria KV260This little tutorial shows how easily a custom model can be run on the Kria KV260 platform. The example is based on the Vitis Design Tutorial for MNIST. The example is adapted to
- start from a model trained elsewhere (not with train.py)
- compile for Kria KV260
First, create and train a model with Keras or Tensorflow and generate a frozen graph (.pb file) of the model. Then, follow the steps below to quantize and compile the model for the Kria KV260 reference design (smartcam) with B3136 DPU:
1. Copy the generated frozen_graph.pb file into the build/freeze folder
2. Create an arch.json in the arch folder containing the fingerprint for B3136 architecture:
{
"fingerprint":"0x1000020F6014406"
}
All DPU related information can easily be checked using the xdputil query command (make sure that the accelerator is loaded):
$ xdputil query
{
"DPU IP Spec":{
"DPU Core Count":2,
"DPU Target Version":"v1.4.1",
"IP version":"v3.3.0",
"generation timestamp":"2021-06-07 19-15-00",
"git commit id":"df4d0c7",
"git commit time":2106071910,
"regmap":"1to1 version"
},
"VAI Version":{
"libvart-runner.so":"Xilinx vart-runner Version: 1.4.0-84798c76e6ebb93300bf384cb56397f214676330 2021-08-24-22:43:13 ",
"libvitis_ai_library-dpu_task.so":"Xilinx vitis_ai_library dpu_task Version: 1.4.0-84798c76e6ebb93300bf384cb56397f214676330 2021-07-27 0 5:41:52 [UTC] ",
"libxir.so":"Xilinx xir Version: xir-84798c76e6ebb93300bf384cb56397f214676330 2021-08-24-20:17:04",
"target_factory":"target-factory.1.4.0 84798c76e6ebb93300bf384cb56397f214676330"
},
"kernels":[
{
"DPU Arch":"DPUCZDX8G_ISA0_B3136_MAX_BG2",
"DPU Frequency (MHz)":300,
"IP Type":"DPU",
"Load Parallel":2,
"Load augmentation":"enable",
"Load minus mean":"disable",
"Save Parallel":2,
"XRT Frequency (MHz)":300,
"cu_addr":"0xa0010000",
"cu_handle":"0xaaaafe95e130",
"cu_idx":0,
"cu_mask":2,
"cu_name":"DPUCZDX8G:DPUCZDX8G_1",
"device_id":0,
"fingerprint":"0x1000020f6014406",
"name":"DPU Core 0"
},
{
"DPU Arch":"",
"cu_addr":"0xa0020000",
"cu_handle":"0xaaaafea47220",
"cu_idx":1,
"cu_mask":1,
"cu_name":"pp_pipeline_accel:pp_pipeline_accel_1",
"device_id":0,
"fingerprint":"0x0",
"name":"DPU Core 1"
}
]
}
2. Use 4_quant.sh to quantize the frozen_graph.pb into the build/quantize folder. Adjust variables in 0_setenv.sh if required.
3. Generate a new file called 6_compile_kv260.sh which compiles against the arch.json. The compiled graph is generated in the build/compile_kv260 folder.
4. Create a new file called 7_make_target_kv26.sh and adjust it properly. This will copy the generated model files and the application into a folder build/target_kv260.
5. Copy the target folder onto the Kria KV260. Make sure that the accelerator is properly loaded into the slot with xmutil loadapp command (refer the getting started guide). If not yet done, install opencv:
$ sudo dnf install python3-opencv-0:4.4.0-r0.0.cortexa72_cortexa53
6. Run the app_my.py in the target folder on the Kria:
$ python3 app_mt.py
Command line options:
--image_dir : images
--threads : 1
--model : model_dir/customcnn.xmodel
Pre-processing 10000 images...
Starting 1 threads...
Throughput=1930.29 fps, total frames = 10000, time=5.1806 seconds
Correct:9953, Wrong:47, Accuracy:0.9953
7. Increase the performance by about 800 fps by defining two threads:
$ python3 app_mt.py -t 2
Command line options:
--image_dir : images
--threads : 2
--model : model_dir/customcnn.xmodel
Pre-processing 10000 images...
Starting 2 threads...
Throughput=2710.00 fps, total frames = 10000, time=3.6900 seconds
Correct:9953, Wrong:47, Accuracy:0.9953
Comments