We built an ONNX environment on the KR260 and executed ONNX Runtime.
We introduce the use of the Vitis AI ONNX Runtime Engine (VOE).
This project is part of a subproject for the AMD Pervasive AI Developer Contest.
Be sure to check out the other projects as well.
**The main project is currently under submission. ***
0. Main project << under submission
2. PYNQ + PWM(DC-Motor Control)
3. Object Detection(Yolo) with DPU-PYNQ
4. Implementation DPU, GPIO, and PWM
6. GStreamer + OpenCV with 360°Camera
7. 360 Live Streaming + Object Detect(DPU)
8. ROS2 3D Marker from 360 Live Streaming
9. Control 360° Object Detection Robot Car
10. Improve Object Detection Speed with YOLOX
11. Benchmark Architectures of the DPU
12. Power Consumption of 360° Object Detection Robot Car
13. Application to Vitis AI ONNXRuntime Engine (VOE) << this project
14. Appendix: Object Detection Using YOLOX with a Webcam
*In this subproject, we will conduct tests in a different environment from the main project as part of the benchmarking process.
IntroductionWe experimented with running YOLOX using ONNX on a DPU and Python.
This project involved setting up a dedicated ONNX environment on the KR260. We will also introduce the use of the Vitis AI ONNX Runtime Engine (VOE).
Using ONNX Runtime makes Vitis AI even more user-friendly. This time, we are testing the comparison between CPU and DPU using ONNX.
The ONNX and DPU (KR260) Test video is shown below:
Moreover, comparison of CPU and DPU on ONNX Test video is shown below:
Vitis AI 3.5 ONNXVitis AI 3.5 ONNX supports both C++ and Python. Refer to the official documentation here:
https://docs.amd.com/r/en-US/ug1414-vitis-ai/Programming-with-VOE
For YOLOX in C++, sample code is provided officially here:
https://github.com/Xilinx/Vitis-AI/tree/v3.5/examples/vai_library/samples_onnx/yolovx_nano
In this subproject, we will write and test YOLOX in Python code.
Building the Environment with PetaLinux from BSPThis subproject, we used PetaLinux to create the OS environment on the KR260. Download the BSP file from the link below and build it with PetaLinux:
source /opt/petalinux/2023.1/settings.sh
petalinux-create -t project -s xilinx-kr260-starterkit-v2023.1-05080224.bsp
cd xilinx-kr260-starterkit-2023.1/
petalinux-build
petalinux-package --boot --u-boot --force
petalinux-package --wic --images-dir images/linux/ --bootfiles "ramdisk.cpio.gz.u-boot,boot.scr,Image,system.dtb,system-zynqmp-sck-kr-g-revB.dtb" --disk-name "sda"
Note:
Petalinux is used because the onnxruntime could not be installed in the Ubuntu environment.
Writing the Image to the SD CardWrite the SD card image (.wic) created with Petalinux, found in the following folder:
~/xilinx-kr260-starterkit-2023.1/images/linux/
We use balenaEtcher to write it to the SD card.
The initial login name for KR260 is "petalinux". Follow the official documentation to install Vitis AI and the ONNX runtime on the KR260:
https://docs.amd.com/r/en-US/ug1414-vitis-ai/Programming-with-VOE
wget https://www.xilinx.com/bin/public/openDownload?filename=vitis_ai_2023.1-r3.5.0.tar.gz
sudo tar -xzvf openDownload\?filename\=vitis_ai_2023.1-r3.5.0.tar.gz -C /
ls
wget https://www.xilinx.com/bin/public/openDownload?filename=voe-0.1.0-py3-none-any.whl -O voe-0.1.0-py3-none-any.whl
pip3 install voe*.whl
wget https://www.xilinx.com/bin/public/openDownload?filename=onnxruntime_vitisai-1.16.0-py3-none-any.whl -O onnxruntime_vitisai-1.16.0-py3-none-any.whl
pip3 install onnxruntime_vitisai*.whl
Installing xrt packagegroup-petalinux-opencvWhen installed according to the official documentation, importing in Python resulted in an error:
xilinx-kr260-starterkit-20231:~$ python3
Python 3.10.6 (main, Aug 1 2022, 20:38:21) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnxruntime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/petalinux/.local/lib/python3.10/site-packages/onnxruntime/__init__.py", line 55, in <module>
raise import_capi_exception
File "/home/petalinux/.local/lib/python3.10/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import ExecutionMode # noqa: F401
File "/home/petalinux/.local/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
from .onnxruntime_pybind11_state import * # noqa
ImportError: libglog.so.0: cannot open shared object file: No such file or directory
It seems that xrt is not installed, so we responded with the following:
sudo dnf install xrt packagegroup-petalinux-opencv
After executing the above, I was able to import onnxruntime without any issues:
xilinx-kr260-starterkit-20231:~$ python3
Python 3.10.6 (main, Aug 1 2022, 20:38:21) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnxruntime
>>> exit()
Creating the DPU Environment with PetalinuxIn Ubuntu, the DPU environment of the FPGA could be overlaid via the PYNQ-DPU library.
In PetaLinux, we used xmutil to prepare the DPU environment.
We prepared the necessary files, including those needed for xmutil, and placed them on GitHub.
https://github.com/iotengineer22/AMD-Pervasive-AI-Developer-Contest/tree/main/src/onnx-test
Creating the Device Tree (pl.dtbo)
Using the FPGA design file (.xsa) created with Vivado, we executed the following command to create the "pl.dtbo" file for the device tree via CUI with Vitis:
xsct
createdts -hw amd_contest_design.xsa -zocl -platform-name mydevice -git-branch xlnx_rel_v2023.1 -overlay -compile -out mydevice
exit
dtc -@ -O dtb -o mydevice/mydevice/mydevice/psu_cortexa53_0/device_tree_domain/bsp/pl.dtbo mydevice/mydevice/mydevice/psu_cortexa53_0/device_tree_domain/bsp/pl.dtsi
mkdir dtg_output
cp mydevice/mydevice/mydevice/psu_cortexa53_0/device_tree_domain/bsp/pl.dtbo dtg_output
We used the.xsa file created in the project below.
4. Implementation DPU, GPIO, and PWM
*Chapter: Creating the DPU Project in Vivado
Creating the shell.json
We also created the "shell.json" file, which contains XRT_FLAT information, necessary when loading the DPU:
echo '{' > shell.json
echo ' "shell_type" : "XRT_FLAT",' >> shell.json
echo ' "num_slots": "1"' >> shell.json
echo '}' >> shell.json
Using the DPU Bitstream File (dpu.xclbin)
We used the DPU bitstream file (dpu.xclbin) created during the synthesis in Vitis.
4. Implementation DPU, GPIO, and PWM
*Chapter: Running the DPU on KR260 with PYNQ
This DPU operates at B4096 and 300MHz. It is found in the "Hardware" folder of the Vitis project’s "hw_link" folder.
Creating the vart.conf
We also created the vart.conf, which is the configuration file for Vitis AI Runtime (VART):
echo 'firmware: /home/petalinux/onnx-test/dpu.xclbin' > vart.conf
This file specifies the location where dpu.xclbin is placed.
The default path was [/run/media/mmcblk0p1/dpu.xclbin].
For testing purposes, I specified a temporary folder.
Preparing the ONNX YOLOX Model
This time, we used the models provided by Vitis AI. Pre-trained and quantized models are provided as samples by Xilinx (AMD), converted from PyTorch models to ONNX:
https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo/model-list/pt_yolox-nano_3.5
Download and unzip the YOLOX sample model:
wget https://www.xilinx.com/bin/public/openDownload?filename=pt_yolox-nano_3.5.zip
unzip openDownload\?filename\=pt_yolox-nano_3.5.zip
The "yolox_nano_onnx_pt.onnx" file is in the "quantized" folder. Transfer this file to the KR260 without further compilation.
The actual program we ran is available on GitHub below.
This program is executed on the KR260.
First, we import onnxruntime and create a session to enable inference with ONNX.
Below is a portion of the program.
import onnxruntime
session = onnxruntime.InferenceSession(
'yolox_nano_onnx_pt.onnx',
providers=["VitisAIExecutionProvider"],
provider_options=[{"config_file":"/usr/bin/vaip_config.json"}])
In the official instructions, the config file was specified as shown below, but we modified it for the KR260's configuration.
provider_options=[{"config_file":"/etc/vaip_config.json"}]
Additionally, since the model was directly converted from PyTorch to ONNX, there is no significant speedup.
We included the sigmoid and softmax processes, just as in PyTorch.
Moreover, because it is ONNX, the swap process is also included, which makes the pre and post-processing slower than in PyTorch.
Testing ONNX on KR260We send the files created for the KR260.
First, We have configured the DPU to be loadable with xmutil.
We created an application called b4096_300m.
ls /lib/firmware/xilinx/
sudo mkdir /lib/firmware/xilinx/b4096_300m
sudo cp pl.dtbo shell.json /lib/firmware/xilinx/b4096_300m/
sudo cp dpu.xclbin /lib/firmware/xilinx/b4096_300m/binary_container_1.bin
ls /lib/firmware/xilinx/b4096_300m/
We also replaced the existing vart.conf with the newly created one.
sudo mv /etc/vart.conf /etc/old_vart.conf
sudo cp vart.conf /etc/
sudo reboot
From here, the steps follow the flow presented in the demo video mentioned at the beginning.
First, load the DPU application (b4096_300m).
xilinx-kr260-starterkit-20231:~$ sudo xmutil listapps
xilinx-kr260-starterkit-20231:~$ sudo xmutil unloadapp
xilinx-kr260-starterkit-20231:~$ sudo xmutil loadapp b4096_300m
xilinx-kr260-starterkit-20231:~$ sudo xmutil listapps
We run a Python program(onnx-yolox.py) within the onnx-test directory on the KR260.
xilinx-kr260-starterkit-20231:~$ cd onnx-test/
xilinx-kr260-starterkit-20231:~/onnx-test$ python3 onnx-yolox.py
When the program runs, it begins compiling to match the DPU specifications.
Note: The initial compilation takes a few minutes. Subsequent runs will skip the compilation step, making the process faster.
During the compilation, the log shows that the program reads the loaded DPU and compiles in DPU mode:
Compile mode: dpu
Debug mode: performance
Target architecture: DPUCZDX8G_ISA1_B4096_0101000016010407
Graph name: torch_jit, with op num: 815
Begin to compile...
Once the compilation is complete, the Python program will execute.
Here is an example of the actual log output:
yolox_nano_test, in ONNX
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20240707 00:25:31.234889 1121 vitisai_compile_model.cpp:242] Vitis AI EP Load ONNX Model Success
I20240707 00:25:31.234995 1121 vitisai_compile_model.cpp:243] Graph Input Node Name/Shape (1)
I20240707 00:25:31.235031 1121 vitisai_compile_model.cpp:247] YOLOX::input_0 : [-1x3x416x416]
I20240707 00:25:31.235271 1121 vitisai_compile_model.cpp:253] Graph Output Node Name/Shape (3)
I20240707 00:25:31.235306 1121 vitisai_compile_model.cpp:257] 2043 : [-1x85x52x52]
I20240707 00:25:31.235337 1121 vitisai_compile_model.cpp:257] 2269 : [-1x85x26x26]
I20240707 00:25:31.235366 1121 vitisai_compile_model.cpp:257] 2495 : [-1x85x13x13]
I20240707 00:25:32.170728 1121 pass_imp.cpp:379] save const info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/const_info_before_const_folding.txt"
I20240707 00:25:32.848812 1121 pass_imp.cpp:288] save fix info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/fix_info.txt"
I20240707 00:25:32.849267 1121 pass_imp.cpp:379] save const info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/const_info_after_const_folding.txt"
I20240707 00:25:32.851840 1121 pass_imp.cpp:406] save const info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/const.bin"
I20240707 00:25:44.541139 1121 compile_pass_manager.cpp:352] [UNILOG][INFO] Compile mode: dpu
I20240707 00:25:44.541229 1121 compile_pass_manager.cpp:353] [UNILOG][INFO] Debug mode: performance
I20240707 00:25:44.541294 1121 compile_pass_manager.cpp:357] [UNILOG][INFO] Target architecture: DPUCZDX8G_ISA1_B4096_0101000016010407
I20240707 00:25:44.549165 1121 compile_pass_manager.cpp:465] [UNILOG][INFO] Graph name: torch_jit, with op num: 815
I20240707 00:25:44.549206 1121 compile_pass_manager.cpp:478] [UNILOG][INFO] Begin to compile...
W20240707 00:27:51.125735 1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = onnx::Conv_232_vaip_1, type = transpose} has been assigned to CPU.
W20240707 00:27:51.551687 1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = 2495, type = transpose} has been assigned to CPU.
W20240707 00:27:51.610715 1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = 2269, type = transpose} has been assigned to CPU.
W20240707 00:27:51.669713 1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = 2043, type = transpose} has been assigned to CPU.
I20240707 00:28:25.202344 1121 compile_pass_manager.cpp:489] [UNILOG][INFO] Total device subgraph number 6, DPU subgraph number 1
I20240707 00:28:25.202572 1121 compile_pass_manager.cpp:504] [UNILOG][INFO] Compile done.
I20240707 00:28:25.395335 1121 anchor_point.cpp:423] before optimization:
onnx::Conv_232_vaip_1_fix <-- transpose@layoutransform --
onnx::DequantizeLinear_229 <-- identity@fuse_DPU --
onnx::DequantizeLinear_229
after optimization:
onnx::Conv_232_vaip_1_fix <-- transpose@layoutransform --
onnx::DequantizeLinear_229
I20240707 00:28:25.395582 1121 anchor_point.cpp:423] before optimization:
2495_vaip_664 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2492 <-- identity@fuse_DPU --
onnx::DequantizeLinear_2492
after optimization:
2495_vaip_664 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2492
I20240707 00:28:25.395682 1121 anchor_point.cpp:423] before optimization:
2269_vaip_697 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2266 <-- identity@fuse_DPU --
onnx::DequantizeLinear_2266
after optimization:
2269_vaip_697 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2266
I20240707 00:28:25.395769 1121 anchor_point.cpp:423] before optimization:
2043_vaip_730 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2040 <-- identity@fuse_DPU --
onnx::DequantizeLinear_2040
after optimization:
2043_vaip_730 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2040
2024-07-07 00:28:26.281134980 [W:onnxruntime:, session_state.cc:1169 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-07 00:28:26.281221749 [W:onnxruntime:, session_state.cc:1171 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
I20240707 00:28:26.707309 1121 custom_op.cpp:141] Vitis AI EP running 723 Nodes
['YOLOX::input_0_dynamic_axes_1', 3, 416, 416]
YOLOX::input_0
tensor(float)
['DequantizeLinear2043_dim_0', 85, 52, 52]
['DequantizeLinear2269_dim_0', 85, 26, 26]
['DequantizeLinear2495_dim_0', 85, 13, 13]
We're using YOLOX for object detection on a single image.
The results of the image recognition showed that the orange ball was successfully detected without any issues.
bboxes of detected objects: [[ 473.17449951 137.78985596 812.97937012 477.59475708]
[ 0. 5.46184874 1280. 720. ]]
scores of detected objects: [0.73033565 0.20149007]
Details of detected objects: [49. 60.]
Pre-processing time: 0.0108 seconds
DPU execution time: 0.0129 seconds
Post-process time: 0.0360 seconds
Total run time: 0.0597 seconds
Performance: 16.740788045213616 FPS
Although not a precise comparison, we will compare the content and speed tested in an Ubuntu environment as described in the article below.
10. Improve Object Detection Speed with YOLOX
We conducted similar tests with TensorFlow2's YOLOv3 and PyTorch's YOLOX.
When comparing the pre-processing, DPU inference, and post-processing, the results were as follows:
For the ONNX YOLOX, no particular speed optimizations were made from the original PyTorch. Both versions of YOLOX yielded almost identical results, which was as expected.
With the capability to use ONNX, comparing the CPU and DPU on the KR260 becomes straightforward.
By modifying a single line in the same program, you can switch from DPU to CPU for inference:
providers=["CPUExecutionProvider"]
session = onnxruntime.InferenceSession(
'yolox_nano_onnx_pt.onnx',
# providers=["VitisAIExecutionProvider"],
providers=["CPUExecutionProvider"],
provider_options=[{"config_file":"/usr/bin/vaip_config.json"}])
The actual program we ran is available on GitHub below.
xilinx-kr260-starterkit-20231:~/onnx-test$ python3 onnx-cpu-yolox.py
Here is the comparison of CPU and DPU on ONNX Test video:
The DPU inference was over 20 times faster than the CPU.
When comparing the previous YOLOv3 and PyTorch YOLOX, the graph below summarizes the results.
It clearly shows how using the DPU can significantly speed up the process.
Many thanks for reference articles and repositories.
https://github.com/Kazuhito00/YOLOX-ONNX-TFLite-Sample/tree/main
SummaryWe built an ONNX environment on the KR260 and executed ONNX Runtime.
We introduce the use of the Vitis AI ONNX Runtime Engine (VOE).
In the next project, we tried object detection with a regular USB-connected webcam using the KR260.
14. Appendix: Object Detection Using YOLOX with a Webcam << next project
Comments