Published July 31, 2024 © Apache-2.0

Vitis AI ONNX Runtime Engine (VOE) with KR260 + Python

This project is part of a subproject for the AMD Pervasive AI Developer Contest. We introduce Vitis AI ONNX Runtime Engine (VOE) with KR260.

IntermediateFull instructions provided4 hours1,015

Vitis AI ONNX Runtime Engine (VOE) with KR260 + Python

Things used in this project

Hardware components

AMD Kria™ KR260 Robotics Starter Kit

Software apps and online services

AMD Vivado Design Suite

Vivado 2023.1

AMD Vitis Unified Software Platform

Vitis 2023.1 Vitis-AI 3.5

OpenCV – Open Source Computer Vision Library OpenCV

AMD PetaLinux

PetaLinux 2023.1

Story

We built an ONNX environment on the KR260 and executed ONNX Runtime.

We introduce the use of the Vitis AI ONNX Runtime Engine (VOE).

Vitis AI ONNXRuntime Engine (VOE)

Subproject for the AMD Pervasive AI Developer Contest

This project is part of a subproject for the AMD Pervasive AI Developer Contest.

Be sure to check out the other projects as well.

**The main project is currently under submission. ***

0. Main project << under submission

1. PYNQ + GPIO(LED Blinking)

2. PYNQ + PWM(DC-Motor Control)

3. Object Detection(Yolo) with DPU-PYNQ

4. Implementation DPU, GPIO, and PWM

5. Remote Control 360° Camera

6. GStreamer + OpenCV with 360°Camera

7. 360 Live Streaming + Object Detect(DPU)

8. ROS2 3D Marker from 360 Live Streaming

9. Control 360° Object Detection Robot Car

10. Improve Object Detection Speed with YOLOX

11. Benchmark Architectures of the DPU

12. Power Consumption of 360° Object Detection Robot Car

13. Application to Vitis AI ONNXRuntime Engine (VOE) << this project

14. Appendix: Object Detection Using YOLOX with a Webcam

*In this subproject, we will conduct tests in a different environment from the main project as part of the benchmarking process.

Introduction

We experimented with running YOLOX using ONNX on a DPU and Python.

This project involved setting up a dedicated ONNX environment on the KR260. We will also introduce the use of the Vitis AI ONNX Runtime Engine (VOE).

Using ONNX Runtime makes Vitis AI even more user-friendly. This time, we are testing the comparison between CPU and DPU using ONNX.

Using ONNXRuntime result example

The ONNX and DPU (KR260) Test video is shown below:

ONNX and DPU (KR260) Test video

Moreover, comparison of CPU and DPU on ONNX Test video is shown below:

comparison of CPU and DPU on ONNX Test Video

Vitis AI 3.5 ONNX

Vitis AI 3.5 ONNX supports both C++ and Python. Refer to the official documentation here:

https://docs.amd.com/r/en-US/ug1414-vitis-ai/Programming-with-VOE

For YOLOX in C++, sample code is provided officially here:

https://github.com/Xilinx/Vitis-AI/tree/v3.5/examples/vai_library/samples_onnx/yolovx_nano

In this subproject, we will write and test YOLOX in Python code.

Building the Environment with PetaLinux from BSP

This subproject, we used PetaLinux to create the OS environment on the KR260. Download the BSP file from the link below and build it with PetaLinux:

https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/1641152513/Kria+SOMs+Starter+Kits#PetaLinux-Build-instructions

source /opt/petalinux/2023.1/settings.sh
petalinux-create -t project -s xilinx-kr260-starterkit-v2023.1-05080224.bsp 
cd xilinx-kr260-starterkit-2023.1/
petalinux-build
petalinux-package --boot --u-boot --force
petalinux-package --wic --images-dir images/linux/ --bootfiles "ramdisk.cpio.gz.u-boot,boot.scr,Image,system.dtb,system-zynqmp-sck-kr-g-revB.dtb" --disk-name "sda"

Note:

Petalinux is used because the onnxruntime could not be installed in the Ubuntu environment.

Writing the Image to the SD Card

Write the SD card image (.wic) created with Petalinux, found in the following folder:

~/xilinx-kr260-starterkit-2023.1/images/linux/

We use balenaEtcher to write it to the SD card.

Installing onnxruntime

The initial login name for KR260 is "petalinux". Follow the official documentation to install Vitis AI and the ONNX runtime on the KR260:

https://docs.amd.com/r/en-US/ug1414-vitis-ai/Programming-with-VOE

wget https://www.xilinx.com/bin/public/openDownload?filename=vitis_ai_2023.1-r3.5.0.tar.gz
sudo tar -xzvf openDownload\?filename\=vitis_ai_2023.1-r3.5.0.tar.gz -C /
ls
wget https://www.xilinx.com/bin/public/openDownload?filename=voe-0.1.0-py3-none-any.whl -O voe-0.1.0-py3-none-any.whl
pip3 install voe*.whl
wget https://www.xilinx.com/bin/public/openDownload?filename=onnxruntime_vitisai-1.16.0-py3-none-any.whl -O onnxruntime_vitisai-1.16.0-py3-none-any.whl
pip3 install onnxruntime_vitisai*.whl

Installing xrt packagegroup-petalinux-opencv

When installed according to the official documentation, importing in Python resulted in an error:

xilinx-kr260-starterkit-20231:~$ python3
Python 3.10.6 (main, Aug  1 2022, 20:38:21) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnxruntime
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/petalinux/.local/lib/python3.10/site-packages/onnxruntime/__init__.py", line 55, in <module>
    raise import_capi_exception
  File "/home/petalinux/.local/lib/python3.10/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import ExecutionMode  # noqa: F401
  File "/home/petalinux/.local/lib/python3.10/site-packages/onnxruntime/capi/_pybind_state.py", line 33, in <module>
    from .onnxruntime_pybind11_state import *  # noqa
ImportError: libglog.so.0: cannot open shared object file: No such file or directory

It seems that xrt is not installed, so we responded with the following:

sudo dnf install xrt packagegroup-petalinux-opencv

After executing the above, I was able to import onnxruntime without any issues:

xilinx-kr260-starterkit-20231:~$ python3
Python 3.10.6 (main, Aug  1 2022, 20:38:21) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import onnxruntime
>>> exit()

Creating the DPU Environment with Petalinux

In Ubuntu, the DPU environment of the FPGA could be overlaid via the PYNQ-DPU library.

In PetaLinux, we used xmutil to prepare the DPU environment.

We prepared the necessary files, including those needed for xmutil, and placed them on GitHub.

https://github.com/iotengineer22/AMD-Pervasive-AI-Developer-Contest/tree/main/src/onnx-test

Creating the Device Tree (pl.dtbo)

Using the FPGA design file (.xsa) created with Vivado, we executed the following command to create the "pl.dtbo" file for the device tree via CUI with Vitis:

xsct
createdts -hw amd_contest_design.xsa -zocl -platform-name mydevice -git-branch xlnx_rel_v2023.1 -overlay -compile -out mydevice
exit
dtc -@ -O dtb -o mydevice/mydevice/mydevice/psu_cortexa53_0/device_tree_domain/bsp/pl.dtbo mydevice/mydevice/mydevice/psu_cortexa53_0/device_tree_domain/bsp/pl.dtsi
mkdir dtg_output
cp mydevice/mydevice/mydevice/psu_cortexa53_0/device_tree_domain/bsp/pl.dtbo dtg_output

We used the.xsa file created in the project below.

4. Implementation DPU, GPIO, and PWM

*Chapter: Creating the DPU Project in Vivado

Creating the shell.json

We also created the "shell.json" file, which contains XRT_FLAT information, necessary when loading the DPU:

echo '{' > shell.json
echo '  "shell_type" : "XRT_FLAT",' >>  shell.json
echo '  "num_slots": "1"' >>  shell.json
echo '}' >> shell.json

Using the DPU Bitstream File (dpu.xclbin)

We used the DPU bitstream file (dpu.xclbin) created during the synthesis in Vitis.

4. Implementation DPU, GPIO, and PWM

*Chapter: Running the DPU on KR260 with PYNQ

This DPU operates at B4096 and 300MHz. It is found in the "Hardware" folder of the Vitis project’s "hw_link" folder.

dpu.xclbin

Creating the vart.conf

We also created the vart.conf, which is the configuration file for Vitis AI Runtime (VART):

echo 'firmware: /home/petalinux/onnx-test/dpu.xclbin' > vart.conf

This file specifies the location where dpu.xclbin is placed.

The default path was [/run/media/mmcblk0p1/dpu.xclbin].

For testing purposes, I specified a temporary folder.

Preparing the ONNX YOLOX Model

This time, we used the models provided by Vitis AI. Pre-trained and quantized models are provided as samples by Xilinx (AMD), converted from PyTorch models to ONNX:

https://github.com/Xilinx/Vitis-AI/tree/master/model_zoo/model-list/pt_yolox-nano_3.5

Download and unzip the YOLOX sample model:

wget https://www.xilinx.com/bin/public/openDownload?filename=pt_yolox-nano_3.5.zip
unzip openDownload\?filename\=pt_yolox-nano_3.5.zip

The "yolox_nano_onnx_pt.onnx" file is in the "quantized" folder. Transfer this file to the KR260 without further compilation.

yolox_nano_onnx_pt.onnx

Python Program for ONNX YOLOX (.py)

The actual program we ran is available on GitHub below.

This program is executed on the KR260.

https://github.com/iotengineer22/AMD-Pervasive-AI-Developer-Contest/blob/main/src/onnx-test/onnx-yolox.py

First, we import onnxruntime and create a session to enable inference with ONNX.

Below is a portion of the program.

import onnxruntime

session = onnxruntime.InferenceSession(
    'yolox_nano_onnx_pt.onnx',
    providers=["VitisAIExecutionProvider"],
    provider_options=[{"config_file":"/usr/bin/vaip_config.json"}])

In the official instructions, the config file was specified as shown below, but we modified it for the KR260's configuration.

provider_options=[{"config_file":"/etc/vaip_config.json"}]

Additionally, since the model was directly converted from PyTorch to ONNX, there is no significant speedup.

We included the sigmoid and softmax processes, just as in PyTorch.

Moreover, because it is ONNX, the swap process is also included, which makes the pre and post-processing slower than in PyTorch.

Testing ONNX on KR260

We send the files created for the KR260.

First, We have configured the DPU to be loadable with xmutil.

We created an application called b4096_300m.

ls /lib/firmware/xilinx/
sudo mkdir /lib/firmware/xilinx/b4096_300m
sudo cp pl.dtbo shell.json /lib/firmware/xilinx/b4096_300m/
sudo cp dpu.xclbin /lib/firmware/xilinx/b4096_300m/binary_container_1.bin
ls /lib/firmware/xilinx/b4096_300m/

We also replaced the existing vart.conf with the newly created one.

sudo mv /etc/vart.conf /etc/old_vart.conf
sudo cp vart.conf /etc/
sudo reboot

From here, the steps follow the flow presented in the demo video mentioned at the beginning.

ONNX and DPU (KR260) Test video

First, load the DPU application (b4096_300m).

xilinx-kr260-starterkit-20231:~$ sudo xmutil listapps
xilinx-kr260-starterkit-20231:~$ sudo xmutil unloadapp
xilinx-kr260-starterkit-20231:~$ sudo xmutil loadapp b4096_300m
xilinx-kr260-starterkit-20231:~$ sudo xmutil listapps

load the DPU application (b4096_300m)

We run a Python program(onnx-yolox.py) within the onnx-test directory on the KR260.

xilinx-kr260-starterkit-20231:~$ cd onnx-test/
xilinx-kr260-starterkit-20231:~/onnx-test$ python3 onnx-yolox.py

When the program runs, it begins compiling to match the DPU specifications.

Note: The initial compilation takes a few minutes. Subsequent runs will skip the compilation step, making the process faster.

During the compilation, the log shows that the program reads the loaded DPU and compiles in DPU mode:

Compile mode: dpu
Debug mode: performance
Target architecture: DPUCZDX8G_ISA1_B4096_0101000016010407
Graph name: torch_jit, with op num: 815
Begin to compile...

compiling to match the DPU specifications_1

compiling to match the DPU specifications_2

Once the compilation is complete, the Python program will execute.

Here is an example of the actual log output:

yolox_nano_test, in ONNX

WARNING: Logging before InitGoogleLogging() is written to STDERR
I20240707 00:25:31.234889  1121 vitisai_compile_model.cpp:242] Vitis AI EP Load ONNX Model Success
I20240707 00:25:31.234995  1121 vitisai_compile_model.cpp:243] Graph Input Node Name/Shape (1)
I20240707 00:25:31.235031  1121 vitisai_compile_model.cpp:247]   YOLOX::input_0 : [-1x3x416x416]
I20240707 00:25:31.235271  1121 vitisai_compile_model.cpp:253] Graph Output Node Name/Shape (3)
I20240707 00:25:31.235306  1121 vitisai_compile_model.cpp:257]   2043 : [-1x85x52x52]
I20240707 00:25:31.235337  1121 vitisai_compile_model.cpp:257]   2269 : [-1x85x26x26]
I20240707 00:25:31.235366  1121 vitisai_compile_model.cpp:257]   2495 : [-1x85x13x13]
I20240707 00:25:32.170728  1121 pass_imp.cpp:379] save const info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/const_info_before_const_folding.txt"
I20240707 00:25:32.848812  1121 pass_imp.cpp:288] save fix info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/fix_info.txt"
I20240707 00:25:32.849267  1121 pass_imp.cpp:379] save const info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/const_info_after_const_folding.txt"
I20240707 00:25:32.851840  1121 pass_imp.cpp:406] save const info to "/tmp/petalinux/vaip/.cache/00d67dd613fde6f65242578b5578aae7/const.bin"
I20240707 00:25:44.541139  1121 compile_pass_manager.cpp:352] [UNILOG][INFO] Compile mode: dpu
I20240707 00:25:44.541229  1121 compile_pass_manager.cpp:353] [UNILOG][INFO] Debug mode: performance
I20240707 00:25:44.541294  1121 compile_pass_manager.cpp:357] [UNILOG][INFO] Target architecture: DPUCZDX8G_ISA1_B4096_0101000016010407
I20240707 00:25:44.549165  1121 compile_pass_manager.cpp:465] [UNILOG][INFO] Graph name: torch_jit, with op num: 815
I20240707 00:25:44.549206  1121 compile_pass_manager.cpp:478] [UNILOG][INFO] Begin to compile...
W20240707 00:27:51.125735  1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = onnx::Conv_232_vaip_1, type = transpose} has been assigned to CPU.
W20240707 00:27:51.551687  1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = 2495, type = transpose} has been assigned to CPU.
W20240707 00:27:51.610715  1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = 2269, type = transpose} has been assigned to CPU.
W20240707 00:27:51.669713  1121 PartitionPass.cpp:4160] [UNILOG][WARNING] xir::Op{name = 2043, type = transpose} has been assigned to CPU.
I20240707 00:28:25.202344  1121 compile_pass_manager.cpp:489] [UNILOG][INFO] Total device subgraph number 6, DPU subgraph number 1
I20240707 00:28:25.202572  1121 compile_pass_manager.cpp:504] [UNILOG][INFO] Compile done.
I20240707 00:28:25.395335  1121 anchor_point.cpp:423] before optimization:

onnx::Conv_232_vaip_1_fix <-- transpose@layoutransform --
onnx::DequantizeLinear_229 <-- identity@fuse_DPU --
onnx::DequantizeLinear_229
after optimization:

onnx::Conv_232_vaip_1_fix <-- transpose@layoutransform --
onnx::DequantizeLinear_229
I20240707 00:28:25.395582  1121 anchor_point.cpp:423] before optimization:

2495_vaip_664 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2492 <-- identity@fuse_DPU --
onnx::DequantizeLinear_2492
after optimization:

2495_vaip_664 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2492
I20240707 00:28:25.395682  1121 anchor_point.cpp:423] before optimization:

2269_vaip_697 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2266 <-- identity@fuse_DPU --
onnx::DequantizeLinear_2266
after optimization:

2269_vaip_697 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2266
I20240707 00:28:25.395769  1121 anchor_point.cpp:423] before optimization:

2043_vaip_730 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2040 <-- identity@fuse_DPU --
onnx::DequantizeLinear_2040
after optimization:

2043_vaip_730 <-- transpose@fuse_transpose --
onnx::DequantizeLinear_2040
2024-07-07 00:28:26.281134980 [W:onnxruntime:, session_state.cc:1169 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-07 00:28:26.281221749 [W:onnxruntime:, session_state.cc:1171 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
I20240707 00:28:26.707309  1121 custom_op.cpp:141]  Vitis AI EP running 723 Nodes
['YOLOX::input_0_dynamic_axes_1', 3, 416, 416]
YOLOX::input_0
tensor(float)
['DequantizeLinear2043_dim_0', 85, 52, 52]
['DequantizeLinear2269_dim_0', 85, 26, 26]
['DequantizeLinear2495_dim_0', 85, 13, 13]

We're using YOLOX for object detection on a single image.

The results of the image recognition showed that the orange ball was successfully detected without any issues.

bboxes of detected objects: [[ 473.17449951  137.78985596  812.97937012  477.59475708]
 [   0.            5.46184874 1280.          720.        ]]
scores of detected objects: [0.73033565 0.20149007]
Details of detected objects: [49. 60.]
Pre-processing time: 0.0108 seconds
DPU execution time: 0.0129 seconds
Post-process time: 0.0360 seconds
Total run time: 0.0597 seconds
Performance: 16.740788045213616 FPS

YOLOX result_1

YOLOX result_2

Comparison with YOLOv3 and YOLOX

Although not a precise comparison, we will compare the content and speed tested in an Ubuntu environment as described in the article below.

10. Improve Object Detection Speed with YOLOX

We conducted similar tests with TensorFlow2's YOLOv3 and PyTorch's YOLOX.

When comparing the pre-processing, DPU inference, and post-processing, the results were as follows:

For the ONNX YOLOX, no particular speed optimizations were made from the original PyTorch. Both versions of YOLOX yielded almost identical results, which was as expected.

Comparison YOLO Speed with DPU table

Comparison YOLO Speed with DPU figure

Comparing YOLOX on CPU and DPU with ONNX

With the capability to use ONNX, comparing the CPU and DPU on the KR260 becomes straightforward.

By modifying a single line in the same program, you can switch from DPU to CPU for inference:

providers=["CPUExecutionProvider"]

session = onnxruntime.InferenceSession(
'yolox_nano_onnx_pt.onnx',
# providers=["VitisAIExecutionProvider"],
providers=["CPUExecutionProvider"],
provider_options=[{"config_file":"/usr/bin/vaip_config.json"}])

The actual program we ran is available on GitHub below.

https://github.com/iotengineer22/AMD-Pervasive-AI-Developer-Contest/blob/main/src/onnx-test/onnx-cpu-yolox.py

xilinx-kr260-starterkit-20231:~/onnx-test$ python3 onnx-cpu-yolox.py

Here is the comparison of CPU and DPU on ONNX Test video:

comparison of CPU and DPU on ONNX Test video

The DPU inference was over 20 times faster than the CPU.

CPU inference

DPU inference

comparison of CPU and DPU on ONNX Table

comparison of CPU and DPU on ONNX Figure

When comparing the previous YOLOv3 and PyTorch YOLOX, the graph below summarizes the results.

It clearly shows how using the DPU can significantly speed up the process.

Comparing YOLO* on CPU and DPU Figure

Reference

Many thanks for reference articles and repositories.

https://github.com/Kazuhito00/YOLOX-ONNX-TFLite-Sample/tree/main

Summary

We built an ONNX environment on the KR260 and executed ONNX Runtime.

We introduce the use of the Vitis AI ONNX Runtime Engine (VOE).

In the next project, we tried object detection with a regular USB-connected webcam using the KR260.

14. Appendix: Object Detection Using YOLOX with a Webcam << next project

Vitis AI ONNX Runtime Engine (VOE) with KR260 + Python

Things used in this project

Hardware components

Software apps and online services

Story

Subproject for the AMD Pervasive AI Developer Contest

Introduction

Vitis AI 3.5 ONNX

Building the Environment with PetaLinux from BSP

Writing the Image to the SD Card

Installing onnxruntime

Installing xrt packagegroup-petalinux-opencv

Creating the DPU Environment with Petalinux

Python Program for ONNX YOLOX (.py)

Testing ONNX on KR260

Comparison with YOLOv3 and YOLOX

Comparing YOLOX on CPU and DPU with ONNX

Reference

Summary

Code

KR260-ONNX-TEST

Credits

misoji engineer

Comments

Embed the widget on your own site

Vitis AI ONNX Runtime Engine (VOE) with KR260 + Python

Vitis AI ONNX Runtime Engine (VOE) with KR260 + Python

Things used in this project

Hardware components

Software apps and online services

Story

Subproject for the AMD Pervasive AI Developer Contest

Introduction

Vitis AI 3.5 ONNX

Building the Environment with PetaLinux from BSP

Writing the Image to the SD Card

Installing onnxruntime

Installing xrt packagegroup-petalinux-opencv

Creating the DPU Environment with Petalinux

Python Program for ONNX YOLOX (.py)

Testing ONNX on KR260

Comparison with YOLOv3 and YOLOX

Comparing YOLOX on CPU and DPU with ONNX

Reference

Summary

Code

KR260-ONNX-TEST

Credits

misoji engineer

Comments

Related channels and tags