The DPU (DPUCZDX8G) comes in several different architectures.
We compared the execution speeds of eight different architectures: "B512, B800, B1024, B1600, B2304, B3136, B4096."
This comparison focuses on the execution speeds of YOLOX object detection on the KR260 platform.
Subproject for the AMD Pervasive AI Developer ContestThis project is part of a subproject for the AMD Pervasive AI Developer Contest.
Be sure to check out the other projects as well.
**The main project is currently under submission. ***
0. Main project << under submission
2. PYNQ + PWM(DC-Motor Control)
3. Object Detection(Yolo) with DPU-PYNQ
4. Implementation DPU, GPIO, and PWM
6. GStreamer + OpenCV with 360°Camera
7. 360 Live Streaming + Object Detect(DPU)
8. ROS2 3D Marker from 360 Live Streaming
9. Control 360° Object Detection Robot Car
10. Imporve Object Detection Speed with YOLOX
11. Benchmark Architectures of the DPU << this project
12. Power Consumption of 360° Object Detection Robot Car
13. Application to Vitis AI ONNX Runtime Engine (VOE)
14. Appendix: Object Detection Using YOLOX with a Webcam
Please note that before running the above subprojects, the following setup, which is the reference for this AMDcontest, is required.
https://github.com/amd/Kria-RoboticsAI
IntroductionWe measured the speed of object detection using various architectures of the DPU (DPUCZDX8G).
The eight architectures compared are "B512, B800, B1024, B1600, B2304, B3196, B4096."
The execution speed of YOLOX object detection on the KR260 was compared.
The tests primarily used a 150MHz clock, with some checks at 300MHz.
Generally, larger sizes resulted in shorter inference times.
The process of creating models and running them will be explained in detail.
Creating Models for Each ArchitectureRefer to the article below for details on how to create files(".bit", ".xclbin", ".hwh") for each DPU architecture.
4. Implementation DPU, GPIO, and PWM
To run YOLOX object detection, you need the following models(".xmodel"). Please refer to the following article.
10. Imporve Object Detection Speed with YOLOX
Generated Files for Each ArchitectureThe created files for each architecture ("B512, B800, B1024, B1600, B2304, B3136, B4096") are listed below.
For B512, B3196, and B4096, files were also generated with the DPU clock set to 300MHz instead of 150MHz.
DPU Configuration (dpu_conf.vh)
The DPU configuration is essentially set to default by Vitis.
The size is changed for each architecture. (Below is an example for B3136)
/*====== Architecture Options ======*/
// |------------------------------------------------------|
// | Support 8 DPU size
// | It relates to model. if change, must update model
// +------------------------------------------------------+
// | `define B512
// +------------------------------------------------------+
// | `define B800
// +------------------------------------------------------+
// | `define B1024
// +------------------------------------------------------+
// | `define B1152
// +------------------------------------------------------+
// | `define B1600
// +------------------------------------------------------+
// | `define B2304
// +------------------------------------------------------+
// | `define B3136
// +------------------------------------------------------+
// | `define B4096
// |------------------------------------------------------|
`define B3136
As the DPU size increases, the BRAM alone became insufficient.
Therefore, starting from B2304, URAM is enabled.
// |------------------------------------------------------|
// | If the FPGA has URAM. You can define URAM_EN parameter
// | if change, Don't need update model
// +------------------------------------------------------+
// | for zcu104 : `define URAM_ENABLE
// +------------------------------------------------------+
// | for zcu102 : `define URAM_DISABLE
// |------------------------------------------------------|
//`define URAM_DISABLE
`define URAM_ENABLE
For B4096_300MHz, the default parameters were not sufficient for building. The parameters used are listed below.
YOLOX Inference Results for Each ArchitectureThe inference time for object detection on a single image using YOLOX was measured. (Testing object detection with an orange ball on table)
The times below exclude preprocessing and postprocessing.
# Fetch data to DPU and trigger it
dpu_start = time.time()
job_id = dpu.execute_async(input_data, output_data)
dpu.wait(job_id)
dpu_end = time.time()
The actual program can be found on the following GitHub repository:
Summary of Key Findings
- Larger architecture sizes generally result in faster performance.
- Increasing the clock speed to the DPU also generally results in faster performance.
For B512, increasing the clock speed from 150MHz to 300MHz significantly improved the execution speed. For B3136 and B4096, the impact was less noticeable compared to B512.
Results and Utilization on the KR260
The results and utilization rates for each architecture on the KR260 are presented below.
For more details on DPU resource utilization, refer to the official documentation.
https://docs.amd.com/r/en-US/pg338-dpu/Resource-Utilization
B512_150MHz
- Pre-processing time: 0.0078 seconds
- DPU execution time: 0.0348 seconds
- Post-process time: 0.0291 seconds
- Total run time: 0.0717 seconds
- Performance: 13.942717336382735 FPS
B800_150MHz
- Pre-processing time: 0.0079 seconds
- DPU execution time: 0.0300 seconds
- Post-process time: 0.0288 seconds
- Total run time: 0.0667 seconds
- Performance: 14.988864588247067 FPS
B1024_150MHz
- Pre-processing time: 0.0077 seconds
- DPU execution time: 0.0255 seconds
- Post-process time: 0.0289 seconds
- Total run time: 0.0620 seconds
- Performance: 16.12263694022679 FPS
B1152_150MHz
- Pre-processing time: 0.0077 seconds
- DPU execution time: 0.0276 seconds
- Post-process time: 0.0291 seconds
- Total run time: 0.0644 seconds
- Performance: 15.529397825893783 FPS
B1600_150MHz
- Pre-processing time: 0.0078 seconds
- DPU execution time: 0.0229 seconds
- Post-process time: 0.0290 seconds
- Total run time: 0.0597 seconds
- Performance: 16.762933980248828 FPS
B2304_150MHz
- Pre-processing time: 0.0079 seconds
- DPU execution time: 0.0207 seconds
- Post-process time: 0.0290 seconds
- Total run time: 0.0575 seconds
- Performance: 17.37865654573479 FPS
B3136_150MHz
- Pre-processing time: 0.0080 seconds
- DPU execution time: 0.0195 seconds
- Post-process time: 0.0307 seconds
- Total run time: 0.0583 seconds
- Performance: 17.16086428188584 FPS
B4096_150MHz
- Pre-processing time: 0.0080 seconds
- DPU execution time: 0.0170 seconds
- Post-process time: 0.0292 seconds
- Total run time: 0.0542 seconds
- Performance: 18.46564028510925 FPS
B512_300MHz
- Pre-processing time: 0.0081 seconds
- DPU execution time: 0.0224 seconds
- Post-process time: 0.0290 seconds
- Total run time: 0.0596 seconds
- Performance: 16.7869524324108 FPS
B3136_300MHz
- Pre-processing time: 0.0077 seconds
- DPU execution time: 0.0147 seconds
- Post-process time: 0.0290 seconds
- Total run time: 0.0514 seconds
- Performance: 19.447244941486304 FPS
B4096_300MHz
- Pre-processing time: 0.0079 seconds
- DPU execution time: 0.0137 seconds
- Post-process time: 0.0291 seconds
- Total run time: 0.0507 seconds
- Performance: 19.72305087933791 FPS
As a reference, we measured the power consumption of B512_150MHz and B4096_150MHz.
First, we checked the power consumption settings of the FPGA with the default (typical) configuration.
The power consumption of B512_150MHz was 3.784W as shown below.
The power consumption of B4096_150MHz was 5.071W as shown below.
The actual test video is below.
We measured the power consumption using a simple current checker.
Since it is the current of the KR260 board, it is higher than the power consumption value of the FPGA.
After running DPU B512, the power consumption was about 9.0W (12V_750mA).
After loading DPU B4096, the power consumption was about 9.6W (12V_800mA).
We compared the execution speeds of eight different architectures: "B512, B800, B1024, B1600, B2304, B3136, B4096."
This comparison focuses on the execution speeds of YOLOX object detection on the KR260 platform.
In the next project, we measured Power Consumption of Robot Car with KR260.
Comments