In my previous projects, I illustrated how to get the Deep Processing Unit (DPU) provided with Vitis-AI up and running on the ZUBoard.
We also explored two HSIO add-on modules that could be used with the ZUBoard:
- HSIO : Dual Camera Mezzanine
- HSIO : DiplayPort with eMMC
In this project, we will look at a new HSIO add-on module that provides exciting new expansion for the ZUBoard:
- HSIO : M2 expansion
I will attempt to implement an HSIO hat-trick (hockey jargon for 3 goals).
The M2 expansion used will be the Hailo-8 AI accelerator.
The following table illustrates the Peak TOPS available with the Vitis-AI DPU on the Zynq UltraScale+ devices.
This assumes that the entire programmable logic (PL) is available to implement the Vitis-AI DPU.
In reality, however, when we add a MIPI capture pipeline for our Dual Camera HSIO, we have to scale back the DPU to leave enough resources for this peripheral logic. The following table adds a lower (more realistic) range for the Peak TOPS that would typically be available in a real design.
To be more specific, in our ZUBoard Dual Camera design, we are able to implement the B128 DPU, which has a peak of 0.038 TOPS.
By adding a Hailo-8 AI accelerator module, we can theoretically boost the Peak TOPS of the ZUBoard beyond what is available in any of the Zynq UltraScale+ devices. In fact we are now in the realm of what is available with only the next-generation Versal AI Edge.
The Peak TOPS available for our ZUBoard Dual Camera design will theoretically increase by a factor of 26.0/0.038 = 684x !!!
This will, of course, have to be benchmarked on the board.
But the motivation definitely is there :)
Introducing the HSIO M2 moduleThe M.2 High Speed I/O module provides support for M2 expansion.
The following connectors are supported:
- B-Key
- E-Key
The following modules sizes are supported:
- B-Key : 2230-2242-2260-2280
- E-Key : 2230
The following modules types are supported:
- B-Key: SSDs or accelerators via PCIe (2 lanes)
- E-Key: WiFi Modules via SDIO/UART
The Hailo-8 M.2 Module is an AI accelerator module for AI applications, compatible with NGFF M.2 form factor M, B+M and A+E keys.
The AI module is based on the 26 tera-operations per second (TOPS) Hailo-8 AI processor with high power efficiency. The M.2 AI accelerator features a full PCIe Gen-3.0 2-lane interface (4-lane in M-key module), delivering unprecedented AI performance for edge devices.
The M.2 module can be plugged into an existing edge device with M.2 socket to execute in real-time and with low power deep neural network inferencing for a broad range of market segments.
Leveraging Hailo’s comprehensive Dataflow Compiler and its support for standard AI frameworks, customers can easily port their Neural Network models to the Hailo-8 and introduce high-performance AI products to the market quickly.
Reference : https://hailo.ai/products/ai-accelerators/hailo-8-m2-ai-acceleration-module/#hailo8-m2-overview
The Hailo-8 comes in three form factors, each with a convenient starter kit:
- M-Key : 4 PCIe lanes, Size 2280, Starter Kit HM218B1C2XAE
- B+M-Key : 2 PCIe lanes, Size 2280, Starter Kit HM218B1C2ZAE
- A+E-Key : 2 PCIe lanes, Size 2230, Starter Kit HM218B1C2YAE
The highest performing module is the M-Key with 4 PCIe lanes. However, our HSIO M.2 adapter does not support this module.
For this project, I chose to use the B+M-Key with 2 PCIe lanes. The impact of having 2 lanes instead of 4 lanes does not affect the Peak TOPS of the core AI engine. It does, however, limit the I/O bandwidth to this core AI engine, which will affect the FPS that can be achieved for larger networks.
The starter kit comes with a lot of thermal options. In order to choose the right solution for my use use, I registered and referred to the abundant documentation on Hailo's developer zone.
https://developer.hailo.ai/developer-zone
I chose to use the natural thermal convection solution, with the heat sink provided in the starter kit, which is sufficient for 4W operation in 25C ambient air.
For this step, I referred to the experience of my colleague Gianluca Filippini (EBV), in order to get the Hailo-8 up and running with the following milestones:
- Milestone 1 - Hailo-8 detected on PCI express bus
- Milestone 2 - Hailo-8 detected by driver and runtime
- Milestone 3 - Hailo-8 working with TAPPAS
The starting point for our design is the ZUBoard Dual Camera design, which can be re-built on a correctly installed Linux machine with Vitis 2022.2 and Vitis-AI 3.0 as follows:
git clone https://github.com/Avnet/bdf bdf
git clone -b 2022.2 https://github.com/Avnet/hdl hdl
git clone -b 2022.2 https://github.com/Avnet/petalinux petalinux
cd petalinux
./scripts/make_zub1cg_sbc_dualcam.sh
Since the first milestone is to detect the Hailo-8 module via PCI express, we need to enable this functionality in our design.
The ZUBoard Dual Camera design has the HSIO DP-eMMC populated on the J2 connector, as shown in the following image:
The HSIO M.2 module, however, must be placed on the J2 connector, since it requires the lower transceiver lanes for the PCI express functionality. For this reason, we must move the HSIO DP-eMMC module to the J1 connector, as shown below:
This starts with our Vivado project, which can be edited as follows:
cd ../hdl/projects/zub1cg_sbc_dualcam_2022_2
vivado zub1cg_sbc_dualcam.xpr &
Open the block diagram, and double-click on the PS block.
Enable the following option to get access to the PCIe Configuration.
- Switch to Advanced Mode : Enabled
Select the PCIe Configuration and make the following changes:
- Device Port Type : Root Port
Next, Select the I/O Configuration, and open the High Speed Peripheral section, and make the following changes to enable PCIe, and move DP to the upper transceivers.
Make the following modifications for PCIe:
- PCIe : Enabled
- Rootport Mode Reset : MIO30
- Reset Polarity : Active Low
- Lane Selection : x2
Make the following modifications for DisplayPort
- DPAUX : EMIO
- Lane Selection : Dual Higher
Select the Clock Configuration, and the Input Clock tab, and make the following modifications:
- GT Lane Reference Frequency (PCIe) : REFCLK0, 100MHz
- GT Lane Reference Frequency (DisplayPort) : REFCLK1, 135MHz
Click OK to save the modifications.
Connect the new DPAUX EMIO ports as External Ports in the block diagram, as shown below:
NOTE : the dp_aux_data_oe_n port requires a polarity inversion, which can be implemented with a util_vector_logic module.
Finally, the pin constraints for the DPAUX EMIO ports must be determined from the following hardware schematics:
Add the following constraints for the DPAUX EMIO ports in the design's XDC file.
#######################################################################
# DisplayPort HPD & AUX
#######################################################################
set_property IOSTANDARD LVCMOS12 [get_ports {dp_hot_plug_detect*}]
set_property IOSTANDARD LVCMOS12 [get_ports {dp_aux_data*}]
set_property PACKAGE_PIN K1 [get_ports dp_aux_data_out_0 ]; # HP_DP_15_P
set_property PACKAGE_PIN J1 [get_ports dp_hot_plug_detect_0 ]; # HP_DP_15_N
set_property PACKAGE_PIN D2 [get_ports dp_aux_data_oe_0 ]; # HP_DP_24_P
set_property PACKAGE_PIN C2 [get_ports dp_aux_data_in_0 ]; # HP_DP_24_N
Save all modifications, and Rebuild the bitstream.
When done, run the following command in the Vivado project's TCL Console to re-generate the XSA file:
write_hw_platform -file zub1cg_sbc_dualcam.xsa -include_bit -force
validate_hw_platform zub1cg_sbc_dualcam.xsa -verbose
The device tree definition in the petalinux project needs to be modified as follows:
project-spec/meta-avnet/recipes-bsp/device-tree/files/zub1cg-sbc/system-bsp.dtsi
...
/ {
...
gtr_refclk_pcie: gtr_refclk_pcie { /* PCIe - 100MHz */
compatible = "fixed-clock";
#clock-cells = <0>;
clock-frequency = <100000000>;
};
gtr_refclk_dp: gtr_refclk_dp { /* DP - 135MHz */
compatible = "fixed-clock";
#clock-cells = <0>;
clock-frequency = <135000000>;
};
};
...
&psgtr {
clocks = <>r_refclk_pcie>,<>r_refclk_dp>;
clock-names = "ref0","ref1";
};
/*
The cells contain the following arguments.
- description: The GTR lane
minimum: 0
maximum: 3
- description: The PHY type
enum:
- PHY_TYPE_DP
- PHY_TYPE_PCIE
- PHY_TYPE_SATA
- PHY_TYPE_SGMII
- PHY_TYPE_USB
- description: The PHY instance
minimum: 0
maximum: 1 # for DP, SATA or USB
maximum: 3 # for PCIE or SGMII
- description: The reference clock number
minimum: 0
maximum: 3
*/
&zynqmp_dpsub {
phy-names = "dp-phy0","dp-phy1";
//phys = <&psgtr 1 6 0 0>, <&psgtr 0 6 1 0>;
phys = <&psgtr 3 6 0 2>, <&psgtr 2 6 1 2>;
status = "okay";
xlnx,max-lanes = <2>;
};
&zynqmp_dpdma {
status = "okay";
};
&zynqmp_dp_snd_pcm0 {
status = "okay";
};
&zynqmp_dp_snd_pcm1 {
status = "okay";
};
&zynqmp_dp_snd_card0 {
status = "okay";
};
&zynqmp_dp_snd_codec0 {
status = "okay";
};
On the command line, rebuild the petalinux project, taking into account the new hardware XSA file:
cd petalinux/projects/zub1cg_sbc_dualcam_2022_2
petalinux-config --get-hw-description=../../../hdl/projects/zub1cg_sbc_dualcam_2022_2/zub1cg_sbc_dualcam.xsa --silentconfig
petalinux-build
This will re-generate the SD card image located in the following directory
petalinux/projects/zub1cg_sbc_dualcam_2022_2/images/linux/rootfs.wic
Program this new image to a 16GB (or greater) micro-SD card, and boot the ZUBoard.
Use the "lspci" command to detect the PCI express peripherals.
root@zub1cg-sbc-dualcam-2022-2:~# lspci
00:00.0 Bridge: Xilinx Corporation Device d011
01:00.0 Co-processor: Hailo Technologies Ltd. Hailo-8 AI Processor (rev 01)
Use the "lspci -vv" variant of the command to list the details of the Hailo-8 AI Processor.
root@zub1cg-sbc-dualcam-2022-2:~# lspci -vv
00:00.0 Bridge: Xilinx Corporation Device d011
...
01:00.0 Co-processor: Hailo Technologies Ltd. Hailo-8 AI Processor (rev 01)
Subsystem: Hailo Technologies Ltd. Hailo-8 AI Processor
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 111
Region 0: Memory at 600000000 (64-bit, prefetchable) [size=16K]
Region 2: Memory at 600008000 (64-bit, prefetchable) [size=4K]
Region 4: Memory at 600004000 (64-bit, prefetchable) [size=16K]
Capabilities: [80] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [e0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [f8] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Capabilities: [108 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [110 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [128 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [200 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
We are doing great !
Next, we will start adding the yocto recipes for the Hailo-8 driver, runtime, and APIs.
Milestone 2 - Hailo-8 detected by driver and runtimeNow that we have a working petalinux project that can detect the Hailo-8 module on the PCI express bus, we can add its driver and run-time. This content is available as a yocto recipe on github, and can be obtained as follows:
cd project-spec
git clone -b honister https://github.com/hailo-ai/meta-hailo
cd ..
The gstreamer patches included with this repo do not work on our petalinux project, so we have to remove them:
project-spec/meta-hailo-tappas/recipes-multimedia/gstreamer/gstreamer1.0-plugins-base_%.imx.bbappend
Remove the the two patch files from the.bbappend file:
FILESEXTRAPATHS:prepend := "${THISDIR}/files:"
SRC_URI += " \
file://allocate-cached-ion-buffer.patch \
file://dont-pre-allocate-ion-buffers-gldownload-and-glupload.patch \
"
As follows:
FILESEXTRAPATHS:prepend := "${THISDIR}/files:"
SRC_URI += " \
"
Now we need to make our petalinux project aware of these new yocto recipes. This can be done in our main "config" file:
project-spec/configs/config
...
#
# User Layers
#
CONFIG_USER_LAYER_0="{proot}/project-spec/meta-avnet"
CONFIG_USER_LAYER_1="{proot}/project-spec/meta-hailo/meta-hailo-accelerator"
CONFIG_USER_LAYER_2="{proot}/project-spec/meta-hailo/meta-hailo-libhailort"
CONFIG_USER_LAYER_3="{proot}/project-spec/meta-hailo/meta-hailo-tappas"
CONFIG_USER_LAYER_4=""
...
Next, we declare the existence of the following Hailo-8 packages in our "userrootfsconfig" file:
project-spec/meta-user/config/userrootfsconfig
...
CONFIG_hailortcli
CONFIG_libhailort
CONFIG_pyhailort
CONFIG_hailo-pci
CONFIG_hailo-firmware
Finally, we add the packages required for the driver and run-time to our project in our "rootfsconfig" file:
project-spec/configs/rootfsconfig
...
#
# User Packages
#
# CONFIG_gpio-demo is not set
# CONFIG_peekpoke is not set
CONFIG_hailortcli=y
CONFIG_libhailort=y
CONFIG_pyhailort=y
CONFIG_hailo-pci=y
CONFIG_hailo-firmware=y
...
On the command line, rebuild the petalinux project:
petalinux-build
This will re-generate the SD card image located in the following directory
petalinux/projects/zub1cg_sbc_dualcam_2022_2/images/linux/rootfs.wic
Program this new image to a 16GB (or greater) micro-SD card, and boot the ZUBoard.
During boot, you will notice new content being generated specific to the Hailo-8 driver.
[ 7.607911] hailo: Init module. driver version 4.15.0
[ 7.608165] hailo 0000:01:00.0: Probing on: 1e60:2864...
[ 7.608174] hailo 0000:01:00.0: Probing: Allocate memory for device extension, 11592
[ 7.608209] pci 0000:00:00.0: enabling device (0000 -> 0002)
[ 7.608225] hailo 0000:01:00.0: enabling device (0000 -> 0002)
[ 7.608235] hailo 0000:01:00.0: Probing: Device enabled
[ 7.608287] hailo 0000:01:00.0: Probing: mapped bar 0 - (____ptrval____) 16384
[ 7.608301] hailo 0000:01:00.0: Probing: mapped bar 2 - (____ptrval____) 4096
[ 7.608312] hailo 0000:01:00.0: Probing: mapped bar 4 - (____ptrval____) 16384
[ 7.608324] hailo 0000:01:00.0: Probing: Setting max_desc_page_size to 4096, (page_size=4096)
[ 7.608333] hailo 0000:01:00.0: Probing: Using userspace allocated vdma buffers
[ 7.608356] hailo 0000:01:00.0: Probing: Enabled 64 bit dma
[ 7.608365] hailo 0000:01:00.0: Disabling ASPM L0s
[ 7.608377] hailo 0000:01:00.0: Successfully disabled ASPM L0s
[ 7.824552] hailo 0000:01:00.0: Firmware was loaded successfully
[ 7.851488] hailo 0000:01:00.0: Probing: Added board 1e60-2864, /dev/hailo0
The presence of the Hailo-8 driver can be confirmed with the "lsmod" command.
root@zub1cg-sbc-m2-2022-2:~# lsmod
Module Size Used by
zocl 184320 0
hailo_pci 65536 0
uio_pdrv_genirq 16384 0
dmaproxy 16384 0
Also, the "lspci -vv" command will output additional content relating to the Hailo-8 driver:
root@zub1cg-sbc-dualcam-2022-2:~# lspci -vv
...
Kernel driver in use: hailo
Kernel modules: hailo_pci
We can verify the Hailo-8 run-time with the "hailortcli" command, as shown below:
root@zub1cg-sbc-m2-2022-2:~# hailortcli fw-control identify -s 01:00.0
Executing on device: 0000:01:00.0
Identifying board
Control Protocol Version: 2
Firmware Version: 4.15.0 (release,app,extended context switch buffer)
Logger Version: 0
Board Name: Hailo-8
Device Architecture: HAILO8
Serial Number: HLLWMB0214600101
Part Number: HM218B1C2LA
Product Name: HAILO-8 AI ACCELERATOR M.2 B+M KEY MODULE
We have successfully detected the Hailo-8 AI Accelerator M.2 B+M Key module !
We can already run some benchmarks with the run-time.
First download some pre-compile model from the Hailo model zoo:
https://github.com/hailo-ai/hailo_model_zoo/blob/master/docs/PUBLIC_MODELS.rst
If your ZUBoard is connected to the internet, this can be done directory on the embedded platform with the "wget" utility:
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/resnet_v1_50.hef
Then run some benchmarks with the "hailortcli" utility:
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark resnet_v1_50.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network resnet_v1_50/resnet_v1_50: 100% | 19686 | FPS: 1312.33 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network resnet_v1_50/resnet_v1_50: 100% | 19686 | FPS: 1312.34 | ETA: 00:00:00
Measuring HW Latency
Network resnet_v1_50/resnet_v1_50: 100% | 4952 | HW Latency: 2.85 ms | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 1312.35
(streaming) = 1312.35
Latency (hw) = 2.85141 ms
Device 0000:01:00.0:
Power in streaming mode (average) = 3.8952 W
(max) = 3.92493 W
root@zub1cg-sbc-dualcam-2022-2:~#
Next, we will add the TAPPAS, which is the application level software, including programming APIs and gstreamer plug-ins.
Milestone 3 - Hailo-8 working with TAPPASNow that we have proven that the Hailo-8 driver and run-time are working, we add the applications layers, or TAPPAS to our project.
We declare the existence of the following Hailo-8 packages in our "userrootfsconfig" file:
project-spec/meta-user/config/userrootfsconfig
...
CONFIG_hailortcli
CONFIG_libhailort
CONFIG_pyhailort
CONFIG_hailo-pci
CONFIG_hailo-firmware
CONFIG_libgsthailo
CONFIG_libgsthailotools
CONFIG_tappas-apps
CONFIG_hailo-post-processes
Finally, we add the packages required for the driver and run-time to our project in our "rootfsconfig" file:
project-spec/configs/rootfsconfig
...
#
# User Packages
#
# CONFIG_gpio-demo is not set
# CONFIG_peekpoke is not set
CONFIG_hailortcli=y
CONFIG_libhailort=y
CONFIG_pyhailort=y
CONFIG_hailo-pci=y
CONFIG_hailo-firmware=y
CONFIG_libgsthailo=y
CONFIG_libgsthailotools=y
CONFIG_tappas-apps=y
CONFIG_hailo-post-processes=y
...
On the command line, rebuild the petalinux project:
petalinux-build
This will re-generate the SD card image located in the following directory
petalinux/projects/zub1cg_sbc_dualcam_2022_2/images/linux/rootfs.wic
Program this new image to a 16GB (or greater) micro-SD card, and boot the ZUBoard.
You will notice the presence of an "apps" directory in the root directory.
root@zub1cg-sbc-dualcam-2022-2:~# ls -R apps
apps:
detection license_plate_recognition multistream_detection
apps/detection:
detection.sh resources
apps/detection/resources:
configs yolov5m_yuv.hef
apps/detection/resources/configs:
yolov5.json
apps/license_plate_recognition:
license_plate_recognition.sh resources
apps/license_plate_recognition/resources:
configs liblpr_ocrsink.so liblpr_overlay.so lpr.raw lprnet_yuy2.hef tiny_yolov4_license_plates_yuy2.hef yolov5m_vehicles_no_ddr_yuy2.hef
apps/license_plate_recognition/resources/configs:
yolov4_license_plate.json yolov5_vehicle_detection.json
apps/multistream_detection:
multi_stream_detection.sh resources
apps/multistream_detection/resources:
configs detection0.mp4 detection1.mp4 detection2.mp4 detection3.mp4 detection4.mp4 detection5.mp4 yolov5s_personface_nv12_no_ddr.hef
apps/multistream_detection/resources/configs:
yolov5_personface.json
root@zub1cg-sbc-dualcam-2022-2:~#
We can run benchmarks on the models including in these directories as well:
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli run2 set-net ./apps/detection/resources/yolov5m_yuv.hef
[HailoRT CLI] [warning] "hailortcli run2" is not optimized for single model usage. It is recommended to use "hailortcli run" command for a single model
[===================>] 100% 00:00:00
yolov5m_yuv: fps: 86.50
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark ./apps/detection/resources/yolov5m_yuv.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network yolov5m_yuv/yolov5m_yuv: 100% | 1548 | FPS: 103.19 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network yolov5m_yuv/yolov5m_yuv: 100% | 1425 | FPS: 94.93 | ETA: 00:00:00
Measuring HW Latency
Network yolov5m_yuv/yolov5m_yuv: 100% | 480 | HW Latency: 21.52 ms | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 103.193
(streaming) = 94.9366
Latency (hw) = 21.5191 ms
Device 0000:01:00.0:
Power in streaming mode (average) = 3.06434 W
(max) = 3.15601 W
The demo scripts were created for iMX platforms, so need to be modified for use with the ZUBoard Dual Camera design.
I have provided one such script in the attachment section of this project : zub1cg_dualcam_hailo8_detection.sh
The script accepts the following arguments:
- mode : primary (left sensor), secondary (right sensor), dual (both sensors)
- width : width of image generated by capture pipeline
- height : height of image generated by capture pipeline
- format : format (yuv, rgb) generated by capture pipeline
- sink : window, dp, fake
root@zub1cg-sbc-dualcam-2022-2:~/apps/detection# ./zub1cg_dualcam_hailo8_detection.sh --help
unknown arg --help
USAGE: zub1cg_dualcam_hailo8_detection.sh [OPTIONS]
-m|--mode mode must be 'dual', 'primary' or 'secondary'
-s|--sink sink must be 'dp', 'window' or 'fake'
-f|--format output format must be 'yuv' or 'rgb'
-w|--width output width
-h|--height output height
root@zub1cg-sbc-dualcam-2022-2:~/apps/detection#
Since the yolov5m_yuv.hef model in this example expects a 1280x720 YUV input, the scripts defaults to these values.
We can select to use the left, right, or both (side-by-side) sensors as input video.
We can also select to use either of the following outputs:
- window (output to desktop, least performance due to color format conversion)
- dp (output to native DP, best performance)
- sink (no output, highest performance)
First let's try outputting to the desktop:
root@zub1cg-sbc-dualcam-2022-2:~/apps/detection# export DISPLAY=:0.0
root@zub1cg-sbc-dualcam-2022-2:~/apps/detection# ./zub1cg_dualcam_hailo8_detection.sh --mode primary --sink window
WARNING: format not set: using default 'yuv' format
WARNING: output resolution not set: using default '1280x720' resolution
Run Camera with: mode=primary, sink=window, output resolution=1280x720, format=yuv
+ media-ctl -d /dev/media0 -V ''\''ap1302.0-003c'\'':2 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0000000.mipi_csi2_rx_subsystem'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0000000.mipi_csi2_rx_subsystem'\'':1 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0020000.v_proc_ss'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0020000.v_proc_ss'\'':1 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0040000.v_proc_ss'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0040000.v_proc_ss'\'':1 [fmt:UYVY8_1X16/1280x720 field:none]'
+ set +x
+ media-ctl -d /dev/media0 -l ''\''0-003c.ar0144.0'\'':0 -> '\''ap1302.0-003c'\'':0[1]'
+ media-ctl -d /dev/media0 -l ''\''0-003c.ar0144.1'\'':0 -> '\''ap1302.0-003c'\'':1[0]'
+ set +x
Detected AR0144 - disabling AWB
Detected AR0144 - setting brightness
+ gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf '!' 'video/x-raw, width=1280, height=720, format=YUY2, framerate=60/1' '!' queue leaky=downstream max-size-buffers=5 max-size-bytes=0 max-size-time=0 '!' hailonet hef-path=/home/root/apps/detectio
n/resources/yolov5m_yuv.hef '!' queue leaky=no max-size-buffers=30 max-size-bytes=0 max-size-time=0 '!' hailofilter function-name=yolov5 config-path=/home/root/apps/detection/resources/configs/yolov5.json so-path=/usr/lib/hailo-post-processes/libyolo_post
.so qos=false '!' queue leaky=no max-size-buffers=30 max-size-bytes=0 max-size-time=0 '!' hailooverlay '!' queue leaky=downstream max-size-buffers=5 max-size-bytes=0 max-size-time=0 '!' videoconvert '!' fpsdisplaysink 'video-sink='\''autovideosink'\''' te
xt-overlay=false sync=false -v
...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 2, dropped: 0, current: 3.89, average: 3.89
The "media-ctl" and "v4l2-ctl" commands configure the following MIPI capture pipeline.
Press CTRL-C to stop the gstreamer pipeline.
As expected, this is the example with the least performance, since there is significant computation being done on the CPU for color format conversion (ie. videoconvert).
Next, let's try outputting directly to the DP monitor (DRM driver).
root@zub1cg-sbc-dualcam-2022-2:~/apps/detection# ./zub1cg_dualcam_hailo8_detection.sh --mode primary --sink dp
WARNING: format not set: using default 'yuv' format
WARNING: output resolution not set: using default '1280x720' resolution
Run Camera with: mode=primary, sink=dp, output resolution=1280x720, format=yuv
+ media-ctl -d /dev/media0 -V ''\''ap1302.0-003c'\'':2 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0000000.mipi_csi2_rx_subsystem'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0000000.mipi_csi2_rx_subsystem'\'':1 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0020000.v_proc_ss'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0020000.v_proc_ss'\'':1 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0040000.v_proc_ss'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0040000.v_proc_ss'\'':1 [fmt:UYVY8_1X16/1280x720 field:none]'
+ set +x
+ media-ctl -d /dev/media0 -l ''\''0-003c.ar0144.0'\'':0 -> '\''ap1302.0-003c'\'':0[1]'
+ media-ctl -d /dev/media0 -l ''\''0-003c.ar0144.1'\'':0 -> '\''ap1302.0-003c'\'':1[0]'
+ set +x
Detected AR0144 - disabling AWB
Detected AR0144 - setting brightness
trying to open device 'i915'...done
setting mode 1280x720-60.00Hz on connectors 43, crtc 41
testing 1280x720@YUYV overlay plane 39
+ gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf '!' 'video/x-raw, width=1280, height=720, format=YUY2, framerate=60/1' '!' queue leaky=downstream max-size-buffers=5 max-size-bytes=0 max-size-time=0 '!' hailonet hef-path=/home/root/apps/detectio
n/resources/yolov5m_yuv.hef '!' queue leaky=no max-size-buffers=30 max-size-bytes=0 max-size-time=0 '!' hailofilter function-name=yolov5 config-path=/home/root/apps/detection/resources/configs/yolov5.json so-path=/usr/lib/hailo-post-processes/libyolo_post
.so qos=false '!' queue leaky=no max-size-buffers=30 max-size-bytes=0 max-size-time=0 '!' hailooverlay '!' queue leaky=downstream max-size-buffers=5 max-size-bytes=0 max-size-time=0 '!' fpsdisplaysink 'video-sink='\''kmssink' plane-id=39 bus-id=fd4a0000.d
isplay 'render-rectangle="<0,0,1280,720>"'\''' fullscreen-overlay=true sync=false -v
...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0/GstTextOverlay:fps-display-text-overlay: text = rendered: 796, dropped: 0, current: 28.36, average: 25.36
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 796, dropped: 0, current: 28.36, average: 25.36
Notice this time, the performance is much better, since there are no longer any color format conversions being executed on the CPU.
If we remove the output, we will get the highest performance achievable:
root@zub1cg-sbc-dualcam-2022-2:~/apps/detection# ./zub1cg_dualcam_hailo8_detection.sh --mode primary --sink fake
WARNING: format not set: using default 'yuv' format
WARNING: output resolution not set: using default '1280x720' resolution
Run Camera with: mode=primary, sink=fake, output resolution=1280x720, format=yuv
+ media-ctl -d /dev/media0 -V ''\''ap1302.0-003c'\'':2 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0000000.mipi_csi2_rx_subsystem'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0000000.mipi_csi2_rx_subsystem'\'':1 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0020000.v_proc_ss'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0020000.v_proc_ss'\'':1 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0040000.v_proc_ss'\'':0 [fmt:UYVY8_1X16/1280x800 field:none]'
+ media-ctl -d /dev/media0 -V ''\''b0040000.v_proc_ss'\'':1 [fmt:UYVY8_1X16/1280x720 field:none]'
+ set +x
+ media-ctl -d /dev/media0 -l ''\''0-003c.ar0144.0'\'':0 -> '\''ap1302.0-003c'\'':0[1]'
+ media-ctl -d /dev/media0 -l ''\''0-003c.ar0144.1'\'':0 -> '\''ap1302.0-003c'\'':1[0]'
+ set +x
Detected AR0144 - disabling AWB
Detected AR0144 - setting brightness
+ gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf '!' 'video/x-raw, width=1280, height=720, format=YUY2, framerate=60/1' '!' queue leaky=downstream max-size-buffers=5 max-size-bytes=0 max-size-time=0 '!' hailonet hef-path=/home/root/apps/detectio
n/resources/yolov5m_yuv.hef '!' queue leaky=no max-size-buffers=30 max-size-bytes=0 max-size-time=0 '!' hailofilter function-name=yolov5 config-path=/home/root/apps/detection/resources/configs/yolov5.json so-path=/usr/lib/hailo-post-processes/libyolo_post
.so qos=false '!' queue leaky=no max-size-buffers=30 max-size-bytes=0 max-size-time=0 '!' hailooverlay '!' queue leaky=downstream max-size-buffers=5 max-size-bytes=0 max-size-time=0 '!' fpsdisplaysink 'video-sink='\''fakevideosink'\''' text-overlay=false
sync=false -v
...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 1136, dropped: 0, current: 32.37, average: 31.77
This completes our Hailo-8 integration for the ZUBoard Dual Camera design.
We already know that a real application consists of more than just the deep learning models. There is also the pre-processing and post-processing to consider.
By leveraging the ZU1 device's PL for the pre-processing, and the Hailo-8 module for the deep learning models, we were able to achieve real-time performance in a realistic camera to display application.
So It BeginsWe now have a working Hailo-8 module, the time has come to run some benchmarks. Last time I felt this level of anticipation, was before the battle of helm's deep...
Benchmarking is only meaningful if we have a baseline to compare against.
In this project, the baseline is our ZUBoard Dual Camera design. For this design, the B128 DPU is the DPU that fits inside the remaining resources of the PL when we have the MIPI capture pipeline for our HSIO DualCam module.
In order to run these benchmarks, I will defer back to the SD image from my previous ZUBoard 2022.2 project:
http://avnet.me/vitis-ai-3.0-robot-control
The setup for this set of benchmarks is shown below:
For convenience, we will configure the image to boot with the dualcam-dpu design:
root@zub1cg-sbc-2022-2:~# xmutil listapps
Accelerator Accel_type #slots(PL+AIE) Active_slot
avnet-zub1cg-dualcam-dpu XRT_FLAT (0+0) -1
avnet-zub1cg-dualcam XRT_FLAT (0+0) -1
avnet-zub1cg-base XRT_FLAT (0+0) -1
avnet-zub1cg-benchmark XRT_FLAT (0+0) 0,
root@zub1cg-sbc-2022-2:~# cat /etc/dfx-mgrd/default_firmware
avnet-zub1cg-benchmark
root@zub1cg-sbc-2022-2:~# echo avnet-zub1cg-dualcam-dpu > /etc/dfx-mgrd/default_firmware
root@zub1cg-sbc-2022-2:~# cat /etc/dfx-mgrd/default_firmware
avnet-zub1cg-dualcam-dpu
We also want to configure the image to use the models pre-compiled for the B128 DPU:
root@zub1cg-sbc-2022-2:~# ls -la /usr/share/vitis_ai_library
total 44
drwxr-xr-x 6 root root 4096 Nov 20 16:41 .
drwxr-xr-x 327 root root 16384 Mar 9 2018 ..
lrwxrwxrwx 1 root root 14 Nov 20 16:41 models -> models.b512-lr
drwxr-xr-x 40 root root 4096 May 6 2023 models.b128-lr
drwxr-xr-x 144 root root 12288 Apr 20 2023 models.b512-lr
drwxr-xr-x 5 root root 4096 Mar 9 2018 samples
drwxr-xr-x 71 root root 4096 Mar 9 2018 test
root@zub1cg-sbc-2022-2:~# rm /usr/share/vitis_ai_library/models
root@zub1cg-sbc-2022-2:~# ln -sf models.b128-lr /usr/share/vitis_ai_library/models
root@zub1cg-sbc-2022-2:~# ls -la /usr/share/vitis_ai_library
total 44
drwxr-xr-x 6 root root 4096 Nov 20 16:41 .
drwxr-xr-x 327 root root 16384 Mar 9 2018 ..
lrwxrwxrwx 1 root root 14 Nov 20 16:41 models -> models.b128-lr
drwxr-xr-x 40 root root 4096 May 6 2023 models.b128-lr
drwxr-xr-x 144 root root 12288 Apr 20 2023 models.b512-lr
drwxr-xr-x 5 root root 4096 Mar 9 2018 samples
drwxr-xr-x 71 root root 4096 Mar 9 2018 test
Finally, we can reboot the design:
root@zub1cg-sbc-2022-2:~# reboot
After reboot, we can verify that the ZUBoard has booted with the dualcam-dpu as follows:
root@zub1cg-sbc-2022-2:~# xmutil listapps
Accelerator Accel_type #slots(PL+AIE) Active_slot
avnet-zub1cg-dualcam-dpu XRT_FLAT (0+0) 0,
avnet-zub1cg-dualcam XRT_FLAT (0+0) -1
avnet-zub1cg-base XRT_FLAT (0+0) -1
avnet-zub1cg-benchmark XRT_FLAT (0+0) -1
root@zub1cg-sbc-2022-2:~# ls -la /usr/share/vitis_ai_library/models
lrwxrwxrwx 1 root root 14 Nov 20 16:41 /usr/share/vitis_ai_library/models -> models.b128-lr
The major challenge with benchmarking the B128 DPU is that I was not able to compile most of the Vitis-AI model zoo for this architecture.
I was able to compile the following, which have equivalent models in the Hailo model zoo.
- resnet50
- mobilenet v1
root@zub1cg-sbc-2022-2:~/benchmarking_b128# source ./benchmark_power8.sh
===========================================================
resnet50_tf2 classification classification
===========================================================
/home/root/Vitis-AI/examples/vai_library/samples/classification
-----------------------------------------------------------
./test_performance_classification resnet50_tf2 ./test_performance_classification.list -s 40 -t 1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:43:04.299680 984 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:43:04.300307 984 benchmark.hpp:214] waiting for 0/40 seconds, 1 threads running
I1120 19:43:14.300457 984 benchmark.hpp:214] waiting for 10/40 seconds, 1 threads running
I1120 19:43:24.300678 984 benchmark.hpp:214] waiting for 20/40 seconds, 1 threads running
I1120 19:43:34.300899 984 benchmark.hpp:214] waiting for 30/40 seconds, 1 threads running
I1120 19:43:44.301200 984 benchmark.hpp:222] waiting for threads terminated
FPS=5.53132
E2E_MEAN=180676
DPU_MEAN=180023
-----------------------------------------------------------
IDLE
-----------------------------------------------------------
./test_performance_classification resnet50_tf2 ./test_performance_classification.list -s 40 -t 2
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:44:25.529937 990 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:44:25.722285 990 benchmark.hpp:214] waiting for 0/40 seconds, 2 threads running
I1120 19:44:35.722501 990 benchmark.hpp:214] waiting for 10/40 seconds, 2 threads running
I1120 19:44:45.722723 990 benchmark.hpp:214] waiting for 20/40 seconds, 2 threads running
I1120 19:44:55.722946 990 benchmark.hpp:214] waiting for 30/40 seconds, 2 threads running
I1120 19:45:05.723246 990 benchmark.hpp:222] waiting for threads terminated
FPS=5.55263
-----------------------------------------------------------
IDLE
-----------------------------------------------------------
./test_performance_classification resnet50_tf2 ./test_performance_classification.list -s 40 -t 4
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:45:47.175784 994 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:45:47.754333 994 benchmark.hpp:214] waiting for 0/40 seconds, 4 threads running
I1120 19:45:57.755023 994 benchmark.hpp:214] waiting for 10/40 seconds, 4 threads running
I1120 19:46:07.755244 994 benchmark.hpp:214] waiting for 20/40 seconds, 4 threads running
I1120 19:46:17.755471 994 benchmark.hpp:214] waiting for 30/40 seconds, 4 threads running
I1120 19:46:27.755781 994 benchmark.hpp:222] waiting for threads terminated
FPS=5.54796
-----------------------------------------------------------
IDLE
-----------------------------------------------------------
./test_performance_classification resnet50_tf2 ./test_performance_classification.list -s 40 -t 8
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:47:09.483896 1001 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:47:10.621503 1001 benchmark.hpp:214] waiting for 0/40 seconds, 8 threads running
I1120 19:47:20.621737 1001 benchmark.hpp:214] waiting for 10/40 seconds, 8 threads running
I1120 19:47:30.621934 1001 benchmark.hpp:214] waiting for 20/40 seconds, 8 threads running
I1120 19:47:40.622140 1001 benchmark.hpp:214] waiting for 30/40 seconds, 8 threads running
I1120 19:47:50.622427 1001 benchmark.hpp:222] waiting for threads terminated
FPS=5.54612
-----------------------------------------------------------
IDLE
===========================================================
mobilenet_1_0_224_tf2 classification classification
===========================================================
/home/root/Vitis-AI/examples/vai_library/samples/classification
-----------------------------------------------------------
./test_performance_classification mobilenet_1_0_224_tf2 ./test_performance_classification.list -s 40 -t 1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:48:32.787725 1011 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:48:32.788704 1011 benchmark.hpp:214] waiting for 0/40 seconds, 1 threads running
I1120 19:48:42.788931 1011 benchmark.hpp:214] waiting for 10/40 seconds, 1 threads running
I1120 19:48:52.789491 1011 benchmark.hpp:214] waiting for 20/40 seconds, 1 threads running
I1120 19:49:02.789690 1011 benchmark.hpp:214] waiting for 30/40 seconds, 1 threads running
I1120 19:49:12.789969 1011 benchmark.hpp:222] waiting for threads terminated
FPS=20.4987
E2E_MEAN=48771.8
DPU_MEAN=48223.7
-----------------------------------------------------------
IDLE
-----------------------------------------------------------
./test_performance_classification mobilenet_1_0_224_tf2 ./test_performance_classification.list -s 40 -t 2
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:49:53.278578 1017 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:49:53.319661 1017 benchmark.hpp:214] waiting for 0/40 seconds, 2 threads running
I1120 19:50:03.319864 1017 benchmark.hpp:214] waiting for 10/40 seconds, 2 threads running
I1120 19:50:13.320076 1017 benchmark.hpp:214] waiting for 20/40 seconds, 2 threads running
I1120 19:50:23.320278 1017 benchmark.hpp:214] waiting for 30/40 seconds, 2 threads running
I1120 19:50:33.320550 1017 benchmark.hpp:222] waiting for threads terminated
FPS=20.7599
-----------------------------------------------------------
IDLE
-----------------------------------------------------------
./test_performance_classification mobilenet_1_0_224_tf2 ./test_performance_classification.list -s 40 -t 4
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:51:13.815217 1021 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:51:13.934170 1021 benchmark.hpp:214] waiting for 0/40 seconds, 4 threads running
I1120 19:51:23.934732 1021 benchmark.hpp:214] waiting for 10/40 seconds, 4 threads running
I1120 19:51:33.934931 1021 benchmark.hpp:214] waiting for 20/40 seconds, 4 threads running
I1120 19:51:43.935127 1021 benchmark.hpp:214] waiting for 30/40 seconds, 4 threads running
I1120 19:51:53.935407 1021 benchmark.hpp:222] waiting for threads terminated
FPS=20.7532
-----------------------------------------------------------
IDLE
-----------------------------------------------------------
./test_performance_classification mobilenet_1_0_224_tf2 ./test_performance_classification.list -s 40 -t 8
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1120 19:52:34.569006 1028 benchmark.hpp:187] writing report to <STDOUT>
I1120 19:52:34.851023 1028 benchmark.hpp:214] waiting for 0/40 seconds, 8 threads running
I1120 19:52:44.851577 1028 benchmark.hpp:214] waiting for 10/40 seconds, 8 threads running
I1120 19:52:54.851781 1028 benchmark.hpp:214] waiting for 20/40 seconds, 8 threads running
I1120 19:53:04.851984 1028 benchmark.hpp:214] waiting for 30/40 seconds, 8 threads running
I1120 19:53:14.852298 1028 benchmark.hpp:222] waiting for threads terminated
FPS=20.7238
-----------------------------------------------------------
IDLE
root@zub1cg-sbc-2022-2:~/Vitis-AI/examples/vai_library/samples/classification#
The power measurements were performed manually with a power meter directly at the electrical outlet.
The setup for this set of benchmarks is shown below:
Although we have already been running some benchmarks, we have yet to compare the performance with the Vitis-AI B128 DPU to validate the theoretical 83X improvement.
For this purpose, let's download more models for inference on Hailo-8:
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/mobilenet_v1.hef
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/mobilenet_v2_1.0.hef
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/mobilenet_v3.hef
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/mobilenet_v3_large_minimalistic.hef
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/ssd_mobilenet_v1.hef
wget https://hailo-model-zoo.s3.eu-west-2.amazonaws.com/ModelZoo/Compiled/v2.9.0/ssd_mobilenet_v2.hef
Now let's run the benchmarks for the following models:
- resnet50
- mobilenet v1
- mobilenet v2
- SSD mobilenet v1
- SSD mobilenet v2
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark resnet_v1_50.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network resnet_v1_50/resnet_v1_50: 100% | 19684 | FPS: 1312.20 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network resnet_v1_50/resnet_v1_50: 100% | 19686 | FPS: 1312.33 | ETA: 00:00:00
Measuring HW Latency
Network resnet_v1_50/resnet_v1_50: 100% | 4934 | HW Latency: 2.85 ms | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 1312.21
(streaming) = 1312.35
Latency (hw) = 2.85164 ms
Device 0000:01:00.0:
Power in streaming mode (average) = 3.84106 W
(max) = 3.87137 W
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark mobilenet_v1.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network mobilenet_v1/mobilenet_v1: 100% | 52360 | FPS: 3490.48 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network mobilenet_v1/mobilenet_v1: 100% | 52360 | FPS: 3490.45 | ETA: 00:00:00
Measuring HW Latency
Network mobilenet_v1/mobilenet_v1: 100% | 9706 | HW Latency: 1.30 ms | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 3490.53
(streaming) = 3490.49
Latency (hw) = 1.29576 ms
Device 0000:01:00.0:
Power in streaming mode (average) = 3.32183 W
(max) = 3.34346 W
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark mobilenet_v2_1.0.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network mobilenet_v2_1_0/mobilenet_v2_1_0: 100% | 36669 | FPS: 2444.44 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network mobilenet_v2_1_0/mobilenet_v2_1_0: 100% | 36668 | FPS: 2444.38 | ETA: 00:00:00
Measuring HW Latency
Network mobilenet_v2_1_0/mobilenet_v2_1_0: 100% | 7114 | HW Latency: 1.86 ms | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 2444.48
(streaming) = 2444.41
Latency (hw) = 1.85706 ms
Device 0000:01:00.0:
Power in streaming mode (average) = 2.27321 W
(max) = 2.27998 W
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark ssd_mobilenet_v1.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network ssd_mobilenet_v1/ssd_mobilenet_v1: 100% | 4816 | FPS: 321.04 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network ssd_mobilenet_v1/ssd_mobilenet_v1: 100% | 5382 | FPS: 358.77 | ETA: 00:00:00
Measuring HW Latency
[HailoRT] [warning] HW Latency measurement is not supported on NMS networks
Network ssd_mobilenet_v1/ssd_mobilenet_v1: 100% | 2525 | HW Latency: NaN | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 316.58
(streaming) = 358.774
Device 0000:01:00.0:
Power in streaming mode (average) = 1.49031 W
(max) = 1.79797 W
root@zub1cg-sbc-dualcam-2022-2:~# hailortcli benchmark ssd_mobilenet_v2.hef
Starting Measurements...
Measuring FPS in hw_only mode
Network ssd_mobilenet_v2/ssd_mobilenet_v2: 100% | 1753 | FPS: 116.86 | ETA: 00:00:00
Measuring FPS and Power in streaming mode
[HailoRT] [warning] Using the overcurrent protection dvm for power measurement will disable the overcurrent protection.
If only taking one measurement, the protection will resume automatically.
If doing continuous measurement, to enable overcurrent protection again you have to stop the power measurement on this dvm.
Network ssd_mobilenet_v2/ssd_mobilenet_v2: 100% | 1752 | FPS: 116.79 | ETA: 00:00:00
Measuring HW Latency
[HailoRT] [warning] HW Latency measurement is not supported on NMS networks
Network ssd_mobilenet_v2/ssd_mobilenet_v2: 100% | 1752 | HW Latency: NaN | ETA: 00:00:00
=======
Summary
=======
FPS (hw_only) = 116.861
(streaming) = 116.79
Device 0000:01:00.0:
Power in streaming mode (average) = 1.06451 W
(max) = 1.0673 W
root@zub1cg-sbc-dualcam-2022-2:~#
The "benchmark" command only measures the latency for the model. In order to get the "overall latency", we need to use the "run" command with the "--measure-latency" and "--measure-overall-latency" options.
root@zub1cg-sbc-2022-2:~/hailo_benchmarks# hailortcli run --measure-latency --measure-overall-latency resnet_v1_50.hef
Running streaming inference (resnet_v1_50.hef):
Transform data: true
Type: auto
Quantized: true
Network resnet_v1_50/resnet_v1_50: 100% | 1654 | HW Latency: 2.85 ms | ETA: 00:00:00
> Inference result:
Network group: resnet_v1_50
Frames count: 1654
HW Latency: 2.85 ms
Overall Latency: 3.00 ms
root@zub1cg-sbc-2022-2:~/hailo_benchmarks# hailortcli run --measure-latency --measure-overall-latency mobilenet_v1.hef
Running streaming inference (mobilenet_v1.hef):
Transform data: true
Type: auto
Quantized: true
Network mobilenet_v1/mobilenet_v1: 100% | 3262 | HW Latency: 1.30 ms | ETA: 00:00:00
> Inference result:
Network group: mobilenet_v1
Frames count: 3262
HW Latency: 1.30 ms
Overall Latency: 1.51 ms
root@zub1cg-sbc-2022-2:~/hailo_benchmarks# hailortcli run --measure-latency --measure-overall-latency mobilenet_v2_1.0.hef
Running streaming inference (mobilenet_v2_1.0.hef):
Transform data: true
Type: auto
Quantized: true
Network mobilenet_v2_1_0/mobilenet_v2_1_0: 100% | 2400 | HW Latency: 1.86 ms | ETA: 00:00:00
> Inference result:
Network group: mobilenet_v2_1_0
Frames count: 2400
HW Latency: 1.86 ms
Overall Latency: 2.07 ms
root@zub1cg-sbc-2022-2:~/hailo_benchmarks# hailortcli run --measure-latency --measure-overall-latency ssd_mobilenet_v1.hef
Running streaming inference (ssd_mobilenet_v1.hef):
Transform data: true
Type: auto
Quantized: true
[HailoRT] [warning] HW Latency measurement is not supported on NMS networks
Network ssd_mobilenet_v1/ssd_mobilenet_v1: 100% | 843 | HW Latency: NaN | ETA: 00:00:00
> Inference result:
Network group: ssd_mobilenet_v1
Frames count: 843
Overall Latency: 5.91 ms
root@zub1cg-sbc-2022-2:~/hailo_benchmarks# hailortcli run --measure-latency --measure-overall-latency ssd_mobilenet_v2.hef
Running streaming inference (ssd_mobilenet_v2.hef):
Transform data: true
Type: auto
Quantized: true
[HailoRT] [warning] HW Latency measurement is not supported on NMS networks
Network ssd_mobilenet_v2/ssd_mobilenet_v2: 100% | 584 | HW Latency: NaN | ETA: 00:00:00
> Inference result:
Network group: ssd_mobilenet_v2
Frames count: 584
Overall Latency: 8.53 ms
root@zub1cg-sbc-2022-2:~/hailo_benchmarks#
The Final VerdictFor the ZUBoard Dual Camera design, the benchmarking provides comparative results between the B128 DPU and the Hailo-8 acceleration module:
- performance (FPS)
- latency (msec)
- power (W)
- performance/W (FPS/W)
The performance (FPS) results were truly impressive, with the Hailo-8 delivering an average of 200x more FPS. Although this is only 30% of the 684x increase we were expecting theoretically, this is still a very big improvement in performance.
The latency (msec) results were equally impressive and unexpected, with the Hailo-8 providing 46x less latency.
When idle, the Hailo-8 based design consumes 0.5W more power than the B128 version. When benchmarking a model, however, the Hailo-8 module consumed 20 times more power.
The performance/W (FPS/W) results compare the relative efficiency of each solution, with Hailo-8 being 20-24 more efficient.
I hope this project inspires you to create innovative applications on ZUBoard.
Don't forget to check out the following projects that describe how to add support for ROS2 in your petalinux 2022.2 projects:
If this project sparks other ideas or questions that you want to share with the community, let me know in the comments below.
AcknowledgementsI want to thank my co-author Gianluca Filippini (EBV) for his pioneering work with the Hailo-8 AI Accelerator, and bringing this marvel to my attention.
I also want to thank Tom Curran for the M2 support on ZUBoard via the M2 HSIO.
Version History- 2023/11/21 - Initial Version
- 2024/01/16 - Update Hailo-8 latency results for overall latency
Comments