This project is part of a series on the subject of deploying the MediaPipe models to the edge on embedded platforms.
If you have not already read part 1 of this series, I urge you to start here:
In this project, I start by giving a recap of the challenges that can be expected when deploying the MediaPipe models, specifically for Hailo-8.
Then I will address these challenges one by one, before deploying the models with the Hailo flow.
Finally, I will perform profiling to determine if our goal of acceleration was achieved.
Hailo Flow OverviewHailo's AI Software Suite allows users to deploy models to the Hailo AI accelerators.
In addition to the Hailo AI accelerator devices, Hailo offers a scalable range of PCIe Gen 3.0 compatible M.2 AI accelerator modules:
No tests have been made with the Hailo-10 AI Acceleration module due to availability. This project will only cover the following Hailo AI acceleration modules:
- Hailo-8 : M.2 M Key (PCIe Gen 3.0, 4 lanes), 26 TOPS
- Hailo-8 : M.2 B+M Key (PCIe Gen 3.0, 2 lanes), 26 TOPS
- Hailo-8L : M.2 B+M Key (PCIe Gen 3.0, 2 lanes), 13 TOPS
The Hailo AI Software Suite supports the following frameworks:
- TensorFlow Lite
- ONNX
Hailo chose TensorFlow Lite, not because of its popular use for "reduced set of instructions and quantized" models, but rather because it is a "more stable" exportable format that also supports full floating-point models.
Other frameworks are indirectly supported by exporting to the TF-Lite or ONNX formats.
The deployment involves the following tasks:
- Model Parsing
- Model Optimization & Resource Allocation
- Model Compilation
The Model Parsing task translates models from industry-standard frameworks to Hailo executable format (HAR). It allows the user to identify unsupported layers, or sequence of layers in the model that are not supported by the compiler. This step is crucial when training our own custom model, since we can adapt the model architecture to use layers that are supported by the target compiler prior to training, thus saving hours (or days) in our deployment flow.
The Model Optimization and Resource Allocation tasks convert the model to the internal representation, using state of the art quantization, then allocates this internal representation to available resources in the Hailo AI accelerator. In order to perform this analysis and conversion, a sub-set of the training dataset is required. The size of the required calibration data is typically in the order of several 1000s of samples.
The Model Compilation task converts the quantized model to micro-code that can be run on the Hailo AI acceleration device.
In this project, we will be using version 2023-10 of the Hailo AI Software Suite Container, which includes:
- Dataflow Compiler : v3.25.0
- Hailo Model Zoo : v2.9.0
- Hailo RT : v4.15.0
- TAPPAS : v3.26.0
More information on versions and compatibility can be found on the Hailo Developer Zone.
A Note about GPUsIt is important to note that a GPU will be used by the Hailo Dataflow Compiler (DFC), specifically for the Model Optimizer task.
In my case, I have the following GPUs in my system:
- AMD GPU Radeon Pro W7900 (45GB) : not supported
- NVIDIA T400 (2GB) : supported
If like me, your GPU does not have enough memory, you will get a message like the following:
hailo_model_optimization.acceleras.utils.acceleras_exceptions.AccelerasResourceError:
GPU memory has been exhausted.
Please try to use Fine Tune with lower batch size or run on CPU.
In this case, it is possible to reduce the batch size from the default 8 down to something less memory intensive, such as 2. Since this may affect quantization accuracy, it is best to use the maximum batch size supported by your GPU for each model.
Installing the Hailo AI SW Suite Docker ContainerAll of these tools are available in the Hailo AI SW Suite docker container, which can be launched as follows:
Start by cloning the "2023.1" branch of my "blaze_tutorial" repository:
$ git clone --branch 2023.1 --recursive https://github.com/AlbertaBeef/blaze_tutorial
Next, download the files needed to launch the Hailo AI SW Suite docker using the "hsilo_ai_sw_suite_docker_download.sh" script:
$ cd blaze_tutorial/hailo-8/hailo_ai_sw_suite_docker
$ source ./hailo_ai_sw_suite_docker_download.sh
Download HailoRT PCIe driver
https://hailo.ai/?dl_dev=1&file=01bbd3ecd420f439eebf2e2d17176610
Install PCIe drier
sudo dpkg -i hailort-pcie-driver_4.15.0_all.deb
Reboot
...
Download hailo_ai_sw_suite_2023-10_docker.zip from Hailo Developper Zone
https://hailo.ai/developer-zone/software-downloads/
https://hailo.ai/?dl_dev=1&file=72aa39896d10bef11d7272c08b96f1eb
Extract docker archive
Unzip hailo_ai_sw_suite_2023-10_docker.zip
Launch docker
./hailo_ai_sw_suite_docker_run.sh
In fact, the above script provides information on how to download the Hailo AI SW Suite docker container.
If you plan to run the Hailo run-time on your linux computer (you will need to have an M.2 socket with a Hailo-8 module installed), you will need to install the PCIe driver.
In order to compile models for the Hailo-8 acceleration modules, you will need to download the Hailo AI SW Suite docker container from Hailo's Developer Zone.
Once downloaded, extract the docker container as follows:
$ unzip hailo_ai_sw_suite_2023-10_docker.zip
Finally, launch the Hailo AI SW Suite docker as follows:
$ ./hailo_ai_sw_suite_docker_run.sh
Challenges of deploying MediaPipe with Hailo-8The first challenge that I encountered, in part 1, was the reality that the performance of the MediaPipe models significantly degrades when run on embedded platforms, compared to modern computers. This is the reason I am attempting to accelerate the models with the Hailo AI SW Suite.
The second challenge is the fact that Google does not provide the dataset that was used to train the MediaPipe models. Since quantization requires a subset of this training data, this presents us with the challenge of coming up with this data ourselves.
Creating a Calibration Dataset for QuantizationAs described previously in the "Hailo Flow Overview" section, the quantization phase requires several hundreds to thousands of data samples, ideally a subset from the training data. Since we do not have access to the training dataset, we need to come up with this data ourselves.
We can generate the calibration dataset using a modified version of the blaze_app_python.py script, as follows:
For each input image that contains at least one hand, we want to generate:
- palm detection input images : resized image and padded to model's input size
- hand landmarks input images : cropped image of each hand, resized to model's input size
Two possible sources for input images are the following:
- Kaggle : many datasets exist, and may be reused
- Pixabay : contains several interesting videos, from which images can be extracted
For the case of Kaggle, if we take an existing dataset such as the following:
We can create a modified version of the blaze_detect_live.py script (from the blaze_app_python repository) that will scan all the images and generate a NumPy-specific binary format (*.npy) file containing our calibration data for the quantization step:
To run this script, navigate to the "blaze_app_python/calib_dataset_kaggle" directory, download the kaggle dataset to this sub-directory, and launch the script as follows:
$ python3 gen_calib_hand_dataset.py
[INFO] 2167 images found in kaggle_hand_gestures_dataset
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[INFO] calib_palm_detection_192_dataset shape = (1871, 192, 192, 3) uint8 0 255
[INFO] calib_palm_detection_256_dataset shape = (1871, 256, 256, 3) uint8 0 255
[INFO] calib_hand_landmark_224_dataset shape = (1880, 224, 224, 3) uint8 0 255
[INFO] calib_hand_landmark_256_dataset shape = (1880, 256, 256, 3) uint8 0 255
This will create the following calibration data for the 0.10 versions of the palm detection and hand landmarks models:
- calib_palm_detection_192_dataset.npy : 1871 samples of 192x192 RGB images
- calib_hand_landmark_224_dataset.npy : 1880 samples of 224x224 RGB images
I ultimately decided to not use this dataset, but documented the process for reference, which can be applied to any other Kaggle dataset.
If we take the case of PixaBay, we can use several videos as source such as the following:
- https://pixabay.com/videos/pixabay-sign-language-people-inclusion-58301/
- https://pixabay.com/videos/pixabay-sign-language-people-inclusion-58302/
- https://pixabay.com/videos/pixabay-man-living-room-faces-expression-13625/
- https://pixabay.com/videos/pixabay-man-face-expression-irritated-182353.mp4
- https://pixabay.com/videos/pixabay-hands-good-accept-vote-ok-gesture-168344/
- https://pixabay.com/videos/pixabay-girl-heart-gesture-symbol-cute-129421/
Once again, we can create a modified version of the blaze_detect_live.py script (from the blaze_app_python repository) that will scan through the videos and generate a NumPy-specific binary format (*.npy) file containing our calibration data for the quantization step:
To run this script, navigate to the "blaze_app_python/calib_dataset_pixabay" directory, download the PixaBay videos in a "videos" sub-sub-directory, and launch the script as follows:
$ python3 gen_calib_hand_dataset.py
[INFO] Start of video ./videos/pixabay-sign-language-people-inclusion-58301.mp4
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[INFO] End of video ./videos/pixabay-sign-language-people-inclusion-58301.mp4
[INFO] Collected 228 images for calib_palm_detection_192_dataset
[INFO] Collected 228 images for calib_palm_detection_256_dataset
[INFO] Collected 336 images for calib_hand_landmark_224_dataset
[INFO] Collected 336 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-sign-language-people-inclusion-58302.mp4
[INFO] End of video ./videos/pixabay-sign-language-people-inclusion-58302.mp4
[INFO] Collected 694 images for calib_palm_detection_192_dataset
[INFO] Collected 694 images for calib_palm_detection_256_dataset
[INFO] Collected 1127 images for calib_hand_landmark_224_dataset
[INFO] Collected 1127 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-man-living-room-faces-expression-136253.mp4
[INFO] End of video ./videos/pixabay-man-living-room-faces-expression-136253.mp4
[INFO] Collected 1041 images for calib_palm_detection_192_dataset
[INFO] Collected 1041 images for calib_palm_detection_256_dataset
[INFO] Collected 1818 images for calib_hand_landmark_224_dataset
[INFO] Collected 1818 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-man-face-expression-irritated-182353.mp4
[INFO] End of video ./videos/pixabay-man-face-expression-irritated-182353.mp4
[INFO] Collected 1138 images for calib_palm_detection_192_dataset
[INFO] Collected 1138 images for calib_palm_detection_256_dataset
[INFO] Collected 1933 images for calib_hand_landmark_224_dataset
[INFO] Collected 1933 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-hands-good-accept-vote-ok-gesture-168344.mp4
[INFO] End of video ./videos/pixabay-hands-good-accept-vote-ok-gesture-168344.mp4
[INFO] Collected 1361 images for calib_palm_detection_192_dataset
[INFO] Collected 1361 images for calib_palm_detection_256_dataset
[INFO] Collected 2379 images for calib_hand_landmark_224_dataset
[INFO] Collected 2379 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-girl-heart-gesture-symbol-asian-129421.mp4
[INFO] End of video ./videos/pixabay-girl-heart-gesture-symbol-asian-129421.mp4
[INFO] Collected 1577 images for calib_palm_detection_192_dataset
[INFO] Collected 1577 images for calib_palm_detection_256_dataset
[INFO] Collected 2595 images for calib_hand_landmark_224_dataset
[INFO] Collected 2595 images for calib_hand_landmark_256_dataset
[INFO] calib_palm_detection_192_dataset shape = (1577, 192, 192, 3) uint8 0 255
[INFO] calib_palm_detection_256_dataset shape = (1577, 256, 256, 3) uint8 0 255
[INFO] calib_hand_landmark_224_dataset shape = (2595, 224, 224, 3) uint8 0 255
[INFO] calib_hand_landmark_256_dataset shape = (2595, 256, 256, 3) uint8 0 255
This will create the following calibration data for the 0.10 versions of the palm detection and hand landmarks models:
- calib_palm_detection_192_dataset.npy : 1577 samples of 192x192 RGB images
- calib_hand_landmark_224_dataset.npy : 2595 samples of 224x224 RGB images
You are free to use either source described above, or use your own source as data for the quantization phase.
I have archived my exploration on this sub-topic (creating hand/face/pose datasets for various versions of models) in the following two archives:
- Kaggle : calib_dataset_kaggle.zip
- Pixabay : calib_dataset_pixabay.zip
Before we tackle the deployment flow with the Hailo AI SW Suite, it is worth taking a deeper dive into the models we will be working with. For this purpose, I will highlight the architecture of the palm detection model.
At a very high level, there are three convolutional neural network backbones that are used to extract features at three different scales. The outputs of these three backbones are combined together to feed two different heads : classifiers (containing score) and regressors (containing bounding box and additional keypoints).
The input to this model is a 256x256 RGB image, while the outputs of the model are 2944 candidate results, each containing:
- score
- bounding box (normalized to pre-determined anchor boxes)
- keypoints (7 keypoints for palm detector)
Palm Detection (0.07) - Block Diagram 2 (π·: AlbertaBeef)
The following block diagram illustrates details of the layers for the model. I have grouped together repeating patterns as "BLAZE BLOCK A", "BLAZE BLOCK B", and "BLAZE BLOCK C", showing the details only for the first occurrence.
The following block diagram is the same as the previous one, but this time showing details of the "BLAZE BLOCK B" patterns, which will required further discussion during the deployment phase.
As we saw previously in the "Hailo Flow Overview" section, the deployment phase starts with an inspection of the model in order to determine if the layers are supported by the Hailo data flow compiler (DFC).
Initial exploration reveals that the final reshape/concatenation layers of the model are not supported. This is reported as shown below for the 0.10 lite version of the palm detection model:
python3 hailo_flow.py --arch hailo8 --name palm_detection_lite --model models/palm_detection_lite.tflite --resolution 192 --process inspect
Command line options:
--arch : hailo8
--blaze : hand
--name : palm_detection_lite
--model : models/palm_detection_lite.tflite
--resolution : 192
--process : inspect
Traceback (most recent call last):
File "hailo_flow.py", line 42, in <module>
hn, npz = runner.translate_tf_model(model_path,model_name)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
return func(self, *args, **kwargs)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 902, in translate_tf_model
parser.translate_tf_model(model_path=model_path, net_name=net_name, start_node_names=start_node_names,
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/tools/subprocess_wrapper.py", line 57, in parent_wrapper
raise SubprocessTracebackFailure(*child_messages)
hailo_model_optimization.acceleras.utils.acceleras_exceptions.SubprocessTracebackFailure: Subprocess failed with traceback
Traceback (most recent call last):
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/tools/subprocess_wrapper.py", line 32, in child_wrapper
func(self, *args, **kwargs)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/parser/parser.py", line 73, in translate_tf_model
return self.parse_model_to_hn(graph, values, net_name, start_node_names, end_node_names, nn_framework)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/parser/parser.py", line 214, in parse_model_to_hn
fuser = HailoNNFuser(converter.convert_model(), valid_net_name, converter.end_node_names)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/model_translator/translator.py", line 63, in convert_model
self._create_layers()
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/model_translator/edge_nn_translator.py", line 26, in _create_layers
self._add_direct_layers()
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/model_translator/edge_nn_translator.py", line 101, in _add_direct_layers
self._layer_callback_from_vertex(vertex)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/model_translator/tflite_translator/tflite_translator.py", line 134, in _layer_callback_from_vertex
layer, consumed_vertices, activation = create_layer_from_vertex(LayerType.concat, vertex)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/model_translator/tflite_translator/tflite_layer_creator.py", line 44, in create_layer_from_vertex
return _create_concat_layer(vertex)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/model_translator/tflite_translator/tflite_layer_creator.py", line 366, in _create_concat_layer
layer = ConcatLayer.create(vertex.name, vertex.input, output_shapes=vertex.output_shapes, axis=axis)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/hailo_nn/hn_layers/concat.py", line 25, in create
layer = super(ConcatLayer, cls).create(original_name, input_vertex_order, output_shapes)
File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/hailo_nn/hn_layers/layer.py", line 88, in create
raise UnsupportedModelError(f'1D form is not supported in layer {original_name} of type '
hailo_sdk_common.hailo_nn.exceptions.UnsupportedModelError: 1D form is not supported in layer Identity_1 of type ConcatLayer.
If we trace this back to our block diagram, these are the final layers of the model, as shown in red below:
These unsupported layers are in an ideal location (first or last layers), since the bulk of the model can execute entirely on the Hailo-8 acceleration module.
With respect to the Hailo AI SW Suite, this means that we need to specify the last CONV2D layers as output layers, and implement the missing layers in our application. This is shown below for the 0.10 lite version of the palm detection model:
python3 hailo_flow.py --arch hailo8 --name palm_detection_lite --model models/palm_detection_lite.tflite --resolution 192 --process parse
Command line options:
--arch : hailo8
--blaze : hand
--name : palm_detection_lite
--model : models/palm_detection_lite.tflite
--resolution : 192
--process : parse
[INFO] start_node_names : ['input_1']
[INFO] end_node_names : ['model_1/model/classifier_palm_16_NO_PRUNING/BiasAdd;model_1/model/classifier_palm_16_NO_PRUNING/Conv2D;model_1/model/classifier_palm_16_NO_PRUNING/BiasAdd/ReadVariableOp/resource1', 'model_1/model/classifier_palm_8_NO_PRUNING/BiasAdd;model_1/model/classifier_palm_8_NO_PRUNING/Conv2D;model_1/model/classifier_palm_8_NO_PRUNING/BiasAdd/ReadVariableOp/resource1', 'model_1/model/regressor_palm_16_NO_PRUNING/BiasAdd;model_1/model/regressor_palm_16_NO_PRUNING/Conv2D;model_1/model/regressor_palm_16_NO_PRUNING/BiasAdd/ReadVariableOp/resource1', 'model_1/model/regressor_palm_8_NO_PRUNING/BiasAdd;model_1/model/regressor_palm_8_NO_PRUNING/Conv2D;model_1/model/regressor_palm_8_NO_PRUNING/BiasAdd/ReadVariableOp/resource1']
[info] Translation completed on TensorFlow Lite model palm_detection_lite
[info] Initialized runner for palm_detection_lite
[info] Saved HAR to: /local/shared_with_docker/palm_detection_lite_hailo_model.har
I used netron.ai to analyze the model and determine the names of these intermediate layers.
Model DeploymentNow that we know which layers are supported by our models, and know which input and output layers to specify for the Hailo AI SW Suite, we can deploy them with scripting, using the calibration data we have prepared.
I have prepared a script for this purpose:
This script takes four (4) arguments when invoked:
- arch : architecture (ie. hailo8, hailo8l)
- name : BlazePalm, BlazeHandLandmark, etc...
- resolution : input size (ie. 256)
- process : inspect, parse, optimize, compile, all
The name argument indicates which model we are deploying, such as BlazePalm for the palm detector or BlazeHandLandmark for the hand landmark models. The resolution indicates the input size to the model.
These two arguments will determine which calibration dataset to use for the quantization. For example:
- name=BlazePalm, size=192 => calib_palm_detection_192_dataset.npy
- name=BlazeHandLandmark, size=224 => calib_hand_landmark_224_dataset.npy
The process argument indicates which task to run. By default specify "all" to parse, optimize, and compile the model. We saw the inspect task in the previous section when we analyzed our models.
I have provided a second script which will call the hailo_flow.py script to parse, optimize, and compile the models to be deployed:.
You will want to modify the following list before execution:
- model_list : specify which model(s) you want to deploy
Below is a modified version of the script that will deploy the 0.10 versions of the palm detection and hand landmarks models.
# TFLite models
model_palm_detector_v0_07=("palm_detection_v0_07","models/palm_detection_without_custom_op.tflite",256)
model_hand_landmark_v0_07=("hand_landmark_v0_07","models/hand_landmark_v0_07.tflite",256)
model_palm_detector_v0_10_lite=("palm_detection_lite","models/palm_detection_lite.tflite",192)
model_palm_detector_v0_10_full=("palm_detection_full","models/palm_detection_full.tflite",192)
model_hand_landmark_v0_10_lite=("hand_landmark_lite","models/hand_landmark_lite.tflite",224)
model_hand_landmark_v0_10_full=("hand_landmark_full","models/hand_landmark_full.tflite",224)
model_face_detector_v0_10_short=("face_detection_short_range","models/face_detection_short_range.tflite",128)
model_face_detector_v0_10_full=("face_detection_full_range","models/face_detection_full_range.tflite",192)
model_face_landmark_v0_10=("face_landmark","models/face_landmark.tflite",192)
model_pose_detector_v0_10=("pose_detection","models/pose_detection.tflite",224)
model_pose_landmark_v0_10_lite=("pose_landmark_lite","models/pose_landmark_lite.tflite",256)
model_pose_landmark_v0_10_full=("pose_landmark_full","models/pose_landmark_full.tflite",256)
model_pose_landmark_v0_10_heavy=("pose_landmark_heavy","models/pose_landmark_heavy.tflite",256)
model_list=(
model_palm_detector_v0_10_lite[@]
model_palm_detector_v0_10_full[@]
model_hand_landmark_v0_10_lite[@]
model_hand_landmark_v0_10_full[@]
)
model_count=${#model_list[@]}
#echo $model_count
# Convert to TensorFlow-Keras
for ((i=0; i<$model_count; i++))
do
model=${!model_list[i]}
model_array=(${model//,/ })
model_name=${model_array[0]}
model_file=${model_array[1]}
input_resolution=${model_array[2]}
echo python3 hailo_flow.py --arch hailo8 --name ${model_name} --model ${model_file} --resolution ${input_resolution} --process all
python3 hailo_flow.py --arch hailo8 --name ${model_name} --model ${model_file} --resolution ${input_resolution} --process all | tee deploy_${model_name}.log
done
This script must be executed in the Hailo AI SW Suite docker container for Pytorch. Launch the docker from the "blaze_tutorial/hailo-8/hailo_ai_sw_suite_docker" directory as follows:
$ ./hailo_ai_sw_suite_docker_run.sh
If you get a message indicating that a container is already running, launch the script with the "--resume" argument as follows:
$ ./hailo_ai_sw_suite_docker_run.sh --resume
Inside the Hailo AI SW Suite docker, download the TFLite models to the "models" sub-directory, then launch the deploy_models.sh script as follows:
$ cd ../shared_with_docker
$ cd models
$ source ./get_tflite_models.sh
$ ..
$ source ./deploy_models.sh
When complete, the following compiled models will be located in the current directory:
- palm_detection_lite.hef
- palm_detection_full.hef
- hand_landmarks_lite.hef
- hand_landmarks_full.hef
For convenience, I have archived the compiled models for Hailo-8 in the following archive:
- Hailo-8 models : blaze_hailo8_models.zip (compiled with DFC v3.25.0)
- Hailo-8L models : blaze_hailo8l_models.zip (compiled with DFC v3.25.0)
One thing that is important to highlight with the Hailo flow is that the number of contexts that are used to implement a model will affect its inference performance.
The ideal scenario is that the model can be implemented with a single context. The following table illustrates the significant performance of the 0.10 versions of the hand landmark models (implemented with 1 context), compared to the 0.07 version (implemented with 3 contexts).
When multiple contexts are required, additional transfers will need to occur over the PCIe bus in order to perform the context switching required to execute the entire model.
It can be worth exploring compression and multiple-precision, with the intent of reducing the number of contexts, in order to improve overall performance.
In my provided hailo_flow.py script, I set the compression parameter auto_4bit_weights_ratio to 0.6 (which means ~60% of the weights will be quantized into 4-bits) and ran the model optimization again. Using 4-bit weights might reduce the modelβs accuracy but will help to reduce the modelβs memory footprint, possibly reduce the number of contexts, and thus increase performance.
# The following line is needed for really small models, when the compression_level is always reverted back to 0.'
'model_optimization_config(compression_params, auto_4bit_weights_ratio=0.6)\n',
# The application of the compression could be seen by the [info] messages: "Assigning 4bit weight to layer .."
# Increase control utilization to reduce number of contexts
'resources_param(strategy=greedy,max_control_utilization=0.80)\n'
The following table illustrates the significant performance increase achieved with 60% of weights being quantized to 4-bit with the 0.07 version of the palm detection model.
All this comes at the cost of increased power consumption. If we look at the performance per watt, however, the trade-off is well worth it, since we get more performance per watt for all versions of the palm detection models.
Model ExecutionIn order to support the Hailo-8 models, the "blaze_app_python" application was augmented with the following inference targets:
My final inference code for the Hailo-8 models can be found in the "blaze_app_python" repository, under the blaze_hailo sub-directory:
Note that the Hailo-8 inference can be run on a computer (with a M.2 socket, populated with a Hailo-8 module) as well as on the Zynq UltraScale+ embedded platform (ie. ZUBoard, with M.2 HSIO, and Hailo-8 accelerator module).
Installing the python application on ZUBoardThe python application can be accessed from the following github repository:
git clone https://github.com/AlbertaBeef/blaze_app_python
cd blaze_app_python
The python demo application requires certain packages which can be installed as follows:
pip3 install tflite_runtime matplotlib plotly kaleido numpy==1.23.2
In order to successfully use the python demo with the original TFLite models, they need to be downloaded from the google web site:
cd blaze_tflite/models
source ./get_tflite_models.sh
cd ../..
In order to successfully use the python demo with the Hailo-8 models, they need to be downloaded as follows:
cd blaze_hailo/models
source ./get_hailo8_models.sh
unzip -o blaze_hailo8_models.zip
cp hailo8/*.hef .
cd ..
Although I provide pre-compiled models for the face/pose detection and landmark models, only the palm detection and hand landmark models are currently working with the python demo application.
You are all set !
Launching the python application on ZUBoardAs we already saw in part 1, the python application can launch many variations of the dual-inference pipeline, which can be filtered with the following arguments:
- --blaze : hand | face | pose
- --target : blaze_tflite |... | blaze_hailo |
- --pipeline : specific name of pipeline (can be queried with --list argument)
In order to display the complete list of supported pipelines, launch the python script as follows:
root@zub1cg-sbc-2022-2:~/blaze_app_python# python3 blaze_detect_live.py --list
[INFO] user@hosthame : root@zub1cg-sbc-2022-2
[INFO] blaze_tflite supported ...
...
[INFO] blaze_hailo supported ...
...
Command line options:
--input :
--image : False
--blaze : hand,face,pose
--target : blaze_tflite,blaze_pytorch,blaze_vitisai,blaze_hailo
--pipeline : all
--list : True
--debug : False
--withoutview : False
--profilelog : False
--profileview : False
--fps : False
List of target pipelines:
...
07 hai_hand_v0_10_lite blaze_hailo/models/palm_detection_lite.hef
blaze_hailo/models/hand_landmark_lite.hef
08 hai_hand_v0_10_full blaze_hailo/models/palm_detection_full.hef
blaze_hailo/models/hand_landmark_lite.hef
...
In order to launch the Hailo-8 pipeline for hand detection and landmarks, with the DisplayPort monitor, use the python script as follows::
export DISPLAY=:0.0
python3 blaze_detect_live.py --pipeline=hai_hand_v0_10_lite
This will launch the 0.10 (lite) version of the models, compiled for Hailo-8, as shown below:
The previous video has not been accelerated. It shows the frame rate to be approximately 19 fps when no hands are detected (one model running : palm detection), approximately 12 fps when one hand has been detected (two models running : palm detection and hand landmarks), and approximately 8 fps when two hands have been detected (three models running : palm detection and 2 hand landmarks).
It is worth noting that this is running with a single-threaded python script. There is an opportunity for increased performance with a multi-threaded implementation. While the graph runner is waiting for transfers from one model's sub-graphs, another (or several other) model(s) could be launched in parallel...
There is also an opportunity to accelerate the rest of the pipeline with C++ code...
Benchmarking the models on ZUBoardFor reasons which I have not resolved, the "--profileview" argument does not work well on the ZUBoard, so we will use the "--profilelog" argument instead.
The profiling functionality uses a test image that can be downloaded from Google as follows:
source ./get_test_images.sh
The following commands can be used to generate profile results for the hai_hand_v0_10_lite pipeline using the Hailo-8 models, and the test image:
rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=hai_hand_v0_10_lite --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_zuboard_hai_hand_v0_10_lite.csv
The following commands can be used to generate profile results for the tfl_hand_v0_10_lite pipeline using the TFLite models, and the test image:
rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=tfl_hand_v0_10_lite --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_zuboard_tfl_hand_v0_10_lite.csv
The same is done for the hai_hand_v0_10_full & tfl_hand_v0_10_full models.
The results of all.csv files were averaged, then plotted using Excel.
Here are the profiling results for the 0.10 versions of the models deployed with Hailo-8, in comparison to the reference TFLite models:
Here are the profiling results for the 0.07 versions of the models deployed with Hailo-8, in comparison to the reference TFLite models:
Again, it is worth noting that these benchmarks have been taken with a single-threaded python script. There is additional opportunity for acceleration with a multi-threaded implementation. While the graph runner is waiting for transfers from one model's sub-graphs, another (or several other) model(s) could be launched in parallel...
There is also an opportunity to accelerate the rest of the pipeline with C++ code...
Setting up the various Hailo-8 platformsIn order to get a better feeling of the acceleration achieved with the Hailo-8 acceleration module, I decided to perform similar profiling for the following platforms:
- Raspberry Pi5 : quad-Cortex-A76 ARM processors / Hailo-8L (B+M, 2 lanes)
- ZUBoard : dual-Cortex-A53 ARM processor / Hailo-8 (B+M Key, 2 lanes)
- ZCU104 : quad-Cortex-A53 ARM processors / Hailo-8 (M Key, 4 lanes)
- HP Z4 G4 Workstation : Intel Xeon (3.6GHz) / Hailo-8 (M Key, 4 lanes)
In order to setup the RPI5, please refer to the following documentation:
I purchased a pre-assembled kit from CanaKit, and followed their getting started instructions:
My understanding is that the RPI5 Hailo-8L integration was performed with Hailo AI SW Suite v2024β04, with models compiled with DFC v3.27.0. In preparation for this, I have compiled the Hailo-8L models using DFC v3.27.0:
- Hailo8L models : blaze_models_hailo8l_dfc_v3.27.0.zip (compiled with DFC v3.27.0)
In order to setup the ZUBoard, please refer to the detailed instructions in my previous project:
In order to setup the UltraZed-EV, I started with the following PCIe enabled design:
- [Github] Avnet/hdl/uz7ev_evcc_nvme
- [Github] Avnet/petalinux/uz7ev_evcc_nvme
I then I added the meta-hailo recipes to the petalinux project, as described in the previous hackster project for ZUBoard, and attached the Hailo-8 module using the Opsero M.2 M-Key Stack FMC:
- [Opsero] M.2 M-key Stack FMC
In order to setup the HP G4 Z4, I simply inserted the Hailo-8 module into the M.2 socket:
In order to determine the acceleration achieved on each platform, the reference TFLite models needed to be profiled as well:
Next, I profiled the 0.07 and 0.10 versions of the models deployed with Hailo-8, and compared with the reference TFLite models:
If we plot the execution times for the Hailo-8 models for each platform, we get the following results:
If we analyze these results per platform, we can observe the following acceleration:
If we plot the acceleration ratios of the execution times for the Hailo-8 models with the most acceleration (versions 0.07), for each platform, we get the following results:
The uncontested winner in terms of performance is the modern workstation (HP G4 Z4). Both of its TFLite and Hailo-8 models have the smallest execution times. If we consider acceleration, however, there is little gain offered by the Hailo-8 acceleration module, since the models are already performing very well on the CPU.
If we consider the acceleration achieved, the Zynq UltraScale+ platforms rise above the others. Specifically on the UltraZed-EV platform, 29X faster for the handlandmarks model.
If we run benchmarks with Hailo's "hailortcli" utility, as follows:
hailortcli run blaze_hailo/models/{model}.hef
We observe the following performance in FPS for each of the palm detection and hand landmarks models on the various platforms:
We can see the significant performance differences between the various Hailo acceleration modules:
- Hailo-8L (B+M Key, 2 lanes) - 13 TOPS
- Hailo-8 (B+M Key, 2 lanes) - 26 TOPS
- Hailo-8 (M Key, 4 lanes) - 26 TOPS
The first design decision would obviously be to choose the Hailo-8 module over the Hailo-8L module.
The next design decision would be to favor a 4 lane PCIe interface over a 2 lane PCIe interface. Although my results show no advantage with these relatively small models, execution of larger models would results in more context switching, and thus lower performance on a 2 lane implementation.
Known IssuesThe current version of this project has models for the following pipelines:
- palm detection : working
- hand landmarks : working
- face detection : status unknown
- face landmarks : status unknown
- pose detection : status unknown (pose detection model does not build)
- pose landmarks : status unknown
The current platforms have been tested:
- Raspberry Pi 5 AI Kit, with Hailo-8-L (B+M Key, 2 lanes) : working
- ZUBoard, with Hailo-8 (B+M Key, 2 lanes) : working
- ZCU104, with Hailo-8 (M Key, 4 lanes) : working
- HP Z4 G4 Workstation, with Hailo-8 (M Key, 4 lanes) : working
I hope this project will inspire you to implement your own custom application.
What applications would you like to see built on top of these foundational MediaPipe models ?
Do you have a Raspberry Pi 5 AI Kit (with Hailo-8-L module) ? If yes, would be be willing to perform tests with these accelerated models ?
Let me know if the comments...
AcknowledgementsI want to thank my co-author Gianluca Filippini (EBV) for his pioneering work with the Hailo-8 AI Accelerator module, and bringing this marvel to my attention. His feedback, guidance, and insight have been invaluable.
I also want to thank Jeff Johnson (Opsero) for his M.2 M-Key Stack FMC, which was indispensable for testing the Hailo-8 module on the UltraZed-7EV FMC Carrier Card:
- [Opsero] M.2 M-key Stack FMC
- 2024/09/02 - Initial Version
- 2024/09/14 - Updated with RPI5 AI Kit results
- 2024/09/19 - Updated with Performance (FPS) results, taken with hailortcli
- [Google] MediaPipe Solutions Guide : https://ai.google.dev/edge/mediapipe/solutions/guide
- [Hailo] Hailo AI SW Suite Documentation :https://hailo.ai/products/hailo-software/hailo-ai-software-suite
- [Hailo] Hailo Dataflow Compiler (DFC) User Guide, v3.25.0
- [Hailo] Hailo RT User Guide, v4.15.0
- [Hailo] Hailo TAPPAS User Guide v3.26.0
- [Hailo] Hailo Developer Zone : https://hailo.ai/developer-zone
- [Opsero] M.2 M-key Stack FMC
- [AlbertaBeef] blaze_app_python : AlbertaBeef/blaze_app_python
- [AlbertaBeef] blaze_tutorial : AlbertaBeef/blaze_tutorial
Comments