This project is part of a series on the subject of deploying the MediaPipe models to the edge on embedded platforms.
If you have not already read part 1 of this series, I urge you to start here:
In this project, I start by giving a recap of the challenges that can be expected when deploying the MediaPipe models, specifically with Vitis-AI 3.5.
Then I will address these challenges one by one, before deploying the models with Vitis-AI 3.5.
Finally, I will perform profiling to determine if our goal of acceleration was achieved.
Vitis-AI 3.5 Flow OverviewAMD's Vitis-AI 3.5 workflow allows users to deploy models to AMD's SoC embedded devices, such as Zynq UltraScale+ and Versal AI Edge.
This workflow supports the following frameworks:
- TensorFlow
- TensorFlow 2
- PyTorch
The deployment involves the following tasks:
- Model Inspection (not shown in flow diagram)
- AI Quantization
- AI Optimization (optional)
- AI Compilation
The Model Inspection task allows the user to identify unsupported layers, or sequence of layers in the model that are not supported by the compiler. This step is crucial when training our own custom model, since we can adapt the model architecture to use layers that are supported by the target compiler prior to training, thus saving hours (or days) in our deployment flow.
The AI Quantization task analyzes the dynamic ranges of each layer in the model and converts the data types from floating point to fixed point. The DPU engine we are targeting supports 8 bit integer. In order to perform this analysis and conversion, a sub-set of the training dataset is required. The size of the required calibration data is in the order of several 100s to 1000s of samples.
As illustrated in the previous workflow graph, in certain cases (one of which I ran into was the MobileNet-V2 classification model), post training quantization may not yield sufficient accuracy and requires quantize-aware training (QAT). Fortunately, this was not the case for the MediaPipe models.
The AI Optimization task prunes weights that have minimal impact on the model's accuracy, thus reducing the compute requirements of the model. This task is strongly recommended for large models, such as ResNet, that were architected solely for accuracy at the expense of redundancy. For smaller compact models, such as MobileNet, that were architected to be resource efficient, the Optimizer will not provide much gain, and is thus not required. For this project, we will not be using the AI Optimizer.
The AI Compilation task converts the quantized model to micro-code that can be run on the DPU engine.
Installing the Vitis-AI 3.5 Docker ContainerAll of these tools are available in AMD's Vitis-AI Docker Container, which can be launched as follows:
Start by cloning the "2023.1" branch of my "blaze_tutorial" repository:
$ git clone --branch 2023.1 --recursive https://github.com/AlbertaBeef/blaze_tutorial
Next, download the files needed to launch the Vitis-AI docker using the "prepare_docker.sh" script:
$ cd blaze_tutorial/vitis-ai
$ source ./prepare_docker.sh
$ cd ..
Download the Vitis-AI docker container for PyTorch as follows:
$ docker pull xilinx/vitis-ai-pytorch-cpu
Finally, launch the Vitis-AI docker as follows:
$ ./vitis-ai/docker_run.sh xilinx/vitis-ai-pytorch-cpu
$ cd vitis-ai
Challenges of deploying MediaPipe with Vitis-AI 3.5The first challenge that I encountered, in part 1, was the reality that the performance of the MediaPipe models significantly degrades when run on embedded platforms, compared to modern computers. This is the reason I am attempting to accelerate the models with Vitis-AI.
The second challenge is the fact that Google does not provide the dataset that was used to train the MediaPipe models. Since quantization requires a subset of this training data, this presents us with the challenge of coming up with this data ourselves.
The third challenge encountered when deploying the MediaPipe models to Vitis-AI 3.5 is that TFLite is not supported. This forces us to consider the use of an alternate framework, and the additional problem of model conversion.
Converting the TFLite models to PyTorchThe choice of converting the models to PyTorch was motivated mainly by existing efforts in the open-source community.
I was fortunate to find the following resources which convert the TFLite models to PyTorch:
- [Vidur Satija] BlazePalm : vidursatija/BlazePalm
- [Matthijs Hollemans] BlazeFace-PyTorch : hollance/BlazeFace-PyTorch
- [Zak Murez] MediaPipePyTorch : zmurez/MediaPipePytorch
The three GitHub repositories seem to be inter-related, and build on top of each other.
The BlazePalm repository only provides the converted models for palm+hands, as well as the conversion script as a jupyter notebook.
Similarly, the BlazeFace-PyTorch repository provides the converted models for the face models, as well as the conversion script.
The MediaPipePyTorch repository provides the converted models for each of the hands, face, and pose models, but does not provide any of the conversion scripts.
The disadvantage of reusing this work is that it only works with the 0.07 version of the models (the latest versions are 0.10). I tried to convert a 0.10 version of the models with the provided conversion scripts, but this did not work. More investigation and work would be needed to make this happen.
We already saw in part 1 that differences are to be expected between version 0.07 and 0.10 of the models:
Nevertheless, I decided to go ahead with the PyTorch version of the 0.07 models for this project.
Creating a Calibration Dataset for QuantizationAs described previously in the "Vitis-AI 3.5 Flow Overview" section, the quantization phase requires several hundreds to thousands of data samples, ideally a subset from the training data. Since we do not have access to the training dataset, we need to come up with this data ourselves.
We can generate the calibration dataset using a modified version of the blaze_app_python.py script, as follows:
For each input image that contains at least one hand, we want to generate:
- palm detection input images : resized image and padded to model's input size
- hand landmarks input images : cropped image of each hand, resized to model's input size
Two possible sources for input images are the following:
- Kaggle : many datasets exist, and may be reused
- Pixabay : contains several interesting videos, from which images can be extracted
For the case of Kaggle, if we take an existing dataset such as the following:
We can create a modified version of the blaze_detect_live.py script (from the blaze_app_python repository) that will scan all the images and generate a NumPy-specific binary format (*.npy) file containing our calibration data for the quantization step:
To run this script, navigate to the "blaze_app_python/calib_dataset_kaggle" directory, download the kaggle dataset to this sub-directory, and launch the script as follows:
$ python3 gen_calib_hand_dataset.py
[INFO] 2167 images found in kaggle_hand_gestures_dataset
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[INFO] calib_palm_detection_192_dataset shape = (1871, 192, 192, 3) uint8 0 255
[INFO] calib_palm_detection_256_dataset shape = (1871, 256, 256, 3) uint8 0 255
[INFO] calib_hand_landmark_224_dataset shape = (1880, 224, 224, 3) uint8 0 255
[INFO] calib_hand_landmark_256_dataset shape = (1880, 256, 256, 3) uint8 0 255
This will create the following calibration data for the 0.07 versions of the palm detection and hand landmarks models:
- calib_palm_detection_256_dataset.npy : 1871 samples of 256x256 RGB images
- calib_hand_landmark_256_dataset.npy : 1880 samples of 256x256 RGB images
I ultimately decided to not use this dataset, but documented the process for reference, which can be applied to any other Kaggle dataset.
If we take the case of PixaBay, we can use several videos as source such as the following:
- https://pixabay.com/videos/pixabay-sign-language-people-inclusion-58301/
- https://pixabay.com/videos/pixabay-sign-language-people-inclusion-58302/
- https://pixabay.com/videos/pixabay-man-living-room-faces-expression-136253/
- https://pixabay.com/videos/pixabay-man-face-expression-irritated-182353/
- https://pixabay.com/videos/girl-hug-cheerful-gesture-cute-129420/
Once again, we can create a modified version of the blaze_detect_live.py script (from the blaze_app_python repository) that will scan through the videos and generate a NumPy-specific binary format (*.npy) file containing our calibration data for the quantization step:
To run this script, navigate to the "blaze_app_python/calib_dataset_pixabay" directory, download the PixaBay videos in a "videos" sub-sub-directory, and launch the script as follows:
$ python3 gen_calib_hand_dataset.py
[INFO] Start of video ./videos/pixabay-sign-language-people-inclusion-58301.mp4
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[INFO] End of video ./videos/pixabay-sign-language-people-inclusion-58301.mp4
[INFO] Collected 228 images for calib_palm_detection_192_dataset
[INFO] Collected 228 images for calib_palm_detection_256_dataset
[INFO] Collected 336 images for calib_hand_landmark_224_dataset
[INFO] Collected 336 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-sign-language-people-inclusion-58302.mp4
[INFO] End of video ./videos/pixabay-sign-language-people-inclusion-58302.mp4
[INFO] Collected 694 images for calib_palm_detection_192_dataset
[INFO] Collected 694 images for calib_palm_detection_256_dataset
[INFO] Collected 1127 images for calib_hand_landmark_224_dataset
[INFO] Collected 1127 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-man-living-room-faces-expression-136253.mp4
[INFO] End of video ./videos/pixabay-man-living-room-faces-expression-136253.mp4
[INFO] Collected 1041 images for calib_palm_detection_192_dataset
[INFO] Collected 1041 images for calib_palm_detection_256_dataset
[INFO] Collected 1818 images for calib_hand_landmark_224_dataset
[INFO] Collected 1818 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-man-face-expression-irritated-182353.mp4
[INFO] End of video ./videos/pixabay-man-face-expression-irritated-182353.mp4
[INFO] Collected 1138 images for calib_palm_detection_192_dataset
[INFO] Collected 1138 images for calib_palm_detection_256_dataset
[INFO] Collected 1933 images for calib_hand_landmark_224_dataset
[INFO] Collected 1933 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-hands-good-accept-vote-ok-gesture-168344.mp4
[INFO] End of video ./videos/pixabay-hands-good-accept-vote-ok-gesture-168344.mp4
[INFO] Collected 1361 images for calib_palm_detection_192_dataset
[INFO] Collected 1361 images for calib_palm_detection_256_dataset
[INFO] Collected 2379 images for calib_hand_landmark_224_dataset
[INFO] Collected 2379 images for calib_hand_landmark_256_dataset
[INFO] Start of video ./videos/pixabay-girl-heart-gesture-symbol-asian-129421.mp4
[INFO] End of video ./videos/pixabay-girl-heart-gesture-symbol-asian-129421.mp4
[INFO] Collected 1577 images for calib_palm_detection_192_dataset
[INFO] Collected 1577 images for calib_palm_detection_256_dataset
[INFO] Collected 2595 images for calib_hand_landmark_224_dataset
[INFO] Collected 2595 images for calib_hand_landmark_256_dataset
[INFO] calib_palm_detection_192_dataset shape = (1577, 192, 192, 3) uint8 0 255
[INFO] calib_palm_detection_256_dataset shape = (1577, 256, 256, 3) uint8 0 255
[INFO] calib_hand_landmark_224_dataset shape = (2595, 224, 224, 3) uint8 0 255
[INFO] calib_hand_landmark_256_dataset shape = (2595, 256, 256, 3) uint8 0 255
This will create the following calibration data for the 0.07 versions of the palm detection and hand landmarks models:
- calib_palm_detection_256_dataset.npy : 1577 samples of 256x256 RGB images
- calib_hand_landmark_256_dataset.npy : 2595 samples of 256x256 RGB images
You are free to use either source described above, or use your own source as data for the quantization phase.
I have archived my exploration on this sub-topic (creating hand/face/pose datasets for various versions of models) in the following two archives:
- Kaggle : calib_dataset_kaggle.zip
- Pixabay : calib_dataset_pixabay.zip
Before we tackle the deployment flow with Vitis-AI, it is worth taking a deeper dive into the models we will be working with. For this purpose, I will highlight the architecture of the palm detection model (0.07 version).
At a very high level, there are three convolutional neural network backbones that are used to extract features at three different scales. The outputs of these three backbones are combined together to feed two different heads : classifiers (containing score) and regressors (containing bounding box and additional keypoints).
The input to this model is a 256x256 RGB image, while the outputs of the model are 2944 candidate results, each containing:
- score
- bounding box (normalized to pre-determined anchor boxes)
- keypoints (7 keypoints for palm detector)
The following block diagram illustrates details of the layers for the model. I have grouped together repeating patterns as "BLAZE BLOCK A", "BLAZE BLOCK B", and "BLAZE BLOCK C", showing the details only for the first occurrence.
The following block diagram is the same as the previous one, but this time showing details of the "BLAZE BLOCK B" patterns, which will required further discussion during the deployment phase.
As we saw previously in the "Vitis-AI Flow Overview" section, the deployment phase starts with an inspection of the model in order to determine if the layers are supported by the Vitis-AI compiler.
The following jupyter notebook contains the results for my initial model inspection:
This notebook, executed within the Vitis-AI 3.5 docker container for PyTorch, produced the following results that indicate that three specific instances of the PAD layer are not supported:
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
**************************************************
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: null
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA1_B4096
[UNILOG][INFO] Graph name: BlazePalm, with op num: 730
[UNILOG][INFO] Begin to compile...
[UNILOG][WARNING] xir::Op{name = BlazePalm__BlazePalm_Sequential_backbone1__BlazeBlock_9__ret_51, type = pad-fix} has been assigned to CPU: [DPU does not support CONSTANT mode. (only support SYMMETRIC mode for all devices and CONSTANT mode for some DPUv4e devices)].
[UNILOG][WARNING] xir::Op{name = BlazePalm__BlazePalm_Sequential_backbone1__BlazeBlock_17__ret_103, type = pad-fix} has been assigned to CPU: [DPU does not support CONSTANT mode. (only support SYMMETRIC mode for all devices and CONSTANT mode for some DPUv4e devices)].
[UNILOG][WARNING] xir::Op{name = BlazePalm__BlazePalm_Sequential_backbone2__BlazeBlock_0__ret_155, type = pad-fix} has been assigned to CPU: [DPU does not support CONSTANT mode. (only support SYMMETRIC mode for all devices and CONSTANT mode for some DPUv4e devices)].
[UNILOG][INFO] Total device subgraph number 10, DPU subgraph number 4
[UNILOG][INFO] Compile done.
...
If we trace this back to our block diagram, these are PAD layers contained in our "BLAZE BLOCK B" patterns, as shown in red below:
The model contains several other PAD layers which have supported configurations, with padding in the first (top/bottom) and second (left/right) dimensions. The PAD layers which are not supported are those which are performing the padding in the third (front/back) dimension, as shown below:
These three unsupported layers result in the generation of a model implemented with 7 sub-graphs : 4 running on the DPU, and the 3 PAD layers running on the CPU. This is not ideal, since the transfer of results between the CPU and DPU will affect performance.
Before continuing with the deployment, I attempted to change the model to be fully supported by the Vitis-AI compiler. The main challenge in doing this is that I can not re-train the model (which is the recommended way forward when unsupported layers need to be handled). The modifications I make must preserve the weights of the trained model.
My first attempt was to re-write the PAD layer as a concatenation layer:
This first modification attempt resulted in more unsupported layers, so did not bring any advantage to the original model.
As a reference, the following jupyter notebook contains the results of my re-write attempt #1:
My second attempt was to re-write the math such that the PAD layer is avoided:
This second modification attempt also resulted in more unsupported layers, so did not being any advantage to the original model.
As a reference, the following jupyter notebook contains the results of my re-write attempt #2:
Ultimately, I abandoned my attempts to avoid the unsupported PAD layer, and moved on with the original PyTorch model.
Fortunately, this is one of the options supported by the Vitis-AI workflow, as shown in the following flow diagram:
We will choose to use the Custom OP flow that allows certain layers to be executed on the CPU.
Model DeploymentThe model that we deployed in the previous section did not yield good accuracy. This is because we used random values as calibration data for the quantization phase. This was fine for exploring the model architecture, but not for our final models.
Now that we are satisfied with our model architectures, we can deploy them with scripting, using the calibration data we have prepared.
I have prepared a script for this purpose:
This script takes three (3) arguments when invoked:
- name : BlazePalm, BlazeHandLandmark, etc...
- resolution : input size (ie. 256)
- process : inspect, quantize, all
The name argument indicates which model we are deploying, such as BlazePalm for the palm detector or BlazeHandLandmark for the hand landmark models. The resolution indicates the input size to the model.
These two arguments will determine which calibration dataset to use for the quantization. For example:
- name=BlazePalm, size=256 => calib_palm_detection_256_dataset.npy
- name=BlazeHandLandmark, size=256 => calib_hand_landmark_256_dataset.npy
The process argument indicates which task to run. By default specify "quantize" or "all" to quantize the model. The model compilation must be invoked using the vai_c_xir command after quantization.
I have provided a second script which will call the vitisai_pytorch_flow.py script to quantize the models to be deployed, and also invokes the compilation command for each DPU architectures that are being targeted.
You will want to modify the following lists before execution:
- model_list : specify which model(s) you want to deploy
- dpu_arch_list : specify which DPU architecture(s) you want to target
Below is a modified version of the script that will deploy the palm detection and hand landmarks models for the B512 and B128 DPU architectures.
# PyTorch models
model_palm_detector_v0_07=("BlazePalm", 256)
model_hand_landmark_v0_07=("BlazeHandLandmark", 256)
model_face_detector_v0_07_front=("BlazeFace", 128)
model_face_detector_v0_07_back=("BlazeFaceBack",256)
model_face_landmark_v0_07=("BlazeFaceLandmark", 192)
model_pose_detector_v0_07=("BlazePose", 128)
model_pose_landmark_v0_07=("BlazePoseLandmark", 256)
model_list=(
model_palm_detector_v0_07[@]
model_hand_landmark_v0_07[@]
)
# Versal AI Edge
dpu_c20b14=("C20B14","./arch/C20B14/arch-c20b14.json")
dpu_c20b1=("C20B1","./arch/C20B14/arch-c20b1.json")
# Zynq-UltraScale+
dpu_b4096=("B4096","./arch/B4096/arch-zcu104.json")
dpu_b3136=("B3136","./arch/B3136/arch-kv260.json")
dpu_b2304=("B2304","./arch/B2304/arch-b2304-lr.json")
dpu_b1152=("B1152","./arch/B1152/arch-b1152-hr.json")
dpu_b512=("B512","./arch/B512/arch-b512-lr.json")
dpu_b128=("B128","./arch/B128/arch-b128-lr.json")
#
dpu_arch_list=(
dpu_b512[@]
dpu_b128[@]
)
model_count=${#model_list[@]}
#echo $model_count
dpu_arch_count=${#dpu_arch_list[@]}
#echo $dpu_arch_count
# Model
for ((i=0; i<$model_count; i++))
do
model=${!model_list[i]}
model_array=(${model//,/ })
model_name=${model_array[0]}
input_resolution=${model_array[1]}
echo python3 vitisai_pytorch_flow.py --name ${model_name} --resolution ${input_resolution} --process all
python3 vitisai_pytorch_flow.py --name ${model_name} --resolution ${input_resolution} --process all | tee deploy_${model_name}_quantize.log
if [ ${model_name} == "BlazeFaceBack" ]
then
mv quantize_result/BlazeFace_int.pt quantize_result/BlazeFaceBack_int.pt
mv quantize_result/BlazeFace_int.onnx quantize_result/BlazeFaceBack_int.onnx
mv quantize_result/BlazeFace_int.xmodel quantize_result/BlazeFaceBack_int.xmodel
fi
for ((j=0; j<$dpu_arch_count; j++))
do
dpu=${!dpu_arch_list[j]}
dpu_array=(${dpu//,/ })
dpu_arch=${dpu_array[0]}
dpu_json=${dpu_array[1]}
echo vai_c_xir -x ./quantize_result/${model_name}_int.xmodel -a ${dpu_json} -o ./models/${model_name}/${dpu_arch} -n ${model_name}
vai_c_xir -x ./quantize_result/${model_name}_int.xmodel -a ${dpu_json} -o ./models/${model_name}/${dpu_arch} -n ${model_name} | tee deploy_${model_name}_compile.log
done
done
This script must be executed in the Vitis-AI 3.5 docker container for Pytorch. Launch the docker from the "blaze_tutorial" directory as follows:
$ ./vitis-ai/docker_run.sh xilinx/vitis-ai-pytorch-cpu
Inside the Vitis-AI 3.5 docker, launch the deploy_models.sh script as follows:
$ cd vitis-ai
$ source ./deploy_models.sh
When complete, the compiled models will be located in the following sub-directory:
- models/BlazePalm/B128/BlazePalm.xmodel
- models/BlazePalm/B512/BlazePalm.xmodel
- models/BlazeHandLandmark/B128/BlazeHandLandmark.xmodel
- models/BlazeHandLandmark/B512/BlazeHandLandmark.xmodel
For convenience, I have archived the compiled models for Vitis-AI in the following archive:
- Vitis-AI models : blaze_vitisai_compiled_models.zip
In order to execute the models, we need to consider that each one of our models has multiple sub-graphs, as shown below:
The CPU sub-graphs (implementing the unsupported PAD layer) are running on the Zynq UltraScale+ ARM processor(s) in the PS. The DPU sub-graphs are running on the DPU engine that is implemented in the PL. The transfers between the CPU and DPU sub-graphs occur on high-speed AXI interconnect.
Vitis-AI provides two methods for accomplishing the scheduling of these multiple sub-graphs:
- manual scheduling : user invokes each of the DPU sub-graphs individually, and explicitly handles the transfers to/from each DPU sub-graph
- graph runner : user invokes the model as a single entity, and the graph runner handles the the sub-graph details including transfers
I went through the exercise of implementing both versions, and now appreciate the convenience provided by the graph runner.
For reference, here are links to initial versions of model execution using manual scheduling versus using the graph runner:
The take-away is that the code for manual scheduling is much more complex,
For the purpose of this use case (support for PyTorch models, and Vitis-AI models), the "blaze_app_python" application was augmented with the following inference targets:
My final inference code for the Vitis-AI models can be found in the "blaze_app_python" repository, under the blaze_vitisai sub-directory:
Note that the PyTorch inference can be run on a computer, while the Vitis-AI inference must be run on the Zynq UltraScale+ embedded platform (ie. ZUBoard).
Implementing Custom OPsSince we are using the Custom OP flow, and know that certain layers will be executed on the CPU, we need to create the code for these unsupported layers layers.
Our models have the following unsupported layers:
- Palm Detector : pad-fix
- Hand Landmarks : pad-fix, eltwise-fix
Fortunately for us, these layers actually exist in the graph runner's supported cpu tasks. They can be found here on our embedded platform:
- pad-fix : /usr/lib/libvart_op_imp_pad-fix.so
- eltwise-fix : /usr/lib/libvart_op_imp_eltwise-fix.so
If we try to run our model with the graph runner "as is", however, we would get the following run-time errors from these existing cpu tasks:
pad-fix
β’WARNING: Logging before InitGoogleLogging() is written to STDERR
F0326 20:22:02.581987 19129 pad-fix.cpp:46]
Check failed: input_shape[3] == output_shape[3] (32 vs. 64)
*** Check failure stack trace: ***
Aborted
eltwise-fix
β’WARNING: Logging before InitGoogleLogging() is written to STDERR
F0326 20:19:26.838560 19071 eltwise-fix.cpp:35]
unsupported eltwise type: DIV
*** Check failure stack trace: ***
Aborted
For the case of the pad-fix layer, it does not support padding in the third dimension.
I have modified the following original code:
To support padding in the third dimension in a modified version in my repo:
For the case of the eltwise-fix layer, it supports ADD and MULT operations, but not DIV.
I have modified the following original code:
To support the DIV operation in a modified version in my repo:
The face landmarks model has an additional unsupported layer called "prelu". When attempting to run this model, the following run-time error will occur:
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0821 17:15:49.156904 1126 op_imp.cpp:110] [UNILOG][FATAL][VAILIB_CPU_RUNNER_OPEN_LIB_ERROR][dlopencan not open lib!] lib=libvart_op_imp_prelu.so;error=libvart_op_imp_prelu.so: cannot open shared object file: No such file or directory;op=xir::Op{name =
BlazeFaceLandmark__BlazeFaceLandmark_Sequential_backbone1__PReLU_1__ret_9, type = prelu}
*** Check failure stack trace: ***
Aborted
Unfortunately for this cpu task, there is no existing code, so a completely new library named libvart_op_imp_prelu.so would have to be created. I did not implement this layer... let me know if you have interest in this in order to use the accelerated face pipeline.
Installing the python application on ZUBoardThe python application can be accessed from the following github repository:
git clone https://github.com/AlbertaBeef/blaze_app_python
cd blaze_app_python
The python demo application requires certain packages which can be installed as follows:
pip3 install tflite_runtime matplotlib plotly kaleido numpy==1.23
In order to successfully use the python demo with the original TFLite models, they need to be downloaded from the google web site:
cd blaze_tflite/models
source ./get_tflite_models.sh
cd ../..
In order to successfully use the python demo with the Vitis-AI models, they need to be downloaded as follows:
cd blaze_vitisai
source ./models/get_dpu_models.sh
unzip -o blaze_vitisai_compiled_models.zip models/Blaze*
cd ..
Although I provide pre-compiled models for the face/pose detection and landmark models, only the palm detection and hand landmark models are currently working with the python demo application.
Finally, we also need to install the modified version of the CPU tasks that are required by the graph runner. This can be done as follows:
cd blaze_vitisai/cpu_tasks
cd op_pad-fix
source ./install_pad-fix_custom.sh
cd ..
cd op_eltwise-fix
source ./install_eltwise-fix_custom.sh
cd ..
cd ../..
You are all set !
Launching the python application on ZUBoardAs we already saw in part 1, the python application can launch many variations of the dual-inference pipeline, which can be filtered with the following arguments:
- --blaze : hand | face | pose
- --target : blaze_vitisai | {others not listed for now}
- --pipeline : specific name of pipeline (can be queried with --list argument)
In order to display the complete list of supported pipelines, launch the python script as follows:
root@zub1cg-sbc-2022-2:~/blaze_app_python# python3 blaze_detect_live.py --list
[INFO] user@hosthame : root@zub1cg-sbc-2022-2
[INFO] blaze_tflite supported ...
[INFO] blaze_pytorch NOT supported ...
[INFO] blaze_vitisai supported ...
[INFO] DPU Architecture : B512
...
Command line options:
--input :
--image : False
--blaze : hand,face,pose
--target : blaze_tflite,blaze_pytorch,blaze_vitisai,blaze_hailo
--pipeline : all
--list : True
--debug : False
--withoutview : False
--profilelog : False
--profileview : False
--fps : False
List of target pipelines:
...
04 vai_hand_v0_07 blaze_vitisai/models/blazepalm/B512/blazepalm.xmodel
blaze_vitisai/models/blazehandlandmark/B512/blazehandlandmark.xmodel
...
In order to launch the Vitis-AI pipeline for hand detection and landmarks, with the DisplayPort monitor, use the python script as follows::
export DISPLAY=:0.0
python3 blaze_detect_live.py --pipeline=vai_hand_v0_07
This will launch the 0.07 version of the model, compiled for Vitis-AI, as shown below:
The previous video has not been accelerated. It shows the frame rate to be approximately 12 fps when no hands are detected (one model running : palm detection), and approximately 8 fps when one hand has been detected (two models running : palm detection and hand landmarks).
It is worth noting that this is running with a single-threaded python script. There is an opportunity for increased performance with a multi-threaded implementation. While the graph runner is waiting for transfers from one model's sub-graphs, another (or several other) model(s) could be launched in parallel...
There is also an opportunity to accelerate the rest of the pipeline with C++ code...
Benchmarking the models on ZUBoardFor reasons which I have not resolved, the "--profileview" argument does not work well on the ZUBoard, so we will use the "--profilelog" argument instead.
The profiling functionality uses a test image that can be downloaded from Google as follows:
source ./get_test_images.sh
The following commands can be used to generate profile results for the vai_hand_v0_07 pipeline using the Vitis-AI models, and the test image:
rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=vai_hand_v0_07 --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_zuboard_vai_hand_v0_07_b512.csv
The following commands can be used to generate profile results for the tfl_hand_v0_07 pipeline using the TFLite models, and the test image:
rm blaze_detect_live.csv
python3 blaze_detect_live.py --pipeline=tfl_hand_v0_07 --image --withoutview --profilelog
mv blaze_detect_live.csv blaze_detect_live_zuboard_tfl_hand_v0_07.csv
The results of both.csv files were averaged, then plotted using Excel:
We can see that the Vitis-AI models provide the following performance increase:
- palm detection (model only) : 14x faster
- hand landmark (model only) : 10x faster
- palm detection + hand landmark (full pipeline) : 8x faster
Again, it is worth noting that these benchmarks have been taken with a single-threaded python script. There is additional opportunity for acceleration with a multi-threaded implementation. While the graph runner is waiting for transfers from one model's sub-graphs, another (or several other) model(s) could be launched in parallel...
There is also an opportunity to accelerate the rest of the pipeline with C++ code...
Benchmarking the models for various Vitis-AI 3.5 platformsIn order to get a better feeling of the acceleration achieved with Vitis-AI 3.5, I decided to perform similar profiling for the following platforms:
- ZUBoard : dual-Cortex-A53 ARM processor / DPU (B512 and B128)
- ZCU104 : quad-Cortex-A53 ARM processors / DPU (dual-B4096)
- VEK280 : dual-Cortex-A72 ARM processors / DPU (C20B1 and C20B14)
In order to determine the acceleration achieved on each platform, the reference TFLite models needed to be profiled as well. Although we will only be accelerating version 0.07 of the palm detection + hand landmarks pipeline, I have profiled the 0.07 and 0.10 version of the models for reference:
Next, I profiled the 0.07 versions of the models deployed with Vitis-AI, and compared with the reference TFLite models:
If we analyze these results per platform, we can observe the following acceleration:
If we plot the acceleration ratio of the execution times for the Vitis-AI models with respect to the TFLite models for each platform, we get the following results:
The uncontested winner is the VEK280, even with the smallest version of its DPU engine (C20B1). Both of its TFLite and Vitis-AI models have the fastest performance.
However, if we look at acceleration ratio, we can observe the greatest acceleration on the ZCU104 platform. Specifically 26X faster for the palm detection model.
This is not the performance I was targeting, but significant acceleration nonetheless !
Known IssuesThe current version of this project has models for the following pipelines:
- palm detection : working
- hand landmarks : working
- face detection : working
- face landmarks : not working (due to missing "prelu" cpu task)
- pose detection : status unknown (I was not able to find the TFLite version of the 0.07 model)
- pose landmarks : status unknown
The current platforms have been tested:
- ZUBoard with B128 : working
- ZUBoard with B512 : working
- ZCU104 with dual-B4096 : working
- VEK280 with C20B1 : working
- VEK280 with C20B14 : not working due to unsupported batch size of 1 (code needs to be modified to provide batch size of 14)
I hope this project will inspire you to implement your own custom application.
What applications would you like to see built on top of these foundational MediaPipe models ?
Do you have a Vitis-AI 3.0 or Vitis-AI 3.5 platform on which you would like to try these models ?
Let me know if the comments...
AcknowledgementsI want to thank Google (https://www.hackster.io/google) for making the following available publicly:
- MediaPipe
A big thanks to AMD for their Vitis-AI 3.5 framework:
- [AMD] Vitis-AI 3.5 Documentation : https://xilinx.github.io/Vitis-AI/3.5/html/index.html
I want to thank the following developers for making their TFLite to PyTorch conversion work available publicly:
- [Vidur Satija] BlazePalm : vidursatija/BlazePalm
- [Matthijs Hollemans] BlazeFace-PyTorch : hollance/BlazeFace-PyTorch
- [Zak Murez] MediaPipePyTorch : zmurez/MediaPipePytorch
- 2024/08/26 - Initial Version
- [Google] MediaPipe Solutions Guide : https://ai.google.dev/edge/mediapipe/solutions/guide
- [Vidur Satija] BlazePalm : vidursatija/BlazePalm
- [Matthijs Hollemans] BlazeFace-PyTorch : hollance/BlazeFace-PyTorch
- [Zak Murez] MediaPipePyTorch : zmurez/MediaPipePytorch
- [AMD] Vitis-AI 3.5 Documentation : https://xilinx.github.io/Vitis-AI/3.5/html/index.html
- [AlbertaBeef] blaze_app_python : AlbertaBeef/blaze_app_python
- [AlbertaBeef] blaze_tutorial : AlbertaBeef/blaze_tutorial
Comments