Introduction: The latest deep learning models combined with state-of-the-art hardware can perform human pose estimation (HPE) in real time. HPE refers to the estimation of a kinematic model of the human body from image data. The Interdisciplinary Center for Artificial Intelligence (ICAI) at the University of Applied Sciences Eastern Switzerland (OST), in collaboration with VRM Switzerland, has adapted the well-known OpenPose network for HPE and made it computationally more efficient. Let's call this network for this project ICAIPose. The current prototype, which uses a multi-camera system with a first-class and computationally intensive deep learning model, runs on multiple graphics processing units (GPUs) with sufficient performance.
For the broad adoption of HPE in the therapeutic environment, a small and cost-effective system is desired. Therefore, the pose-tracking system should run on edge devices.
Aim: In this project, ICAIPose should be implemented on the FPGA edge device Kria KV 260 form AMD-Xilinx. As ICAIPose was designed for GPU, the required effort to run such a given network on an FPGA and the performance impact is of major interest.
Approach: The application requires a camera interface and a deep learning processing unit. To test these hardware parts, a given example project that uses these parts was run on the Kria board first. Then, AMD-Xilinx's Vitis AI is used to compile the ICAIPose network for the Deep-Learning Processor Unit (DPU) on the FPGA with minor adjustments to the network. The included Vitis AI Runtime Engine with its Python API communicates with the DPU via an embedded Linux on the FPGAs microprocessor.
Conclusion: ICAIPose is a very large neural network with more than 100 GOps to process one frame. Nevertheless, on the KV260 a throughput of 8 frames per second could be achieved. The GPU-based NVIDIA Jetson Xavier NX, which costs more than double as the Kria board, achieves a similar frame rate.
The successful implementation of ICAIPose on an edge device with promising performance opens the field for broad applications in the therapeutic environment.
The Vitis AI framework from AMD-Xilinx has been extensively tested and shows its strengths, but also some teething problems. For running deep neural networks on FPGAs, Vitis AI is a framework with a good trade-off between development time and performance. It should be considered before implementing hardware-accelerated algorithms in HDL or HLS.
Prerequisite- Linux host PC with Vitis AI installed
- Knowledge of the Vitis AI workflow
- Internet access for the KV260
The ususal output of a HPE networks are confidence maps for given keypoints of the human pose. For the task of a single HPE, the maximum of the confidence map is found and the corresponding keypoint is assigned.
The camera interface is an important part of the design. With the Kria KV260 Basic Accessory Pack, a small camera is included.
AMD-Xilinx provides an example application for the Kria™ KV260 Vision AI Starter Kit Applications.
The block design of the smart camera application shows that the hardware platform contains everything we need for this project including hardware interface for the camera and the DPU. This example application can be used as a base design to run custom Vitis AI models with the camera.
First, all versions are carefully checked to make sure they match:
- For Vitis AI 1.4 and previous versions the board image for the KV260 is 2020.2
- This requires using the Smartcamera app, which also uses the 2020.2 board image (not the latest version).
- The Vitis AI version of the 2020.2 smartcamera platform is Vitis AI 1.3.0
Follow this instructions to install the smartcamera app on the KV260 (until and with section 5).
Connect the KV260 board to a local network using the ethernet port.
While being connected via UART/JTAG, check the assigned IP address of the ethernet (eth0) port.
ifconfig
The output of that command, would look similar to that:
eth0 Link encap:Ethernet HWaddr 00:0a:35:00:22:01
inet addr:152.96.212.163 Bcast:152.96.212.255 Mask:255.255.255
inet6 addr: fe80::20a:35ff:fe00:2201/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:67 errors:0 dropped:0 overruns:0 frame:0
TX packets:51 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9478 (9.2 KiB) TX bytes:5806 (5.6 KiB)
Interrupt:44
in this case the ip-address is 152.96.212.163
.
Use this address to connect from a host PC (connected to the same network as the KV260) over the network with ssh
to the KV260.
ssh petalinux@<ip-address>
To run all Vitis AI examples, some further installations have to be done on the KV260. Ensure that the device is connected to the internet.
X11-Forwarding
sudo dnf install packagegroup-petalinux-x11
Set the display environment
export DISPLAY=:0.0
Vitis AI
sudo dnf install packagegroup-petalinux-vitisai
OpenCV
sudo dnf install packagegroup-petalinux-opencv
Tar
sudo dnf install xz
Vitis AI Runtime
sudo wget https://www.xilinx.com/bin/public/openDownload?filename=vitis-ai-runtime-1.3.0.tar.gz
sudo tar -xzvf openDownload\?filename\=vitis-ai-runtime-1.3.0.tar.gz
cd vitis-ai-runtime-1.3.0/aarch64/centos/
sudo bash setup.sh
The Vitis AI Runtime (VART) expects the location of a DPU.xclbin
file in the vart.conf
file. But the corresponding xclbin
file is the one from the smartcam app.
Change the vart.conf
and the xclbin file with the following commands.
echo "firmware: /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin" | sudo tee /etc/vart.conf
sudo cp /lib/firmware/xilinx/kv260-smartcam/kv260-smartcam.xclbin /usr/lib/
sudo mv /usr/lib/kv260-smartcam.xclbin /usr/lib/dpu.xclbin
Before Vitis AI example can be run, the according smartcam application must be loaded (after every boot).
sudo xmutil unloadapp
sudo xmutil loadapp kv260-smartcam
When the KV260-smartcam app is loaded, the camera can be tested with X11-forwarding with the following GStreamer command:
gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert! ximagesink
If a display is connected via HDMI (1920x1200 in this case), the camera can be tested as well. Please change the width and height parameter according to your connected display
gst-launch-1.0 mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=1920, height=1200, format=NV12, framerate=30/1 ! kmssink driver-name=xlnx plane-id=39 sync=false fullscreen-overlay=true
Vitis AI Model ZooIn the next project step, the camera system is tested in combination with Vitis AI. With the wide collection of pre-trained neural networks from Vitis AI Model Zoo, an example can be chosen.
Hourglass is a HPE network with following properties:
cf_hourglass_mpii_256_256_10.2G_2.0
- Description: Pose Estimation Model with Hourglass
- Input size: 256x256
- Float ops: 10.2G
- Task: pose estimation
- Framework: caffe
- Prune: 'no'
- Newest version: Vitis AI 2.0
The precompiled version of the model for the KV260 has been compiled with Vitis AI 2.0 with the DPU configuration B4096
.
We use the DPU configuration B3136
. Therefore, the hourglass model has to be recompiled with the caffe
workflow for the corresponding DPU and the correct Vitis AI version 1.3.0 (Docker image: xilinx/vitis-ai-cpu:1.3.411
).
The DPU fingerprint and the corresponding arch.json
file can be found in the smartcam documentation.
{
"fingerprint":"0x1000020F6014406"
}
The newly compiled files model can be saved on the KV260.
The test application (test_video_hourglass
) provided by the Vitis AI Library is used to run the model. To do so, use the prebuilt files provided in this project or compile the test application with the cross-compilation system environment on the host PC for the KV260 (follow the Vitis AI instructions).
Download and extract the prebuilt files on the KV260.
wget https://github.com/Nunigan/HardwareAcceleratedPoseTracking/raw/main/prebuilt.tar.xz
tar -xf prebuilt.tar.xz
Go to the hourglass folder
cd prebuilt/hourglass/
The GStreamer string from the camera interface is used as the input device. With the following command, the program runs with two threads.
./test_video_hourglass hourglass_kv.xmodel "mediasrcbin media-device=/dev/media0 v4l2src0::io-mode=dmabuf v4l2src0::stride-align=256 ! video/x-raw, width=256, height=256, format=NV12, framerate=30/1 ! videoconvert ! appsink" -t 2
Hourglass runs with 30 fps. Note That the limiting factor is the camera and not the neural network.
Vitis AIThe main part of the project was to use a neural network designed for a conventional GPU implementation with Tensorflow and try to run it on an FPGA.
ICAIPose is a fairly large network with about 11 million learnable parameters and 103 GOps
to process an image.
The original network is composed with following layers:
Conv2D
PReLU activation function
Concatenate
UpSampling2D
DepthwiseConv2D
MaxPooling2D
For the usage with Vitis AI, one has to check if all layers of the neural network are supported by Vitis AI (see the corresponding user guide).
All layers but the PReLU
activation function are supported. The "Parametric ReLU" is very similar to the Leaky ReLU
function (see following figure) except that the leak-term is a learnable parameter. Vitis AI supports Leaky ReLU
with a fixed leak-term of 0.1.
Some challenges came with the introduction of the Leaky ReLU
activation function. Unfortunately, Leaky ReLU
is not supported for the Tensorflow 2 (TF2
) workflow (only for Vitis AI 1.3, it is supported by newer versions). Therefore, the Tensorflow 1 (TF1
) workflow is used.
ICAIPose is written and trained in the TF2
(Keras) framework. A saved model from Keras' h5
files can be converted into a TF1
Checkpoint and meta graph with the following code.
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
tf.keras.backend.set_learning_phase(0)
loaded_model = tf.keras.models.load_model('model.h5')
print ('Keras model information:')
print (' Input names :',loaded_model.inputs)
print (' Output names:',loaded_model.outputs)
print('-------------------------------------')
tfckpt = 'train/tfchkpt.ckpt'
tf_session = tf.keras.backend.get_session()
# write out tensorflow checkpoint & meta graph
saver = tf.compat.v1.train.Saver()
save_path = saver.save(tf_session,tfckpt)
print (' Checkpoint created :',tfckpt)
Even though Keras is supported by TF1
and TF2
, a Keras network saved in TF2
and then converted to TF1
cannot be processed by Vitis AI.
Therefore, the network is saved in Keras from TF1
.
Since the model has been previously trained in TF2
(Keras) the weights must be exported.
#Running in TF2
model = keras.models.load_model("trained_model.h5")
model.save_weights("weights.h5")
and then imported in TF1
.
#Running in TF1.15
#Description of the model
##########################################
model.load_weights("weights.h5")
model.save("TF1_model.h5")
After the conversion to the TF1
files, the normal Vitis AI flow (Freezing, quantization and compilation) can be used.
A test dataset is used for quantization. Save the dataset in the validation_data
folder.
The code to run ICAIPose is adapted from an example by Mario Bergeron on GitHub. It uses the normal VART
API for python. The script run_ICAIPose_FPGA.py
is used to run the model on the FPGA in a multi-threaded environment.
The following library must be installed to use the script:
sudo pip3 install imutils
Go to the folder prebuilt/ICAIPose
and run the script.
python3 run_ICAIPose_FPGA.py <number of threads> <model_file>
For example
python3 run_ICAIPose_FPGA.py 3 own_network_256_KV260.xmodel
Close the window by pressing letter 'q' on the host PC. The throughput performance is printed to the terminal.
Or with a HDMI Display
python3 run_ICAIPose_FPGA_hdmi.py <number of threads> <model_file> <Display Width> <Display Height>
For example
python3 run_ICAIPose_FPGA_hdmi.py 3 own_network_256_KV260.xmodel 1920 1200
It is recommended to run the application over HDMI, depending on the network connection the throughput of the video stream may appear worse.
ResultsIt is interesting to know how fast the network is running on the FPGA, but it is important to know if some HPE performance is lost due to the quantization.
Throughput Performance
ICAIPose (256x256, 103 GOps): 8 fps
With the B3136
DPU and a clock frequency of 300 MHz
, a theoretical throughput of 940 GOps/s
is given.
Therefore, the results are in the expected range (recall: 103 GOps
for one image).
For comparison, the NVIDIA Jetson Xavier NX which is more expensive than a KV260 and has a significant higher theoretical throughput (21 TOps) reached the same throughput of 8 fps.
Human Pose Estimation Performance
This dataset that provides more than 2000 images and the corresponding ideal confidence map is used to test the HPE performance.
The Mean Squared Error (MSE) of the normalized confidence maps is computed by squaring and summing up the difference between every pixel. An example is shown in the following image. The mean of the image on the left is the MSE of the given input.
We can now compare the MSE between the quantized and the float network. As an additional information the MSE of the original network with the PReLU
activation Function is shown.
The MSE of all images:
Float: 0.8109
Quantized INT8: 0.9332
PReLU: 0.9348
The change from the PRelu
to the Leaky ReLU
activation function increased the network performance even a bit. The quantization has an impact on the MSE
, but a rather small one. The quantized networked performed as well as the unquantized PReLU
network
The Vitis AI framework from AMD-Xilinx was extensively tested and showed its strengths and some teething troubles. Changing the target device from GPU to FPGA was possible without loosing significant performance even though the FPGA board is cheaper. Vitis AI allows to design an efficient deep neuronal network for an FPGA without knowledge in HDL or HLS.
The Kria KV260 Vision AI Starter Kit is a great choice to start with Vitis AI. The provided camera can easily be used within the petalinux environment.
ReferencesAcknowledgmentSpecial Thank to the ICAI and VRM Switzerland for providing a trained version of ICAIPose.
Thanks to the Institute of Microelectronics and Embedded Systems for supporting this challenge as part of a student project.
Revision History- 3/14/2022 - Initial release
Comments