This guide provides detailed instructions on implementing face detection and face tracking in Python on the Ultra96-V2 platform.
This tutorial builds on top of the following “Vitis-AI 1.1 flow for Avnet Vitis Platforms” two part tutorial:
https://www.hackster.io/AlbertaBeef/vitis-ai-1-1-flow-for-avnet-vitis-platforms-part-1-007b0e
https://www.hackster.io/AlbertaBeef/vitis-ai-1-1-flow-for-avnet-vitis-platforms-part-2-f18be4
Although this tutorial specifically targets the Ultra96-V2 platform, it can target any of the following platforms:
- Ultra96-V2 Development Board
- UltraZed-EV SOM (7EV) + FMC Carrier Card
- UltraZed-EG SOM (3EG) + IO Carrier Card
- UltraZed-EG SOM (3EG) + PCIEC Carrier Card
In this tutorial, we will build the following AI pipeline, implemented in Python, which can serve as a basis for future algorithm exploration.
There are many algorithms that can be used for face detection:
- Haar Cascade
- Histogram of Gradients (HOG) + State Vector Machines (SVM)
- Deep Neural Networks (DNN) : Single-Shot Detectors (SSD), DenseBox, etc…
There are also several algorithms that can be used for object tracking:
- Optical flow
- Kalman filtering
- Meanshift / Camshift
One possible strategy when combining a “detection” algorithm with a “tracking” algorithm is to take advantage of the fact that “tracking” algorithms are generally faster than “detection” algorithms. One possible implementation could start with one “detection” iteration, followed by several “tracking” iterations, in order to optimize limited compute resources.
In this tutorial, I will take a different strategy. Since we already have an optimized face detection algorithm (DenseBox) that runs real-time, I will perform the face “detection” on each frame, and use a much simpler tracking algorithm, also executed on each frame.
The object tracking I decided to use is a simple centroid tracker, implemented by Adrian Rosebrock at PyImageSearch:
Adrian Rosebrock, Simple Object Tracking with OpenCV, PyImageSearch,https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/ accessed on 15 June, 2020
As mentioned previously, this tutorial will reuse the existing pre-optimized densebox model, for face detection. We already saw this example in the “Vitis-AI 1.1. flow for Avnet Vitis platforms” tutorial. This time, we will be calling it from a Python script instead of a C++ application. The motivation for using Python is simply that the Python language is largely used in the industry for quick algorithm exploration. There is an incredible wealth of Python packages and examples that can be used to quickly prototype an idea.
This tutorial will go through the following steps:
- Step 0 – Overview of the Python scripts
- Step 1 – Create the SD card image for the Vitis-AI 1.1 enabled platform
- Step 2 – Install the tutorial files and required packages
- Step 3 – Execute the face detection and tracking Python scripts
Python implementation of Face Detection
Vitis-AI 1.1, provided by Xilinx, provides a development flow for AI inference on Xilinx devices. This flow includes an AI engine, called the DPU (Deep-Learning Processing Unit), along with an API for Linux applications, called VART.
This VART API is available for C++ applications, as well as Python scripts.
Most of the provided examples are in C++, while two of the classification examples are provided in Python:
- inception-v1
- resnet50
There is, however, no Python example provided for the following face detection model:
- densebox
Since the best way to understand an API is to write code that makes use of it, I embarked on the task of writing a Python version of the face detection example, making use of the densebox model, from the Model Zoo.
As it turns out, I found a verification script in the model zoo, for the cf_densebox_wider_360_640_1.11G model,
models/cf_densebox_wider_360_640_1.11G/code/test/visualTest/detect.py
This script was used as a reference for the implementation of my code.
To get started, the general format of a Python example, making use of the VART API, is the following:
dpu = runner.Runner("vitis_rundir")[0]
""" Prepare input/output buffers """
...
""" Execute model on DPU """
job_id = dpu.execute_async( inputData, outputData )
dpu.wait(job_id)
""" Retrieve output results """
...
The first line initialized the VART API, and specifies the directory where the meta-data for the model can be found. In our Vitis-AI 1.1 platform, we are interested in the 640x360 version of the densebox model, which is located in the “/usr/share/vitis-ai_library/models/densebox_640_360” directory, and has the following content:
/usr/share/vitis_ai_library/models/densebox_640_360/
│
│ densebox_640_360.elf
│ densebox_640_360.prototxt
│ meta.json
The “meta.json” file contains the meta-data for the model:
{
"target": "DPUv2",
"lib": "libvart-dpu-runner.so",
"filename": "densebox_640_360.elf",
"kernel": [ "densebox_640_360" ],
"config_file": "densebox_640_360.prototxt"
}
This file indicates that we are targeting the ”DPUv2” hardware core, using the “libvart-dpu-runner.so” API. This model has a single kernel, which is an uninterrupted sequence of layers making up the CNN model. The kernel name is “densebox_640_360”, and the executable code/data for this kernel is contained in the “densebox_640_360.elf” binary.
The prototxt file indicates important pre-processing information (mean values, scaling factor), which we will need for our Python implementation.
model {
name : "dense_box_640x360"
kernel {
name: "tiling_v7_640"
mean: 128.0
mean: 128.0
mean: 128.0
scale: 1.0
scale: 1.0
scale: 1.0
}
model_type : DENSE_BOX
dense_box_param {
num_of_classes : 2
nms_threshold: 0.3
det_threshold: 0.9
}
}
The first step is to pre-process the input image, and prepare the input/output buffers:
""" Image pre-processing """
# normalize
img = img - 128.0
# resize
img = cv2.resize(img,(inputWidth,inputHeight))
""" Prepare input/output buffers """
inputData = []
inputData.append(np.empty((inputShape), dtype=np.float32,order='C'))
inputImage = inputData[0]
inputImage[0,...] = img
outputData = []
outputData.append(np.empty((output0Shape), dtype=np.float32,order='C'))
outputData.append(np.empty((output1Shape), dtype=np.float32,order='C'))
The pre-processing involves subtracting the mean (128), scaling (in this case 1.0, so not performed), and resizing the incoming image to the model’s input size of inputWidth x inputHeight (640 x 360).
Next, an input buffer needs to be prepared that contains the input image, and two output buffers need to be allocated for the following two outputs:
- bounding boxes : 160 x 90 x {xmin, ymin, xmax, ymax}
- scores : 160 x 90 x {score0, score1}
Notice that the model output consists of 2D grids of results which are 4x smaller (in both width and height) than the input image.
The model is executed on the DPU, then the two output results are retrieved from memory.
""" Execute model on DPU """
job_id = dpu.execute_async( inputData, outputData )
dpu.wait(job_id)
""" Retrieve output results """
OutputData0 = outputData[0].reshape(1,output0Size)
bboxes = np.reshape( OutputData0, (-1, 4) )
#
outputData1 = outputData[1].reshape(1,output1Size)
scores = np.reshape( outputData1, (-1, 2))
The bounding boxes coordinates are relative to each grid position, so need to be post-processed to add the absolute coordinates of each grid position to bounding box results. The following Python code implements this in a vectorized style in order to keep performance optimal:
""" Get original face boxes """
gy = np.arange(0,output0Height)
gx = np.arange(0,output0Width)
[x,y] = np.meshgrid(gx,gy)
x = x.ravel()*4
y = y.ravel()*4
bboxes[:,0] = bboxes[:,0] + x
bboxes[:,1] = bboxes[:,1] + y
bboxes[:,2] = bboxes[:,2] + x
bboxes[:,3] = bboxes[:,3] + y
The two score results for each grid location correspond to the scores associated with a bounding box being not present, and being present. These two results need to be normalized into a probability distribution which sums to 1.0. We use the softmax function to perform this step. More specifically, we use a special version of softwax, softmax_2, that performs several iterations (160x90) of the 2 class normalization. Once normalized, we only keep the results for which the probability of a bounding box being present are above a certain threshold.
""" Run softmax """
softmax = softmax_2( scores )
""" Only keep faces for which prob is above detection threshold """
prob = softmax[:,1]
keep_idx = prob.ravel() > self.detThreshold
bboxes = bboxes[ keep_idx, : ]
bboxes = np.array( bboxes, dtype=np.float32 )
prob = prob[ keep_idx ]
At this point, there are many bounding boxes that remain, most of which are duplicates (overlapping entities) of each other. In order to remove these duplicates, the non-maximal suppression algorithm is used, which measures the overlap (IOU) of each bounding box with respect to each other.
""" Perform Non-Maxima Suppression """
face_indices = []
if ( len(bboxes) > 0 ):
face_indices = nms_boxes( bboxes, prob, self.nmsThreshold );
faces = bboxes[face_indices]
The final step is to scale the coordinates of the detected faces back to the original input image size. For this step, I did not code in a vectorized style, but this should be negligible, since we should typically have less than a dozen faces.
# extract bounding box for each face
for i, face in enumerate(faces):
xmin = max(face[0] * scale_w, 0 )
ymin = max(face[1] * scale_h, 0 )
xmax = min(face[2] * scale_w, imgWidth )
ymax = min(face[3] * scale_h, imgHeight )
faces[i] = ( int(xmin),int(ymin),int(xmax),int(ymax) )
It is left as an exercise to the reader to re-code this section in a vectorized form. Please share your implementation in the comments below.
All of the above code has been encapsulated in the following class:
vitis_ai_vart/facedetect.py
This allows the demo scripts to be simpler to code, and read. As an example, here is an excerpt of what the code looks like for the face detection example:
...
import runner
from vitis_ai_vart.facedetect import FaceDetect
...
# Initialize Vitis-AI/DPU based face detector
dpu = runner.Runner("/usr/share/vitis_ai_library/models/densebox_640_360")[0]
dpu_face_detector = FaceDetect(dpu,detThreshold,nmsThreshold)
dpu_face_detector.start()
...
while True:
...
faces = dpu_face_detector.process(frame)
...
# Stop the face detector
dpu_face_detector.stop()
del dpu
Face Tracking
In order to implement the face tracking, the Face Detection is followed by a simple centroid based Object Tracking algorithm. For each detected face, the centroid of the bounding box is calculated, and tracked from frame to frame. For more details on this tracking implementation, refer to the original tutorial on PyImageSearch.com:
Adrian Rosebrock, Simple Object Tracking with OpenCV, PyImageSearch,https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/ accessed on 15 June, 2020
Single-Threaded vs Multi-Threaded
The main flow for the single-threaded demo scripts is illustrated in the following simplified timing diagram, where “Worker” refers to our “Face Detection” and “Face Tracking” application examples.
The maximum frame rate of the single-threaded implementation is limited by the total execution time of the entire application pipeline : Capture + Worker + Display. This is not ideal, since we know that the CPU may be idle while waiting for a new frame from the USB camera, and will certainly be waiting for the DPU to finish execution of each kernel.
In order to increase the frame rate, a multi-threaded implementation is also provided which breaks down the application into three main tasks:
- CaptureTask
- WorkerTask
- DisplayTask
The tasks communicate with each other with synchronized queues.
The next simplified diagram illustrates how each of the Tasks can start execution in parallel.
One thread is provided for the CaptureTask and DisplayTask threads, while a user configurable number of threads are provided for the WorkerTask, allowing several requests for the DPU to be pipelined.
The following table provides an overview of the Python scripts provided with this tutorial.
For detailed instructions on creating a Vitis-AI 1.1 enabled platform, refer to the “Vitis-AI 1.1 flow for Avnet Vitis platforms” 2 part tutorial:
https://www.hackster.io/AlbertaBeef/vitis-ai-1-1-flow-for-avnet-vitis-platforms-part-1-007b0e
https://www.hackster.io/AlbertaBeef/vitis-ai-1-1-flow-for-avnet-vitis-platforms-part-2-f18be4
For quick instructions on creating a Vitis-AI 1.1 enabled Ultra96-V2 platform, follow the following steps:
1) Download and extract the following pre-build SD card image
- ULTRA96V2 : http://avnet.me/avnet-ultra96v2-vitis-ai-1.1-image (MD5SUM = 7f54ceed152a0c704f5da18c4738b3fc)
2) Program the “Avnet-ULTRA96V2-Vitis-AI-1-1-2020-05-15.img” image to a 16GB micro SD card using Balena Etcher
3) Download the following solution archive for the Vitis-AI 1.1 tutorial, and extract to the micro SD card’s BOOT partition:
- COMMON : http://avnet.me/Avnet-COMMON-Vitis-AI-1-1-image (MD5SUM = 464ecc94368d1cb7deb184b653e740a1)
4) Boot the Ultra96-V2 board, with the SD card
5) Perform the following configuration for the WAYLAND desktop, which will allow to change the resolution of the monitor (for details, refer to the Vitis-AI 1.1 tutorial)
$ cd /mnt/runtime/WAYLAND
$ source ./install.sh
$ cat weston_append.ini >> /etc/xdg/weston/weston.ini
$ source ./change_resolution_1920x1080.sh
6) Perform the following configuration for the VART runtime
$ cd /mnt/runtime/VART
$ source ./install.sh
$ cd /mnt/runtime
$ dpkg -i vitis_ai_model_ULTRA96V2_2019.2-r1.1.1.deb
To verify the Vitis-AI 1.1 enabled platform, perform the following steps:
7) Change to a lower resolution, such as 1280x720
$ source /mnt/runtime/WAYLAND/change_resolution_1280x720.sh
8) Define the DISPLAY environment variable
$ export DISPLAY=:0.0
9) Run the C++ version of the face detection sample
$ cd /mnt/Vitis-AI-Library/overview/samples/facedetect
$ ./test_video_facedetect densebox_640_360 0
Step 2 – Install the tutorial files and required packagesThe files for this tutorial can be found on the Avnet github repositories:
https://github.com/Avnet/face_py_vart
With an ethernet connection enabled on your Ultra96-V2 embedded platform, use “pip3 install …” to install the Python package:
$ cd /mnt
$ git clone https://github.com/Avnet/face_py_vart
If you do not have an ethernet connection available on your embedded platform, download the tutorial files from the repository:
Then, copy it to the BOOT partition of your SD card, under a directory called “face_py_vart”.
This tutorial has the following project structure:
/mnt/face_py_vart/
│
│ avnet_face_detection.py
│ avnet_face_detection_mt.py
│ avnet_face_tracking.py
│ avnet_face_tracking_mt.py
│ avnet_passthrough.py
│ avnet_passthrough_mt.py
│
├───pyimagesearch
│ │ centroidtracker.py
│ │ __init__.py
│ └───__pycache__
│
└───vitis_ai_vart
│ facedetect.py
│ __init__.py
└───__pycache__
The “avnet_*.py” scripts are the main demo scripts.
The “vitis_ai_vart” directory contains the Python implementation of the VART based face detection.
The “pyimagesearch” directory contains the reused centroid tracking code from PyImageSearch.com
The following Python package is required for this tutorial.
With an ethernet connection enabled on your Ultra96-V2 embedded platform, use “pip3 install …” to install the Python package:
$ pip3 install imutils
If you do not have an ethernet connection available on your embedded platform, download the “imutils.0.5.3.tar.gz” package from the following web site:
Then, copy it to your SD card, and install it using the following command:
$ pip3 install imutils.0.5.3.tar.gz
Step 3 – Execute the face detection and tracking Python scriptsPython scripts for the following three demos have been provided, each with a single-threaded and multi-threaded implementation:
To use the Python scripts for this tutorial, navigate to the “face_py_vart” directory that you extracted on the BOOT partition of the SD card:
$ cd /mnt/face_py_vart
In order to get help on how to use each of the demo Python scripts, use the “-h” option, as described below:
$ python3 avnet_passthrough_mt.py -h
usage: avnet_passthrough_mt.py [-h] [-i INPUT] [-t THREADS]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input camera identifier (default = 0)
-t THREADS, --threads THREADS
number of worker threads (default = 4)
$ python3 avnet_face_detection_mt.py -h
usage: avnet_face_detection_mt.py [-h] [-i INPUT] [-d DETTHRESHOLD] [-n NMSTHRESHOLD] [-t THREADS]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input camera identifier (default = 0)
-d DETTHRESHOLD, --detthreshold DETTHRESHOLD
face detector softmax threshold (default = 0.55)
-n NMSTHRESHOLD, --nmsthreshold NMSTHRESHOLD
face detector NMS threshold (default = 0.35)
-t THREADS, --threads THREADS
number of worker threads (default = 4)
$ python3 avnet_face_tracking_mt.py -h
usage: avnet_face_tracking_mt.py [-h] [-i INPUT] [-d DETTHRESHOLD] [-n NMSTHRESHOLD] [-t THREADS]
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input camera identifier (default = 0)
-d DETTHRESHOLD, --detthreshold DETTHRESHOLD
face detector softmax threshold (default = 0.55)
-n NMSTHRESHOLD, --nmsthreshold NMSTHRESHOLD
face detector NMS threshold (default = 0.35)
-t THREADS, --threads THREADS
number of worker threads (default = 4)
Face Detection
To launch the multi-threaded face detection Python script:
$ python3 avnet_face_detection_mt.py -i 0 -d 0.55 -n 0.35 -t 4
Face Tracking
To launch the multi-threaded face tracking Python script:
$ python3 avnet_face_tracking_mt.py -i 0 -d 0.55 -n 0.35 -t 4
Experimenting with Parameters
If you are getting duplicate detected ROIs for a unique face, you can try to increase the detection threshold, detThreshold, to a higher value using the “-d 0.90” command line argument.
The centroid based object tracker allows for a detected face to disappear for a certain amount of frames, which can be changed in the pyimageserach/centroidtracker.py script. This allows a face that is momentarily “lost” to keep it’s associated “id” when it re-appears. The number of frames that a face is tolerated as been temporarity lost, maxDisappeared, is defined in the “pyimagesearch/centroidtracker.py” script:
class CentroidTracker():
def __init__(self, maxDisappeared=20):
SummaryThis tutorial described how to access the pre-trained densebox model from the Xilinx Model Zoo for face detection in python.
This python based example was augmented with a simple object tracking algorithm.
Finally, a multi-threaded implementation was used to make better use of the CPU and the DPU (hardware AI engine), in order to achieve higher throughput.
I hope this tutorial will serve as a foundation for additional exploration !
Comments