Face swap is a fun machine vision application that employs several machine learning and vision algorithms to swap two faces in a video. In this project we will start by implementing this fun application on a CPU, then by using Software profiling we'll look for parts in the computational pipeline that can be accelerated on a FPGA, and if we found any suitable parts, we'll accelerate them to improve performance.
Some of the things you'll learn throughout this project:
- A cool face swap algorithm
- How to profile your application in Vitis IDE to find compute intensive tasks
- How to use Vitis AI Library models to accelerate your application on FPGA
The target device used for the project is a Xilinx Zynq MPSoC ZCU104 evaluation board but the same flow with minor differences can be employed on other devices.
As a hardware engineer, I personally was not very familiar with most of the vision algorithms used in face swapping and it was very fun and exciting to find out about them so I hope you enjoy doing this project as much as I did.
Now let's get to work!
Step 1: Understanding the Face swap algorithmFirst, we need to choose a proper algorithm to implement. Face swapping is not a new or state-of-the-art application and there are several algorithms for it out there, the most powerful ones probably being the ones based on deep fake. But due to the complexity of deep fake algorithms, they don't seem suitable for Edge SoC devices. So to keep the project simple, I'm going for a simpler face swap algorithm which I found this learnopencv.com post by Satya Mallick.
To explain the algorithm in short, it first finds the faces in each frame using face detection algorithms, then in feeds the detected faces to a 68 point face landmark detector to find 68 points on each face. Then the convex hull containing these 68 points are calculated (which is basically the face contour of the face). After that the Delaunay Triangulation of these 68 points is done (which will find very small triangles that create the shape of the human face). When the triangles are found, the triangles of the first face are warped on the convex hull of the second face and vice versa, swapping the faces. At last, for the final touch, we use seamless cloning to make the warped faces look more realistic and clean.
To run the described algorithm on CPU, I've used DLIB's HOG face detector for face detection and Dlib's shape_predictor_68_face_landmarks for landmark detection. The rest of the process is done using C++ openCV library.
----------------------------------------Running on x86 (Optional)-------------------------------
This project focuses on Xilinx Zynq MPSoC platform for target device but you can run the application on x86 cpus too. To compile and run the application on x86 processors you need to first clone both dlib and my git repository. Also make sure that you have openCV library installed on you machine. The SSE and AVX flags are optional and used for better performance.
$ cd $workspace
$ git clone https://github.com/davisking/dlib.git
$ git clone https://github.com/Ali-Flt/vitis_projects.git
$ cd vitis_projects/face_swap/without_vitis_ai/src
$ g++ -std=c++17 -O3 -I$workspace/dlib/ face_swap_sw.cpp source.cpp `pkg-config --cflags --libs opencv` -lpthread -lX11 -msse -msse2 -msse4.2 -mavx -o face_swap_sw.o
// download the shape predictor model for face landmark
$ wget https://github.com/davisking/dlib-models/raw/master/shape_predictor_68_face_landmarks.dat.bz2
$ bzip2 -d shape_predictor_68_face_landmarks.dat.bz2
$ ./face_swap_sw.o ../../data/video_input_640x360.webm
You can stop the application from running by interrupting it using "ctrl+C" and then the output file will be created in the same directory.
----------------------------------------Running on Zynq's APU------------------------------------
To run the application on our zynq's processing system, we use Vitis Software development tools. If you have not already created a Vitis Platform ready for creating new applications, make sure to checkout my previous project where I explain exactly how to do it. I'm going to use the same Vitis platform with Vitis AI sysroot we built from that project.
So to start, open Vitis in the same workspace directory where you created your platform and application projects. If you followed the last post it must be this directory (AKA $platform_dir):
$workspace/Vitis-Tutorials/Vitis_Platform_Creation/..../ref_files/step3_pfm/platform_repo/zcu104_custom_platform
- Now if you have already created a system project (dpu_trd_system), right-click on that and select new -> application project. (If you haven't, just create a new application project using your Vitis custom Software platform.)
- From the platform selection window, select the custom platform we created before and click next.
- In the next window select your system project under "select a system project" panel and enter "face_swap" in "application project name" section.
- In the next window don't change anything as the sysroot (Vitis AI sysroot), Image and Rootfs paths must be loaded automatically if you chose your system project correctly in the previous window.
- Finally click finish to create the application project.
- Right-click on src folder of your application project and select "Import sources".
- From $workspace/vitis_projects/face_swap/without_vitis_ai/src folder import all.cpp and.hpp files.
- Right-click on your application project and select "C/C++ build settings".
- Under Includes, add $workspace/dlib to include paths. (dlib needs to be downloaded first as described in the previous section)
- Under Optimization set Optimization Level to O3
- Under Miscellaneous add -std=c++17 to other flags
- Under Libraries make sure all following libraries are added:
xilinxopencl opencv_photo opencv_xphoto opencv_highgui vitis_ai_library-dpu_task X11 vitis_ai_library-xnnpp vitis_ai_library-model_config vitis_ai_library-math vart-util xir opencv_highgui vitis_ai_library-facedetect json-c glog opencv_core opencv_videoio opencv_imgproc opencv_imgcodecs pthread rt dl crypt stdc++
- Click apply and close.
- Right click on the application project and click build project.
--------------------------------------------Using TCF Profiler---------------------------------------
Now that the application is built, let's run it. But we don't just want to run it, we also want to profile it on runtime so that we can find the compute intensive tasks for acceleration right? To do that by using the TCF Profiler in Vitis which profiles the code while running in debug mode. Using this tools is explained in this video by Xilinx.
Note: Here I assume that your target platform (ZCU104 or other board) is booted up as explained in the previous post and an IP address is assigned to it on the network. you could use the "ifconfig" command to assign an ip address to your board if DHCP is not enabled on the network. I will be referring to board's ip address as "board_ip".
- First copy an input video file and the dlib shape predictor model file to the board's home directory.
$ cd $workspace
//download the shape predictor model if you haven't done already
$ wget https://github.com/davisking/dlib-models/raw/master/shape_predictor_68_face_landmarks.dat.bz2
$ bzip2 -d shape_predictor_68_face_landmarks.dat.bz2
$ scp vitis_projects/face_swap/data/video_input_640x360.webm shape_predictor_68_face_landmarks.dat root@board_ip:~
- In Vitis IDE, click on the triangle next to debug icon and select "debug configurations"
- Click on "Single application debug" to create a new configuration
- Select Debug type as "Linux application Debug"
- Create a New connection and use board_ip in the "Host" section. Leave the port value as default (1534).
- Under application tab, use the compiled application file as Local file path. It should be located here:
$platform_dir/face_swap/Hardware/face_swap
- Use "/home/root/face_swap" as "remote file path" and "/home/root" as "working directory"
- Click apply and debug
If the board and network connection is set up correctly, the debug should start after a few minutes.
- From the Vitis IDE top panel select Window -> show view... -> TCF Profiler
- From the TCF profiler window, select start, check both "aggregate per function" and "enable stack tracing" and click ok.
- Select the "resume" icon from the top panel to run the application for a while. The more time you let the application run, the better accuracy you get in the profiling.
- After a while click on "suspend" to stop the application from running and analyze the profiling results.
As can be seen in the results, the most compute intensive task was the first element of the pipeline, namely the dlib's face detector. So let's try and accelerate the face detection on the FPGA.
Step 3: Using Vitis AI to accelerate the applicationSo in the previous section we found out that our application's bottleneck is the face detector. But wait... doesn't Vitis-AI library have two ready-to-use face detection models with 700-1000 FPS performance? Why not use that instead of dlib's compute intensive HoG face detector that runs on CPU?
To do this, I've created a struct called "FaceSwap" in the face_swap.hpp file. This struct has a "run" method that receives an image (or a batch of images) and performs the face swap on it in place. Meaning that the output of face swapping overwrites the input image. Please read the file contents to get a better understanding.
This way I can use the same functions used for Vitis AI demos (main_for_jpeg_demo, main_for_video_demo,... ) which are defined in the demo.hpp file of Vitis AI Library. This is specially useful for video applications because the "main_for_video_demo" handles multithreading and you don't need to worry about it.
Also note that the "main_for_video_demo" function doesn't output the video as a file, instead it shows the output on the display connected to the device. So because I wanted to get file output I had to modify the demo.hpp file. Note that because the ZCU104 board DPU doesn't support batch input mode, I have also set the "SAMPLES_ENABLE_BATCH" parameter to 0. You can also modify the gstreamer pipe for video writer to get your desired output format.
From:
DEF_ENV_PARAM(DEBUG_DEMO, "0")
DEF_ENV_PARAM(DEMO_USE_VIDEO_WRITER, "0")
DEF_ENV_PARAM_2(
DEMO_VIDEO_WRITER,
"appsrc ! videoconvert ! queue ! kmssink "
"driver-name=xlnx plane-id=39 fullscreen-overlay=false sync=false",
std::string)
DEF_ENV_PARAM(DEMO_VIDEO_WRITER_WIDTH, "640")
DEF_ENV_PARAM(DEMO_VIDEO_WRITER_HEIGHT, "480")
DEF_ENV_PARAM(SAMPLES_ENABLE_BATCH, "1");
DEF_ENV_PARAM(SAMPLES_BATCH_NUM, "0");
To:
DEF_ENV_PARAM(DEBUG_DEMO, "0")
DEF_ENV_PARAM(DEMO_USE_VIDEO_WRITER, "1")
DEF_ENV_PARAM_2(
DEMO_VIDEO_WRITER,
"appsrc ! video/x-raw ! queue ! mux. avimux name=mux ! filesink location=output.avi",
std::string)
DEF_ENV_PARAM(DEMO_VIDEO_WRITER_WIDTH, "640")
DEF_ENV_PARAM(DEMO_VIDEO_WRITER_HEIGHT, "360")
DEF_ENV_PARAM(SAMPLES_ENABLE_BATCH, "0");
DEF_ENV_PARAM(SAMPLES_BATCH_NUM, "0");
So you could copy the demo.hpp file from my git to the vitis ai sysroot.
$ cd $workspace
$ cp vitis_projects/face_swap/with_vitis_ai/cortexa72-cortexa53-xilinx-linux/usr/include/vitis/ai/demo.hpp $sysroot/usr/include/vitis/ai/
- Now open Vitis IDE again and remove the "face_swap_sw.cpp" and "face_swap_sw.hpp" files from the src folder.
- From "$workspace/vitis_projects/face_swap/with_vitis_ai/" import all the cpp and hpp sources to the src folder.
- Right-click on the application project and click Build project
Note: If you haven't already install Vitis AI runtime on the target board do it as described in Step 6 of my last post.
- Download Vitis AI face detection model compiled for zcu104 and put it into "/usr/share/vitis_ai_library/models" directory on the board.
// clone Vitis AI repository if you haven't done already
$ cd $workspace
$ git clone https://github.com/Xilinx/Vitis-AI.git
$ cd Vitis-AI/models/AI-Model-Zoo/model-list/cf_densebox_wider_360_640_1.11G_1.4
$ cat model.yaml
......
- name: densebox_640_360
type: xmodel
board: zcu102 & zcu104 & kv260
download link: https://www.xilinx.com/bin/public/openDownload?filename=densebox_640_360-zcu102_zcu104_kv260-r1.4.0.tar.gz
checksum: 101bce699b9dada0e97fdf0c95aa809f
......
$ wget https://www.xilinx.com/bin/public/openDownload?filename=densebox_640_360-zcu102_zcu104_kv260-r1.4.0.tar.gz -O densebox_640_360-zcu102_zcu104_kv260-r1.4.0.tar.gz
$ tar -xzvf densebox_640_360-zcu102_zcu104_kv260-r1.4.0.tar.gz
$ scp -r densebox_640_360 root@board_ip:/usr/share/vitis_ai_library/models/
- Copy the application binary file into the board and run it. (Note that if the input video file and Dlib's shape predictor file are not in the board's home directory copy them as described before.)
$ scp $platform_dir/face_swap/Hardware/face_swap root@board_ip:~
- Connect to board either via Serial port or ssh and run the application on four threads.
$ ssh root@board_ip
root@petalinux:~# ./face_swap video_input_640x360.webm -t 4
- Let the application run for a while and stop it using ctrl+C.
- copy the "output.avi" file created in the same directory to the host machine and view it.
$ scp root@board_ip:~/output.avi $workspace
As you can see we have improved the performance by 5 times by doing the face detection on DPU.
As you noticed, the final result has a low FPS of 5. This is mainly because even if we accelerate the face detection, the face landmark and face swapping algorithm is too heavy for the APU of our Zynq MPSoC chip. But considering acceleration, we did a pretty good job, having a x5 FPS boost.
Possible ImprovementsIn order to improve the results, you may find a 68 point face landmark model suitable to integrate in Vitis AI and use that instead of Dlib's shape predictor model. I've actually searched around the internet for such models but didn't find a suitable one, so if you know of any 68 points face landmark model suitable for Vitis AI please share it down in the comments.
Also I've reviewed the processes after the landmark (Delaunay triangulation, Warp transform and seamless clone) for process intensive tasks but didn't find anything worth accelerating on FPGA. Meaning they were mostly memory intensive tasks that wouldn't benefit from parallelism. But if you have any suggestions for improvement in that area make sure to share them in the comments.
ConclusionI hope this project helps you get a better idea of how to utilize Heterogenous devices such as Zynq MPSoCs to achieve better performances in your applications.
I would love to hear your thoughts and feedback on any of my projects in the comment section. Also if you have any Ideas for future content you like to see, make sure to share them down in the comments. :)
AcknowledgmentsFor creating the Vitis AI struct in Step 3, I've used Mario Bergeron's vitis ai examples as a reference. So thanks to him for his great work, you should definitely check out his projects.
Also Thanks to Satya Mallick for his cool openCV projects such as Face swap algorithm.
Comments
Please log in or sign up to comment.