The goal of this project was to create a system that can be used by businesses and schools to monitor mask-usage in public places (to figure out where people are congregating and where stronger enforcement is required). However, it is also important that the system maintain people's privacy, so this data must be anonymized and aggregated when it is logged, without saving exact data about specific individuals. Finally, it must be able to connect to existing security systems (where installed), or run as a standalone system using any generic cameras (for places which don't have a sufficient security system).
OverviewThis system centers around the Xilinx ZCU104 board, running a PetaLinux image with Vitis AI and the relevant software to access the VCU. It streams video, either from connected USB cameras or compressed video from networked IP cameras, and then feeds it into a face-detection neural network, which was quantized using and compiled using Vitis AI (note that this network only detects locations of faces, but does not attempt to identify the face or safe any information about it, hence preserving privacy). The patches extracted from this network (which are the exact faces themselves) are then cropped out, resized, and fed into a second (very tiny) neural network, which classifies the faces based on whether a mask is being worn or not (as well as an experimental feature which detects if people are wearing the masks incorrectly - i.e. not having the mask cover their nose). The results are then annotated back onto the video stream using simple colored-boxes (as well as being logged to a spreadsheet for later analysis) and streamed back out, so they can be viewed using a standard security-camera remote interface.
Training processThe main inference pipeline is in 2 parts. One is the densenet model to detect faces (which is the heavy part of the workload, and requires the DPU in order to be at all usable), and the other is the tiny conv-net to classify whether a face is masked or not (takes a 32x32x3 image and outputs a classification). The densenet is based on a standard face-detection densenet model, and the training for the fase-mask-classifier can be seen in the training directory in the code respository.
Inference on FPGAThe model is run on the FPGA by starting a gstreamer pipeline which pulls frames from either the local camera or an IP camera, and then running the face-detect and mask-classify binaries (with the output of face-detect piped into mask-classify). This will start a remote-streaming server with the annotated stream (which can be connected back into existing security systems).
Comments