Hi, in this project I want to show you how to build a pizza slice angle calculator, based on an FPGA accelerated Yolov3 network. The pizza toppings are analyzed by a Yolov3 network. Based on the Yolo output, a pepperoni heatmap is calculated. This heatmap is used to calculate the perfect angle for slicing the pizza into two fair parts. The result for the best cutting angle is projected close to the Pizza. An Ultra96-V2 Board from Avnet is used for all calculations. A GoPro Hero Camera is used to capture images from the pepperoni pizza. The camera and projector are connected to the Ultra96-V2 Board. I used a custom HDMI Board, but instead, you can use a USB Webcam for image capturing. The incoming images from the camera are resized, this is done with a Programmable logic (PL) accelerator. The Yolov3 object detection network, which is running on the Xilinx Deep Learning Processor Unit (DPU) performs the pizza and pepperoni detection.#projectofthemonth
The project documentation is divided into three parts, "Getting Started", "From Data set to DPU elf" and the "Slicer program". The "Getting Started" section describes the steps which are needed to get a Yolo-based pizza slicer running on an Ultra96V2 board. Section "From Data set to DPU elf" describes the whole development process from a custom data set to a DPU elf file. This part of the project was most time-consuming, as the Xilinx documentation and examples only covered parts of the whole development process. At the moment, this part is under construction, but in combination with the project source code, it is a good starting point for your own custom Xilinx DPU projects. Just a few hours before this project was released on hackster, LogicTronix released this tutorial, which covers some parts of the Yolov3 to DPU elf compilation. Last but not least, you can find a detailed description of the python code which implement the Pizza slicer application in the section "Slicer program".
Getting Started- Download PYNQ V2.6 Image for Ultra96V2
- Install PYNQ-DPU from this hackster tutorial
- To improve OpenCV speed, I rebuild OpenCV with Tengine support. This improves the overall performance of ARM processor-based systems.
- Clone my project repo to build the FPGA hardware. I used a custom HDMI board, so the pizza-base-without-hdmi.tcl creates the Ultra96-V2 board with the pl image resizer and DPU without the HDMI in/output.
- Run the compile-caffe.sh script from a Vitis ai docker. The Script uses the pre-trained Caffe model. I also put the test images for DPU quantization and compilation inside the repo.
- Copy the FPGA *.bin file and the DPU *.elf file on the board. But everything in the same directory as the pizza slicer jupyter notebook.
- Use the pizza slicer jupyter notebook from my repo and test the yolov3 network and slice some pizza.
Sadly, the model Zoo from Xilinx didn't provide a pre-trained network for pizza detection, so I had to build the network from scratch. The following diagram shows the steps, needed to get a Yolo network running on a Xilinx DPU.
The workflow in detail:
- Images: First I had to get some images of pepperoni pizza. I used 22 images from Google, later I added some real images from the GoPro. All images are resized and cropped to 320x320 pixels.
- Image Annotation: A great tool for image annotation is Microsoft VoTT. For the pizza slicer application, bounding boxes for pizza and pepperonis are used. VoTT can export the dataset in different formats, I used the Pascal VOC format for export.
- Data augmentation: Training a network with only 22 images is not very robust for the network inference. To improve the training data set and robustness, the data augmentation workflow from roboflow ai. is used. This is an easy way to modify a small dataset and add some preprocessed images for training. With a free account, which is limit to 1000 images per dataset, it's possible to build a small dataset. For this use-case, rotation and saturation are great methods to simulate the real-world scenario. The cool thing is, that roboflow updates the annotations in the preprocessing step. My final Dataset contains around 800 pizza images, slowly I'm getting hungry... The preprocessed dataset needs to be exported to Pascal VOC (Xilinx DPU compilation) and Yolo dataset (for training). The possibility to generate multiple output formats is another great feature from roboflow.
- Network training: The YOLO network is trained with darknet. For training, a Google Colab GPU notebook is used. The training notebook is available in my GitHub repo. I added support for yolov2-tiny, yolov3-tiny and yolov4-tiny training in my Colab notebook. After Training, you can compare the different networks and their output. The pre-trained network, which is used for this project, is available at my GitHub repo.
- Darknet to Caffe convert: To use the Yolo network with a Xilinx DPU, the network must be converted to Caffe format. This step is tricky. There is nearly no documentation on how to convert the darknet output to a Caffe model which is compatible with the Xilinx Vitis ai compiler. Xilinx offers a xilinx-caffe version inside the Vitis-Ai repo. I build a docker container, which builds the xilinx-caffe version, this container is able to convert a darknet model to the caffe-model. To convert the darknet model (*cfg and *weights) use the convert.py script, which is located in Vitis-AI/AI-Model-Zoo/caffe-xilinx/scripts/ If everything works, you get a *.prototxt and a *.caffemodel
- Compile DPU elf: The final step, quantization and compilation of the DPU elf file need some preparation. A Vivado hardware project, which includes a DPU IP, the Pascal Validation data set, the Caffe network and the Vitis-AI docker container are needed. This is only valid for a Yolo network on Ultra96-V2 board DPU, for other network types the requirements can be different. The following steps are needed to be performed inside the Vitis Ai docker container. But before we are able to compile the model we have to change the first "layer" of the *.prototxt file. See project GitHub Repo and Xilinx Tutorial. The quantization step needs valid network input data. To get the image loader of the first layer running, we need images in the ms coco dataset format. Luckily, roboflow can export the dataset in coco format (see the step "Data augmentation"). For quantization, the test images from roboflow can be used. Roboflow automatically splits the dataset in test, train and valid parts. The complete code for compilation, the test images and the DPU config are part of my repo. Besides the network layer adaption, the DPU compiler needs some information about the DPU which is implemented in the hardware design. The configuration is stored in a JSON file. The JSON file is divided into two parts, one for CPU arch and DPU type, the other part is the DPU config (Cores, Softmax, etc.). This JSON file can be generated with the DELT tool. The DELT tool converts the *.hwh file into a JSON file, which can be used for compilation. For more information see this Xilinx forum post. My custom JSON file is part of my repo. If you prepared everything, you can quantize and compile the DPU elf. from your custom network. If everything succeeded, the dpu.elf file is generated. This file can be loaded with the PYNQ-DPU python framework.
The last section explains the Pizza Slicer code. The program is written in python3 and usees a jupyter notebook to control the program. SourceCode
The first three cells cover all Python imports and FPGA / DPU file loading.
Generate Slicing masks. The problem of finding the best angle is solved geometrically. I used 180 masks, which cut the 320 x 320 image into two parts. The comment code, enable matplotlib plots which draw the slicing maks.
Loading anchors for Yolov3 network. Yolov3 uses a fixed set of anchors to generate the bounding boxes. These anchors are used in the evaluation function.
The evaluation function converts the Yolov3 output to boxes, classes and scores. This is needed because all other steps need the output in bounding boxes format.
HDMI setup for camera input and projector output. The code is compatible with the PYNQ HDMI interface. A custom in/output can be set up at this point.
The input image from the camera is cropped in the first step, to select the region of interest (ROI). My GoPro Camera outputs a 720p image, so the ROI is set to 440 x 440 pixels, cropping is done in the software. This is useful if your camera is not fixed at the projector and some adjustments are needed to fit the pizza in the middle of the ROI. The ROI image data is transferred to the FPGA image resizer, the data transfer is done by some DMA tasks. To enable data transfer from the python memory to the PL memory, the pynq.buffer is used. The output from the PL image resizer is used as DPU input. The source code from PL resizer is forked from this PYNQ-Helloworld GitHub repo. The PL resizer can handle custom in/ output images, the settings for in/output images can be set via registers. Four register values must be set: 0x10 : input image height ; 0x18 input image width; 0x20 output image height; 0x28 output image width
DPU init and DPU Task starting. The evaluation function converts the raw Yolov3 network output to bounding boxes format.
The output from the evaluation function is passed to a sorting process. Duplicate pizza detections are removed, only the highest confidence for pizza is used. Pepperoni detections are used if the confidence is higher than 0.5.
Based on the sorting out process, a virtual pizza is drawn in a 320 x320 uint8 array. The bounding box coordinates are used to draw an ellipse inside the bounding box. The inside of the ellipse is filled with 1 the outside with 0. Every pepperoni approximation is added to the 320 x 320 base array. The base array element is a pizza bounding box approximation. Pepperoni detections outside the pizza base array are ignored. The result is a pepperoni heatmap.
Pepperoni heatmap, overlaying pepperoni slices are also simulated (brighter areas). That is one reason, why an object detection network is used instead of a segmentation network. With the object detection network and the elliptic area approximation overlaying pepperonis can be simulated.
3D pepperoni heatmap, the overlapping pepperoni area is drawn as spikes. This plot shows very clearly how the heatmap data is combined after the elliptic area approximation and layer adding.
After calculating the pepperoni heatmap, the best cutting angle must be defined. This problem is solved by applying a mask to the pepperoni heatmap. The mask set half of the heatmap values to zero. After applying the mask all elements are summed up. This value is compared to half of all elements from the original pepperoni heatmap. The binary search algorithm tries to minimize the difference from the masked sum to the half heatmap sum. I used the fact that we only have to search from 0 to 180 degrees because the mask rotates around the center of the array. This kind of search is very effective and reduces the number of calculations immensely.
Visualization for the best cutting angle in 3D
In the last step, the output image for "Cut Here" and the cutting arrows are calculated.
Comments