Sandro Magalhães, Filipe Neves Dos Santos, Sattik Shyam

Published April 18, 2022

Grape detection using Vitis AI and RetinaNet

FPGA accelerated detection of bunches of grapes using object detectors in vineyards.

Work in progress981

Grape detection using Vitis AI and RetinaNet

Things used in this project

Hardware components

AMD Kria KV260 Vision AI Starter Kit

Camera (generic)

Software apps and online services

AMD Vitis Unified Software Platform

TensorFlow

FiftyOne

OpenCV CVAT

Story

Introduction

Agriculture needs technological solutions to solve the labour shortage and answer society's demand for more sustainable practices. Precision agriculture and robotics are one path to answer these needs. Martins, R. et al. (DOI:10.1007/978-3-030-30241-2_14) present an example of a robotic monitoring solution for precision agriculture, where a spectrometer is used to monitor the grape/leaf maturation state, by means of using computer vision and mobile manipulator. This monitoring task - like many other agricultural tasks such as harvesting or spraying - needs accurate computer vision algorithms for fruit detection, that should be precise and fast to run in real-time.

To solve the problem above, the authors propose the use of the RetinaNet ResNet 50 object detector trained on a dataset of bunches of grapes in different growing stages and trunks. Using this strategy, we can approach problems like robot localisation in the vineyard based on landmarks (vines' trunks and grapes) and fruit localisation for different tasks like yield assessment, spraying, monitoring or harvesting.

This solution applies a custom acquired and manually labelled and preprocessed dataset in an object detector. The RetinaNet ResNet 50 model using the TensorFlow 2 framework was selected for this project. The trained model is quantised and compiled using Vitis AI and executed on the Xilinx KV260 starter kit and ZCU104 Development Board.

Requirements

Table 1: Requirements list

Risk analysis

Table 2: Risk analysis

Project Planning

Figure 1: Gantt chart

Solution design

For better understanding of the solution for the stated problem, the current use case diagram was designed (Fgure 2). A single camera captures frames from the environment. After it sends it to an FPGA board. The board will preprocess it and send it to the DPU for inference, i.e. estimate and localise the objects (grapes and trunks) in the scene. After that, the DPU returns the inference results and the CPU of the FPGA decodes the results and merges them with the original input figure.

Figure 2: Use case diagram

A better understanding of the previous diagram canbe found in the image of figure 3.

Figure 3: Sequence diagram graph

Dataset

For this work, we used a manually labelled dataset of vine trunks in different seasons of the year and green grapes in different growing stages. The augmented dataset comprises 428 498 images. The dataset is publicly available at https://doi.org/10.5281/zenodo.5114141

The stated dataset has the following classes:

trunk
medium_grape_bunch
tiny_grape_bunch

The images were acquired from multiple sources as stereo cameras (ZED camera, Intel RealSense), monocular cameras (Raspberry Pi Camera HQ), and thermal cameras (FLIR). In this way, we can get a robust dataset provided from different sources and with different featured images.

Raspberry Pi Camera HQ (https://www.raspberrypi.com/products/raspberry-pi-high-quality-camera/)
ZED Camera (https://www.stereolabs.com/zed/)
FLIR Camera (https://www.flir.com/products/m232/)

In all the cases, the cameras were assembled to the mobile robot AgRob v16 (http://agrob.inesctec.pt ; DOI:10.3390/agronomy11091890) in different poses. The Raspberry Pi Camera HQ was used to acquire high-quality images at a low framerate, i.e. about 3 FPS. In this way, we got many pictures of the scene while avoiding very similar pictures. The FLIR camera was used with a similar strategy to acquire thermal images. The same methodology was used for the stereo camera. For the last one, only images of the left lens were acquired.

All the frames were merged together to create an extensive dataset. The different sets of the dataset (train, validation and test) were built by random sampling of images.

For this dataset, it was applied different augmentation procedures as rotation, scaling, flipping, translation, and multiply. These augmentations were applied to all the images in the dataset, like:

Rotation (Random rotation between -30º and +30º)
Scaling (Random scaling between 50% and 150% of the original image)
Flipping
Translation (Random translation between +30% and -30% on the x and y-axis)
Multiply (Multiply the pixel by a random factor to become the images darker or lighter)

For training and evaluation purposes, the dataset was divided into three sets: training set, validation set and test set. The model was trained on the training set and was runtime evaluated on the validation set. The final assessment of the network was made on the test set.

The dataset is organised in a similar structure to Pascal VOC dataset. The files train.txt, test.txt, and validation.txt have the list of images and annotations in each set.

Aveleda
├── Annotations
├── ImageSets
│   └── Main
│       ├── test.txt
│       ├── train.txt
│       └── validation.txt
└── JPEGImages

Model definition

To detect the fruits and the trunks in open-field scenes, we used the RetinaNet ResNet 50 object detector (DOI:10.1109/TPAMI.2018.2858826) inspired by the Keras implementation by Srihari Humbarwadi (https://keras.io/examples/vision/retinanet/). Because TensorFlow 2 for Vitis AI is only compatible with the functional API for deep learning models, we translated the sub-modelling model.

The compilation of the deep learning model for the xmodel file format has some restrictions, being the most important the ReLU operation that needs to be associated with a previous operation. Because we need the output at Conv2D at P6, this association is impossible. We can overcome this problem by duplicating the operation Conv2D at P6 to associate with the ReLU.

The full definition of the model can be found at retinanet.py.

Training the model

After performing the due changes in the file, it is enough to execute the python script for training the model. The trainer will load the previously trained weights for the ImageNet dataset in the backbone. Still, the classifier and the regressor weights will be randomly initialised based on a normal distribution.

$ python3 train.py

We used the focal loss to train the model, as coded by Srihari Humbarwadi. The loss function is minimised by the stochastic gradient descendent (SGD). Because the loss function has two hyperparameters (alpha and gamma) and the SGD has two hyperparameters (learning rate and momentum), we used the Keras tuner, with the hyperband algorithm, to compute the best values that minimise the loss function.

The architecture of the folding system for training, quantising and compiling is stated like:

training
    ├── aveleda
    │   ├── aveleda
    │   │   └── Aveleda
    │   │       ├── Annotations
    │   │       ├── ImageSets
    │   │       │   └── Main
    │   │       │       ├── test.txt
    │   │       │       ├── train.txt
    │   │       │       └── validation.txt
    │   │       └── JPEGImages
    │   ├── aveleda.py
    │   ├── label_map.pbtxt
    │   └── labels.list
    ├── dump_model.py
    ├── evaluate_voc.py
    ├── exporter.py
    ├── exporter_qat.py
    ├── post_quantizer_ft.py
    ├── post_quantizer.py
    ├── retinanet.py
    ├── train.py
    └── train_quantised.py

Quantising the model

The result of the previous step should be a floating-point model. However, the Vitis AI compiler requires an int8 model, i.e., a quantised model with integers of 8 bits. For getting the quantised model, Vitis AI has three strategies:

Post-training quantisation
Post-training quantisation with finetuning

Quantisation aware training

The post-training quantisation had outstanding results, so this was the used strategy for this project. The following command can make the quantisation:

$ python3 post_quantizer.py

This script loads the previously trained model at the checkpoints folder and quantises it, given a calibration set. We used the train set as the calibration set for the current case. Other checkpoints can be used by changing the file checkpoint inside the checkpoints folder.

Compiling the model

Compiling for the AMD-Xilinx FPGA for using the DPU requires a specific compiler, as said previously. AMD-Xilinx developed the vai_c_tensorflow2 that can convert Keras H5 models to xmodel files.

$ vai_c_tensorflow2 \
    -m quantized.h5 \
    -a /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json \
    -o compiled \
    -n retinanet_resnet50_aveleda_384_384_3

Bear that the compiler is not compatible with all kinds of operations. Mainly, the compiler cannot convert the operations related to the preprocessing and post-processing. Therefore, the output of the quantisation process should be the raw model composed by:

Backbone network (ResNet 50)
Feature Pyramid Network (FPN) layers
Bounding Boxes regressor
Classes classifier

Executing the model

After having the compiled model for the DPU, move the xmodel file to the FPGA memory. In the FPGA, the authors have a folder structure as indicated below. The folder aveleda_test has the images that will be used for the inference. The postprocessing.py file has the routines to perform the post-processing layers of the inference, i.e. decode the bounding boxes and filter them.

.
├── aveleda_test
│   └── <images>.jpg
├── postprocessing.py
├── retinanet_benchmark.py
└── retinanet_inference.py

The current state of this implementation allows us to infer and output the results of the detections (retinanet_inference.py) and benchmark the network performance using multithreading (retinanet_benchmark.py).

$ python3 retinanet_benchmark.py

This script, retinanet_benchmark.py, makes the performance assessment of the network. Change the lines 131 and 132 to set the right test images folder and the desired number of threads. Change also line 137 to set the right path to the xmodel.

$ python3 retinanet_inference.py

This script, retinanet_inference.py, infers the objects (grapes and trunks) in the images into a folder and outputs the results to a JSON file. Change the same lines to set the correct configurations. For this script, we cannot fix the number of threads.

Results and discussion

The results below asses the test set of the used dataset. The results show up to be reasonable and representative for assessment in real-world scenarios. The results were assessed using different evaluation metrics from the COCO (https://cocodataset.org/#detection-eval) evaluation metrics, confusion matrix, and others.

Table 3: Confusion Matrix in the test set

The network has overall good performance. However, some additional work should be done to improve its precision. However, the network can successfully detect most of the fruits and trunks. In many situations, instead of bad detection of the fruits, there are being multiple detections of the same fruit.

It is relevant to mind that, in this network assessment, some labelling errors were also found in the dataset that may bias the results. Between the labelling mistakes, we can find missing annotations and wrong annotations.

Table 4: RetinaNet ResNet 50 assessment in test set

The implemented neural network was accessed in the AMD-Xilinx ZCU104 Development Board and AMD-Xilinx Kria KV260 starter kit. The ZCU104 has 4 CPU (Computer Processing Unit) cores and 2 DPU (Deep Learning Processing Unit) cores. So, while using a single thread, the program uses a single DPU core, but with two or more threads, the program always uses 2 DPU cores. The efficiency is improved using more threads because the not whole program can be executed in the DPU (preprocessing and post-processing layers). If we use four threads, the deep learning model can be used to compute images in real-time.

Table 5: Performance in the AMD-Xilinx ZCU104 Development Board

Unlike ZCU104, the AMD-Xilinx KV260 starter kit only has space to describe a single DPU core. Therefore, the level of parallelisation is lower, leading to an increase in the inference time. The inference time could only be decreased in a bigger FPGA device with enough resources to describe more DPU cores.

Table6: Performance in the AMD-Xilinx KV260 starter kit

Figure 1 shows some sample images with the results of the inference process.

1 / 7 • Figure 4: Sample of the results in some images

A video showing the system performance can be viewed at DOI:10.5281/zenodo.6402573

Conclusion

The authors assessed the capacity to use cost-effective FPGA to process images to detect grapes under real-world scenarios during this work. Both ZCU104 and KV260 were evaluated in a RetinaNet ResNet50 v1 and looked reasonable. The neural network detected the objects in the images with a Recall of about 90% and a Precision of about 50%. This totalised with an F1 score of about 70%. FPGA boards behaviour similarly but because ZCU104 has one more DPU than KV260, the first one reached faster inference speeds. The detection speed is bounded between 8.38 and 25 FPS.

This project aimed to implement and accelerate an object detector capable of detecting trunks and highly occluded fruits during different weather conditions. A big and augmented dataset with many objects per class was used. Besides, the objects were captured using different cameras and during different periods of the year. The fruits were captured in a natural and wine production vineyard to have the most accurate samples of bunches of grapes. These grapes are green and very similar to the background. The approached object detector needs to be compatible with the DPU, so we tried to isolate the incompatible layers (preprocessing and post-processing layers) and worked over the other layers to become fully compatible. Both, preprocessing and post-processing, were implemented in Python. Preprocessing layers used OpenCV to convert and prepare the input image. Post-processing used NumPy library to optimise the code. Further optimisations in this block of code could be made, for instance implementing it in the Logical Part of the FPGA.

Further work concerns integrating the inference system into a robotic system using ROS2 (https://www.ros.org/) and optimising the deep learning model by pruning. The Robotic Operating System (ROS) is the most used middleware for robotics development in research and is becoming a standard for the industry. Therefore, most of the underdevelopment robots are using ROS or ROS2 (the new generation of ROS). The implementation of these routines compatible with ROS2 is a major advance for robotics integration. Besides, AgRob v16 is an underdevelopment robot to operate in the vineyards that is implemented in ROS. The pruning process is a strategy to identify and remove the useless nodes in the network that may be becoming slower without reducing severely the performance.

Other studies should be performed. Other networks' backbones may improve the network accuracy, like ResNet101 or VGG, or even become the network faster using, for instance, MobileNets. The use of Binary Neural Networks (BNN) is becoming a trend in the literature to quickly assess the targets. Due to the characteristics of FPGAs, the use o BNNs can relevant because bitwise operations are straight. Finally, neural networks usually have many hyperparameters that can and should be optimised, the use of some other optimisation algorithms should be important as the use of genetic algorithms.

Grape detection using Vitis AI and RetinaNet

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Requirements

Risk analysis

Project Planning

Solution design

Dataset

Model definition

Training the model

Quantising the model

Compiling the model

Executing the model

Results and discussion

Conclusion

Code

Gitlab - Xilinx Adaptive Computing Challenge 2021

Credits

Sandro Magalhães

Filipe Neves Dos Santos

Sattik Shyam

Comments

Embed the widget on your own site

Grape detection using Vitis AI and RetinaNet

Grape detection using Vitis AI and RetinaNet

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Requirements

Risk analysis

Project Planning

Solution design

Dataset

Model definition

Training the model

Quantising the model

Compiling the model

Executing the model

Results and discussion

Conclusion

Code

Gitlab - Xilinx Adaptive Computing Challenge 2021

Credits

Sandro Magalhães

Filipe Neves Dos Santos

Sattik Shyam

Comments

Related channels and tags