Agriculture needs technological solutions to solve the labour shortage and answer society's demand for more sustainable practices. Precision agriculture and robotics are one path to answer these needs. Martins, R. et al. (DOI:10.1007/978-3-030-30241-2_14) present an example of a robotic monitoring solution for precision agriculture, where a spectrometer is used to monitor the grape/leaf maturation state, by means of using computer vision and mobile manipulator. This monitoring task - like many other agricultural tasks such as harvesting or spraying - needs accurate computer vision algorithms for fruit detection, that should be precise and fast to run in real-time.
To solve the problem above, the authors propose the use of the RetinaNet ResNet 50 object detector trained on a dataset of bunches of grapes in different growing stages and trunks. Using this strategy, we can approach problems like robot localisation in the vineyard based on landmarks (vines' trunks and grapes) and fruit localisation for different tasks like yield assessment, spraying, monitoring or harvesting.
This solution applies a custom acquired and manually labelled and preprocessed dataset in an object detector. The RetinaNet ResNet 50 model using the TensorFlow 2 framework was selected for this project. The trained model is quantised and compiled using Vitis AI and executed on the Xilinx KV260 starter kit and ZCU104 Development Board.
RequirementsFor better understanding of the solution for the stated problem, the current use case diagram was designed (Fgure 2). A single camera captures frames from the environment. After it sends it to an FPGA board. The board will preprocess it and send it to the DPU for inference, i.e. estimate and localise the objects (grapes and trunks) in the scene. After that, the DPU returns the inference results and the CPU of the FPGA decodes the results and merges them with the original input figure.
A better understanding of the previous diagram canbe found in the image of figure 3.
For this work, we used a manually labelled dataset of vine trunks in different seasons of the year and green grapes in different growing stages. The augmented dataset comprises 428 498 images. The dataset is publicly available at https://doi.org/10.5281/zenodo.5114141
The stated dataset has the following classes:
- trunk
- medium_grape_bunch
- tiny_grape_bunch
The images were acquired from multiple sources as stereo cameras (ZED camera, Intel RealSense), monocular cameras (Raspberry Pi Camera HQ), and thermal cameras (FLIR). In this way, we can get a robust dataset provided from different sources and with different featured images.
- Raspberry Pi Camera HQ (https://www.raspberrypi.com/products/raspberry-pi-high-quality-camera/)
- ZED Camera (https://www.stereolabs.com/zed/)
- FLIR Camera (https://www.flir.com/products/m232/)
In all the cases, the cameras were assembled to the mobile robot AgRob v16 (http://agrob.inesctec.pt ; DOI:10.3390/agronomy11091890) in different poses. The Raspberry Pi Camera HQ was used to acquire high-quality images at a low framerate, i.e. about 3 FPS. In this way, we got many pictures of the scene while avoiding very similar pictures. The FLIR camera was used with a similar strategy to acquire thermal images. The same methodology was used for the stereo camera. For the last one, only images of the left lens were acquired.
All the frames were merged together to create an extensive dataset. The different sets of the dataset (train, validation and test) were built by random sampling of images.
For this dataset, it was applied different augmentation procedures as rotation, scaling, flipping, translation, and multiply. These augmentations were applied to all the images in the dataset, like:
- Rotation (Random rotation between -30º and +30º)
- Scaling (Random scaling between 50% and 150% of the original image)
- Flipping
- Translation (Random translation between +30% and -30% on the x and y-axis)
- Multiply (Multiply the pixel by a random factor to become the images darker or lighter)
For training and evaluation purposes, the dataset was divided into three sets: training set, validation set and test set. The model was trained on the training set and was runtime evaluated on the validation set. The final assessment of the network was made on the test set.
The dataset is organised in a similar structure to Pascal VOC dataset. The files train.txt
, test.txt
, and validation.txt
have the list of images and annotations in each set.
Aveleda
├── Annotations
├── ImageSets
│ └── Main
│ ├── test.txt
│ ├── train.txt
│ └── validation.txt
└── JPEGImages
Model definitionTo detect the fruits and the trunks in open-field scenes, we used the RetinaNet ResNet 50 object detector (DOI:10.1109/TPAMI.2018.2858826) inspired by the Keras implementation by Srihari Humbarwadi (https://keras.io/examples/vision/retinanet/). Because TensorFlow 2 for Vitis AI is only compatible with the functional API for deep learning models, we translated the sub-modelling model.
The compilation of the deep learning model for the xmodel file format has some restrictions, being the most important the ReLU operation that needs to be associated with a previous operation. Because we need the output at Conv2D at P6, this association is impossible. We can overcome this problem by duplicating the operation Conv2D at P6 to associate with the ReLU.
The full definition of the model can be found at retinanet.py
.
After performing the due changes in the file, it is enough to execute the python script for training the model. The trainer will load the previously trained weights for the ImageNet dataset in the backbone. Still, the classifier and the regressor weights will be randomly initialised based on a normal distribution.
$ python3 train.py
We used the focal loss to train the model, as coded by Srihari Humbarwadi. The loss function is minimised by the stochastic gradient descendent (SGD). Because the loss function has two hyperparameters (alpha and gamma) and the SGD has two hyperparameters (learning rate and momentum), we used the Keras tuner, with the hyperband algorithm, to compute the best values that minimise the loss function.
The architecture of the folding system for training, quantising and compiling is stated like:
training
├── aveleda
│ ├── aveleda
│ │ └── Aveleda
│ │ ├── Annotations
│ │ ├── ImageSets
│ │ │ └── Main
│ │ │ ├── test.txt
│ │ │ ├── train.txt
│ │ │ └── validation.txt
│ │ └── JPEGImages
│ ├── aveleda.py
│ ├── label_map.pbtxt
│ └── labels.list
├── dump_model.py
├── evaluate_voc.py
├── exporter.py
├── exporter_qat.py
├── post_quantizer_ft.py
├── post_quantizer.py
├── retinanet.py
├── train.py
└── train_quantised.py
Quantising the modelThe result of the previous step should be a floating-point model. However, the Vitis AI compiler requires an int8
model, i.e., a quantised model with integers of 8 bits. For getting the quantised model, Vitis AI has three strategies:
- Post-training quantisation
- Post-training quantisation with finetuning
Quantisation aware training
The post-training quantisation had outstanding results, so this was the used strategy for this project. The following command can make the quantisation:
$ python3 post_quantizer.py
This script loads the previously trained model at the checkpoints
folder and quantises it, given a calibration set. We used the train set as the calibration set for the current case. Other checkpoints can be used by changing the file checkpoint
inside the checkpoints folder.
Compiling for the AMD-Xilinx FPGA for using the DPU requires a specific compiler, as said previously. AMD-Xilinx developed the vai_c_tensorflow2
that can convert Keras H5 models to xmodel files.
$ vai_c_tensorflow2 \
-m quantized.h5 \
-a /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json \
-o compiled \
-n retinanet_resnet50_aveleda_384_384_3
Bear that the compiler is not compatible with all kinds of operations. Mainly, the compiler cannot convert the operations related to the preprocessing and post-processing. Therefore, the output of the quantisation process should be the raw model composed by:
- Backbone network (ResNet 50)
- Feature Pyramid Network (FPN) layers
- Bounding Boxes regressor
- Classes classifier
After having the compiled model for the DPU, move the xmodel file to the FPGA memory. In the FPGA, the authors have a folder structure as indicated below. The folder aveleda_test
has the images that will be used for the inference. The postprocessing.py
file has the routines to perform the post-processing layers of the inference, i.e. decode the bounding boxes and filter them.
.
├── aveleda_test
│ └── <images>.jpg
├── postprocessing.py
├── retinanet_benchmark.py
└── retinanet_inference.py
The current state of this implementation allows us to infer and output the results of the detections (retinanet_inference.py
) and benchmark the network performance using multithreading (retinanet_benchmark.py
).
$ python3 retinanet_benchmark.py
This script, retinanet_benchmark.py
, makes the performance assessment of the network. Change the lines 131 and 132 to set the right test images folder and the desired number of threads. Change also line 137 to set the right path to the xmodel.
$ python3 retinanet_inference.py
This script, retinanet_inference.py
, infers the objects (grapes and trunks) in the images into a folder and outputs the results to a JSON file. Change the same lines to set the correct configurations. For this script, we cannot fix the number of threads.
The results below asses the test set of the used dataset. The results show up to be reasonable and representative for assessment in real-world scenarios. The results were assessed using different evaluation metrics from the COCO (https://cocodataset.org/#detection-eval) evaluation metrics, confusion matrix, and others.
The network has overall good performance. However, some additional work should be done to improve its precision. However, the network can successfully detect most of the fruits and trunks. In many situations, instead of bad detection of the fruits, there are being multiple detections of the same fruit.
It is relevant to mind that, in this network assessment, some labelling errors were also found in the dataset that may bias the results. Between the labelling mistakes, we can find missing annotations and wrong annotations.
The implemented neural network was accessed in the AMD-Xilinx ZCU104 Development Board and AMD-Xilinx Kria KV260 starter kit. The ZCU104 has 4 CPU (Computer Processing Unit) cores and 2 DPU (Deep Learning Processing Unit) cores. So, while using a single thread, the program uses a single DPU core, but with two or more threads, the program always uses 2 DPU cores. The efficiency is improved using more threads because the not whole program can be executed in the DPU (preprocessing and post-processing layers). If we use four threads, the deep learning model can be used to compute images in real-time.
Unlike ZCU104, the AMD-Xilinx KV260 starter kit only has space to describe a single DPU core. Therefore, the level of parallelisation is lower, leading to an increase in the inference time. The inference time could only be decreased in a bigger FPGA device with enough resources to describe more DPU cores.
Figure 1 shows some sample images with the results of the inference process.
A video showing the system performance can be viewed at DOI:10.5281/zenodo.6402573
ConclusionThe authors assessed the capacity to use cost-effective FPGA to process images to detect grapes under real-world scenarios during this work. Both ZCU104 and KV260 were evaluated in a RetinaNet ResNet50 v1 and looked reasonable. The neural network detected the objects in the images with a Recall of about 90% and a Precision of about 50%. This totalised with an F1 score of about 70%. FPGA boards behaviour similarly but because ZCU104 has one more DPU than KV260, the first one reached faster inference speeds. The detection speed is bounded between 8.38 and 25 FPS.
This project aimed to implement and accelerate an object detector capable of detecting trunks and highly occluded fruits during different weather conditions. A big and augmented dataset with many objects per class was used. Besides, the objects were captured using different cameras and during different periods of the year. The fruits were captured in a natural and wine production vineyard to have the most accurate samples of bunches of grapes. These grapes are green and very similar to the background. The approached object detector needs to be compatible with the DPU, so we tried to isolate the incompatible layers (preprocessing and post-processing layers) and worked over the other layers to become fully compatible. Both, preprocessing and post-processing, were implemented in Python. Preprocessing layers used OpenCV to convert and prepare the input image. Post-processing used NumPy library to optimise the code. Further optimisations in this block of code could be made, for instance implementing it in the Logical Part of the FPGA.
Further work concerns integrating the inference system into a robotic system using ROS2 (https://www.ros.org/) and optimising the deep learning model by pruning. The Robotic Operating System (ROS) is the most used middleware for robotics development in research and is becoming a standard for the industry. Therefore, most of the underdevelopment robots are using ROS or ROS2 (the new generation of ROS). The implementation of these routines compatible with ROS2 is a major advance for robotics integration. Besides, AgRob v16 is an underdevelopment robot to operate in the vineyards that is implemented in ROS. The pruning process is a strategy to identify and remove the useless nodes in the network that may be becoming slower without reducing severely the performance.
Other studies should be performed. Other networks' backbones may improve the network accuracy, like ResNet101 or VGG, or even become the network faster using, for instance, MobileNets. The use of Binary Neural Networks (BNN) is becoming a trend in the literature to quickly assess the targets. Due to the characteristics of FPGAs, the use o BNNs can relevant because bitwise operations are straight. Finally, neural networks usually have many hyperparameters that can and should be optimised, the use of some other optimisation algorithms should be important as the use of genetic algorithms.
Comments