Team Sundance AI:

Pedro Machado

•

Ivica Matic

•

Tim Fernandez-Hart

Published December 2, 2020 © GPL3+

Fruit Detection Using MPSoCs

Fruit detection at the edge using a ZCU104.

AdvancedFull instructions providedOver 1 day2,278

Things used in this project

Hardware components

Zynq UltraScale+ MPSoC ZCU104

Software apps and online services

OpenCV – Open Source Computer Vision Library OpenCV

Computer Vision library

ROS Robot Operating System

Amazon Web Services AWS EC2

AMD Xilinx Software Development Kit

Story

Abstract

Precision farming is an innovative field that has been capturing the attention of researchers, innovators, and end-users alike. Farming applications such as robotic harvesting and intelligent spraying systems demand the use of intelligent vision solutions (IVS). By necessity, these systems are often deployed remotely. In reality, this means embedded platforms with limited resources and no access to the cloud. This constrained environment where embedded devices are solely responsible for data collection and processing is called the edge. Running state of the art IVS involves complex artificial intelligence models and computer vision algorithms. Being able to accelerate these at the edge provides huge advantages in terms of power usage, frame rate, and model accuracy.

We propose using the You Only Look Once version 3 (YOLOv3) convolutional neural network (CNN) in combination with other computer vision techniques for fruit detection at the edge. The customised version of YOLOv3 is accelerated on the ZCU104 by Xilinx. Our results show a performance acceleration of five times when implemented on a Xilinx ZU7EV programmable logic using the Xilinx DPU IP + Vitis AI.

1 - Introduction

Between 2018 and 2019 the Utilised Agricultural Area (UAA) in the UK increased by 1.0% to 17.5 million hectares. It now covers 72% of the country [1]. Every year, the agricultural industry requires the recruitment of tens of thousands of seasonal skilled labourers to work this expanding farmland [3]. The problem is, few Britons want to work in this sector. On top of this, Brexit is bringing new challenges with changes in law and the regulatory environment making it more difficult for farmers to find the workforce they need from overseas [3]. To solve this problem, the UK government is investing in the development of robotic farming applications to make the UK self-sufficient in terms of food production [2].

Most farming tasks are essentially repetitive, lending themselves to automation. But they rely on the accurate detection and grasping of fruits. This is the perfect situation to make use of robotic applications. In this project, we present an application to detect fruits (apples and oranges) at the edge (where the data is being generated) using the Xilinx ZCU104 board.

2- Brief Literature Overview

Brandenburg et al. [1] reported that while each picking task is different, a robot's computer vision performance has a disproportionately large impact on its success. Although, one could argue, that the end-effector and gripping strategy play an equal part in the success of precision farming. The initial step is a visual identification of the fruit. Without a reliable and robust way to do this, any subsequent tasks will be prone to failure. Thus, the contribution of our work focuses on the initial computer vision step. Analysing how state-of-the-art AI algorithms (i.e. YOLOv3) operating on the Xilinx ZCU104 development board could improve fruit classification — providing an essential function for fruit picking robots.

A fruit grasping algorithm includes five distinct phases, 1) fruit classification: use the CNN algorithm to classify fruits in the image; 2) pose estimation: estimate the best pose to pick the object; 3) path searching: search the best path assuming the current position of the objects in the scene; 4) path planning: predict moving objects trajectory and estimate the optimal path while avoiding collisions and 5) grasping: process grasping the target fruit.

Figure 1: Fruit grasping steps. This work is focus on step 1. fruit classification

In our work, we are going to focus on part 1, fruit classification as we are currently interested in counting the fruit and estimating the potential yield.

Please refer to the websites below to learn more about the YOLOv3 and VITIS.

YOLOv3Website: https://pjreddie.com/darknet/yolo/
YOLOv3 Research Paper: https://pjreddie.com/media/files/papers/YOLOv3.pdf
Strawberry Detection Using a Heterogeneous Multi-Processor Platform: https://arxiv.org/pdf/2011.03651.pdf
Vitis-AI User Guide: https://www.xilinx.com/support/documentation/sw_manuals/vitis_ai/1_0/ug1414-vitis-ai.pdf

3 - Methodology

3.1 - Aims and objective

The aims and objective are as follows:

Objective:

Detect oranges and apples at the edge.

Aims:

Design and implement a custom YOLOV3 for detecting oranges and apples.
Accelerate the custom YOLOV3 model using Xilinx DPU fitted for the ZCU104 Zynq Ultrascale+ evaluation board.

3.2. Project plan

3.2.1 List of requirements

The tasks were extracted from the list of requirements using the MoSCoW analysis and shown in Table 1.

Table 1: Requirements list

3.2.2 Risk Analysis

Table 2 shows the project risks that were identified in this project.

Table 2: Risk analysis

3.2.3 Gantt chart

The Gantt chart is shown in Table 3.

Table 3: Project Gantt Chart

3.3 - Setup the environment

The environment setup is presented in this section.

3.3.1 System overview

To accelerate custom YOLOv3 model on the MPSoC device we need to do the following steps:

Obtain the images for the training
Annotate the images using the YOLO compatible annotation tool
Train the YOLOv3 neural network using Darknet
Convert YOLOv3 weights file to TensorFlow model
Run the Vitis-AI toolchain on our converted TensorFlow model, exporting the necessary files for ZCU104
Run the inference on ZCU104

You can see the visual representation in the figure below:

Figure 2: Workflow representation

3.3.2 System components

The following system components were used:

ZCU104 Development Board
Video Camera (See3Cam included in ZCU104 Kit should do)
MicroSD Memory Card (16GB minimum recommended)

3.3.3 Dataset

Multiple datasets were used to train the YOLOv3 model. The custom apple dataset was recorded by bardsley-england.com.

The orange dataset was taken from various online sources (including YouTube, random Google Images, and Google Open Dataset).

Google Open Dataset: https://storage.googleapis.com/openimages/web/index.html

The training process requires the use of images of fruit captured in a variety of scenarios. This ensures that the final neural network is not overly specialised, and is capable of generalising better to the detection of fruit in other situations. Furthermore, the authors annotated the datasets themselves to ensure only high-quality data was used in the training process. Although these annotation and training processes were time-consuming, it was time well spent. High-quality training data is key to improving model accuracy and performance.

3.3.3 Neural Network Design

YOLOv3 is a convolutional neural network used in this project to perform real-time object detection. YOLOv3 works by taking a single image, dividing it into multiple areas and predicting bounding boxes and probabilities for each region. These bounding boxes are then weighted by their predicted probabilities.

The result of this operation is image ratios used to calculate the resolution for the bounding boxes, together with the appropriate class id number.

auto results = yolo->run(img); //pass the image to yolo api and obtain results
for (auto& box : results.bboxes) { //for each of the results
   int label = box.label; //get the class label
   float xmin = box.x * img.cols; //get the minimum x coordinate
   float ymin = box.y * img.rows; //get the minimum y coordinate
   float xmax = xmin + box.width * img.cols; //get max x by adding box width
   float ymax = ymin + box.height * img.rows; //get max y by adding box height

Code snapshot 1: Cartesian coordinates for drawing the bounding boxes are calculated in the following way.

3.3.4 Neural Network Training

The assembled dataset was trained on AWS P3 instance using NVIDIA TESLA V100 GPU.

In the figure below you can see that we are using the latest Nvidia driver (at the time of writing), V100 GPU with CUDA version 11. You can also see that this image was taken in the process of training the neural network and that we were utilising 115000MiB of video memory.

Figure 3: GPU Details Reported by Nvidia-Smi tool

To achieve better detection performance, our starting point was a pre-trained YOLOv3 convolution with weights provided at https://pjreddie.com/darknet/yolo/.

Training is accomplished using Darknet trainer https://github.com/AlexeyAB.

The training steps are as follows:

Move the images to the AWS instance
Move the annotation files to AWS instance
Run the training command
Download trained weights from AWS instance and run the inference

We have trained the neural network over 4000 iterations, following the YOLOv3 guide that for each class in Neural Network there should be 2000 iterations.

You can see the training loss curve on the figure below:

Figure 4: YOLOv3 Training Loss Curve

After the network was trained, we ran inference tests using images that were not included during training to calculate the precision of our neural network.

We managed to achieve good detection performance. The mean average precision of our neural network is 88.10%

Firuge 5: YOLOv3 mAP value as reported by the Darknet training tool

After the network was trained, weights were converted to TensorFlow files using the open-source DW2TF software available at https://github.com/jinyu121/DW2TF. This was necessary because currently, the Vitis-AL toolchain only supports TensorFlow and Caffe models.

The DW2TF converter works by taking the Darknet weights and configuration file and creating the equivalent TensorFlow network from the provided data.

The TensorFlow model was then put through the Vitis-AI compilation toolchain available on DockerHub. The tool takes 32bit floating-point values and converts them to int8 type, providing a significant reduction in memory and power usage, without any significant degradation to detection accuracy. This is termed quantization. To quantize the network appropriately, a small subset of our dataset was used.

3.3.5 Running the Inference

After the neural network was trained and converted using the Vitis-AI compilation tool, the model was moved to the ZCU104 and inference was run on video using a simple C++ test programme.

First, the image was loaded from the camera (or an image file) into memory. Then the image was passed to the Yolo detection API, which is the part of Xilinx Vitis-AI Library. Inference details were then obtained from the API. For each detection, the appropriate bounding boxes and class labels were drawn on top of the original image. Finally, the resultant image was either saved to file or displayed to the user.

You can see the visual representation of the inference process below:

Figure 6: Inference Process on the ZCU104

3.4 Results analysis

The testing phase was completed and documented in test case scenario tables available below:

Please press the left or right arrows to visualise all the test results.

Table 5: Tests scenarios

Please press the left or right arrows to visualise all the images and tables.

1 / 5

1 / 5 • Figure 1: Inference results - Orange detection

You can also take a look at some video inference examples:

Video 1: Inference results running the YOLOv3 on GPU - Apple Counting

Video 2: Inference results running the YOLOv3 on GPU - Orange Detection

Conclusion and future work

The key goal of this project was to enable fruit detection on the edge using FPGA technology. After some consideration, we opted to use a Xilinx MPSoC device, specifically the ZCU104 FPGA development board.

Accelerating YoloV3 models to run on Xilinx Vitis-AI platform provides a fast and power-efficient AI classification method, which can be used in various fields. Ranging from agricultural to industrial use. In this regard, our system is fulfilling all aspects of Adaptive Compute Acceleration, providing high-performance inference at the edge. It has low power requirements and enables rapid changes in workflow by providing the ability to easily train and deploy a new improved model.

Our first step was obtaining the datasets of oranges and apples and annotating them appropriately. We then used these high-quality datasets to train the YoloV3 CNN on a GPU (Cuda architecture). This process was iteratively repeated, incrementally improving the performance of the CNN.

Once we had a successful CNN running on a GPU, our next task was to move it onto the Vitis-AI tool to accelerate our model to run on the ZCU104 and concluded our work with successful system testing.

Overall, the system performed as expected. Providing adequate inference speed on the edge, whilst being very power efficient. This makes our system a viable candidate for future use in the field of agriculture.

Further work is still needed, however. You can see from the above image that the bounding boxes are sometimes offset from their target. Which is not the case with Xilinx Vitis-AI provided demo models. We have yet to find a reason for this artefact. We also plan on optimising the ZCU104 implementation using the Vitis AI optimiser to prune the network without significantly degrading performance and to thoroughly test the network on real-world farming applications.

References:

[1] Department for Environment, Food and Rural Affairs, UK government; Agriculture in the United Kingdom, https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/904024/AUK_2019_27July2020.pdf, last accessed: 28/11/2020, available online

[2] A. Angus, P. J. Burgess, J. Morris, and J. Lingard.Agriculture and land use: Demand for and supply of agricultural commodities, characteristics of the farming and food industries, and implications for land use in the UK. Robert Bogue Fruit picking robots: has their time come?IndustrialRobot, 47(2):141–145, 2020.Land Use Policy, 26(SUPPL. 1):230–242, 2009.

[3] Robert Bogue. Fruit picking robots: has their time come?IndustrialRobot, 47: 141–145, 2020.

Acknowledgements:

We would like to thank:

1) Samuel Brandenburg who has kindly shared his Major Project's source code and documentation while finishing an MSc degree at the Nottingham Trent University.

2) Adam Slate at the bardsley-england.com for sharing the apple dataset.

#include <glog/logging.h>
#include <iostream>
#include <memory>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <xilinx/ai/demo.hpp>
#include <xilinx/ai/yolov3.hpp>
#include <xilinx/ai/nnpp/yolov3.hpp>
#include "opencv2/opencv.hpp"
#include <memory>
#include <opencv2/core.hpp>
#include <vector>
#include <xilinx/ai/nnpp/yolov3.hpp>
#include <chrono>
using namespace std;
using namespace cv;
int main(int argc, char** argv)
{
    const string classes[2] = { "Orange, Apple" };
    auto yolo = xilinx::ai::YOLOv3::create("fruit-detection-network", false);
    VideoCapture cap;
    if (!cap.open(argv[1]))
        return 0;
    while (1) {
        Mat img;
        cap >> img;
        if (img.empty())
            return 0;
        auto results = yolo->run(img);
        for (auto& box : results.bboxes) {
            int label = box.label;
            float xmin = box.x * img.cols + 1;
            float ymin = box.y * img.rows + 1;
            float xmax = xmin + box.width * img.cols;
            float ymax = ymin + box.height * img.rows;
            if (xmin < 0.)
                xmin = 1.;
            if (ymin < 0.)
                ymin = 1.;
            if (xmax > img.cols)
                xmax = img.cols;
            if (ymax > img.rows)
                ymax = img.rows;
            float confidence = box.score;
            if (label > -1) {
                rectangle(img, Point(xmin, ymin), Point(xmax, ymax),
                    Scalar(200, 255, 255), 1, 1, 0);
                putText(img, classes[label], Point(xmin, ymin - 10),
                    cv::FONT_HERSHEY_DUPLEX, 1.0, CV_RGB(118, 185, 0), 2);
            }
        }
        imshow("frame", img);
        waitKey(1);
        return 0;
    }
}

#include <glog/logging.h>
#include <iostream>
#include <memory>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <xilinx/ai/demo.hpp>
#include <xilinx/ai/yolov3.hpp>
#include <xilinx/ai/nnpp/yolov3.hpp>
#include "opencv2/opencv.hpp"
#include <memory>
#include <opencv2/core.hpp>
#include <vector>
#include <xilinx/ai/nnpp/yolov3.hpp>
#include <chrono>
using namespace std;
using namespace cv;
int main(int argc, char** argv)
{
    const string classes[2] = { "Orange", "Apple" };
    auto yolo = xilinx::ai::YOLOv3::create("fruit");
    Mat img;
    img = imread(argv[1], IMREAD_COLOR);
    auto t1 = std::chrono::high_resolution_clock::now();
    auto results = yolo->run(img);
    for (auto& box : results.bboxes) {
        int label = box.label;
        float xmin = box.x * img.cols;
        float ymin = box.y * img.rows;
        float xmax = xmin + box.width * img.cols;
        float ymax = ymin + box.height * img.rows;
        if (xmin < 0.)
            xmin = 1.;
        if (ymin < 0.)
            ymin = 1.;
        if (xmax > img.cols)
            xmax = img.cols;
        if (ymax > img.rows)
            ymax = img.rows;
        float confidence = box.score;
        if (label > -1) {
            if (label == 0) {
                rectangle(img, Point(xmin, ymin), Point(xmax, ymax),
                    Scalar(200, 0, 0), 1, 1, 0);
                putText(img, classes[label], Point(xmin, ymin - 10),
                    cv::FONT_HERSHEY_DUPLEX, 1.0, CV_RGB(200, 0, 0), 2);
            }
            else {
                rectangle(img, Point(xmin, ymin), Point(xmax, ymax),
                    Scalar(0, 200, 0), 1, 1, 0);
                putText(img, classes[label], Point(xmin, ymin - 10),
                    cv::FONT_HERSHEY_DUPLEX, 1.0, CV_RGB(0, 200, 0), 2);
            }
        }
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    if (img.empty())
        return 1;
    String name = argv[1];
    name += "_detected.jpg";
    imwrite(name, img);
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
    cout << 1000000 / duration << " fps" << endl;
    return 0;
}