"Yolov3 Pytorch Quantization, Compilation, and Inference"
Quantizing Yolov3 Pytorch with Vitis AI 3.0
Inference of Quantized Model
Compiling Quantized Model
DPU Inference

LogicTronix [FPGA Design + Machine Learning Company]

Published September 26, 2023 © GPL3+

Yolov3 Pytorch Quantization, Compilation, and Inference

Tutorial on Quantizing Yolov3 Pytorch, Compiling it and running inference on Kria KV260 or MPSoC Board with Vitis AI 3.0.

IntermediateFull instructions provided3 hours3,041

Yolov3 Pytorch Quantization, Compilation, and Inference

Things used in this project

Hardware components

AMD Kria™ KV260 Vision AI Starter Kit

Story

"Yolov3 Pytorch Quantization, Compilation, and Inference"

YoloV3 Architecture

YOLOv3 (You Only Look Once version 3) is an advanced real-time object detection algorithm. It is designed to quickly and accurately detect objects in images or video frames. YOLOv3 builds upon the success of its predecessor, YOLOv2, and introduces several enhancements to improve detection accuracy and performance.

One notable improvement in YOLOv3 is the use of a feature pyramid network (FPN), which enables the model to extract features at multiple scales. By combining high-resolution and low-resolution feature maps, YOLOv3 can effectively detect objects of various sizes.

Another key feature of YOLOv3 is the integration of a multi-scale detection approach. Instead of processing images at a fixed resolution, YOLOv3 operates on three different scales. This allows the model to detect both small and large objects accurately.

YOLOv3 also adopts a deeper neural network architecture known as Darknet-53. This architecture, based on residual connections, consists of 53 convolutional layers. The increased depth enables YOLOv3 to learn more complex representations and capture finer details in the input images.

In summary, YOLOv3 is a state-of-the-art object detection algorithm that combines feature pyramid networks, multi-scale detection, and deeper neural network architecture to achieve high accuracy and real-time performance.

To explore the full YOLOv3 model code, please refer to the GitHub repository available at YoloV3 main model. You can also follow ReadMe.md (https://github.com/LogicTronix/Vitis-AI-Reference-Tutorials/tree/main/Quantizing-Compiling-YOLOv3-Pytorch-with-DPU-Inference

1. Darknet-53

Darknet-53 is a deep convolutional neural network architecture used in the YOLOv3 object detection algorithm. It serves as the backbone network for feature extraction in YOLOv3.

There are no max-pooling layers in YOLOv3; rather, downsampling is done using Convolutional layers with stride. There is a significant advantage of using a convolutional layer instead of a max-pooling layer. The convolutional layer captures the feature from a specified area whereas max-pooling only captures the max value from the specified area due to which we have a chance of missing other features.

The darknet-53 also uses residual connections, also known as skip connections, to allow information to flow through the network more easily, preventing the degradation of the network's performance as it gets deeper. These connections skip over several layers and pass the information to subsequent layers, helping to alleviate the vanishing gradient problem and enabling the network to learn more effectively.

Fig. YoloV3 model backbone [Source: https://pjreddie.com/darknet/yolo/]

2. Multi-scale Detection

YOLOv3 divides the input image into a grid and assigns each grid cell the responsibility of detecting objects. Instead of processing the entire image at a fixed resolution, YOLOv3 processes the image at three different scales: small, medium, and large.

At each scale, YOLOv3 predicts bounding boxes, objectness scores, and class probabilities. This enables the model to detect objects that may appear small or large in the original image. By analyzing the image at multiple scales, YOLOv3 can capture both fine-grained details of small objects and contextual information of large objects.

Fig. Multi-scale Detection in YoloV3

From this code snippet, it's clear that the model with input x, has 3 outputs and it’s because there are predictions 3 times, each with a different grid size: 13, 26, and 52. The output from grid size 13x13 is responsible for detecting larger objects, similarly, grid size 26x26 detects medium objects, and finally, grid size 52x52 is responsible for detecting smaller objects. Hence, all kinds of objects are detected by YOLO v3, thus improving the accuracy as well.

3. Anchor Boxes

In YOLOv3, each grid cell in the input image is responsible for predicting a fixed number of bounding boxes. The number of anchor boxes assigned to each grid cell determines the detection capability of the model. For instance, if there are three anchor boxes assigned to each grid cell, YOLOv3 will predict three bounding boxes at each grid cell.

Anchor boxes in YOLOv3 are defined based on prior knowledge about the dataset and the expected object shapes. Typically, a set of anchor boxes with different aspect ratios and sizes is chosen to cover a wide range of object shapes and sizes commonly found in the dataset.

Figure: Bounding Box Prediction in Yolov3 (Picture Source: https://pjreddie.com/media/files/papers/YOLOv3.pdf)

The predicted height and width are applied with exponentials such that the value of height and width are always positive. They are then multiplied by the anchor's height and width to get the final bounding box since they are predicted with respect to the anchor boxes.

The code snippet above shows that the x and y coordinates are applied with a sigmoid function such that they are in the range of 0 to 1 which is with respect to the grid cell and height and width are applied with an exponential function and multiplied by the anchor’s height width. All these are concatenated to get the predicted bounding box.

Quantizing Yolov3 Pytorch with Vitis AI 3.0

Quantizing YOLOv3 in PyTorch involves converting the model's parameters from floating-point to lower-precision fixed-point or integer representations to reduce memory and computation requirements while preserving acceptable accuracy.

We will be using Vitis AI 3.0 (GPU) for quantization, we can also perform the quantization on CPU docker of Vitis AI. GPU docker needs to be build locally while CPU docker can be pulled and used. Performing quantization on GPU docker is faster than CPU Docker.

Fig: Vitis AI 3.0 (GPU Docker)

Steps for quantization:

Load the model
Quantization → quant model
Forward pass
Handle the quant model

To access the codebase for YOLOv3 quantization, please refer to the following GitHub repository: YoloV3 Quantization.

Step 1: Load the model

The float-trained model needs to be loaded such that it can be quantized.

state_dict = torch.load(config["pretrain_snapshot"])
model.load_state_dict(state_dict)

First, the code snippet loads a PyTorch model's state dictionary from a specified file. After loading the state dictionary, this line of code loads the state dictionary into a PyTorch model. The model variable represents the neural network model that we want to load the parameters into. By calling load_state_dict, we are transferring the trained weights and parameters from the saved state dictionary to the model.

Step 2: Quantization (Generating quant model)

The quantization is performed using the torch_quantizerfunction from pytorch_nndct library which is provided by Vitis AI.

from pytorch_nndct.apis importtorch_quantizer
input = torch.randn([batch_size, 3, 416, 416])
quantizer = torch_quantizer(quant_mode, model, (input), device=device, quant_config_file=config_file, target=target)
quant_model = quantizer.quant_model

The code snippet above creates an instance of a quantizer object. The quantizer object is created by passing the parameters to torch_quantizerfunction of pytorch_nndct library.

The torch_quantizer function takes in:

● quant_mode which can be either “calib” or “test”

● Model which is the float model that we loaded

● Input which is the dummy input required for the float model with batch_sizeimages, each with 3 color channels (RGB), and each image has a resolution of 416 pixels in height and 416 pixels in width

● The device represents the "cpu" or "cuda" on which the quantization should be performed.

● quant_config_file is a configuration file containing settings and options for quantization. The configuration file is empty for the “calib” step due to which it utilizes the default configuration file. In the “test” step, it utilizes the configuration file export from the “calib” step.

● target is a platform or hardware for which the quantized model is intended.

After creating the quantizer object, quantize the modelusing the settings and options specified in the quant_config_file. The quantized model is stored in the quant_modelvariable, and it will typically have lower precision weights and activations compared to the original model.

Step 3: Forward pass

A forward pass is the process of propagating input data through the network's layers in order to compute an output. During this forward pass, the input data flows through the layers of the model, undergoes various mathematical operations (such as matrix multiplications, activation functions, and pooling), and eventually produces a prediction or output.

# Forward -- Dry Run
input_data = torch.randn([batch_size, 3, 416, 416]).to(device)
quant_model(input_data)

As specified by the code above, dry run is performed for forward pass through the quant_model which is a crucial step, if not performed throws an error.

From Ug1414 [https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Error-Codes]:

Error code ID: QUANTIZER_TORCH_NO_FORWARD

Error message: torch_quantizer.quant_model FORWARD function must be called before exporting quantization result. Please refer to example code at https://github.com/Xilinx/Vitis-AI/blob/master/src/Vitis-AI-Quantizer/vai_q_pytorch/example/resnet18_quant.py.

Step 4: Handling quantization result

The quantization result is handled based on the two different modes of quantization: calibration (quant_mode == 'calib') and testing (quant_mode == 'test').

# Handle quantization result
if quant_mode == 'calib':
quantizer.export_quant_config()
if quant_mode == 'test':
quantizer.export_torch_script(verbose=False)
quantizer.export_xmodel(deploy_check=True, dynamic_batch=True)

Calibration Mode (quant_mode == 'calib'): In this mode, the script is configuring the quantizer to perform model calibration and exporting the quantization configuration. Model calibration is a process where you collect statistics about the model's inputs (e.g., min and max values of activations) to later quantize the model effectively.
quantizer.export_quant_config(): This function exports the quantization configuration obtained during calibration. This configuration is crucial for quantizing the model correctly during deployment.
Testing Mode (quant_mode == 'test'): In this mode, the script is performing quantization testing and exporting the quantized model in different formats. This mode can also be used to validate the quantized model's performance before deploying it in a production environment.
quantizer.export_torch_script(verbose=False): This function exports the quantized model as a TorchScript i.e.pt file format. TorchScript is a way to serialize PyTorch models and necessary for inference on quantized models since VitisAI only supports loading torch script quantized models for inference.
quantizer.export_xmodel(deploy_check=True, dynamic_batch=True): This function exports the quantized model "xmodel" format. The parameters deploy_check and dynamic_batch suggest that this export may include checks for deployment readiness and support for dynamic batch sizes. The quantized xmodel is necessary to generate a compiled xmodel for a given target.

Inference of Quantized Model

Inference of a quantized model involves using a neural network model that has been quantized, typically from floating-point precision to lower precision (e.g., fixed-point or integer), for the purpose of reducing memory and computation requirements while maintaining acceptable accuracy.

From Ug1414:

● You can run the quantized model in TorchScript format, which is a.pt file, in PyTorch framework. The pytorch_nndct module has to be imported before inference because it sets up quantized operators used in this model.

● So far XIR format quantized model cannot be run by any tool.

import torch
import pytorch_nndct
# Load the model
quantized_model = torch.jit.load('quantized_result/ModelMain_int.pt')
# Feed input data to quantized model and do inference
output = quantized_model(input)

The code snippet above, load the quantized torch script model using torch.jit.load. Thus, this quantized model can be further used for inference with appropriate input.

For YoloV3, the inference performed on image will have the preprocessed image as the input for quantized model.

To access the codebase for performing inference with the quantized YOLOv3 model in TorchScript, please refer to the following GitHub repository: Quantized YoloV3 Torch Script Inference.

Quantized YOLOv3 Inference: Object Detection on Example Images:

Fig. Inference on Test Image 1

Fig. Inference on Test Image 2

Fig. Inference on Test Image 3

Compiling Quantized Model

Compiling a quantized model refers to the process of preparing a quantized neural network model for deployment on a target platform or hardware. The quantized model is compiled for the target hardware or accelerator. This involves adapting the model to take full advantage of the hardware's capabilities, such as vectorized instructions, hardware accelerators, or specialized memory layouts. The compiled model is integrated into the inference pipeline of the target application.

Once compiled and integrated, the quantized model is ready for deployment on the target platform or device. It can be used to make predictions or perform computations efficiently, taking advantage of the reduced precision and hardware optimizations.

For PyTorch, the quantizer NNDCT outputs the quantized model in the XIR format directly i.e compiled.xmodel formatted model.

Use vai_c_xir to compile the quantized model:

vai_c_xir --xmodel /PATH/TO/quantized.xmodel --arch /PATH/TO/arch.json --output_dir /OUTPUTPATH --net_name netname

Fig: Compilation command details (Source: UG1414, AMD)

The compiled YOLOv3 quantized model is available for download from the following GitHub repository: Compiled Model.

In addition to the compiled model, you may need the following files:

md5sum.txt: A checksum file to ensure the integrity of your model files.
meta.json: A JSON file containing metadata and configuration information about the model.

You can access these files in the same GitHub repository for your convenience.

DPU Inference

DPU (Deep Learning Processing Unit) inference refers to the process of running deep learning models on specialized hardware accelerators known as DPUs. DPUs offer significant performance advantages over general-purpose CPUs or GPUs for inference tasks. They are designed to maximize throughput and minimize power consumption, making them ideal for edge devices and real-time applications.

During DPU inference the script is converted from its original deep learning framework format (e.g., TensorFlow, PyTorch) into a format that is suitable for deployment on the target hardware. This conversion typically includes framework independent approaches.

Preparing the Board- Kria KV260:

Setup the KV260 board with Vitis AI 3.0 prebuilt image from: https://xilinx.github.io/Vitis-AI/3.0/html/docs/quickstart/mpsoc.html#setup-the-target

Fig: Kria KV260 Boar, AMD-Xilinx

DPU Inference with vai_runtime:

During inference, the vai_runtime library takes care of loading the compiled model onto the FPGA or SoC and executing the model to perform AI inference on the input data.

Steps for DPU inference:

Deserialize xmodel
Get child subgraph
Create Runner
Preprocessing the input
Main Execution Logic
Post processing the outputs
Visualising the result

To access the codebase for performing inference with the compiled YOLOv3 model (XIR format) on a DPU, please refer to the following GitHub repository: Inference on DPU.

Step 1: Deserialize xmodel

Deserializing an xmodel is the process of loading a machine learning model that has been serialised or saved in a specific format, often for the purpose of inference or further training.

g = xir.Graph.deserialize(argv[1])

The code snippet deserializes the DPU graph from the file specified by the command-line argument `argv[1]`.

Step 2: Get child subgraph

Getting a child subgraph extracts a smaller, self-contained portion of a larger computational graph in the context of deep learning or graph-based processing.

subgraphs = get_child_subgraph_dpu(g)

The code snippet extracts the DPU subgraphs from the graph.

def get_child_subgraph_dpu(graph: "Graph") -> List["Subgraph"]:
assert graph is not None, "'graph' should not be None."
root_subgraph = graph.get_root_subgraph()
assert (root_subgraph is not None), "Failed to get root subgraph of input Graph object."
if root_subgraph.is_leaf:
return []
child_subgraphs = root_subgraph.toposort_child_subgraph()
assert child_subgraphs is not None and len(child_subgraphs) > 0
return [cs for cs in child_subgraphs
if cs.has_attr("device") and cs.get_attr("device").upper() == "DPU"]

The provided function `get_child_subgraph_dpu` is a Python function that takes a graph object as input and returns a list of subgraphs that represent DPU (Deep Learning Processor Unit) kernels within the input graph.

Fig. Flowchart of extracting child subgraphs from input graph

Explaining the flowchart above with code snippet:

● First of all,

`root_subgraph = graph.get_root_subgraph()`

This line retrieves the root subgraph of the input `graph`

● If,

`if root_subgraph.is_leaf: return []`

This line checks whether the `root_subgraph` is a leaf subgraph (has no children).
If it is a leaf, it means there are no child subgraphs, so the function returns an empty list `[]`.

● Else,

`child_subgraphs = root_subgraph.toposort_child_subgraph()`

This line retrieves a list of child subgraphs of the `root_subgraph`.
The `toposort_child_subgraph` function is a method of the `Subgraph` class that returns the child subgraphs in topological order.

● Then,

`return [cs for cs in child_subgraphs if cs.has_attr("device") and cs.get_attr("device").upper() == "DPU"]`

This line is a list comprehension that filters the `child_subgraphs` list to include only those subgraphs that represent DPUs.
It checks whether each child subgraph has an attribute named "device" and if its value (converted to uppercase) is equal to "DPU".
If these conditions are met, the child subgraph is included in the resulting list.

Step 3: Create Runner

A DPU runner, typically used in the context of Xilinx Deep Processing Units (DPUs), is a software component responsible for executing machine learning models or inference tasks on DPUs. A DPU runner acts as an interface between the host CPU and the DPU hardware. It handles tasks such as model loading, input data preprocessing, and model execution on the DPU.

"""Creates DPU runner, associated with the DPU subgraph."""
dpu_runners = vart.Runner.create_runner(subgraphs[0], "run")

In the code snippet creates a DPU (Deep Learning Processing Unit) runner for a DPU subgraph using the Vitis AI (vart) library. The DPU runner allows you to execute the DPU subgraph on a compatible hardware accelerator.

Step 4: Preprocessing the Input

Preprocessing prepares an image for further processing, such as feeding it into a machine learning model or neural network for tasks like image classification or object detection.

# Preprocessing
image_path = argv[2]
image = cv2.imread(image_path, cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (config["img_w"], config["img_h"]),
interpolation=cv2.INTER_LINEAR)
image = image.astype(np.float32)
image /= 255.0

In the given code snippet,

● The image path is read from an image file provided as the third command-line argument (argv[2]).

● Then the image is read from the specified image_path using OpenCV's cv2.imread function with the cv2.IMREAD_COLOR flag, which loads the image in color (3 channels: BGR).

● The image from the BGR color space (commonly used by OpenCV) is converted to the RGB color space.

● Then the image is resized to the dimensions specified by config["img_w"](width) and config["img_h"] (height) using the cv2.resize function. It uses the cv2.INTER_LINEAR interpolation method for resizing, which typically produces smoother results compared to other methods.

● The image data type is converted to float32.

● Final step of preprocessing is to normalise the pixel values in the image to the range [0, 1] by dividing each pixel value by 255.0.

Note:The preprocessed image is in the shape (416, 416, 3) for YoloV3, which means height and width of the image is 416x416 with 3 channels (RGB). The shape doesn’t need transposing into (3, 416, 416) unlike required by floating model (Inference on GPU) and must be avoided strictly because the board inference uses a compiled xmodel which in turn requires input with shape (416, 416, 3).

Fig. Input shape for DPU inference (Picture generated by netron.app)

Step 5: runYolo (Main Execution Logic)

● Input tensor / Output tensor

# get the model input tensor
inputTensors = dpu_runner_tfYolo.get_input_tensors()
# get the model output tensor
outputTensors = dpu_runner_tfYolo.get_output_tensors()

The code snippet extracts input tensors and output tensors using runner, which are required to prepare input and output buffers, which is finally executed.

For example, the inputTensors looks like:

inputTensor[0]: {name: 'ModelMain__input_0_fix',
shape: [1, 416, 416, 3],
type: 'xint8',
attrs: {'location': 1,
'ddr_addr': 1264, 'bit_width': 8,
'round_mode': 'DPU_ROUND',
'reg_id': 2,
'fix_point': 4,
'if_signed': True}}

● Prepare batch input/output

outputHeight_0 = outputTensors[0].dims[1]
outputWidth_0 = outputTensors[0].dims[2]
outputChannel_0 = outputTensors[0].dims[3]
outputHeight_1 = outputTensors[1].dims[1]
outputWidth_1 = outputTensors[1].dims[2]
outputChannel_1 = outputTensors[1].dims[3]
outputHeight_2 = outputTensors[2].dims[1]
outputWidth_2 = outputTensors[2].dims[2]
outputChannel_2 = outputTensors[2].dims[3]
runSize = 1
shapeIn = (runSize,) + tuple([inputTensors[0].dims[i] for i in range(inputTensors[0].ndim)][1:])

Yolov3 has 3 outputs and hence, the outputTensors. So, the code snippet above extracts the height, width, and number of channels for three different tensors: outputTensors[0], outputTensors[1], and outputTensors[2] for outputTensors. For inputTensors, it constructs a tuple called shapeIn with dimensions [batchSize, height, width, channel].

'''prepare batch input/output '''
outputData = []
inputData = []
outputData.append(np.empty((runSize,outputHeight_0,outputWidth_0,outputChannel_0), dtype = np.float32, order = 'C'))
outputData.append(np.empty((runSize,outputHeight_1,outputWidth_1,outputChannel_1), dtype = np.float32, order = 'C'))
outputData.append(np.empty((runSize,outputHeight_2,outputWidth_2,outputChannel_2), dtype = np.float32, order = 'C'))
inputData.append(np.empty((shapeIn), dtype = np.float32, order = 'C'))

In the code snippet above, runSize is scalar value indicating batch size. This code creates empty NumPy arrays for both outputData and inputData and appends them to the respective lists. Since Yolov3 has 3 outputs for different grid sizes, the list outputData appends an empty array for 3 outputs and list inputData appends empty array for input image. The dimensions for each input or output is given by [batchSize, height, width, channel].

● Execute async

job_id =dpu_runner_tfYolo.execute_async(inputData,outputData)
dpu_runner_tfYolo.wait(job_id)

The code snippet executes the YOLO model asynchronously using the DPU runner. It takes inputDataas input and is expected to produce results that are stored in outputData. inputData contains input data for the model, and outputData is where the model's output will be stored.

The result of dpu_runner_tfYolo.execute_async(inputData, outputData) is assigned to job_id. This job_id represents a unique identifier for this particular execution of the model.

dpu_runner_tfYolo.wait(job_id) waits for the execution of the YOLO model with the specified job_id to complete.

Step 6: Post processing the Outputs

Fig. Non Maximum Suppression

Non-Maximum Suppression (NMS) is a post-processing algorithm commonly used in object detection tasks, including YOLO, to filter out redundant and overlapping bounding box predictions. The goal of NMS is to select the most relevant and accurate bounding boxes while suppressing redundant detections.

Input:

● Set of bounding boxes with their corresponding confidence scores.

Algorithm:

1. Sort the bounding boxes based on their confidence scores in descending order.

2. Initialize an empty list to store the selected bounding boxes.

3. While there are still bounding boxes remaining in the sorted list:

Select the bounding box with the highest confidence score (top box).
Add the top box to the list of selected bounding boxes.

4. For each remaining bounding box in the sorted list:

Calculate the Intersection over Union (IoU) with the top box.
If the IoU is above a predefined threshold (e.g., 0.5), discard the bounding box as it overlaps significantly with the top box.
If the IoU is below the threshold, keep the bounding box as a separate detection.

5. Repeat steps 3-4 until all bounding boxes in the sorted list are processed.

Output:

● The list of selected bounding boxes that have passed the non-maximum suppression.

Intersection over Union (IoU):

Intersection over Union (IoU) is a metric commonly used to evaluate the overlap between two bounding boxes or regions of interest (ROIs). It measures the spatial agreement between the predicted bounding box and the ground truth bounding box. IoU is widely used in object detection tasks, including non-maximum suppression (NMS) and evaluating the performance of object detection models.

The IoU is calculated as the ratio of the intersection area to the union area of two bounding boxes. Here's the formula for calculating IoU:

IoU = (Area of Intersection) / (Area of Union)

Fig. Intersection Over Union

Suppose,

predicted: (x,y,w,h) = (0.338,0.4667, 0.184, 0.106)
actual: (x,y,w,h) = (0.546, 0.481, 0.136, 0.130)

Formula to calculate corners:

x1 = x - w/2
x2 = x + w/2
y1 = y - h/2
y2 = y + h/2

Then,

pred(x1,y1, x2,y2) = (0.246, 0.4137, 0.43, 0.5197)
actual(x1,y1,x2,y2) = (0.478, 0.416,0.614,0.546)

Formula to calculate intersection:

intersection = (x2-x1) * (y2-y1)

Then,

x1=max(0.246,0.478)=0.478 x2 = max(0.43, 0.614) = 0.614
y1=max(0.4137,0.416)=0.416 y2 = max(0.5197, 0.546) = 0.546

Thus, intersection = 0.01768

Formula to calculate area:

Box_area = abs((x2-x1) * (y2-y1))

So,

pred_area = 0.019504
actual_area = 0.01768

Formula to calculate IOU:

IoU = intersection / (pred_area + actual_area - intersection)

Thus,

IoU = 0.9064

Step 7: Visualising the result

for x1, y1, x2, y2, conf, cls_conf, cls_pred in detections:
color = bbox_colors[int(cls_pred)]
# Rescale coordinates to original dimensions
ori_h, ori_w, _ = im.shape
pre_h, pre_w = config["img_h"], config["img_w"]
box_h = ((y2 - y1) / pre_h) * ori_h
box_w = ((x2 - x1) / pre_w) * ori_w
y1 = (y1 / pre_h) * ori_h
x1 = (x1 / pre_w) * ori_w
# Create a Rectangle patch
cv2.rectangle(im, (int(x1), int(y1)), (int(x1 + box_w),
int(y1 + box_h)), color, 2)
# Add label
label = classes[int(cls_pred)]
cv2.putText(im, label, (int(x1), int(y1) - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

● Detection is the objects that are detected in the given image by YoloV3 model.

● x1, y1, x2, y2, conf, cls_conf, and cls_pred are unpacked from the detection. These variables represent the coordinates of a bounding box, confidence scores, and class predictions for the detected object.

● A color is assigned based on the cls_pred (class prediction).

● The coordinates of the bounding box (x1, y1, x2, y2) are rescaled to match the original dimensions of the image (ori_h, ori_w). This is necessary because object detection is often performed on resized images, and these coordinates need to be adjusted to the original image size.

● A rectangle is drawn on the image using cv2.rectangle. This rectangle represents the detected bounding box. The color variable determines the color of the rectangle.

● A label is added to the image using cv2.putText. This label represents the class of the detected object. The label is placed just above the detected object's bounding box, and the color variable determines the color of the label.

DPU Inference Results from Kria KV260 Board: Object Detection on Example Image

Fig. DPU Inference on Test Image 1

Fig. DPU Inference on Test Image 2

Fig. DPU Inference on Test Image 3

Git Repo (again):

https://github.com/LogicTronix/Vitis-AI-Reference-Tutorials/tree/main/Quantizing-Compiling-YOLOv3-Pytorch-with-DPU-Inference

Even though we are taking about Yolov7 and Yolov8, understanding complete flow of "Quantizing, Compiling and DPU-Board Inference" can help to port newer Yolo or other CNN into MPSoC Board (or Kria KV260) with Vitis AI.

Kudos to our Team member,Jinu Nyachhyon for creating this tutorial!

Thanks to our Machine Learning Lead, Dikesh Shakya Banda for planning this in-depth tutorial!

For any queries, you can also raise Git issues or you can contact us at: info@logictronix.com.