Three Methods of Object Detection
Region-Based Object Detection method
Single-Shot Object Detection Method
Transformer-Based Object Detection Method
Which model should be used on the Raspberry PI AI Camera
Object Detection On Raspberry PI AI Camera

Published November 17, 2024 © Apache-2.0

The Ultimate Guide: Raspberry Pi AI Camera Object Detection

Everything you must know to start working with Raspberry PI AI Camera

BeginnerProtip1 hour1,954

The Ultimate Guide: Raspberry Pi AI Camera Object Detection

Things used in this project

Hardware components

Raspberry PI 5

Story

Modern technologies can instantly recognize objects and living beings in real time, and this ability no longer surprises anyone nowadays.

But how does it actually work?

In this article, I will explain the basics of Object Detection using the Raspberry PI AI Camera as an example.

Traditionally, this article is available on YouTube:

Three Methods of Object Detection

Object detection is a part of computer vision. It identifies and locates objects on the image or video.

There are three main detection methods.

Region-Based: R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN
Single-Shot: YOLO (all versions), SSD, EfficientDet
Transformer-Based: DETR, Deformable DETR

Let's briefly discuss each of them. As an example, we will run detection on the following image:

Region-Based Object Detection method

The region-based process is also called the sliding window method.

It checks content inside a small part of the image and then shifts the window to the next position.

If a part of the object (or a whole object) is in the position, the algorithm marks this area.

Then, it goes to the next position. If there are a few object parts (or features) in the window, it marks both of them.

It checks the whole image step by step.

When the image is processed, the algorithm counts the amount of parts and groups them into objects.

As a result, there will be boxes with classes of detected objects.

1 / 2

In real life, the sliding window will be smaller and provide more precise detections, like this:

This method is highly accurate and works well in most cases.

On the flip side, it’s a demanding process.

It does a lot of computations and has a relativelyslow processing speed.

Single-Shot Object Detection Method

In this detection process, the image is divided into a grid, and all the cells are processed simultaneously.

A popular model of this method has the name YOLO. Which stands for

You
Only
Look
Once

And it's literally what happens under the hood. The image is processed in one iteration.

As a result, there will be boxes of classes of detected objects.

The method is quick and less demanding. It's more suitable for the Edge devices like Raspberry Pi.

However, I'm not entirely right about this picture.

If the overlap of objects is too intense, some cars may be missed or grouped into one. Like this:

Or this:

Transformer-Based Object Detection Method

This type of network determines far more information about a picture than the previous two types. It captures complex spatial relationships and the global context of the image, then wraps the results in bounding boxes and returns them.

1 / 2

DETR is one of the most precise object detection methods because it eliminates the need for traditional anchor boxes and region proposals, directly predicting bounding boxes and classes using transformers. Its use of attention mechanisms allows it to understand global context, leading to better performance on complex scenes. Additionally, DETR can handle object relationships and occlusions more effectively, resulting in higher precision compared to traditional methods.

However, like anything, it comes at a price—it’s the most resource-demanding. So, its usage in real-time object detection could be limited on edge devices like Raspberry PI.

Which model should be used on the Raspberry PI AI Camera?

There is no universal network that is best for everything. Each of these models is suited to different situations and tasks.

We have a variety of models:

R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN.... - Region-Based
YOLO (all versions), SSD, EfficientDet, NanoDet - Single-Shot
DETR, Deformable DETR - Transformer-Based

For the Raspberry PI AI camera, the most suitable approach is using single-shot methodmodels like YOLO (all versions), SSD, EfficientDet, NanoDet. Sony suggests this as well.

All these models are available on their GitHub:

https://github.com/raspberrypi/imx500-models

Object Detection On Raspberry PI AI Camera

I've prepared a small example repository that has a Python script and some installation instructions:

https://github.com/Nerdy-Things/raspberry-pi-ai-camera-object-detection-zero-hero

To make it work, I need to clone it and download detection models:

git clone https://github.com/Nerdy-Things/raspberry-pi-ai-camera-object-detection-zero-hero.git
cd raspberry-pi-ai-camera-object-detection-zero-hero
git submodule init && git submodule update

The repository contains the following files:

The file ./install.sh installs all the required dependencies.

The most important are the next:

sudo apt install -y imx500-all
sudo apt install -y python3-opencv python3-munkres

The script recognition.pymainly contains an example from the official Raspberry PI's repository. The important part of the code is at the bottom of the file:

if __name__ == "__main__":
    model = "./imx500-models/imx500_network_yolov8n_pp.rpk"
    # model = "./imx500-models/imx500_network_ssd_mobilenetv2_fpnlite_320x320_pp.rpk"
    # model = "./imx500-models/imx500_network_efficientdet_lite0_pp.rpk"
    # model = "./imx500-models/imx500_network_nanodet_plus_416x416.rpk"
    # model = "./imx500-models/imx500_network_nanodet_plus_416x416_pp.rpk"

    # This must be called before instantiation of Picamera2
    imx500 = IMX500(model)
    intrinsics = imx500.network_intrinsics
    if not intrinsics:
        intrinsics = NetworkIntrinsics()
        intrinsics.task = "object detection"
    elif intrinsics.task != "object detection":
        print("Network is not an object detection task", file=sys.stderr)
        exit()

    # Defaults
    if intrinsics.labels is None:
        with open("assets/coco_labels.txt", "r") as f:
            intrinsics.labels = f.read().splitlines()
    intrinsics.update_with_defaults()

    picam2 = Picamera2(imx500.camera_num)

    config = picam2.create_preview_configuration(
        controls = {}, 
        buffer_count=12
    )

    imx500.show_network_fw_progress_bar()
    picam2.start(config, show_preview=False)

    if intrinsics.preserve_aspect_ratio:
        imx500.set_auto_aspect_ratio()

    last_results = None
    picam2.pre_callback = draw_detections
    labels = get_labels()

    while True:
        metadata = picam2.capture_metadata()
        last_results = parse_detections(metadata)

        if (len(last_results) > 0):
            for result in last_results:
                label = f"{int(result.category)} {labels[int(result.category)]} ({result.conf:.2f})"
                print(f"Detected {label}")

Let's break it down.

The script starts with a model definition:

model = "./imx500-models/imx500_network_yolov8n_pp.rpk"
# model = "./imx500-models/imx500_network_ssd_mobilenetv2_fpnlite_320x320_pp.rpk"
# model = "./imx500-models/imx500_network_efficientdet_lite0_pp.rpk"
# model = "./imx500-models/imx500_network_nanodet_plus_416x416.rpk"
# model = "./imx500-models/imx500_network_nanodet_plus_416x416_pp.rpk"

As you can see, we are already familiar with those names.

Then, I create an object of a specific device and pass the model as an argument. In my case, the device is Sony IMX500.

imx500 = IMX500(model)

The next few lines are the definition of network settings.

    intrinsics = imx500.network_intrinsics
    if not intrinsics:
        intrinsics = NetworkIntrinsics()
        intrinsics.task = "object detection"
    elif intrinsics.task != "object detection":
        print("Network is not an object detection task", file=sys.stderr)
        exit()

    # Defaults
    if intrinsics.labels is None:
        with open("assets/coco_labels.txt", "r") as f:
            intrinsics.labels = f.read().splitlines()
    intrinsics.update_with_defaults()

Then, I create an object of a camera and a preview setting. In controls = {}, I can tune some controls, but it could be empty for now.

    picam2 = Picamera2(imx500.camera_num)

    config = picam2.create_preview_configuration(
        controls = {}, 
        buffer_count=12
    )

The Python script uploads the model to the camera. This line shows the progress bar when it happens.

imx500.show_network_fw_progress_bar()

The following line starts the recognition process.

picam2.start(config, show_preview=False)

From that point on, the AI camera will capture and recognize images while the script is running.

The pre_callback method is used to draw bounding boxes on the picture from the camera.

picam2.pre_callback = draw_detections

Finally, we’ve reached the most essential part.

In this infinite loop, I receive the recognition results.

    while True:
        metadata = picam2.capture_metadata()
        last_results = parse_detections(metadata

The capture_metadata method retrieves the camera's current state, providing a large JSON with many parameters.

Here’s an example of what we receive from it.

Then, I feed this metadata to the parse_detections method, which returns a simplified JSON.

[
    {
        "box": [
            34,
            0,
            605,
            472
        ],
        "category": 0.0,
        "conf": 0.8515625
    }
]

As you can see, it’s an array of objects, each representing a bounding box.

The box field indicates where the object is located in the image.

The category field specifies what the object is; for instance, in labels.json, zero (or the first position) corresponds to a person.

And that's exactly who I am: a person 😁

Last, the confidence (conf) field shows how confident the model is about detection, with a value between zero and one.

The final lines in the script print the detection results to the console.

if (len(last_results) > 0):
    for result in last_results:
        label = f"{int(result.category)} {labels[int(result.category)]} ({result.conf:.2f})"
        print(f"Detected {label}")

That’s it! With this understanding of how it works and how to use it, anyone can build their project with an AI camera.

Thanks for reading, and have a wonderful day!

The Ultimate Guide: Raspberry Pi AI Camera Object Detection

Things used in this project

Hardware components

Story

Three Methods of Object Detection

Region-Based Object Detection method

Single-Shot Object Detection Method

Transformer-Based Object Detection Method

Which model should be used on the Raspberry PI AI Camera?

Object Detection On Raspberry PI AI Camera

Code

Github

Credits

Eugene Tkachenko

Comments

Embed the widget on your own site

The Ultimate Guide: Raspberry Pi AI Camera Object Detection

The Ultimate Guide: Raspberry Pi AI Camera Object Detection

Things used in this project

Hardware components

Story

Three Methods of Object Detection

Region-Based Object Detection method

Single-Shot Object Detection Method

Transformer-Based Object Detection Method

Which model should be used on the Raspberry PI AI Camera?

Object Detection On Raspberry PI AI Camera

Code

Github

Credits

Eugene Tkachenko

Comments

Related channels and tags