Modern technologies can instantly recognize objects and living beings in real time, and this ability no longer surprises anyone nowadays.
But how does it actually work?
In this article, I will explain the basics of Object Detection using the Raspberry PI AI Camera as an example.
Traditionally, this article is available on YouTube:
Three Methods of Object DetectionObject detection is a part of computer vision. It identifies and locates objects on the image or video.
There are three main detection methods.
- Region-Based: R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN
- Single-Shot: YOLO (all versions), SSD, EfficientDet
- Transformer-Based: DETR, Deformable DETR
Let's briefly discuss each of them. As an example, we will run detection on the following image:
The region-based process is also called the sliding window method.
It checks content inside a small part of the image and then shifts the window to the next position.
If a part of the object (or a whole object) is in the position, the algorithm marks this area.
Then, it goes to the next position. If there are a few object parts (or features) in the window, it marks both of them.
It checks the whole image step by step.
When the image is processed, the algorithm counts the amount of parts and groups them into objects.
As a result, there will be boxes with classes of detected objects.
In real life, the sliding window will be smaller and provide more precise detections, like this:
This method is highly accurate and works well in most cases.
On the flip side, it’s a demanding process.
It does a lot of computations and has a relativelyslow processing speed.
Single-Shot Object Detection MethodIn this detection process, the image is divided into a grid, and all the cells are processed simultaneously.
A popular model of this method has the name YOLO. Which stands for
- You
- Only
- Look
- Once
And it's literally what happens under the hood. The image is processed in one iteration.
As a result, there will be boxes of classes of detected objects.
The method is quick and less demanding. It's more suitable for the Edge devices like Raspberry Pi.
However, I'm not entirely right about this picture.
If the overlap of objects is too intense, some cars may be missed or grouped into one. Like this:
Or this:
This type of network determines far more information about a picture than the previous two types. It captures complex spatial relationships and the global context of the image, then wraps the results in bounding boxes and returns them.
DETR is one of the most precise object detection methods because it eliminates the need for traditional anchor boxes and region proposals, directly predicting bounding boxes and classes using transformers. Its use of attention mechanisms allows it to understand global context, leading to better performance on complex scenes. Additionally, DETR can handle object relationships and occlusions more effectively, resulting in higher precision compared to traditional methods.
However, like anything, it comes at a price—it’s the most resource-demanding. So, its usage in real-time object detection could be limited on edge devices like Raspberry PI.
Which model should be used on the Raspberry PI AI Camera?There is no universal network that is best for everything. Each of these models is suited to different situations and tasks.
We have a variety of models:
- R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN.... - Region-Based
- YOLO (all versions), SSD, EfficientDet, NanoDet - Single-Shot
- DETR, Deformable DETR - Transformer-Based
For the Raspberry PI AI camera, the most suitable approach is using single-shot methodmodels like YOLO (all versions), SSD, EfficientDet, NanoDet. Sony suggests this as well.
All these models are available on their GitHub:
https://github.com/raspberrypi/imx500-models
I've prepared a small example repository that has a Python script and some installation instructions:
https://github.com/Nerdy-Things/raspberry-pi-ai-camera-object-detection-zero-hero
To make it work, I need to clone it and download detection models:
git clone https://github.com/Nerdy-Things/raspberry-pi-ai-camera-object-detection-zero-hero.git
cd raspberry-pi-ai-camera-object-detection-zero-hero
git submodule init && git submodule update
The repository contains the following files:
The file ./install.sh
installs all the required dependencies.
The most important are the next:
sudo apt install -y imx500-all
sudo apt install -y python3-opencv python3-munkres
The script recognition.py
mainly contains an example from the official Raspberry PI's repository. The important part of the code is at the bottom of the file:
if __name__ == "__main__":
model = "./imx500-models/imx500_network_yolov8n_pp.rpk"
# model = "./imx500-models/imx500_network_ssd_mobilenetv2_fpnlite_320x320_pp.rpk"
# model = "./imx500-models/imx500_network_efficientdet_lite0_pp.rpk"
# model = "./imx500-models/imx500_network_nanodet_plus_416x416.rpk"
# model = "./imx500-models/imx500_network_nanodet_plus_416x416_pp.rpk"
# This must be called before instantiation of Picamera2
imx500 = IMX500(model)
intrinsics = imx500.network_intrinsics
if not intrinsics:
intrinsics = NetworkIntrinsics()
intrinsics.task = "object detection"
elif intrinsics.task != "object detection":
print("Network is not an object detection task", file=sys.stderr)
exit()
# Defaults
if intrinsics.labels is None:
with open("assets/coco_labels.txt", "r") as f:
intrinsics.labels = f.read().splitlines()
intrinsics.update_with_defaults()
picam2 = Picamera2(imx500.camera_num)
config = picam2.create_preview_configuration(
controls = {},
buffer_count=12
)
imx500.show_network_fw_progress_bar()
picam2.start(config, show_preview=False)
if intrinsics.preserve_aspect_ratio:
imx500.set_auto_aspect_ratio()
last_results = None
picam2.pre_callback = draw_detections
labels = get_labels()
while True:
metadata = picam2.capture_metadata()
last_results = parse_detections(metadata)
if (len(last_results) > 0):
for result in last_results:
label = f"{int(result.category)} {labels[int(result.category)]} ({result.conf:.2f})"
print(f"Detected {label}")
Let's break it down.
The script starts with a model definition:
model = "./imx500-models/imx500_network_yolov8n_pp.rpk"
# model = "./imx500-models/imx500_network_ssd_mobilenetv2_fpnlite_320x320_pp.rpk"
# model = "./imx500-models/imx500_network_efficientdet_lite0_pp.rpk"
# model = "./imx500-models/imx500_network_nanodet_plus_416x416.rpk"
# model = "./imx500-models/imx500_network_nanodet_plus_416x416_pp.rpk"
As you can see, we are already familiar with those names.
Then, I create an object of a specific device and pass the model as an argument. In my case, the device is Sony IMX500.
imx500 = IMX500(model)
The next few lines are the definition of network settings.
intrinsics = imx500.network_intrinsics
if not intrinsics:
intrinsics = NetworkIntrinsics()
intrinsics.task = "object detection"
elif intrinsics.task != "object detection":
print("Network is not an object detection task", file=sys.stderr)
exit()
# Defaults
if intrinsics.labels is None:
with open("assets/coco_labels.txt", "r") as f:
intrinsics.labels = f.read().splitlines()
intrinsics.update_with_defaults()
Then, I create an object of a camera and a preview setting. In controls = {}
, I can tune some controls, but it could be empty for now.
picam2 = Picamera2(imx500.camera_num)
config = picam2.create_preview_configuration(
controls = {},
buffer_count=12
)
The Python script uploads the model to the camera. This line shows the progress bar when it happens.
imx500.show_network_fw_progress_bar()
The following line starts the recognition process.
picam2.start(config, show_preview=False)
From that point on, the AI camera will capture and recognize images while the script is running.
The pre_callback method is used to draw bounding boxes on the picture from the camera.
picam2.pre_callback = draw_detections
Finally, we’ve reached the most essential part.
In this infinite loop, I receive the recognition results.
while True:
metadata = picam2.capture_metadata()
last_results = parse_detections(metadata
The capture_metadata
method retrieves the camera's current state, providing a large JSON with many parameters.
Here’s an example of what we receive from it.
Then, I feed this metadata to the parse_detections
method, which returns a simplified JSON.
[
{
"box": [
34,
0,
605,
472
],
"category": 0.0,
"conf": 0.8515625
}
]
As you can see, it’s an array of objects, each representing a bounding box.
The box
field indicates where the object is located in the image.
The category
field specifies what the object is; for instance, in labels.json
, zero (or the first position) corresponds to a person.
And that's exactly who I am: a person 😁
Last, the confidence (conf
) field shows how confident the model is about detection, with a value between zero and one.
The final lines in the script print the detection results to the console.
if (len(last_results) > 0):
for result in last_results:
label = f"{int(result.category)} {labels[int(result.category)]} ({result.conf:.2f})"
print(f"Detected {label}")
That’s it! With this understanding of how it works and how to use it, anyone can build their project with an AI camera.
Thanks for reading, and have a wonderful day!
Comments