Scene perception in indoor activities is a difficult problem for visually impaired people. If technology can provide users with certain hints and target identification, it will be of great help.
After analysis, we propose a solution: using computer vision to identify discrete objects in indoor scenes and prompt users with voice prompts. Users can complete their perception of the indoor scene and locate the target object through the prompts. For example, if a knife and fork are found in an indoor scene, it may be in a restaurant or kitchen. Users can understand the general location of the target object through coordinates and take further action.
The specific implementation includes three modules: the first is the target recognition module, which identifies and classifies the targets in the scene through the camera; the second is the analysis module, which analyzes the results of target recognition and generates a report; The second is the voice module, which will provide the user with the recognition results through voice prompts.
1)Grove Vision AI 2
It is an MCU-based vision AI module powered by Arm Cortex-M55 & Ethos-U55. It supports TensorFlow and PyTorch frameworks and is compatible with Arduino IDE. With the SenseCraft AI algorithm platform, trained ML models can be deployed to the sensor without the need for coding. It features a standard CSI interface, an onboard digital microphone and an SD card slot, making it highly suitable for various embedded AI vision projects.
2) OV5647 Fov Camera
This camera module benefits from Fisheye Lens to achieve 62 Field of View, utilizing the OV5647 sensor to reach 2592 x 1944 active array size image show, supporting Raspberry Pi 3B+4B.
3)Xiao ESP32 module
Seeed Studio XIAO Series are diminutive development boards, sharing a similar hardware structure, where the size is literally thumb-sized. The code name "XIAO" here represents its half feature "Tiny", and the other half will be "Puissant". Seeed Studio XIAO ESP32S3 Sense integrates camera sensor, digital microphone and SD card supporting. Combining embedded ML computing power and photography capability, this development board can be your great tool to get started with intelligent voice and vision AI.
4)Gravity Text-to-Speech module
The Text-to-Speech module enables easy integration of voice functionality into various projects. It supports both Chinese and English languages, providing clear and natural pronunciation. The module can also broadcast the current time and environmental data, and when combined with a speech recognition module, it enables conversational interactions with your projects. The module uses I2C and UARTcommunication modes, with a gravity interface, ensuring compatibility with most main controllers on the market, such as Arduino, micro:bit, Firebeetle Series, LattePanda, and Raspberry Pi. Additionally, it includes a built-in speaker, eliminating the need for an external one. The primary applications for this module include robot voice, voice broadcast, voice prompt, and text reading.
1)step1: Select Model and Deploy
Luckly, we can find many models trained in SenseCraft AI Model Assistant.
Fork Detection model, based on the Yolo-Word algorithm, is designed for the Seeed Studio Grove Vision AI (V2) device to accurately detect and recognize forks. This AI model utilizes the advanced Swift yolo algorithm, focusing on fork recognition, and can accurately detect and tag forks in real-time video streams. It is particularly suited for the Seeed Studio Grove Vision AI (V2) device, offering high compatibility and stability. The deployment process is extremely simple, requiring only a few straightforward steps without the need for complex configurations and debugging. The model is suitable for multiple application scenarios, including dining management, retail, kitchen organization, and hospitality. For instance, in dining management, the model can monitor fork inventory and cleanliness, ensuring proper table setup and guest satisfaction; in retail, it can optimize fork display and stock levels. Evaluation metrics: mAP50: 0.53981 mAP50-95: 0.33936.
Connecting Vision AI 2 board and download this model.
2) step2:Connect Analysis module
Connect the Grove Vision AI (WE2) module to the default I2C interface of your Arduino board(Xiao Esp32 module) using the 4-Pin Cable. Make sure each wire is connected to the correct pin.
- SCL -> SCL (Grove Vision AI WE2)
- SDA -> SDA (Grove Vision AI WE2)
- VCC -> VCC (Grove Vision AI WE2, 3.3V)
- GND -> GND (Grove Vision AI WE2)
Xiao esp32 module will receive Grove Vision 2 board message through I2C protocol, then create analysis report and send it to TTS module realtime.
3) step3:Connect TTS module
Connect Xiao Esp32 module with Gravity Text-to-speech module with UART or I2C (see figure below, pay attention there is a switch control mode UART or I2C, be sure that mode is correct with connection),then the analysis result can be heard through speech voice.
1)Receive Grove Vision2 message, analysis and report.
First, install depended library:
- Seeed_Arduino_SSCMA v1.0.0
- ArduinoJson v7.1.0
- DFRobot_SpeechSynthesis v1.0.1
As we only focus on high score object,so we set score threshold 0.6,this param can be adjust in pratice. The result will report target and score and center point x,y of object. If report continue,seems too noisy, so give delay 3 seconds to reduce report frequence.
#include <Arduino.h>
#include <Seeed_Arduino_SSCMA.h>
#include "DFRobot_SpeechSynthesis.h"
#define THRESHOLD 0.6
DFRobot_SpeechSynthesis_I2C ss;
SSCMA AI;
void setup()
{
//initial and setup Grove Vision 2
AI.begin();
//initial and setup Speech Synthesis module
ss.begin();
//set speech volume.
ss.setVolume(5);
//set speech speed.
ss.setSpeed(5);
//set speech type female.
ss.setSoundType(ss.eFemale1);
//set speech tone.
ss.setTone(5);
//set speech in english word mode.
ss.setLanguage(ss.eEnglishl);
//initial serial for debug and monitor
Serial.begin(9600);
//message application started
Serial.print("Object Detect App start!");
}
void loop()
{
String analysis = "";
// invoke once, no filter , get image
if (!AI.invoke(1, false, true))
{
analysis += String("Detected ");
for (int i = 0; i < AI.boxes().size(); i++)
{
if(AI.boxes()[i].score>THRESHOLD)
{
analysis += String(i);
analysis += String(" target is ");
analysis += String(AI.boxes()[i].target);
analysis += String(" score is ");
analysis += String(AI.boxes()[i].score);
analysis += String(" position is ");
analysis += String(AI.boxes()[i].x+AI.boxes()[i].w/2);
analysis += String(" ");
analysis += String(AI.boxes()[i].y+AI.boxes()[i].h/2);
}
}
ss.speak(analysis.c_str());
}
//interval 3 seconds.
delay(3000);
}
Comments