Voice of the Eye : "Vision-O-Bot"

Our project uses AI-powered glasses for real-time object detection and auditory alerts, enhancing safety for the visually impaired.

Things used in this project

Hardware components

M5Stack ESP32 Camera Module Development Board

DFRobot Gravity:Digital Push Button (Yellow)

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

Rechargeable Battery, 3 V

Software apps and online services

Microsoft Windows 10

Microsoft Visual Studio 2017

Pycharm Community Edition

GitHub

Hand tools and fabrication machines

3D Spectacles

Story

What is your project about ?

Voice of the Eye : Vision-O-Bot is a smart assistive device particularly designed to contribute positively to the life of a visually impaired person. It consists of a high-resolution camera embedded in the frame of the glasses and makes use of advanced technologies such as object detection, AI assistance, text recognition, and TTS. This system allows users to locate obstacles, read text in regional languages, and interact with an AI assistant in natural conversation. It is designed to improve safety, independence, and overall quality of life by overcoming key challenges that visually impaired people face in their daily activities.

Why did you decide to make it?

By extension, for instance, existing solutions like guide sticks are useful, but have some serious limitation in giving in to the needs of the visually impaired individual in such respect as obstacle detection, text reading, and interfacing with AI. This is why we designed Vision-O-Bot to be a more comprehensive and preemptive solution going past traditional methods. We want to establish technological empowerment for greater independence and safety among users, enabling them to feel much more confident and assured as they move about in the environment. It is with these concepts that this innovative approach attempts to bridge the gap; something existing assistive devices currently cannot do in order to significantly enhance the user's experience.

How does it work?

Real-time Object Detection with ESP32 CAM — YOLO

This real-time object detection system adopts the YOLO (You Only Look Once) model to identify and trace objects in video streams produced by an ESP32 CAM module. Frames are processed using the YOLO model, loaded from pre-trained weights, to detect objects and draw bounding boxes around them. Afterwards, objects are labeled with the class name and the detection confidence score. The system then sets up with the ESP32 CAM to acquire frames, via a URL stream, from the ESP32 CAM and to process the image by detecting objects within the image frame. Detected object names are then converted into audible alerts using the Google Text-to-Speech (gTTS) library in order to increase user interaction. Distance from the camera of each detected object is calculated using its known width and distance calibration values. The measured distance is audibly reported, providing real-time feedback for the user. If the object becomes too close to the camera, the system sends an alert to the user. Such a detection and feedback loop continue until the user stops the system.

Text Recognition

The text recognition system combines several models and techniques in recognizing different types of text from the images given by the ESP32 CAM. In this process, the system captures video stream frames and processes those using OCR for detecting printed text. If no text is detected with OCR, it tries to detect synthetic text using a pre-trained CNN model. For hand-written text, yet another CNN model is used, and for recognizing digits from the MNIST dataset, a special model is trained on hand-written digits. This gets the user an audible output from each known text type: printed, synthetic, handwritten, and digits. In other words, it is a comprehensive way of retrieving proper recognition for different forms of text, which are then presented in both types of reading—visual and audio outputs.

Conversational AI Assistant

The Conversational AI Assistant uses speech recognition, text-to-speech, and generative AI for an interaction with the user. It listens for user voice commands from a microphone and records the speech of the user using a speech_recognition library. After it has captured the speech of the user and transcribed it, Gemini API of Google is utilized to come up with responses. That is properly configured, with some safety settings and parameters to ensure appropriate and relevant outputs. The system follows up by reacting to user questions through the text generated, which gets turned into speech with the help of the gTTS library, making it audible. Another ability is to stop the conversation when the keyword "goodbye" is heard, providing the interface in a very smooth and user-friendly manner. This system can be applied in a number of scenarios, ensuring that the intelligent responses it gives remain engaging to the user.

Videos and Images:

1 / 3

Video url link : https://drive.google.com/drive/folders/16XA6vSIgE9SoAUZB_anERhibN0fRy7ff?usp=sharing

Voice of the Eye : "Vision-O-Bot"