ThirdEye - Eyes for those who cannot see

Team FutureVision:

Saketh Gurram

•

Ansh Saraiya

•

Charvikrama K S

Created July 28, 2024

ThirdEye - Eyes for those who cannot see

A quick, accurate and real-time voice-assisted object detection solution for people who cannot see using Gemini and Vitis AI.

Things used in this project

Hardware components

AMD Kria™ KR260 Robotics Starter Kit

Raspberry Pi USB Plug and Play Desktop Microphone

M5Stack ESP32 Camera Module Development Board

Software apps and online services

Snappy Ubuntu Core

Google Gemini API

Story

ThirdEye - Eyes for those who cannot see

The most difficult part of this project as all the members of Team FutureVision will agree was deciding how our work could bring a change in our lives, but more importantly in the lives of those around us. Then, one of us suggested that AI and Robotics are useless if not helpful and unavailable to all. Hence, we

came up with the project "ThirdEye" considering the social impact it could have.

As of 2020, there are 43 million visually impaired people worldwide. But what if they could use their other senses to "see"? As prosthetics have become common replacements for absent limbs, we hope to visualize a similar concept using glasses with a camera, a microphone and earphones.

With money and time constraints, we decided to use an ESP32 camera for a portable camera solution as it is comparatively smaller than other cheaper alternatives. The feed from the camera is picked up using the OpenCV library and sent to the Kria KR260 FGPA board. On the board, we used the Gemini API to implement a object and voice detection algorithm.

Figure 1

Figure 1 shows the KRIA in action with the following connections:

Ethernet
Microphone
Web Camera
Display Port
Power Source

The algorithm runs in a loop to detect a keyword spoken in a sentence, which in this case was "Product". Upon detecting the keyword, the OpenCV library captures a single frame and runs the object detection model on the image. Thereafter, a detailed report about the product in the image is read out through the headphones to the user enabling them to get a gist of the obstacles in front of them.

Since the algorithm runs in a loop, it waits for the keyword again.

For a demonstration of the project, kindly visit the following URL.

Third Eye Gemini API

import cv2
import google.generativeai as genai
import os
import pyttsx3
import speech_recognition as sr

# Configure the Google Generative AI API
genai.configure(api_key="AIzaSyAKi62cSIUc6fVt5XH0MZWN9G3WDkDCuCs")

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")

engine = pyttsx3.init()

# Set properties for the speech
engine.setProperty('rate', 180)    # Speed of speech
engine.setProperty('volume', 1)    # Volume level (0.0 to 1.0)

# Check if espeak is available and set properties
if 'espeak' in engine.getProperty('voices')[0].id:
    engine.setProperty('voice', 'en-us')
    engine.setProperty('pitch', 70)  # Set pitch (0-100)

# Function to capture a frame and save it as an image
def capture_frame():
    ret, frame = cap.read()
    if ret:
        image_path = "captured_frame.jpg"
        cv2.imwrite(image_path, frame)
        return image_path
    return None

# Function to upload an image and get the response from the AI model
def identify_object(image_path):
    sample_file = genai.upload_file(path=image_path, display_name="Captured Frame")
    response = model.generate_content([sample_file, "what product is this and where can I buy it"])
    return response.text

# Initialize webcam
cap = cv2.VideoCapture(0)

# Initialize the recognizer
recognizer = sr.Recognizer()

# Function to listen for the keyword and capture a frame
def listen_for_keyword():
    with sr.Microphone() as source:
        print("Listening for the keyword 'product'...")
        recognizer.adjust_for_ambient_noise(source)
        while True:
            try:
                audio = recognizer.listen(source)
                speech_text = recognizer.recognize_google(audio)
                print("You said: " + speech_text)
                if "product" in speech_text.lower():
                    image_path = capture_frame()
                    if image_path:
                        result = identify_object(image_path)
                        print("Identified Object:", result)
                        sentences = result.split('. ')
                        for sentence in sentences:
                            engine.say(sentence)
                            engine.runAndWait()
            except sr.UnknownValueError:
                print("Could not understand audio")
            except sr.RequestError as e:
                print("Could not request results; {0}".format(e))

# Start listening for the keyword
listen_for_keyword()

# When everything is done, release the capture
cap.release()
cv2.destroyAllWindows()

Credits

Comments

Please log in or sign up to comment.

Embed the widget on your own site

ThirdEye - Eyes for those who cannot see

ThirdEye - Eyes for those who cannot see

Things used in this project

Hardware components

Software apps and online services

Story

ThirdEye - Eyes for those who cannot see

Code

Third Eye Gemini API

Credits

Saketh Gurram

Ansh Saraiya

Charvikrama K S

Comments