Saketh GurramAnsh SaraiyaCharvikrama K S
Created July 28, 2024

ThirdEye - Eyes for those who cannot see

A quick, accurate and real-time voice-assisted object detection solution for people who cannot see using Gemini and Vitis AI.

Things used in this project

Hardware components

AMD Kria™ KR260 Robotics Starter Kit
Raspberry Pi USB Plug and Play Desktop Microphone
ESP32 Camera Module Development Board
M5Stack ESP32 Camera Module Development Board

Software apps and online services

Snappy Ubuntu Core
Google Gemini API


Third Eye Gemini API

The attached script is a python program to execute the project. The user has to install the following dependencies before running the program:
1. pyttsx3
2. espeak
3. google-generativeai

Run the script using python3 "file_name".
import cv2
import google.generativeai as genai
import os
import pyttsx3
import speech_recognition as sr

# Configure the Google Generative AI API

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")

engine = pyttsx3.init()

# Set properties for the speech
engine.setProperty('rate', 180)    # Speed of speech
engine.setProperty('volume', 1)    # Volume level (0.0 to 1.0)

# Check if espeak is available and set properties
if 'espeak' in engine.getProperty('voices')[0].id:
    engine.setProperty('voice', 'en-us')
    engine.setProperty('pitch', 70)  # Set pitch (0-100)

# Function to capture a frame and save it as an image
def capture_frame():
    ret, frame =
    if ret:
        image_path = "captured_frame.jpg"
        cv2.imwrite(image_path, frame)
        return image_path
    return None

# Function to upload an image and get the response from the AI model
def identify_object(image_path):
    sample_file = genai.upload_file(path=image_path, display_name="Captured Frame")
    response = model.generate_content([sample_file, "what product is this and where can I buy it"])
    return response.text

# Initialize webcam
cap = cv2.VideoCapture(0)

# Initialize the recognizer
recognizer = sr.Recognizer()

# Function to listen for the keyword and capture a frame
def listen_for_keyword():
    with sr.Microphone() as source:
        print("Listening for the keyword 'product'...")
        while True:
                audio = recognizer.listen(source)
                speech_text = recognizer.recognize_google(audio)
                print("You said: " + speech_text)
                if "product" in speech_text.lower():
                    image_path = capture_frame()
                    if image_path:
                        result = identify_object(image_path)
                        print("Identified Object:", result)
                        sentences = result.split('. ')
                        for sentence in sentences:
            except sr.UnknownValueError:
                print("Could not understand audio")
            except sr.RequestError as e:
                print("Could not request results; {0}".format(e))

# Start listening for the keyword

# When everything is done, release the capture


