Published September 3, 2024 © GPL3+

E-Guide (EchoGuide)

E-Guide for Empowering the visually impaired with real-time object detection and audio feedback for a more accessible world.

AdvancedFull instructions provided13 hours171

Lucky Draw for TWO Submissions

Build2gether 2.0 — Inclusive Innovation Challenge

Things used in this project

Hardware components

Raspberry Pi 4 Model B

Webcam, Logitech® HD Pro

Soundcore by Anker Life Q30 Hybrid Active Noise Cancelling Headphones

Flash Memory Card, MicroSD Card

Software apps and online services

Raspberry Pi Raspbian

shapr3d

Hand tools and fabrication machines

Soldering Gun Kit, Instant Heat

Story

Overview

E-Guide - A Helping Hand for the Visually Impaired

Let’s say you are in a familiar environment; it can be your home, your office or any other place that will include the things you use daily – your phone, keys, or a cup of coffee, for example. Now, let me bring you to the next level and ask you to locate these items with your eyes closed. I try to make the first move, yet all semblance of game playing appears more forced and random and one starts to get annoyed. This is a challenge that millions of visually impaired people have to go through every day of their lives. Welcome to E-Guide a personal assistant to finally give the Visually impaired community an easy and independent live. By using E-Guide, you have your assistant who will describe to you the environment, and lead you to the objects you need.

Build2gether 2.0 Challenge

That is the idea behind the Build2gether 2.0 Inclusive Innovation Challenge, which focuses strictly on successful innovations and cooperation. It is a special challenge where innovators from across the globe converge with an aim of addressing some of the challenges faced in society through use of technology. Let me compare it to a game where not only do people create projects, but create solutions that positively impact people’s lives.

Build2gether 2.0 Inclusive Innovation Challenge

As inspired by Build2gether, rather than only considering the possibility of my solution’s implementation, I tried to turn the spotlight towards the people who would be using it. It’s something that made me develop something that solves a real need or addresses a real issue, and that’s making a firm step in the right direction to creating a world in which people can move through life without fear or trepidation.

Problem Identification

The problem I’m addressing with E-Guide is one that affects millions of people worldwide: The anticipated feeling of the daily fight of living in the midst of blindness. Each of the above mentioned symptoms can be mentally and physically tiring for the visually impaired; for example, it can take ages for the visually impaired to look for an object that has been misplaced or even to move from one section of a room to another. Some of the conventional assistive devices used by the disabled include those for walking or moving around such as canes and guide dogs, they do not guide one in finding an object or give details of the environment. This has the effect of making them feel as though they are completely dependent on others because they cannot access such things as accessibility incorporates an impaired person’s capability to function on his or her own. I realized this as a big challenge that required a solution and thus embarked on coming up with such a solution. The idea which led to the creation of EchoGuide is based on the need to provide visually impaired individuals with the means to become more engaged with the surrounding world. EchoGuide integrates voice recognition, object detection, and audio feedback that brings a break through to the disability that blindness is; turning the inconvenience of not seeing into gain of knowing that preserves independence.

Developing a Solution

E-Guide in wearable cap

Solution Overview: E-Guide is a wearable cap that just like a normal cap has the factor of safety, fashionable color and also being enriched with the features of technology for helping the visually impaired in their daily activities. On the head part, only the camera should be seen while wires are neatly sewn throughout the fabric of the cap. The aim was to develop a device that will not make the impression that the user is impaired in any way and can be used in any environment.

Technical Approach: The main technology used in E-Guide is a machine learning technology that is coupled with deep learning approaches and this work in analyzing the visual data from a webcam placed on the cap. This makes it possible for the system to detect objects in real-time as well as in different environments.

User Interface and Experience: The given system has the ability to speak out along with it passes all the recognized objects and their locations. The focus is made on the case that the interface layer is friendly and adjustable to individual users depending on their requirements. It is possible to adjust the device for various types of the loss of vision, thus it is possible to cover different levels of disability.

How It Works

E-Guide (EchoGuide)

E-Guide works through the camera located at the cap that captures images of the surroundings. These images are analyzed using the machine learning algorithms to recognize objects in the pictures. The system finally conveys audio feedback to the user through the headphone, to explain what objects has been recognized, and where they are found.

Additionally, it gives the direction of the identified object is described in relation to the clock-face method and distance is indicated in beeps with the frequency increasing as one gets closer to the object.The user has the possibility to set the level of feedback’s informativeness and choose between basic and detailed l evels. If the given option was basic, then the feedback will be without any color detection and if the given option was detail then the feedback will be of the colors of the object.

Build Instructions

Below are instructions on how to prototype, assemble, and deploy the E-Guide Cap. These instructions assume you are familiar with the Raspberry Pi 4 Model B and working with Python and Machine Learning.

Complete project code and 3D files are found within this project's attachments and on Github.

E-Guide Hardware

Building the E-Guide device, need the following components:

Raspberry pi 4 model B

Camera: you can use NoIR or normal pi camera or webcam

NoIR and normal Pi camera V2

webcam

power supply or power bank

power supply 5V 3A

Headphones

Prototype Circuit

The setup of the E-Guide mainly comprises a circuit of Raspberry Pi 4 along with a WebCam and Headphone for interaction with the users. Here is a circuit diagram showing the connections as described above.

E-Guide circuit diagram

Project Files

For the audio output to work in the E-Guide, there is need to include sound files in (.wav) format, together with a Python script.

Download and unzip the project beeps sound bundle from the following link on GitHub: Beeps_sound.zip.

Unzip the downloaded COCO datasets. I chose this version coco_ssd_mobilenet_v1_1.0_quant_2018_06_29, For the record, I uploaded it to my GitHub; you can download the unzip of it when you want.

E-Guide.py

Let's go through the project code step by step, explaining the implementation process in detail:

Step 1: Importing Necessary Packages

import os
import argparse
import cv2
import numpy as np
import sys
import time
from threading import Thread
import importlib.util
import pyttsx3
import speech_recognition as sr
import pygame
from pygame import mixer
import webcolors
from scipy.spatial import KDTree

I started by importing all the necessary Python packages. These include essential libraries like os, sys, and time for handling file paths, system operations, and time management, respectively. I used argparse for parsing command-line arguments, cv2 (OpenCV) for video processing, and numpy for numerical operations. I also imported threading to handle parallel tasks and importlib.util for dynamic module loading.

Additionally, pyttsx3 and speech_recognition were imported for text-to-speech and voice recognition functionalities. Finally, I used pygame for playing audio feedback, webcolors for color name detection, and KDTree from scipy.spatial for efficiently finding the closest color name.

Step 2: Setting Up the Color Recognition Function

def get_color_name(rgb_tuple):
    css3_db = webcolors.CSS3_HEX_TO_NAMES
    names = []
    rgb_values = []

    for color_hex, color_name in css3_db.items():
        names.append(color_name)
        rgb_values.append(webcolors.hex_to_rgb(color_hex))
    
    kdt_db = KDTree(rgb_values)
    distance, index = kdt_db.query(rgb_tuple)
    return names[index]
def get_color_name(rgb_tuple):
    css3_db = webcolors.CSS3_HEX_TO_NAMES
    names = []
    rgb_values = []

    for color_hex, color_name in css3_db.items():
        names.append(color_name)
        rgb_values.append(webcolors.hex_to_rgb(color_hex))
    
    kdt_db = KDTree(rgb_values)
    distance, index = kdt_db.query(rgb_tuple)
    return names[index]

Next, I created a function called get_color_name that uses the KDTree algorithm to find the closest color name corresponding to an RGB value. I accessed a dictionary of CSS3 color names and their hexadecimal values from the webcolors library.

I then converted these hexadecimal values to RGB and stored them in a list. Using KDTree, I constructed a tree for fast nearest-neighbor lookup and returned the name of the closest matching color.

Step 3: Setting Up Text-to-Speech (TTS)

def mySpeak(message):
    engine = pyttsx3.init()
    engine.setProperty('rate', 150)
    engine.say('{}'.format(message))
    engine.runAndWait()

I implemented a mySpeak function that uses the pyttsx3 library to convert text to speech. I initialized the TTS engine, set the speaking rate to 150 words per minute for clarity, and instructed the engine to speak the given message. This function was used throughout the project to provide audio feedback to the user.

Step 4: Defining Beep Files for Audio Feedback

BEEP_FAST = "/home/pi/Desktop/beeps_fast.wav"
BEEP_MEDIUM = "/home/pi/Desktop//beeps_medium.wav"
BEEP_SLOW = "/home/pi/Desktop/beeps_slow.wav"

I defined file paths for three different beep sounds (BEEP_FAST, BEEP_MEDIUM, and BEEP_SLOW) that were used later in the code to indicate the distance of the detected object. These beeps provided auditory cues to the user, corresponding to near, medium, or far distances.

Step 5: Creating a Function for Voice-Activated Object Detection

def detect_object_by_voice():
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()

    with microphone as source:
        print("Listening...")
        mySpeak("Please say the object you want to detect")
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)

    try:
        object_name = recognizer.recognize_google(audio)
        print(f"You said: {object_name}")
        mySpeak(f"Searching for {object_name}")
        return object_name.lower()  # Convert to lowercase for consistency
    except sr.UnknownValueError:
        print("Could not understand audio")
        mySpeak("Sorry, I could not understand. Please try again.")
        return None
    except sr.RequestError as e:
        print(f"Could not request results; {e}")
        mySpeak("Sorry, I'm having trouble processing your request.")
        return None

I developed the detect_object_by_voice function, which prompts the user to say the name of the object they want to detect. Using the speech_recognition library, the function listens to the user's voice, recognizes the spoken words using Google's speech recognition API, and returns the object name in lowercase. If the system fails to understand the user, it provides appropriate feedback and prompts them to try again.

Step 6: Implementing Functions for Direction and Distance Estimation

def get_direction(x, width):
    if x < width / 12:
        return "1 o'clock"
    # (similar checks for other directions)
    else:
        return "12 o'clock"

def get_distance(y, height):
    if y < height / 3:
        return "near"
    elif y > 2 * height / 3:
        return "far"
    else:
        return "medium"

I created two helper functions, get_direction and get_distance, to determine the object's position in the frame. The get_direction function divided the frame horizontally into 12 regions (like the face of a clock) and returned the corresponding direction based on the object's x-coordinate. The get_distance function divided the frame vertically into three regions and returned "near, " "medium, " or "far" based on the object's y-coordinate.

Step 7: Setting Up Voice-Activated Verbosity Level Selection

def get_verbosity_level():
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()

    with microphone as source:
        print("Listening for verbosity level...")
        mySpeak("Please say the verbosity level. Options are: basic or detailed.")
        recognizer.adjust_for_ambient_noise(source)
        aud2 = recognizer.listen(source)

    try:
        verbosity = recognizer.recognize_google(aud2)
        print(f"You said: {verbosity}")
        mySpeak(f"You said {verbosity}")
        if verbosity.lower() in ["basic", "detailed", "detail", "details"]:
            return verbosity.lower()
        else:
            mySpeak("Invalid option. Please say either basic or detailed.")
            return get_verbosity_level()
    except sr.UnknownValueError:
        print("Could not understand audio")
        mySpeak("Sorry, I could not understand the audio.")
        return get_verbosity_level()
    except sr.RequestError as e:
        print(f"Could not request results; {e}")
        mySpeak(f"Sorry, I could not request results; {e}")
        return get_verbosity_level()

I created the get_verbosity_level function, which prompts the user to select between "basic" and "detailed" verbosity levels. The function uses speech recognition to capture the user's preference and repeats the process if it doesn't understand the user's response. The selected verbosity level determines the amount of information provided during object detection (e.g., with or without color details).

Step 8: Creating the VideoStream Class

class VideoStream:
    def __init__(self, resolution=(640, 480), framerate=30):
        self.stream = cv2.VideoCapture(0)
        ret = self.stream.set(cv2.CAP_PROP_FOURCC, cv2.VideoWriter_fourcc(*'YUYV'))
        ret = self.stream.set(3, resolution[0])
        ret = self.stream.set(4, resolution[1])
        (self.grabbed, self.frame) = self.stream.read()
        self.stopped = False

    def start(self):
        Thread(target=self.update, args=()).start()
        return self

    def update(self):
        while True:
            if self.stopped:
                self.stream.release()
                return
            (self.grabbed, self.frame) = self.stream.read()

    def read(self):
        return self.frame

    def stop(self):
        self.stopped = True

I designed a VideoStream class to handle video streaming from the camera. This class allows the camera feed to be read in a separate thread, ensuring that the video stream remains smooth and responsive. The start method initiates the thread, while the read method returns the current frame. The stop method stops the video stream.

Step 9: Initializing Pygame for Audio Feedback

pygame.mixer.init()

I initialized pygame.mixer to handle audio feedback for object detection. This initialization allows the program to play sound files like beeps during execution.

Step 10: Parsing Command-Line Arguments

parser = argparse.ArgumentParser()
parser.add_argument('--modeldir', help='Folder the .tflite file is located in', required=True)
parser.add_argument('--graph', help='Name of the .tflite file, if different than detect.tflite', default='detect.tflite')
parser.add_argument('--labels', help='Name of the labelmap file, if different than labelmap.txt', default='labelmap.txt')
parser.add_argument('--threshold', help='Minimum confidence threshold for displaying detected objects', default=0.5)
parser.add_argument('--resolution', help='Desired webcam resolution in WxH. If the webcam does not support the resolution entered, errors may occur.', default='1280x720')
parser.add_argument('--edgetpu', help='Use Coral Edge TPU Accelerator to speed up detection', action='store_true')
args = parser.parse_args()

I used argparse to create a command-line interface for the project, allowing users to specify parameters like the model directory, TensorFlow Lite model file, label map file, confidence threshold, and webcam resolution. This flexible setup makes it easy to configure the project for different environments and hardware.

Step 11: Loading the TensorFlow Lite Model and Labels

MODEL_NAME = args.modeldir
GRAPH_NAME = args.graph
LABELMAP_NAME = args.labels
min_conf_threshold = float(args.threshold)
resW, resH = args.resolution.split('x')
imW, imH = int(resW), int(resH)
use_TPU = args.edgetpu

# Load the label map
with open(PATH_TO_LABELS, 'r') as f:
    labels = [line.strip() for line in f.readlines()]

if labels[0] == '???':
    del (labels[0])

interpreter = Interpreter(model_path=PATH_TO_CKPT)
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
height = input_details[0]['shape'][1]
width = input_details[0]['shape'][2]
floating_model = (input_details[0]['dtype'] == np.float32)
input_mean = 127.5
input_std = 127.5

outname = output_details[0]['name']

if 'StatefulPartitionedCall' in outname:
    boxes_idx, classes_idx, scores_idx = 1, 3, 0
else:
    boxes_idx, classes_idx, scores_idx = 0, 1, 2

I loaded the TensorFlow Lite model and label map specified by the user. The label map contains the class names for object detection, while the TensorFlow Lite interpreter processes the input and output tensors. This setup is essential for performing object detection and classification using the specified model.

Step 12: Starting the Video Stream and Processing Frames

videostream = VideoStream(resolution=(imW, imH), framerate=30).start()
time.sleep(1)

while True:
    frame1 = videostream.read()
    frame = frame1.copy()
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frame_resized = cv2.resize(frame_rgb, (width, height))
    input_data = np.expand_dims(frame_resized, axis=0)

    if floating_model:
        input_data = (np.float32(input_data) - input_mean) / input_std

    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()
    boxes = interpreter.get_tensor(output_details[boxes_idx]['index'])[0]
    classes = interpreter.get_tensor(output_details[classes_idx]['index'])[0]
    scores = interpreter.get_tensor(output_details[scores_idx]['index'])[0]

    for i in range(len(scores)):
        if (scores[i] > min_conf_threshold) and (scores[i] <= 1.0):
            ymin = int(max(1, (boxes[i][0] * imH)))
            xmin = int(max(1, (boxes[i][1] * imW)))
            ymax = int(min(imH, (boxes[i][2] * imH)))
            xmax = int(min(imW, (boxes[i][3] * imW)))
            object_name = labels[int(classes[i])]
            object_color = get_color_name(frame[ymin:ymax, xmin:xmax].mean(axis=0).mean(axis=0))

            cx = int((xmin + xmax) / 2)
            cy = int((ymin + ymax) / 2)
            direction = get_direction(cx, imW)
            distance = get_distance(cy, imH)

I started the live video feed and begun on the processing of frames on a real time basis. Each frame was then reduced from 4 D to 3 D, that is converted to RGB, resized to the expected input size by the model and then again converted to a 4-D tensor.

TensorFlow Lite interpreter was then used to perform object detection and this would give the interpreter bounding boxes, class indices and confidence scores. Specifically, for each of the detected object whose confidence score is above the threshold, its bounding box coordinates, color, estimated direction and distance were calculated.

Step 13: Providing Audio Feedback

if object_name == target_object:
                if verbosity == "detailed":
                    mySpeak(f"{object_name} at {direction}, {distance}, color {object_color}")
                else:
                    mySpeak(f"{object_name} at {direction}, {distance}")

                if distance == "near":
                    mixer.music.load(BEEP_FAST)
                elif distance == "medium":
                    mixer.music.load(BEEP_MEDIUM)
                else:
                    mixer.music.load(BEEP_SLOW)
                mixer.music.play()

            cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (10, 255, 0), 2)
            cv2.putText(frame, object_name, (xmin + 10, ymin + 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2, cv2.LINE_AA)
            cv2.putText(frame, f"Color: {object_color}", (xmin + 10, ymin + 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2, cv2.LINE_AA)

    cv2.imshow('Object detector', frame)
    if cv2.waitKey(1) == ord('q'):
        break

cv2.destroyAllWindows()
videostream.stop()

To end the loop, I used an ‘if’ statement to compare the detected object to the target object that the user has inputted. Depending on the selected level of verbosity, the system reported in detail or just the basics about the position of the object, the direction it was moving in and color.

Depending on the distance of the object a beep sound which corresponded to the object was played. I also translated the bounding box, name of the object and the color right on the video stream for the confirmation purpose. This loop stopped when the user pressed ‘q’ key which closed the window and halted the stream of the video.

Prototype Testing

As part of the evaluation procedure, E-Guide was tested in a number of settings to provide an understanding of its efficiency. Testing involved:

Indoor Environments: Here, the testing was done in a normal set-up which was a typical house, and the objects used were easily found in any home.
Low-Light Conditions: Even further, night vision camera for identification of objects and targets in the night environment.
Multiple Object Detection: Evaluating the system’s capacity on recognizing particular objects particularly when there are other objects around them.

Using the E-Guide

To use the E-Guide:

Wear the Cap: Wear the cap in such a way that the camera is focused on what you see.
Power On: Power on Raspberry Pi to start the system and let it boot up completely.
Voice Command: To play a specific object speak out its name with fluency.
Receive Feedback: For direction, distance and color information listen to the audio feedback.
Repeat: If the object does not appear, do so again or look for another object.

Future Development

Future enhancements for the E-Guide project include:

Enhanced Object Recognition: Increasing the number of items in the object database and enhancing the recognition accuracy.
Integration with Smart Home Devices: Enabling the E-Guide to have some level of integration such that it can communicate with other smart devices.
Real-Time Navigation Assistance: Extend the functionality towards allowing the users to move from one environment to another with ease.
Customizable Voice Commands: The possibility to assign one’s own words for controlling the object with voice commands.

Custom parts and enclosures

Sketchfab still processing.

Schematics

Code

E-Guide.py

# Import packages # last onee
import os
import argparse
import cv2
import numpy as np
import sys
import time
from threading import Thread
import importlib.util
import pyttsx3
import speech_recognition as sr
import pygame
from pygame import mixer
import webcolors
from scipy.spatial import KDTree

# Function to get color name from RGB value using KDTree for closest match
def get_color_name(rgb_tuple):
    css3_db = webcolors.CSS3_HEX_TO_NAMES
    names = []
    rgb_values = []

    for color_hex, color_name in css3_db.items():
        names.append(color_name)
        rgb_values.append(webcolors.hex_to_rgb(color_hex))
    
    kdt_db = KDTree(rgb_values)
    distance, index = kdt_db.query(rgb_tuple)
    return names[index]



def mySpeak(message):
    engine = pyttsx3.init()
    engine.setProperty('rate', 150)
    engine.say('{}'.format(message))
    engine.runAndWait()
mySpeak('Hello')
mySpeak('Your Journey Is Started')

# Define constants for beep files
BEEP_FAST = "/home/pi/Desktop/beeps_fast.wav"
BEEP_MEDIUM = "/home/pi/Desktop//beeps_medium.wav"
BEEP_SLOW = "/home/pi/Desktop/beeps_slow.wav"


def detect_object_by_voice():
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()

    with microphone as source:
        print("Listening...")
        mySpeak("Please say the object you want to detect")
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)

    try:
        object_name = recognizer.recognize_google(audio)
        print(f"You said: {object_name}")
        mySpeak(f"Searching for {object_name}")
        return object_name.lower()  # Convert to lowercase for consistency
    except sr.UnknownValueError:
        print("Could not understand audio")
        mySpeak("Sorry, I could not understand. Please try again.")
        return None
    except sr.RequestError as e:
        print(f"Could not request results; {e}")
        mySpeak("Sorry, I'm having trouble processing your request.")
        return None

def get_direction(x, width):
    if x < width / 12:
        return "1 o'clock"
    elif x < width / 6:
        return "2 o'clock"
    elif x < width / 4:
        return "3 o'clock"
    elif x < width / 3:
        return "4 o'clock"
    elif x < 5 * width / 12:
        return "5 o'clock"
    elif x < width / 2:
        return "6 o'clock"
    elif x < 7 * width / 12:
        return "7 o'clock"
    elif x < 2 * width / 3:
        return "8 o'clock"
    elif x < 3 * width / 4:
        return "9 o'clock"
    elif x < 5 * width / 6:
        return "10 o'clock"
    elif x < 11 * width / 12:
        return "11 o'clock"
    else:
        return "12 o'clock"

def get_distance(y, height):
    if y < height / 3:
        return "near"
    elif y > 2 * height / 3:
        return "far"
    else:
        return "medium"




# Function to get verbosity level from voice
def get_verbosity_level():
    recognizer = sr.Recognizer()
    microphone = sr.Microphone()

    with microphone as source:
        print("Listening for verbosity level...")
        mySpeak("Please say the verbosity level. Options are: basic or detailed.")
        recognizer.adjust_for_ambient_noise(source)
        aud2 = recognizer.listen(source)

    try:
        verbosity = recognizer.recognize_google(aud2)
        print(f"You said: {verbosity}")
        mySpeak(f"You said {verbosity}")
        if verbosity.lower() in ["basic", "detailed", "detail", "details"]:
            return verbosity.lower()
        else:
            mySpeak("Invalid option. Please say either basic or detailed.")
            return get_verbosity_level()
    except sr.UnknownValueError:
        print("Could not understand audio")
        mySpeak("Sorry, I could not understand the audio.")
        return get_verbosity_level()
    except sr.RequestError as e:
        print(f"Could not request results; {e}")
        mySpeak(f"Sorry, I could not request results; {e}")
        return get_verbosity_level()


class VideoStream:
    """Camera object that controls video streaming from the Picamera"""
    def __init__(self, resolution=(640, 480), framerate=30):
        self.stream = cv2.VideoCapture(0)
        ret = self.stream.set(cv2.CAP_PROP_FOURCC, cv2.VideoWriter_fourcc(*'YUYV'))
        ret = self.stream.set(3, resolution[0])
        ret = self.stream.set(4, resolution[1])
        (self.grabbed, self.frame) = self.stream.read()
        self.stopped = False

    def start(self):
        Thread(target=self.update, args=()).start()
        return self

    def update(self):
        while True:
            if self.stopped:
                self.stream.release()
                return
            (self.grabbed, self.frame) = self.stream.read()

    def read(self):
        return self.frame

    def stop(self):
        self.stopped = True

# Initialize Pygame for audio feedback
pygame.mixer.init()

# Define and parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument('--modeldir', help='Folder the .tflite file is located in', required=True)
parser.add_argument('--graph', help='Name of the .tflite file, if different than detect.tflite', default='detect.tflite')
parser.add_argument('--labels', help='Name of the labelmap file, if different than labelmap.txt', default='labelmap.txt')
parser.add_argument('--threshold', help='Minimum confidence threshold for displaying detected objects', default=0.5)
parser.add_argument('--resolution', help='Desired webcam resolution in WxH. If the webcam does not support the resolution entered, errors may occur.', default='1280x720')
parser.add_argument('--edgetpu', help='Use Coral Edge TPU Accelerator to speed up detection', action='store_true')

args = parser.parse_args()

MODEL_NAME = args.modeldir
GRAPH_NAME = args.graph
LABELMAP_NAME = args.labels
min_conf_threshold = float(args.threshold)
resW, resH = args.resolution.split('x')
imW, imH = int(resW), int(resH)
use_TPU = args.edgetpu

pkg = importlib.util.find_spec('tflite_runtime')
if pkg:
    from tflite_runtime.interpreter import Interpreter
    if use_TPU:
        from tflite_runtime.interpreter import load_delegate
else:
    from tensorflow.lite.python.interpreter import Interpreter
    if use_TPU:
        from tensorflow.lite.python.interpreter import load_delegate

if use_TPU:
    if (GRAPH_NAME == 'detect.tflite'):
        GRAPH_NAME = 'edgetpu.tflite'       

CWD_PATH = os.getcwd()
PATH_TO_CKPT = os.path.join(CWD_PATH, MODEL_NAME, GRAPH_NAME)
PATH_TO_LABELS = os.path.join(CWD_PATH, MODEL_NAME, LABELMAP_NAME)

with open(PATH_TO_LABELS, 'r') as f:
    labels = [line.strip() for line in f.readlines()]

if labels[0] == '???':
    del(labels[0])

if use_TPU:
    interpreter = Interpreter(model_path=PATH_TO_CKPT, experimental_delegates=[load_delegate('libedgetpu.so.1.0')])
else:
    interpreter = Interpreter(model_path=PATH_TO_CKPT)

interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
height = input_details[0]['shape'][1]
width = input_details[0]['shape'][2]

floating_model = (input_details[0]['dtype'] == np.float32)
input_mean = 127.5
input_std = 127.5

#frame_rate_calc = 1
#freq = cv2.getTickFrequency()

videostream = VideoStream(resolution=(imW, imH), framerate=15).start()
time.sleep(1)


while True:
    object_to_detect = detect_object_by_voice()
    if object_to_detect is None:
        continue  # Retry if no valid object name detected
        
    verbosity_level = get_verbosity_level()
    if verbosity_level is None:
        continue  # Retry if no valid verbosity level detected

    start_time = time.time()
    detected_object = False

    while time.time() - start_time < 60:  # Search for the object for 1 minute
        #t1 = cv2.getTickCount()
        frame1 = videostream.read()
        frame = frame1.copy()
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame_resized = cv2.resize(frame_rgb, (width, height))
        input_data = np.expand_dims(frame_resized, axis=0)

        if floating_model:
            input_data = (np.float32(input_data) - input_mean) / input_std

        interpreter.set_tensor(input_details[0]['index'], input_data)
        interpreter.invoke()

        boxes = interpreter.get_tensor(output_details[0]['index'])[0]
        classes = interpreter.get_tensor(output_details[1]['index'])[0]
        scores = interpreter.get_tensor(output_details[2]['index'])[0]

        

        for i in range(len(scores)):
            if ((scores[i] > min_conf_threshold) and (scores[i] <= 1.0)):
                object_id = int(classes[i])
                if labels[object_id].lower() == object_to_detect:
                    detected_object = True

                    ymin = int(max(1, (boxes[i][0] * imH)))
                    xmin = int(max(1, (boxes[i][1] * imW)))
                    ymax = int(min(imH, (boxes[i][2] * imH)))
                    xmax = int(min(imW, (boxes[i][3] * imW)))

                    cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (10, 255, 0), 2)

                    object_name = labels[object_id]
                    label = '%s: %d%%' % (object_name, int(scores[i] * 100))
                    labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.7, 2)
                    label_ymin = max(ymin, labelSize[1] + 10)
                    cv2.rectangle(frame, (xmin, label_ymin - labelSize[1] - 10), (xmin + labelSize[0], label_ymin + baseLine - 10), (255, 255, 255), cv2.FILLED)
                    cv2.putText(frame, label, (xmin, label_ymin - 7), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 0), 2)

                    #message = f"{object_name} at {get_direction((xmin + xmax) / 2, imW)} and {get_distance((ymin + ymax) / 2, imH)}"
                    #mySpeak(message)

                    direction = get_direction((xmin + xmax) / 2, imW)
                    distance = get_distance((ymin + ymax) / 2, imH)

                    if verbosity_level == "basic":
                        print(f"Detected {object_name} at {direction}")
                        mySpeak(f"{object_name} at {direction}")
                    else:  # detailed verbosity
                        color = frame[(ymin+ymax)//2, (xmin+xmax)//2]
                        color_name = get_color_name(color)  # Get color name
                        print(f"Detected {object_name} at {direction} with color {color_name}")
                        mySpeak(f"{object_name} at {direction} with color {color_name}")

                    beep_file = {
                       "near": BEEP_FAST,
                        "medium": BEEP_MEDIUM,
                        "far": BEEP_SLOW
                        }.get(distance, BEEP_SLOW)  # default to slow beep if distance is unknown
                    pygame.mixer.music.load(beep_file)
                    pygame.mixer.music.play()
                    #time.sleep(0.5)  # wait for the beep to finish
                    #pygame.mixer.music.stop()
                    #pygame.quit()

                    time.sleep(0.5)
                    pygame.mixer.music.stop()

                    if (ymin + ymax) / 2 > (2 * imH) / 3:
                        print(f"{object_name} is close")
                        mySpeak(f"{object_name} is close")
                        object_to_detect = detect_object_by_voice()  # Ask again for the object
                        verbosity_level = get_verbosity_level()  # Ask again for verbosity level

                    # Add a small delay to avoid overloading the CPU
                    time.sleep(0.2)


                

    if not detected_object:
        mySpeak(f"No {object_to_detect} detected")

    cv2.imshow('Object detector', frame)
    #t2 = cv2.getTickCount()
    #time1 = (t2 - t1) / freq
    #frame_rate_calc = 1 / time1

    if cv2.waitKey(1) == ord('q'):
        break

videostream.stop()
cv2.destroyAllWindows()

Credits

youssef eldemery

2 projects • 1 follower

Maker & Researcher with a passion for building!

Contact

Comments

Please log in or sign up to comment.

Awards

Lucky Draw for TWO Submissions

Build2gether 2.0 — Inclusive Innovation Challenge

E-Guide (EchoGuide)

Things used in this project

Hardware components

Software apps and online services

Hand tools and fabrication machines

Story

Overview

Build2gether 2.0 Challenge

Problem Identification

Developing a Solution

How It Works

Build Instructions

E-Guide Hardware

Prototype Circuit

Project Files

E-Guide.py

Prototype Testing

Using the E-Guide

Future Development

Custom parts and enclosures

Cap base

Schematics

Controller Schematic

Code

E-Guide.py

Credits

youssef eldemery

Comments

Awards

Embed the widget on your own site

E-Guide (EchoGuide)

E-Guide (EchoGuide)

Things used in this project

Hardware components

Software apps and online services

Hand tools and fabrication machines

Story

Overview

Build2gether 2.0 Challenge

Problem Identification

Developing a Solution

How It Works

Build Instructions

E-Guide Hardware

Prototype Circuit

Project Files

E-Guide.py

Prototype Testing

Using the E-Guide

Future Development

Custom parts and enclosures

Cap base

Schematics

Controller Schematic

Code

E-Guide.py

Credits

youssef eldemery

Comments

Awards

Related channels and tags